## More on Word2Vec

parameters `θ`

: matrix `U`

and `V`

(**each word vec is a row**):

and the predictions don't take into account the *distance* between center word **c** and outside word **o**.
⇒ all word vecs predict high for the stopwords.

## Optimization Basics

min loss function: `J(θ)`

### gradient descent

*direction of the gradient = direction where J(θ) increases the most.*

#### SGD

pb: `J(θ)`

is for all words/positions in the corpus, `grad J(θ)`

is very expensive to compute.
⇒ use *random samples* each time
⇒ each time we only sample one window of `2m+1`

words

I.e.each time we compute the grad only on a **minibatch***.*
*⇒ *each time we only update the *wordvecs that appear* in the minibatch.

## Word2Vec: Model Variants

final embedding of a word = *average* of `u`

and `v`

⇒ we can also do only one vec per word, not much difference

2 main varaints of w2v family:

- skip-grams(
**SG**): predict outside word**o**with center word**c**:`P(o|c)`

.← presented in the class. - continuous bag-of-words (
**CBOW**): predict center word**c**using outside words**o**.

## negative sampling

So far: naive sotfmax, i.e.sum over *all* words in vocab — *expensive*.

⇒ **negative sampling**.
Use several *binary logistic regressions* on:

- the
**true pair**(center word and a true context word ) - several
**nois pairs**(center word and a*random*word) ←*random negative pair*

**logistic regression ***= softmax with vocab size=2*
sigmoid function: from inner-product value to probability score

⇒ objective funcion (to maximize):

where `P(w)`

is the distribution of negative sampling.
in practice: `P(w)=U(w)^(3/4)`

i.e.3/4 power of the uniform distribution → makes *rare words appear more often*.

⇒ loss function (to minimize):

**QUESTION: why put the minus inside the logit?**

## Alternative Methods: cooccurrence counts

**Why not using co-occurrence counts matrix?**

- co-occurrence in windows
- co-occurrence in document

problems with co-occurrence matrix:

- matrix size increase with vocab size
- high dimensional, expensive storage
- Subsequent classification models have sparsity issues

⇒ *model less robust*

#### solution1: low-dimensional vectors

(signular value decomposition for k-largest eigen values)

⇒ popular in the year 2000.

#### solution2: hacks to co-occurrence matrix

- trim stopwords
- counter more on closer words
- Pearson correlations to replace counts

#### comparison

#### encode meanings

insight: **ratios** of co-occurrence probabilities can encode meaning.

⇒ **make dot-products equal to the log of co-occurence prabability**

*⇒ vector diff*gives the ratio of co-occur proba

## GloVe

Combine the best of both worlds: count method and prediction method.

log-bilinear:

Where the weight ter f(Xij) is capped:

- Fast training
- Scalable to huge corpora
- Good performance even with small corpus and small vectors

## Evaluating Word Vectors

*Intrinsic vs extrinsic* in NLP-eval:
**Intrinsic**:

- Eval on a specific/intermediate
*subtask*: e.g.word similarity, POS tag, etc. - Fast to compute
- Not clear if really helpful unless correlation to real task is established

**Extrinsic***:*

- Eval on
*real*task: e.g.web search / question-answering / phone dialog - Hard to run
- Hard to diagnois

⇒ today: focus on intrinsic word vector evaluation

### Eval on word Analogies

**Semantic**eval: city-in-state (e.g.*"Chicago Illinois Houston Texas"*)**Syntactic**eval: gram4-superlative (e.g.*"bad worst big biggest"*)

Hyperparameters:

### Eval on word Similarities

## Word Senses

sense *ambiguity*: words have lots of meanings

- crude solution(2012): for each common word, find clusters of context that it occurs, split the word into pesudowords.
- Linear Algebraic Structure solution (2018):

word sense is a linear superposition:

⇒ result: Because of ideas from *sparse coding* you can actually separate out the senses (providing they are relatively common)
actual word/sense are very sparse

**Extrinsic** word vector evaluation: e.g. NER (finding person/organization/location)

#### Part 2 of series «XCS224N: NLP with deep learning»：

- [XCS224N] Lecture 1 – Introduction and Word Vectors
- [XCS224N] Lecture 2 – Word Vectors and Word Senses
- [XCS224N] Lecture 3 – Neural Networks
- [XCS224N] Lecture 4 – Backpropagation
- [XCS224N] Lecture 5 – Dependency Parsing
- [XCS224N] Lecture 6 – Language Models and RNNs
- [XCS224N] Lecture 7 – Vanishing Gradients and Fancy RNNs
- [XCS224N] Lecture 8 – Translation, Seq2Seq, Attention

## Disqus 留言