## I - Introduction to Word Embeddings

**Word representation**

So far: representing words with one-hot encoding → word relationships are not generalized.

⇒ want to learn a *featurized representatin* for each word as a high-dim vector

→ visualize word embeddings in 2-dim space, e.g. via *t-SNE*

### Using word embeddings

example: NER

transfer learning: using pretrained embeddings

- learn word embeddings from large text corpus (or download pre-trained embeddings)
- transfter embedding to new task with smaller training set
- (optional) fine-tune word embeddings with new data

Word embedding ~ face encoding(embedding) in previous words.

### Properties of word embeddings

*Analogy reasoning* of word embeddings.

e.g. man->woman as king->?

Embedding vectors have the relationship: `e_man - e_woman ~= e_king - e_queen`

Finding nearest neighbor (according to cosine-similarity).

examples:

### Embedding matrix

learning an embedding matrix `E`

of shape `(embed_dim, vocab_size)`

e.g. embed_dim=300, vocab_size=10000

notataion:

`o_w`

= one-hot encoding of a word `w`

, dim=(10000,1)

→ `e_w = E * o_w`

, dim=(300,1)

→ Learn `E`

by random-init & gradient descent.

(In practice: use lookup function instead of matrix multiplication.)

## II - Learning Word Embeddings: Word2vec & GloVe

### Learning word embeddings

Some concrete algos to learn embedding matrix E. Start with complex algos and show simpler algos with good result.

e.g. Neural language model, i.e. *predict next word in the sequence*.

with fixed window size(e.g. =4), predict next word. Context=last 4 words, target=next word.

→ Using gradient-descent to update params (`E, W, b`

).

Other kinds of (simpler) context/targets:

- context=4 words to the left & right, target=the word in the middle
- context=previous word, target=next word
- context=nearby 1 word (skip-gram model)

### Word2Vec

**Skip-grams**

supervised problem: context/target pairs, given context word, predict target word

(|v| classes in total) → softmax

- context
`c`

:*randomly*pick a context word - target
`t`

:*randomly*pick a target word in +/- 4 word window.

**Skip-gram Model**

model: context `c`

-> onehot `o_c`

-(`E`

)-> embedding `e_c`

→ softmax → prediction `yhat`

→ target word `y`

loss = cross-entropy(y, yhat) / log-loss(y, yhat)

**how to sample context c**

uniform sampling for `c`

: frequent words might dominant the training set.

→ heuristics to sample less common words via `P(c)`

.p

**problem with skip-gram model**

*slow speed*: to compute `p(t|c)`

with softmax → involves *summing up all 10000(vocab_size) logits* in denominator of sotmax:

- method 1: using hierachical softmax: split all vocabs into buckets/binary trees —
`log(|v|)`

instea of`O(|v|)`

- method 2:
*negative sampling*, modify training objective

### Negative Sampling

To simplify the computation of softmax denominator: a different learning problem.

**new learning problem**

given a pair of word → predict *if this pair is a context/target pair*.

— binary classification instead of |v| classes.

- positive examples: sampled as before (sample context word
`c`

, then sample target word`t`

in word window) - negative examples: sample context word
`c`

, then*pick rand word*`t`

*from dictionary*. Sample*k=5~20*negative examples for each context word`c`

.

Model: *logistic regression*

For context word c, there are *|v|=10000 potential binary classification problems* to train

→ *and we train only (k+1) of them*.

**negative sampling**: trun 10000-way softmax problem into 10000 binary classification problems, and each iteration only train k+1 of them.

**How to sampling negative examples**

- sample according to empirical word frequence P(w) ~= tf(w) → might have a lot stop words sampled
- sample uniformly: P(w) = 1/|v|
- empirical best-choice: P(w) = tf(w)^0.75 / sum(tf(w')^0.75)

### GloVe word vectors

GloVe(Global Vectors for word representation): even simpler than negative sampling.

previously: sampling skip-grams, i.e. pairs of words (`c`

,`t`

)

glove: Count appearance of co-occurrence, `X_ij = #(i in the context of j)`

.

**GloVe Model**

Trying to approximate `theta_i*e_j`

to `-logX_ij`

, by solving `theta_i`

and `e_j`

.

- add weighting term
`f(X_ij) = 0`

if`X_ij=0`

, (0*log0 = 0), and f gives more weight to less frequent words `theta_i`

and`e_i`

are*symmetric*in optimization ⇒ final embedding`e_i = (e_i+theta_i)/2`

**Featurization view of word embeddings**

Learning embeddings are not guaranteed to align with featurized components in each axis.

## III-Applications using Word Embeddings

### Sentiment Classification

mapping a piece of text into a rating

**simple model**

sum/average all word embedding vectors as feature vector → pass to softmax clf.

pb: ignoring all word order

**RNN model**

take embedding vectors → sequence embedding matrix → feed to RNN

→ using last step output and feed to softmax.

### Debiasing word embeddings

Eliminate biases in word embeddings, e.g. gender bias, due to biases in training text.

- Dentify bias direction: e.g. gender bias

take differences of vectors: e.g. `e_he - e_she`

, `e_male - e_female`

→ find bias and non-bias directions

- Neutralization: project embeddings (of
*non-definitional words*) to non-bias directions - Equalize pairs: for definitional words(e.g. grandmother/grandfather, boy/girl), let them be equally-distant to axis

→ How to find defnitional words: train a classifier.

#### Part 15 of series «Andrew Ng Deep Learning MOOC»：

- [Neural Networks and Deep Learning] week1. Introduction to deep learning
- [Neural Networks and Deep Learning] week2. Neural Networks Basics
- [Neural Networks and Deep Learning] week3. Shallow Neural Network
- [Neural Networks and Deep Learning] week4. Deep Neural Network
- [Improving Deep Neural Networks] week1. Practical aspects of Deep Learning
- [Improving Deep Neural Networks] week2. Optimization algorithms
- [Improving Deep Neural Networks] week3. Hyperparameter tuning, Batch Normalization and Programming Frameworks
- [Structuring Machine Learning Projects] week1. ML Strategy (1)
- [Structuring Machine Learning Projects] week2. ML Strategy (2)
- [Convolutional Neural Networks] week1. Foundations of Convolutional Neural Networks
- [Convolutional Neural Networks] week2. Deep convolutional models: case studies
- [Convolutional Neural Networks] week3. Object detection
- [Convolutional Neural Networks] week4. Special applications: Face recognition & Neural style transfer
- [Sequential Models] week1. Recurrent Neural Networks
- [Sequential Models] week2. Natural Language Processing & Word Embeddings
- [Sequential Models] week3. Sequence models & Attention mechanism

## Disqus 留言