More on Word2Vec
parameters θ
: matrix U
and V
(each word vec is a row):
and the predictions don't take into account the distance between center word c and outside word o. ⇒ all word vecs predict high for the stopwords.
Optimization Basics
min loss function: J(θ)
gradient descent
direction of the gradient = direction where J(θ) increases the most.
SGD
pb: J(θ)
is for all words/positions in the corpus, grad J(θ)
is very expensive to compute.
⇒ use random samples each time
⇒ each time we only sample one window of 2m+1
words
I.e.each time we compute the grad only on a minibatch. ⇒ each time we only update the wordvecs that appear in the minibatch.
Word2Vec: Model Variants
final embedding of a word = average of u
and v
⇒ we can also do only one vec per word, not much difference
2 main varaints of w2v family:
- skip-grams(SG): predict outside word o with center word c:
P(o|c)
.← presented in the class. - continuous bag-of-words (CBOW): predict center word c using outside words o.
negative sampling
So far: naive sotfmax, i.e.sum over all words in vocab — expensive.
⇒ negative sampling. Use several binary logistic regressions on:
- the true pair (center word and a true context word )
- several nois pairs (center word and a random word) ← random negative pair
logistic regression = softmax with vocab size=2 sigmoid function: from inner-product value to probability score
⇒ objective funcion (to maximize):
where P(w)
is the distribution of negative sampling.
in practice: P(w)=U(w)^(3/4)
i.e.3/4 power of the uniform distribution → makes rare words appear more often.
⇒ loss function (to minimize):
QUESTION: why put the minus inside the logit?
Alternative Methods: cooccurrence counts
Why not using co-occurrence counts matrix?
- co-occurrence in windows
- co-occurrence in document
problems with co-occurrence matrix:
- matrix size increase with vocab size
- high dimensional, expensive storage
- Subsequent classification models have sparsity issues
⇒ model less robust
solution1: low-dimensional vectors
(signular value decomposition for k-largest eigen values)
⇒ popular in the year 2000.
solution2: hacks to co-occurrence matrix
- trim stopwords
- counter more on closer words
- Pearson correlations to replace counts
comparison
encode meanings
insight: ratios of co-occurrence probabilities can encode meaning.
⇒ make dot-products equal to the log of co-occurence prabability ⇒ vector diff gives the ratio of co-occur proba
GloVe
Combine the best of both worlds: count method and prediction method.
log-bilinear:
Where the weight ter f(Xij) is capped:
- Fast training
- Scalable to huge corpora
- Good performance even with small corpus and small vectors
Evaluating Word Vectors
Intrinsic vs extrinsic in NLP-eval: Intrinsic:
- Eval on a specific/intermediate subtask: e.g.word similarity, POS tag, etc.
- Fast to compute
- Not clear if really helpful unless correlation to real task is established
Extrinsic:
- Eval on real task: e.g.web search / question-answering / phone dialog
- Hard to run
- Hard to diagnois
⇒ today: focus on intrinsic word vector evaluation
Eval on word Analogies
- Semantic eval: city-in-state (e.g."Chicago Illinois Houston Texas")
- Syntactic eval: gram4-superlative (e.g."bad worst big biggest")
Hyperparameters:
Eval on word Similarities
Word Senses
sense ambiguity: words have lots of meanings
- crude solution(2012): for each common word, find clusters of context that it occurs, split the word into pesudowords.
- Linear Algebraic Structure solution (2018):
word sense is a linear superposition:
⇒ result: Because of ideas from sparse coding you can actually separate out the senses (providing they are relatively common) actual word/sense are very sparse
Extrinsic word vector evaluation: e.g. NER (finding person/organization/location)
Part 2 of series «XCS224N: NLP with deep learning»:
- [XCS224N] Lecture 1 – Introduction and Word Vectors
- [XCS224N] Lecture 2 – Word Vectors and Word Senses
- [XCS224N] Lecture 3 – Neural Networks
- [XCS224N] Lecture 4 – Backpropagation
- [XCS224N] Lecture 5 – Dependency Parsing
- [XCS224N] Lecture 6 – Language Models and RNNs
- [XCS224N] Lecture 7 – Vanishing Gradients and Fancy RNNs
- [XCS224N] Lecture 8 – Translation, Seq2Seq, Attention
Disqus 留言