problems with text:
- often very rare word is important, e.g. retinopathy
- ambiguity: e.g. cat and kitty
→ need a lot of labeled data ⇒ not realistic.
⇒ unsupervised learning
similar words appear in similar context.
embedding: map words to small vectors
measure the closeness by cosine distance:
initial: random vector
→ train model to predict nearby word.
pb: too many words in dictionary → softmax too slow
⇒ random sample the non-target words
dimension reduction (not PCA) that preserves the neighborhood structure (close vector → close in 2d as well).
treat varaible length sequences of words.
use the current word (Xi) and the last prediction as input.
backprop for RNN
apply highly correlated derivatives to W → not good for SGD.
pb if we use highly correlated updates: grad either explod or it disappear quickly.
fix grad-exploding: clip
grad-vanishing: memory loss in RNN
in RNN: replace the NN by a LSTM cell
represent the system with memory by a diagram with logical gates:
change the decision variables to continous:
a logistic regression in each gate: controls when to remember and when to forget things.
regularization for LSTM:
- L2 regularization: OK
- dropout: OK when used for input/output (X and Y), but NOT use to the recurrent in/out.
beam search is for generating sequences by RNN.
Greedy approach: at each step, sample from the predicted distribution of the RNN.
predict more steps and pick the seq with largest proba.
pb with this: the number of possible seq grows exponentially
⇒ just keep the few most promising seqs → "Beam search"
seq to seq
RNN: model to map vaiable length seq to fix-length vectors.
Beam search: sequence generation (map fix-length vectors to seq)
concat them together: seq to seq system
translation, speech recognation, image captionning