(DeepLearning MOOC) Lesson 4: Deep Models for Text and Sequences

Tue, 07 Jun 2016 deep learning Series Part 4 of «Deep Learning udacity MOOC»

word2vec
tSNE
RNN
backprop for RNN
LSTM
beam search
seq to seq

problems with text:

often very rare word is important, e.g. retinopathy
ambiguity: e.g. cat and kitty

→ need a lot of labeled data ⇒ not realistic.
⇒ unsupervised learning

similar words appear in similar context.
embedding: map words to small vectors

measure the closeness by cosine distance:

word2vec

initial: random vector
→ train model to predict nearby word.

pb: too many words in dictionary → softmax too slow
⇒ random sample the non-target words

tSNE

dimension reduction (not PCA) that preserves the neighborhood structure (close vector → close in 2d as well).

RNN

treat varaible length sequences of words.
use the current word (Xi) and the last prediction as input.

backprop for RNN

apply highly correlated derivatives to W → not good for SGD.

pb if we use highly correlated updates: grad either explod or it disappear quickly.

fix grad-exploding: clip

grad-vanishing: memory loss in RNN
⇒ LSTM

LSTM

in RNN: replace the NN by a LSTM cell

represent the system with memory by a diagram with logical gates:

change the decision variables to continous:

a logistic regression in each gate: controls when to remember and when to forget things.

http://blog.csdn.net/dark_scope/article/details/47056361

regularization for LSTM:

L2 regularization: OK
dropout: OK when used for input/output (X and Y), but NOT use to the recurrent in/out.

beam search

beam search is for generating sequences by RNN.

Greedy approach: at each step, sample from the predicted distribution of the RNN.

smarter approach:
predict more steps and pick the seq with largest proba.

pb with this: the number of possible seq grows exponentially
⇒ just keep the few most promising seqs → "Beam search"