Background (Pre-Neural Machine Translation)

  • machine translation (MT): sentence from source lang to target lang.
  • 1950s: rule based, using bilingual dictionary.

1990s-2010s: Statistical MT (SMT)

using Bayes rule: P(y|x) = P(x|y)*P(y) / P(x)

⇒ The language model we already learnt in prev lectures ⇒ To get the ...

Vanishing Gradient Intuition and Proof

ex: grad of loss at position 4 w.r.t.hidden state at postion 1

with chain rule, the grad is smaller as it backprops

If the largest eigenvalue of Wh is less than 1, the gradient J_i/h_j will exponentially shrink.

Why Vanishing Gradient ...

Language Modeling

Language Modeling: task of predicting what words come next.

  • i.e.compute the conditional probability distribution

  • a language model can also be viewed as a system to give probability to a piece of text.

n-gram Language Models

n-gram Language Model: pre-deep learning solution for language modelling.

idea: Collect ...

Phrase structure: organize words into nested constituents.

Context-Free Grammars

context-free grammars (CFGs)

  • start with words, words are given a category (part of speech = POS):

  • words combine into phrases with categories like NP(noun phrase) and PP(prep.phrase):

  • Phrases can combine into bigger phrases recursively:

⇒ forms a tree structure:

Dependency ...

This week: neural net fundamentals

Classification Setup and Notation

training data:

softmax classifier

(linear classifier — hyperplane):

ith row of the param W: weight vector for class i to compute logits:

prediction = softmax of f_y:


goal: for (x, y), maximize p(y|x) ⇒ loss for (x, y) = -log p(y ...

More Matrix Gradients

Deriving Gradients wrt Words

pitfall in tetraining word vectors: if some word is not in training data, but other synonyms are present ⇒ only the synonyms word vectors are moved




  • apply (generalized) chain rule
  • re-use shared stuff

computation graph

⇒ Go backwards along edges, pass along ...

More on Word2Vec

parameters θ : matrix U and V (each word vec is a row):

and the predictions don't take into account the distance between center word c and outside word o. ⇒ all word vecs predict high for the stopwords.

Optimization Basics

min loss function: J(θ)

gradient descent ...

Course intro

Word Meaning and Representation

denotational semantics

wordnet (nltk): word meanings, synonym, relationships, hierarchical

pb: missing nuance, missing new meanings, required human labor, can't compute word similarity

Traditional NLP (untill 2012):

  • each words are discrete symbols — "localist representation"
  • use one-hot vectors for encoding

  • pbs with one-hot vecotrs:
  • large ...

This week: seq2seq.

I-Various sequence to sequence architectures

Basic Models

e.g. Machine translation
encoder network: many-to-one RNN
decoder network: one-to-many RNN

This architecture also works for image captioning: use ConvNet as encoder

Difference between seq2seq and generating new text with language model: seq2seq don't randomly choose a translation ...

I - Introduction to Word Embeddings

Word representation
So far: representing words with one-hot encoding → word relationships are not generalized.
⇒ want to learn a featurized representatin for each word as a high-dim vector

→ visualize word embeddings in 2-dim space, e.g. via t-SNE

Using word embeddings

example: NER
transfer learning: using ...