Background (Pre-Neural Machine Translation)
- machine translation (MT): sentence from source lang to target lang.
- 1950s: rule based, using bilingual dictionary.
1990s-2010s: Statistical MT (SMT)
using Bayes rule:
P(y|x) = P(x|y)*P(y) / P(x)
⇒ The language model we already learnt in prev lectures
⇒ To get the translation model: learn from a lot of parallel data, e.g.large corpus of English doc and French translations
and break it down with alignment
Examples of alignment: can be without counterparts, or one-to-many, or many-to-one, or many-to-many,
"Decoding": use heuristics to search for argmax sequence
2014: NMT to do machine translation using a single neural network. architecture: seq2seq, with 2 RNNs.
In decoder RNN: instead of taking argmax to gen text, take the neg log prob of the correct translated words.
"Greedy decoding": Always take argmax at each step
⇒ the final greedy sentence might not be argmax over all sentences
"Exhaustive search decoding":
⇒ complexity O(VT) , too expensive
"Beam search decoding" (not guaranteed to find optimal solution, but very efficient.) At each step, keep track the k-most probable hypotheses (partial translations). k = beam size
QUESTION: logP is negative, logP1*logP2 become positive?
Beam Search Example
- In greedy decoding: we stop right after argmax=
- In beam search:
<END>can be produced at different times
<END> is produced, that hypothese is complete, continue exploring other hypotheses.
problem: longer hypotheses have lower scores ?
⇒ normalize scores by seq length:
NMT Advantages & Disadvantages
- An e2e system: no subsystems to be individually optimized
- Much less human engineering effort
Disadvantages (w.r.t. SMT)
- less interpretable, hard to debug
- difficult to control: can't specify rules or guidelines
Eval metrics for machine translation: BLEU (Bilingual Evaluation Understudy) — compute similarity score between machine translations and human translations.
- based on ngram precision (n<=4): how many overlaps of 1/2/3/4-grams with human translations
- brevity penalty: penalty for too-short system translations
- BLEU is useful but imperfect
NMT outperformed tranditional SMT systems in 2016.
Attention Motivation and Overview
The bottleneck problem with vanilla seq2seq architecture:
depend too much on the single vector of the last encoder RNN hidden state
⇒ only the last hidden state influences decoder behavior.
On each step of the decoder, use direct connection to the encoder. Focus on a particular part of the source sequence
Compute an attention score as dot prod between the current-step decoder hidden state H(k)d and each-step encoder hidden state H(i)e.
Apply softmax(attention-scores) to turn the attention scores into attention distribution, that shows which encoder hidden state we should focus on:
take the weighted average (according to the attention-distribution) of the encoder hidden states as attention output.
(this is so-called "soft alignment" as it's a distribution instead of one-hot in SMT)
use the "attention output" to influence the next word prediction in decoder
e.g.concat attention output with decoder current hidden state, and compute the decoder's word distribution and output a word
decoder go to the next position, and repeat
- significantly improves NMT performance: allow decoder to focus on certain parts of the source
- solves the bottleneck problem
- helps with vanishing gradient problem: have direct connections between decoder and encoder over many timesteps
- provides some interpretability
- inspecting attention distribution, we can see what the decoder was focusing on
- we get (soft) alignment for free!
- The network just learned alignment by itself
Generalization and Variants
Attention is a general Deep Learning technique You can use attention in many architectures (not just seq2seq) and many tasks (not just MT).
More general definition of attention:
"query attend to the values"
- The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on
- Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query).
Attention variants on how to compute attention score:
- Basic dot-product attention
- Multiplicative attention:
use a bilinear func of query and value-i
the weight matrix is learnable parameter
- Additive attention
W2 and weight vector
v are learnable, attention dimensionality d3 is hyperparam
Part 8 of series «XCS224N: NLP with deep learning»：
- [XCS224N] Lecture 1 – Introduction and Word Vectors
- [XCS224N] Lecture 2 – Word Vectors and Word Senses
- [XCS224N] Lecture 3 – Neural Networks
- [XCS224N] Lecture 4 – Backpropagation
- [XCS224N] Lecture 5 – Dependency Parsing
- [XCS224N] Lecture 6 – Language Models and RNNs
- [XCS224N] Lecture 7 – Vanishing Gradients and Fancy RNNs
- [XCS224N] Lecture 8 – Translation, Seq2Seq, Attention