Overview

Background (Pre-Neural Machine Translation)
- machine translation (MT): sentence from source lang to target lang.
- 1950s: rule based, using bilingual dictionary.
1990s-2010s: Statistical MT (SMT)

using Bayes rule: P(y|x) = P(x|y)*P(y) / P(x)

⇒ The language model we already learnt in prev lectures
⇒ To get the translation model: learn from a lot of parallel data, e.g.large corpus of English doc and French translations
and break it down with alignment a :

Examples of alignment: can be without counterparts, or one-to-many, or many-to-one, or many-to-many,




"Decoding": use heuristics to search for argmax sequence
Seq2Seq Overview
2014: NMT to do machine translation using a single neural network. architecture: seq2seq, with 2 RNNs.



Training NMT
In decoder RNN: instead of taking argmax to gen text, take the neg log prob of the correct translated words.

Decoding Methods
"Greedy decoding": Always take argmax at each step

⇒ the final greedy sentence might not be argmax over all sentences
"Exhaustive search decoding":

⇒ complexity O(VT) , too expensive
"Beam search decoding" (not guaranteed to find optimal solution, but very efficient.) At each step, keep track the k-most probable hypotheses (partial translations). k = beam size

QUESTION: logP is negative, logP1*logP2 become positive?
Beam Search Example




Stopping criterion
- In greedy decoding: we stop right after argmax=
<END> - In beam search:
<END>can be produced at different times
when <END> is produced, that hypothese is complete, continue exploring other hypotheses.

problem: longer hypotheses have lower scores ?
⇒ normalize scores by seq length:

NMT Advantages & Disadvantages
Advantages
- Performance
- An e2e system: no subsystems to be individually optimized
- Much less human engineering effort
Disadvantages (w.r.t. SMT)
- less interpretable, hard to debug
- difficult to control: can't specify rules or guidelines
Evaluation
Eval metrics for machine translation: BLEU (Bilingual Evaluation Understudy) — compute similarity score between machine translations and human translations.
- based on ngram precision (n<=4): how many overlaps of 1/2/3/4-grams with human translations
- brevity penalty: penalty for too-short system translations
- BLEU is useful but imperfect
NMT outperformed tranditional SMT systems in 2016.

Attention Motivation and Overview
The bottleneck problem with vanilla seq2seq architecture:
depend too much on the single vector of the last encoder RNN hidden state
⇒ only the last hidden state influences decoder behavior.

Attention mechanism:
On each step of the decoder, use direct connection to the encoder. Focus on a particular part of the source sequence
-
Compute an attention score as dot prod between the current-step decoder hidden state H(k)d and each-step encoder hidden state H(i)e.

-
Apply softmax(attention-scores) to turn the attention scores into attention distribution, that shows which encoder hidden state we should focus on:

-
take the weighted average (according to the attention-distribution) of the encoder hidden states as attention output.
(this is so-called "soft alignment" as it's a distribution instead of one-hot in SMT)

-
use the "attention output" to influence the next word prediction in decoder
e.g.concat attention output with decoder current hidden state, and compute the decoder's word distribution and output a word

-
decoder go to the next position, and repeat


Attention Equations

Attention Advantages
- significantly improves NMT performance: allow decoder to focus on certain parts of the source
- solves the bottleneck problem
- helps with vanishing gradient problem: have direct connections between decoder and encoder over many timesteps
- provides some interpretability
- inspecting attention distribution, we can see what the decoder was focusing on
- we get (soft) alignment for free!
- The network just learned alignment by itself

Generalization and Variants
Attention is a general Deep Learning technique You can use attention in many architectures (not just seq2seq) and many tasks (not just MT).
More general definition of attention:

"query attend to the values"

Intuition:
- The weighted sum is a selective summary of the information contained in the values, where the query determines which values to focus on
- Attention is a way to obtain a fixed-size representation of an arbitrary set of representations (the values), dependent on some other representation (the query).
Attention variants on how to compute attention score:
- Basic dot-product attention

- Multiplicative attention:
use a bilinear func of query and value-i

the weight matrix is learnable parameter
- Additive attention

W1 , W2 and weight vector v are learnable, attention dimensionality d3 is hyperparam
Part 8 of series «XCS224N: NLP with deep learning»:
- [XCS224N] Lecture 1 – Introduction and Word Vectors
- [XCS224N] Lecture 2 – Word Vectors and Word Senses
- [XCS224N] Lecture 3 – Neural Networks
- [XCS224N] Lecture 4 – Backpropagation
- [XCS224N] Lecture 5 – Dependency Parsing
- [XCS224N] Lecture 6 – Language Models and RNNs
- [XCS224N] Lecture 7 – Vanishing Gradients and Fancy RNNs
- [XCS224N] Lecture 8 – Translation, Seq2Seq, Attention
Disqus 留言