mx's blog

[XCS224N] Lecture 8 – Translation, Seq2Seq, Attention

Sun, 05 Apr 2020 Category notes deep learning Series Part 8 of «XCS224N: NLP with deep learning»

Overview

Background (Pre-Neural Machine Translation)

machine translation (MT): sentence from source lang to target lang.
1950s: rule based, using bilingual dictionary.

1990s-2010s: Statistical MT (SMT)

using Bayes rule: P(y|x) = P(x|y)*P(y) / P(x)

⇒ The language model we already learnt in prev lectures ⇒ To get the ...

[XCS224N] Lecture 7 – Vanishing Gradients and Fancy RNNs

Sat, 04 Apr 2020 Category notes deep learning Series Part 7 of «XCS224N: NLP with deep learning»

Vanishing Gradient Intuition and Proof

ex: grad of loss at position 4 w.r.t.hidden state at postion 1

with chain rule, the grad is smaller as it backprops

⇒

If the largest eigenvalue of Wh is less than 1, the gradient J_i/h_j will exponentially shrink.

Why Vanishing Gradient ...

[XCS224N] Lecture 6 – Language Models and RNNs

Sat, 28 Mar 2020 Category notes deep learning Series Part 6 of «XCS224N: NLP with deep learning»

Language Modeling

Language Modeling: task of predicting what words come next.

i.e.compute the conditional probability distribution
a language model can also be viewed as a system to give probability to a piece of text.

n-gram Language Models

n-gram Language Model: pre-deep learning solution for language modelling.

idea: Collect ...

[XCS224N] Lecture 5 – Dependency Parsing

Mon, 23 Mar 2020 Category notes deep learning Series Part 5 of «XCS224N: NLP with deep learning»

Phrase structure: organize words into nested constituents.

Context-Free Grammars

context-free grammars (CFGs)

start with words, words are given a category (part of speech = POS):

words combine into phrases with categories like NP(noun phrase) and PP(prep.phrase):

Phrases can combine into bigger phrases recursively:

⇒ forms a tree structure:

Dependency ...

[XCS224N] Lecture 3 – Neural Networks

Sat, 21 Mar 2020 Category notes deep learning Series Part 3 of «XCS224N: NLP with deep learning»

This week: neural net fundamentals

Classification Setup and Notation

training data:

softmax classifier

(linear classifier — hyperplane):

ith row of the param W: weight vector for class i to compute logits:

prediction = softmax of f_y:

cross-entropy

goal: for (x, y), maximize p(y|x) ⇒ loss for (x, y) = -log p(y ...

[XCS224N] Lecture 4 – Backpropagation

Sat, 21 Mar 2020 Category notes deep learning Series Part 4 of «XCS224N: NLP with deep learning»

More Matrix Gradients

⇒

Deriving Gradients wrt Words

pitfall in tetraining word vectors: if some word is not in training data, but other synonyms are present ⇒ only the synonyms word vectors are moved

takeaway:

Backpropagation

backprop:

apply (generalized) chain rule
re-use shared stuff

computation graph

⇒ Go backwards along edges, pass along ...

[XCS224N] Lecture 2 – Word Vectors and Word Senses

Tue, 17 Mar 2020 Category notes deep learning Series Part 2 of «XCS224N: NLP with deep learning»

More on Word2Vec

parameters θ : matrix U and V (each word vec is a row):

and the predictions don't take into account the distance between center word c and outside word o. ⇒ all word vecs predict high for the stopwords.

Optimization Basics

min loss function: J(θ)

gradient descent ...

[XCS224N] Lecture 1 – Introduction and Word Vectors

Mon, 09 Mar 2020 Category notes deep learning Series Part 1 of «XCS224N: NLP with deep learning»

Course intro

Word Meaning and Representation

denotational semantics

wordnet (nltk): word meanings, synonym, relationships, hierarchical

pb: missing nuance, missing new meanings, required human labor, can't compute word similarity

Traditional NLP (untill 2012):

each words are discrete symbols — "localist representation"
use one-hot vectors for encoding

pbs with one-hot vecotrs:
large ...

[Sequential Models] week3. Sequence models & Attention mechanism

Wed, 28 Feb 2018 Category notes deep learning Series Part 16 of «Andrew Ng Deep Learning MOOC»

This week: seq2seq.

I-Various sequence to sequence architectures

Basic Models

e.g. Machine translation
encoder network: many-to-one RNN
decoder network: one-to-many RNN

This architecture also works for image captioning: use ConvNet as encoder

Difference between seq2seq and generating new text with language model: seq2seq don't randomly choose a translation ...

[Sequential Models] week2. Natural Language Processing & Word Embeddings

Mon, 26 Feb 2018 Category notes deep learning Series Part 15 of «Andrew Ng Deep Learning MOOC»

I - Introduction to Word Embeddings

Word representation
So far: representing words with one-hot encoding → word relationships are not generalized.
⇒ want to learn a featurized representatin for each word as a high-dim vector

→ visualize word embeddings in 2-dim space, e.g. via t-SNE

Using word embeddings

example: NER
transfer learning: using ...