[XCS224N] Lecture 3 – Neural Networks

Sat, 21 Mar 2020 deep learning Series Part 3 of «XCS224N: NLP with deep learning»

Classification Setup and Notation
- softmax classifier
- cross-entropy
Neural Network Classifier
Named Entity Recognition (NER)
Binary Word Window Classification
Computing Gradients by Hand
Gradients in Neural Network

This week: neural net fundamentals

Classification Setup and Notation

training data:

softmax classifier

(linear classifier — hyperplane):

ith row of the param W: weight vector for class i to compute logits:

prediction = softmax of f_y:

cross-entropy

goal: for (x, y), maximize p(y|x) ⇒ loss for (x, y) = -log p(y|x)

in our case, the truth distribution is one-hot , i.e.p = [0, 0, ... , 1, ... 0] ⇒ cross entropy H = - sum{log q(y|x), for all x, y}

loss for all training data = averaging the losses:

where the logits vector f is:

Neural Network Classifier

softmax or SVM or other linear models are not powerful enough ⇒ NN to learn nonlinear decision boundaries

in NLP: learn both model parameters ( W ) and representations (wordvecs x )

artificial neuron: y=f(Wx) , where f is nonlinear activation func.

when f = sigmoid = 1/(1+exp(-x)) , the neuron is binary logistic regression unit.

A neural network = running several logistic regressions at the same time

matrix notation:

Without non-linearities f(), deep neural networks can’t do anything more than a linear transform.

Named Entity Recognition (NER)

task: find and classify names in text

BIO encoding:

Binary Word Window Classification

Classify a word in its context window of neighboring words. simple idea: concat all context words

Binary classification with unnormalized scores(2008&2011): build true window and corrupted windows.

feed-forward computation:

intuition: middle layer learns non-linear interactions between words: example: only if “museums” is first vector should it matter that “in” is in the second position.

max-margin loss: let true window score be larger (by at leaset delta=1)than the corrupted window score.

QUESTION: why we can use SGD when continuous?

SGD:

Computing Gradients by Hand

multivariable derivatives / matrix calculus

*when f is from Rn → R1, *Gradient=vector of partial derivatives

when f is from Rn → Rm, Jacobian is an m x n matrix of partial derivatives

chain rule: multiply the Jacobians

⇒ the nonlinear (activation) function h is element-wise, Jacobian of h is diagonal:

⇒ other Jacobians:

Gradients in Neural Network

notation:

⇒ Apply the chain rule with Jacobian/grad formulars from last section:

⇒ extract the common part, call it local error signal:

⇒

[XCS224N] Lecture 3 – Neural Networks

Classification Setup and Notation

softmax classifier

cross-entropy

Neural Network Classifier

Named Entity Recognition (NER)

Binary Word Window Classification

Computing Gradients by Hand

Gradients in Neural Network

[XCS224N] Lecture 4 – Backpropagation

[XCS224N] Lecture 5 – Dependency Parsing

Part 3 of series «XCS224N: NLP with deep learning»：

Disqus 留言