This week: neural net fundamentals
Classification Setup and Notation
training data:

softmax classifier
(linear classifier — hyperplane):

ith row of the param W: weight vector for class i to compute logits:

prediction = softmax of f_y:

cross-entropy
goal: for (x, y), maximize p(y|x) ⇒ loss for (x, y) = -log p(y|x)


in our case, the truth distribution is one-hot , i.e.p = [0, 0, ... , 1, ... 0]
⇒ cross entropy H = - sum{log q(y|x), for all x, y}
loss for all training data = averaging the losses:

where the logits vector f is:

Neural Network Classifier
softmax or SVM or other linear models are not powerful enough ⇒ NN to learn nonlinear decision boundaries

in NLP:
learn both model parameters ( W ) and representations (wordvecs x )

artificial neuron: y=f(Wx) , where f is nonlinear activation func.

when f = sigmoid = 1/(1+exp(-x)) , the neuron is binary logistic regression unit.
A neural network = running several logistic regressions at the same time

matrix notation:

Without non-linearities f(), deep neural networks can’t do anything more than a linear transform.
Named Entity Recognition (NER)
task: find and classify names in text

BIO encoding:

Binary Word Window Classification
Classify a word in its context window of neighboring words. simple idea: concat all context words

Binary classification with unnormalized scores(2008&2011): build true window and corrupted windows.

feed-forward computation:

intuition: middle layer learns non-linear interactions between words: example: only if “museums” is first vector should it matter that “in” is in the second position.
max-margin loss: let true window score be larger (by at leaset delta=1)than the corrupted window score.

QUESTION: why we can use SGD when continuous?
SGD:

Computing Gradients by Hand
multivariable derivatives / matrix calculus
- *when f is from Rn → R1, *Gradient=vector of partial derivatives

- when f is from Rn → Rm, Jacobian is an m x n matrix of partial derivatives

chain rule: multiply the Jacobians

⇒ the nonlinear (activation) function h is element-wise, Jacobian of h is diagonal:

⇒ other Jacobians:


Gradients in Neural Network
notation:

⇒ Apply the chain rule with Jacobian/grad formulars from last section:



⇒ extract the common part, call it local error signal:

⇒

Part 3 of series «XCS224N: NLP with deep learning»:
- [XCS224N] Lecture 1 – Introduction and Word Vectors
- [XCS224N] Lecture 2 – Word Vectors and Word Senses
- [XCS224N] Lecture 3 – Neural Networks
- [XCS224N] Lecture 4 – Backpropagation
- [XCS224N] Lecture 5 – Dependency Parsing
- [XCS224N] Lecture 6 – Language Models and RNNs
- [XCS224N] Lecture 7 – Vanishing Gradients and Fancy RNNs
- [XCS224N] Lecture 8 – Translation, Seq2Seq, Attention
Disqus 留言