[XCS224N] Lecture 4 – Backpropagation

Sat, 21 Mar 2020 deep learning Series Part 4 of «XCS224N: NLP with deep learning»

目录

More Matrix Gradients
Deriving Gradients wrt Words
Backpropagation
- computation graph
More on Backpropagation
- Backprop in general computation graph
Backprop Implementations
Regularization
Vectorization/tensorization
Nonlinearities
Parameter Initialization
Optimizers and Learning Rates

More Matrix Gradients

⇒

Deriving Gradients wrt Words

pitfall in tetraining word vectors: if some word is not in training data, but other synonyms are present ⇒ only the synonyms word vectors are moved

takeaway:

Backpropagation

backprop:

apply (generalized) chain rule
re-use shared stuff

computation graph

⇒ Go backwards along edges, pass along gradients

receive upstream grad => compute downstream grad

for node with multiple inputs:

More on Backpropagation

intuition:

plus( + ) distributes upstream grad

maxroutes upstream grad

multiply( * ) switches the upstream grad

efficency: compute shared part once

Backprop in general computation graph

comput-graph is a DAG ⇒ topological sort

Fprop: visit nodes in topological
Bprop: in reverse topological order

Complexity = O(n) Automatic Differentiation: symbolic computation on the symbolic expression of Fprop. Moden DL framework: must provide the Fprop/Bprop formular for each node.

Backprop Implementations

for each gate, impl the forward/backward API:

Numeric Gradient For checking if the forward/backward impl is correct e.g.check

(note: use two-sided gradient checks)

Regularization

Regularization term added to loss func to prevent overfitting:

Vectorization/tensorization

avoid forloops, use matrix multiplication instead.

Nonlinearities

tanh is recaled and shifted of sigmoid: tanh(x) = 2 * sigmoid(2x) - 1

new world activation func:

Parameter Initialization

Initialize the weights to small, random values ⇒ break the symmetry.

Init bias to 0
Init all other weights to Uniform(-r, r).
Xavier initialization: variance inverse-proportional to sum of prev&next layer size

Optimizers and Learning Rates

Usually simple SGD works fine, but needs to tune the learningrate (lr). adaptive optimizers: per-parameter learning rate.

learning rate:

try with powers of 10
learningrate-decay:

(epoch = full pass over the training data)

[XCS224N] Lecture 2 – Word Vectors and Word Senses

[XCS224N] Lecture 3 – Neural Networks

Part 4 of series «XCS224N: NLP with deep learning»：

Disqus 留言

comments powered by Disqus