# [XCS224N] Lecture 4 – Backpropagation      pitfall in tetraining word vectors: if some word is not in training data, but other synonyms are present ⇒ only the synonyms word vectors are moved takeaway: ## Backpropagation

backprop:

• apply (generalized) chain rule
• re-use shared stuff

### computation graph ⇒ Go backwards along edges, pass along gradients  for node with multiple inputs:  ## More on Backpropagation intuition:

• plus( `+` ) distributes upstream grad • `max`routes upstream grad • multiply( `*` ) switches the upstream grad efficency: compute shared part once ### Backprop in general computation graph

comput-graph is a DAG ⇒ topological sort

• Fprop: visit nodes in topological
• Bprop: in reverse topological order  Complexity = O(n) Automatic Differentiation: symbolic computation on the symbolic expression of Fprop. Moden DL framework: must provide the Fprop/Bprop formular for each node.

## Backprop Implementations for each gate, impl the forward/backward API: Numeric Gradient For checking if the forward/backward impl is correct e.g.check ## Regularization

Regularization term added to loss func to prevent overfitting:  ## Vectorization/tensorization

avoid forloops, use matrix multiplication instead. ## Nonlinearities tanh is recaled and shifted of sigmoid: `tanh(x) = 2 * sigmoid(2x) - 1`

new world activation func: ## Parameter Initialization

Initialize the weights to small, random values ⇒ break the symmetry.

• Init bias to 0
• Init all other weights to Uniform(-r, r).
• Xavier initialization: variance inverse-proportional to sum of prev&next layer size ## Optimizers and Learning Rates

Usually simple SGD works fine, but needs to tune the learningrate (lr). adaptive optimizers: per-parameter learning rate. learning rate:

• try with powers of 10
• learningrate-decay: (epoch = full pass over the training data)