[Neural Networks and Deep Learning] week4. Deep Neural Network

Deep L-layer neural network

Layer counting:

  • input layer is not counted as a layer, "layer 0"
  • last layer (layer L, output layer) is counted.

notation: layer 0 = input layer L = number of layers n^[l] = size of layer l a^[l] = activation of layer l = g[l]( z[l] ) → a[L] = yhat, a[0] = x

Forward Propagation in a Deep Network

⇒ general rule:

vectorization over all training examples: Z = [z(1),...,z(m)] one column per example ⇒

A[0] = X
for l = 1..L:
  Z[l] = W[l]A[l-1] + b[l]
  A[l] = g[l]( Z[l] )
Yhat = A[L]

Getting your matrix dimensions right

Debug: walk through matrix dimensions of NN, W[l].

Single training example dimension:
a[l-1].shape = (n[l-1], 1)
z[l].shape = (n[l], 1)
z[l] = W[l] * a[l-1] + b[l], shape = (n[l],1)
W[l].shape = (n[l], n[l-1]), b[l].shape = (n[l],1)

Vectorized (m examples) dimension:
Z = [z(1),...,z(m)] stacking columns.
Z[l].shape = (n[l], m)
Z[l] = W[l] * A[l-1] + b[l]

Z[l].shape = A[l].shape = (n[l], m)

Why deep representations?

intuition:

as layers grow: simple to complex representation / low to high level of abstraction.

Circuit theory: small deep NN is better than big shallow NN.

Example: representation of a XOR.join(x1..xn) function.

  • Using deep NN ⇒ build an XOR binary tree

  • Using shallow NN: one single layer → enumerate all 2^n configurations of inputs.

Building blocks of deep neural networks

Fwdprop and backprop, for layer l.

  • Fwdprop: from a[l-1] to a[l]

note: cache z[l] for backprop.

  • Backprop: from da[l] to da[l-1], dw[l] and db[l]

Once the fwd and back functions are implemented, put layers together:

Forward and Backward Propagation

Fwd prop
input = a[l-1], output = a[l], cache = z[l]

Z[l] = W[l] * A[l-1] + b[l]
Z[l] = g[l]( Z[l] )

Back prop input = da[l], output = da[l-1], dW[1], db[l]

note:

  • remember da = dL/da, so here da~='1/da' mathematically.
  • derivate of matrix multiplication = transposed matrix derivative: (A*B)' = B^T' * A^T'
  • initial paule of backprop: da[L] = dL/dyhat


Vectorized version:

Parameters vs Hyperparameters

  • parameters: W[l] and b[l] → trained from data
  • hyperparams:
    • alpha (learning_rate), number of iterations, L, n[l] size of each layer, g[l] at each layer...
    • momentum, minibatch, regularization...

→ finally decides what params will be.

empirical: try out different hyperparams.

What does this have to do with the brain?

logistic regression unit ~~~> neuron in brain

assignment: implementing a L-layer NN

  • params initialization:

note: different signature for np.random.randn and np.zeros:

W = np.random.randn(d0, d1) * 0.01
b = np.zeros((d0, d1)) # Needs putting dims in a tuple!
  • function activation:

np.maximum is element-wise comparison, whereas np.max will apply on certain axis. so ReLU(x) = np.maximum(0, x)

  • Fwd prop:

  • cost:

  • backprop formula:

  • initial paulse of backprop dA[L]:


dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

comments powered by Disqus