Deep L-layer neural network

Layer counting:

  • input layer is not counted as a layer, "layer 0"
  • last layer (layer L, output layer) is counted.

notation: layer 0 = input layer L = number of layers n^[l] = size of layer l a^[l] = activation of layer l = g[l]( z[l] ) → a ...

Neural Networks Overview

new notation:

  • superscript [i] for quantities in layer i. (compared to superscript (i) for ith training example).
  • subscript i for ith unit in a layer

Neural Network Representation


  • a^[i]: activation at layer i.
  • input layer: x, layer 0.
  • hidden layer
  • output layer: prediction (yhat)
  • don ...

This week: logistic regression.

Binary Classification & notation

ex. cat classifier from image image pixels: 64x64x3 ⇒ unroll(flatten) to a feature vector x dim=64x64x3=12288:=n (input dimension)


  • superscript (i) for ith example, e.g. x^(i)
  • superscript [l] for lth layer, e.g. w^[l]
  • m: number of ...

What is a neural network?

Example: housing price prediciton.

Each neuron: ReLU function

Stacking multiple layers of neurons: hidden layers are concepts more general than input layer — found automatically by NN.

Supervised Learning with Neural Networks

supervised learning: during training, always have output corresponding to input.

Different NN types are ...

problems with text:

  1. often very rare word is important, e.g. retinopathy
  2. ambiguity: e.g. cat and kitty

→ need a lot of labeled data ⇒ not realistic.
unsupervised learning

similar words appear in similar context.
embedding: map words to small vectors

measure the closeness by cosine distance:


initial: random vector ...

statistical invariance → weight sharing
e.g. image colors, translation invariance...


is NNs that share their weights across space.

convolution: slide a small patch of NN over the image to produce a new "image"

convnet forms a pyramid, each "stack of pincake" get larger depth and smaller area.

convolutional lingo ...

Linear models

matrix multiplication: fast with GPU
numerically stable
cannot cocatenate linear units → equivalent to one big matrix...

⇒ add non-linear units in between

rectified linear units (RELU)

chain rule: efficient computationally

back propagation

easy to compute the gradient as long as the function Y(X) is made of simple blocks ...

这是udacity上deeplearning的笔记, 做得非常粗糙, 而且这门课也只是介绍性质的...

Softmax function

socres yi ⇒ probabilities pi

property: smaller scores ⇒ less certain about result

Onehot encoding

Cross entropy

measure how well the probability vector S corresponds to the label vector L. ⇒ cross entropy D(S,L)( D>=0, the smaller the better ...