This week: two special application of ConvNet.
What is face recognition
Face verification & face recognition
- verification: input = image and ID → output whether the image and ID are the same.
- recognition: database = K persons, input = image → output = ID of the image among the K person or "not recognized".
→ the verification system's precision needs to be very high in order to be used in face recognition.
One Shot Learning
"one shot": learn from just one example, and able to recognize the person again.
A CNN+softmax is not practical, e.g. when new images are added to database, output_dim will change...
⇒ instead, learn a similarity function.
d(img1, img2) = degree of difference between images. + threshold
To learn a disimilarity function: Siamese Network.
Use CNN+FC to encode a pic
x into vector
⇒ define disimilarity function d(x1, x2) as distance between encoded vectors.
want to learn NN params for the encoding f(x) such that:
when x1 and x2 are same person, dist(f(x1), f(x2)) is small, otherwise large.
triplet: (anchor, positive, negative).
want f(A) similar to f(P) and different from f(N):
d(A, P) - d(A, N) <= 0
⇒ to avoid NN from learning a trival output (i.e. all encodings are identical, d(A, P)=d(A, N)=0), add margin
d(A, P) - d(A, N) + alpha<= 0
Loss function definition:
similar to hinge loss:
L = max(0, d(A,P)-d(A,N) + alpha)
→ Generate triplets from dataset, and feed to the NN.
note: here we do need >1 pics of the same person (Anchor and Positive).
Choosing triplets A, P, N
If A,P,N are randomly chosen, the constraint is easily satisfied → NN won't learn much.
→ choose A,P,N that are "hard" to train on. Computation efficiency of learning is improved.
(details presented in the FaceNet paper)
in practice: better download pretrained model.
Face Verification and Binary Classification
Triplet loss is one way of doing face verification.
another way that works as well: treat as a binary classification problem.
Given input image x1 and x2 → feed f(x1) and f(x2) to a logistic regression unit.
→ feed the difference in encodings and feed to logistic regression.
computation trick: precompute encodings of imgs in database at inference time.
II-Neural Style Transfer
What is neural style transfer?
C+ style image
S → generated image
What are deep ConvNets learning?
visualize hidden units of different layers
Pick one unit in 1st layer → find the nine image patches that maximize the unit's activation.
In deeper layers: units can see larger image patches → gather nine argmax image patchs as before.
deeper layers can detect higher level shapes:
Define a cost function for the generated image.
J(G) measure how good is an image, contains two parts:
J_content(C, G): how similar is G to C
J_style(S, G): how similar is G to S.
Find generated image G
(similar to embedding?)
- initialize G randomly
- Use gradient-descent to minimized J(G)
Content Cost Function
Given a pre-trained CNN, use hidden layer l to compute J_content. The depth of l controls the level of details focuses on.
Define J_content(C, G) = difference (e.g. L2-norm) between the activation of layer l of image C and image G.
Style Cost Function
l to measure "style".
→ style defined as correlation between activations across channels.
n_C=5 channels of slices
n_W * n_H.
correlation between 2 channel ~= which high-level features tend to occur / not occur together in an image.
a_ijk= (scalar)activation at hight=
- "Style matrix)
G[l], (G for "Gram matrix") shape =
n_C * n_C, measures how correlated any two channels are. (i.e. G[l] measures the degree to which the activations of different feature detectors in layer l vary (or correlate) together with each other.)
G_kk' = correlation between kth and k'th channel.
:= sum_over_i_j( a_ijk * a_ijk')
mathematically, this "correlation" is unnormalized cross-covariance (without substracting the mean).
G for both the style image and generated image.
→ J_style(Gen_img, Sty_img) = difference (Frobenius norm) between G(gen) and G(sty)
In practice, take J_style for multiple layers:
1D and 3D Generalizations
images: 2D data.
→ generalize convolution to 1D and 3D data.
e.g. EKG data (heart voltage).
(note: most 1D data use RNN...)
have height width and depth,
e.g. C.T. data; movie date (frame by frame)
→ generalized from 2D
Part 13 of series «Andrew Ng Deep Learning MOOC»：
- [Neural Networks and Deep Learning] week1. Introduction to deep learning
- [Neural Networks and Deep Learning] week2. Neural Networks Basics
- [Neural Networks and Deep Learning] week3. Shallow Neural Network
- [Neural Networks and Deep Learning] week4. Deep Neural Network
- [Improving Deep Neural Networks] week1. Practical aspects of Deep Learning
- [Improving Deep Neural Networks] week2. Optimization algorithms
- [Improving Deep Neural Networks] week3. Hyperparameter tuning, Batch Normalization and Programming Frameworks
- [Structuring Machine Learning Projects] week1. ML Strategy (1)
- [Structuring Machine Learning Projects] week2. ML Strategy (2)
- [Convolutional Neural Networks] week1. Foundations of Convolutional Neural Networks
- [Convolutional Neural Networks] week2. Deep convolutional models: case studies
- [Convolutional Neural Networks] week3. Object detection
- [Convolutional Neural Networks] week4. Special applications: Face recognition & Neural style transfer
- [Sequential Models] week1. Recurrent Neural Networks
- [Sequential Models] week2. Natural Language Processing & Word Embeddings
- [Sequential Models] week3. Sequence models & Attention mechanism