Computer Vision
CV with DL: rapid progress in past two years.
CV problems:
- image classification
 - object detection: bounding box of objects
 - neural style transfer
 
input features could be very high dimension: e.g. 1000x1000 image → 3 million dim input ⇒ if 1st layer has 1000 hidden units → 3 billion params for first layer...
foundamental operation: convolution.
Edge Detection Example
Motivating example for convolution operation: detecting vertical edges.
Convolve image with a filter(kernel) matrix:

Each element in resulting matrix: sum(element-wise multiplication of filter and input image).  
Why the filter can detect vertical edge?
 
More Edge Detection
Positive V.S. negative edges:
dark to light V.S. light to dark
 
Instead of picking filter by hand, the actual parameters can be learned by ML.
Next: discuss some building blocks of CNN, padding/striding/pooling...  
Padding
Earlier: image shrinks after convolution.
Input n*n image, convolve with f*f filter ⇒ output shape = (n-f+1) * (n-f+1)
downside:  
- image shrinks on every step (if 100 layer → shrinks to very small images)
 - pixels at corner are less used in the output
 
⇒ pad the image so that output shape is invariant.
if p = padding amount (width of padded border)
→ output shape = (n+2p-f+1) * (n+2p-f+1) 
Terminology: valid and same convolutions:
- valid convolution: no padding, output shape = (n-f+1) * (n-f+1)
 - same convolution: output size equals input size. i.e. filter width 
p = (f-1) / 2(only works when f is odd — this is also a convention in CV, partially because this way there'll be a central filter) 
Strided Convolutions
Example stride = 2 in convolution:

(convention: stop moving if filter goes out of image border.)  
if input image n*n, filter size f*f, padding = f, stride = s
⇒ output shape = (floor((n+2p-f)/s) + 1) * (floor((n+2p-f)/s) + 1) 
 
Note on cross-correlation v.s. convolution
In math books, "convolution" involves flip filter in both direction before doing "convolution" operation.

The operation discribed before is called "cross-correlation".  
(Why doing the flipping in math: to ensure assosative law for convolution — (AB)C=A(BC).)
Convolutions Over Volume
example: convolutions on RGB image
image size = 663 = height * width * #channels
filter size = 333, (convention: filter's #channels matches the image)
output size = 44 (1)  —  output is 2D for each filter.  
 
Multiple filters:
- take >1 filters
 - stack outputs together to form an output volume.
 
 
Summary of dimensions:
input shape = n*n*n_c
filter shape = f*f*n_c 
filters = n_c'
⇒ output shape = (n-f+1) * (n-f+1) * n_c'  
One Layer of a Convolutional Network
For each filter's output: add bias b, then apply nonlinear activation function.
One layer of a CNN:
 
with analogy to normall NN:
- linear operation (matrix mul V.S. convolution)
 - bias
 - nonlinear activation
 - difference: Number of parameters doesn't depend on input dimension: even for very large images.
 
Notation summary:

note: ordering of dimensions: example index, height, width, #channel.  
Simple Convolutional Network Example
 
general trend: as going to later layers, image size shrinks, #channels increases.
Pooling Layers
Pooling layers makes CNN more robust.
Max pooling
divide input into regions, take max of each region.  
- Hyperparams:
 
(common choice) filter size f=2 or 3, strid size s=2, padding p=0.
- note: no params to learn for max pooling layer, pooling layer not counted in #layers (conv-pool as a single layer)
 

Intuition: a large number indicats a detected feature in that region → preseved after pooling.  
Formula of dimension floor((n+2p-f+1)/s + 1) holds for POOL layer as well.  
Output of max pooling: the same #channels as input (i.e. do maxpooling on each channel).
Average pooling
Less often used than max pooling.
Typical usecase: collapse 771000 activation into 111000.  
CNN Example
LeNet-5
 
 
Why Convolutions?
2 main advantages of CONV over FC: param sharing; sparsity of connections.
Parameter sharing:
A feature detector useful in one part of img is probably useful in another part as well.
→ no need to learn separate feature detectors in different parts.  
Sparsity of connections:
For each output value depends only on a small number of inputs (the pixels near that position)  
- Invarance to translation...
 
Part 10 of series «Andrew Ng Deep Learning MOOC»:
- [Neural Networks and Deep Learning] week1. Introduction to deep learning
 - [Neural Networks and Deep Learning] week2. Neural Networks Basics
 - [Neural Networks and Deep Learning] week3. Shallow Neural Network
 - [Neural Networks and Deep Learning] week4. Deep Neural Network
 - [Improving Deep Neural Networks] week1. Practical aspects of Deep Learning
 - [Improving Deep Neural Networks] week2. Optimization algorithms
 - [Improving Deep Neural Networks] week3. Hyperparameter tuning, Batch Normalization and Programming Frameworks
 - [Structuring Machine Learning Projects] week1. ML Strategy (1)
 - [Structuring Machine Learning Projects] week2. ML Strategy (2)
 - [Convolutional Neural Networks] week1. Foundations of Convolutional Neural Networks
 - [Convolutional Neural Networks] week2. Deep convolutional models: case studies
 - [Convolutional Neural Networks] week3. Object detection
 - [Convolutional Neural Networks] week4. Special applications: Face recognition & Neural style transfer
 - [Sequential Models] week1. Recurrent Neural Networks
 - [Sequential Models] week2. Natural Language Processing & Word Embeddings
 - [Sequential Models] week3. Sequence models & Attention mechanism
 
Disqus 留言