[Convolutional Neural Networks] week1. Foundations of Convolutional Neural Networks


Computer Vision

CV with DL: rapid progress in past two years.

CV problems:

  • image classification
  • object detection: bounding box of objects
  • neural style transfer

input features could be very high dimension: e.g. 1000x1000 image → 3 million dim input ⇒ if 1st layer has 1000 hidden units → 3 billion params for first layer...

foundamental operation: convolution.

Edge Detection Example

Motivating example for convolution operation: detecting vertical edges.

Convolve image with a filter(kernel) matrix:

Each element in resulting matrix: sum(element-wise multiplication of filter and input image).

Why the filter can detect vertical edge?

More Edge Detection

Positive V.S. negative edges:
dark to light V.S. light to dark

Instead of picking filter by hand, the actual parameters can be learned by ML.
Next: discuss some building blocks of CNN, padding/striding/pooling...


Earlier: image shrinks after convolution.
Input n*n image, convolve with f*f filter ⇒ output shape = (n-f+1) * (n-f+1)

  • image shrinks on every step (if 100 layer → shrinks to very small images)
  • pixels at corner are less used in the output

pad the image so that output shape is invariant.

if p = padding amount (width of padded border)
→ output shape = (n+2p-f+1) * (n+2p-f+1)

Terminology: valid and same convolutions:

  • valid convolution: no padding, output shape = (n-f+1) * (n-f+1)
  • same convolution: output size equals input size. i.e. filter width p = (f-1) / 2 (only works when f is odd — this is also a convention in CV, partially because this way there'll be a central filter)

Strided Convolutions

Example stride = 2 in convolution:

(convention: stop moving if filter goes out of image border.)

if input image n*n, filter size f*f, padding = f, stride = s
⇒ output shape = (floor((n+2p-f)/s) + 1) * (floor((n+2p-f)/s) + 1)

Note on cross-correlation v.s. convolution
In math books, "convolution" involves flip filter in both direction before doing "convolution" operation.

The operation discribed before is called "cross-correlation".

(Why doing the flipping in math: to ensure assosative law for convolution — (AB)C=A(BC).)

Convolutions Over Volume

example: convolutions on RGB image
image size = 663 = height * width * #channels
filter size = 333, (convention: filter's #channels matches the image)
output size = 44 (1) — output is 2D for each filter.

Multiple filters:

  • take >1 filters
  • stack outputs together to form an output volume.

Summary of dimensions:
input shape = n*n*n_c
filter shape = f*f*n_c

filters = n_c'

⇒ output shape = (n-f+1) * (n-f+1) * n_c'

One Layer of a Convolutional Network

For each filter's output: add bias b, then apply nonlinear activation function.

One layer of a CNN:

with analogy to normall NN:

  • linear operation (matrix mul V.S. convolution)
  • bias
  • nonlinear activation
  • difference: Number of parameters doesn't depend on input dimension: even for very large images.

Notation summary:

note: ordering of dimensions: example index, height, width, #channel.

Simple Convolutional Network Example

general trend: as going to later layers, image size shrinks, #channels increases.

Pooling Layers

Pooling layers makes CNN more robust.

Max pooling
divide input into regions, take max of each region.

  • Hyperparams:

(common choice) filter size f=2 or 3, strid size s=2, padding p=0.

  • note: no params to learn for max pooling layer, pooling layer not counted in #layers (conv-pool as a single layer)

Intuition: a large number indicats a detected feature in that region → preseved after pooling.

Formula of dimension floor((n+2p-f+1)/s + 1) holds for POOL layer as well.

Output of max pooling: the same #channels as input (i.e. do maxpooling on each channel).

Average pooling
Less often used than max pooling.
Typical usecase: collapse 771000 activation into 111000.

CNN Example


Why Convolutions?

2 main advantages of CONV over FC: param sharing; sparsity of connections.

Parameter sharing:
A feature detector useful in one part of img is probably useful in another part as well.
→ no need to learn separate feature detectors in different parts.

Sparsity of connections:
For each output value depends only on a small number of inputs (the pixels near that position)

  • Invarance to translation...
comments powered by Disqus