Convolutional Neural Networks

Alex Sanchez, Ferran Reverter and Esteban Vegas

Outline

Computer Vision and Deep Learning
Convolutions and Feature Selection
Convolutional Neural Networks
A toy example

I. Computer vision and Deep Learning

We want computers that can see

Goal: Computer systems able to see what is present in the world, but also to predict and anticipate events.

source: MIT Course, http://introtodeeplearning.com/, L3

DNN in computer vision systems

Deep Learning enables many systems to undertake a variety of computer vision related tasks.

source: MIT Course, http://introtodeeplearning.com/, L3

Facial detection and recognition

In particular it enables automatic feature extraction.

source: MIT Course, http://introtodeeplearning.com/, L3

Autonomous driving

Autonomus Driving would not be possible without the possibility of performing Automatic Feature Extraction

source: MIT Course, http://introtodeeplearning.com/, L3

Medicine, biology, self care

Neither would systems for automatic disease detection be able to distinguish healthy from affected people though images.

source: MIT Course, http://introtodeeplearning.com/, L3

Main tasks in Computer Vision:

Regression: Output variable takes continuous value. E.g. Distance to target
Classification: Output variable takes class labels. E.g. Probability of belonging to a class

source: MIT Course, http://introtodeeplearning.com/, L3

II. Convolutions and Feature Selection

What (how) do computers see?

To a computer images, of course, are numbers.
A greyscale image is a N x M array of numbers.

source: MIT Course, http://introtodeeplearning.com/, L3

What (how) do computers see?

An RGB (for Red, Green, Blue) color image is an N x M x 3 array of numbers

source: Bhupendra Pratap Singh

High level feature detection

Each image is characterized by a different set of features.
Before attempting to build a computer vision system, we need to be aware of what feature keys are in our data that need to be identified and detected.

source: MIT Course, http://introtodeeplearning.com/, L3

How to do feature extraction

Manual feature extraction is hard!
Feature characterization needs to define a hierarchy of features allowing an increasing level of detail.
Deep Neural networks can do this automatically!

source: MIT Course, http://introtodeeplearning.com/, L3

Feature extraction with dense NN

Fully connected NN could, in principle, be used to learn visual features

source: MIT Course, http://introtodeeplearning.com/, L3

Accounting for spatial structure

Images have a spatial structure.
How can this be used to inform the architecture of the Network?

source: MIT Course, http://introtodeeplearning.com/, L3

Extending the idea with patches

source: MIT Course, http://introtodeeplearning.com/, L3

Use filters to extract features

Filters can be used to extract local features
- A filter is a set of weights
Different filters can extract different characteristics.
- Combining the filters is an efficient way to caracterize an image.
Filters that matter in one part of the input should matter elsewhere so:
- Parameters of each filter are spatially shared.

A filter for each pattern?

By applying different filters, i.e. changing the weights,
We can achieve completely different results

In practice filters are combined to completely characterize the images.

Example: “X or X”?

source: MIT Course, http://introtodeeplearning.com/, L3

Images are represented by matrices of pixels, so
Literally speaking these images are different.

What are the features of X

Look for a set of features that:
- characterize the images, and
- and are the same in both cases.

source: MIT Course, http://introtodeeplearning.com/, L3

Filters can detect X features

source: MIT Course, http://introtodeeplearning.com/, L3

Is a given pattern in the image?

Imagine we want to check if a (small) pattern is contained in an (larger) image.
A slow option is to do a pixel to pixel comparison.
A better option is to scan the image using an operation known as convolution of the image and the pattern (here called patch, filter or kernel).
It is faster and allows detecting how well the patch matches different regions in the image.

The Convolution Operation

Given an input image \(I\) and a filter (kernel) \(K\), the convolution operation is defined as:

\[ (I * K)(i,j) = \sum_m \sum_n I(i-m, j-n) K(m,n) \]
Here:
- \(I(i,j)\) represents the pixel value at position \((i,j)\) in the image.
- \(K(m,n)\) represents the kernel values.
- The summation runs over the dimensions of the kernel.
- The result \((I * K)(i,j)\) gives a new pixel value after applying the filter at that location.

The Convolution Operation

source: MIT Course, http://introtodeeplearning.com/, L3

Visualizing Convolution

Consider a 3×3 kernel applied to a 5x5 image:

\[ I = \begin{bmatrix} x_{11} & x_{12} & \dots & x_{15} \\ x_{21} & x_{22} & \dots & x_{25} \\ \dots & \dots & \dots & \dots \\ x_{51} & x_{52} & \dots & x_{55} \end{bmatrix}, \quad K = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix} \]
The kernel slides across the image, performing the weighted sum:
- Multiply corresponding elements.
- Sum the results.
- Store in a new matrix (feature map).

Visualizing Convolution

Suppose we want to compute the convolution of a 5x5 image and a 3x3 filter.

source: MIT Course, http://introtodeeplearning.com/, L3

The Convolution Operation

Slide the 3x3 filter over the input image,
Elementwise multiply and
Add the outputs

source: MIT Course, http://introtodeeplearning.com/, L3

The Convolution Operation

Can filters be learned?

Different filters can be used to extract different characteristics from the image.
- Building filters by trial-and-error can be slow.
If a NN can learn these filters from the data, then
- They can be used to classify new images.
This is what Convolutional Neural Networks is about.

III. Convolutional Neural Networks