Lviv University
Definition
Convolutional networks (LeCun, 1989), also known as convolutional neural networks, or CNNs, are a specialized kind of neural network for processing data that has a known grid-like topology.
CNNs are a family of models that were originally inspired by how the visual cortex of the human brain works when recognizing objects.
What’s in a name?
Convolutional networks have been tremendously successful in practical applications. The name “convolutional neural network” indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation.
Data Examples
Examples include:
Image processing challenges using feed-forward NNs
Uses
Uses
Data properties
Origins
The development of CNNs goes back to the 1990s, when Yann LeCun and his colleagues proposed a novel NN architecture for classifying handwritten digits from images
Turing award
Several years later, in 2019, Yann LeCun received the Turing award (the most prestigious award in computer science) for his contributions to the field of artificial intelligence (AI), along with two other researchers, Yoshua Bengio and Geoffrey Hinton.
Cats
The original discovery of how the visual cortex of our brain functions was made by David H. Hubel and Torsten Wiesel in 1959, when they inserted a microelectrode into the primary visual cortex of an anesthetized cat.
They measured the electrical responses of individual neurons in the visual cortex of cats while presenting visual stimuli to the cats’ eyes.
Gabor filters
These model responses of simple cells. \[ \begin{align*} &G(x,y) = A \exp \left(-\alpha \tilde{x}^2 - \beta \tilde{y}^2\right) \sin \left(\omega \tilde{x} + \psi\right), \; \text{where} \\ &\tilde{x} = (x-x_0)\cos \theta + (y-y_0) \sin \theta, \\ &\tilde{y} = -(x-x_0)\sin \theta + (y-y_0) \cos \theta. \end{align*} \]
Definition
Convolutional networks are neural networks that use convolution in place of general matrix multiplication in at least one of their layers.
Feed-forward NN
\[ z = Wx + b \]
Convolutional NN
\[ \textbf{Z} = \textbf{W} \ast \textbf{X} + b \]
Plan
General definition
An operation on two functions of a real-valued argument.
Spaceship example
Limitations
Notation
\[ s(t) = (x \ast w)(t) \]
Terminology
Discrete convolution
Measurements cannot be continuous in practice, they will be discrete. Therefore: \[ s(t) = (x \ast w)(t) = \sum\limits_{a=-\infty}^{+\infty} x(a)w(t-a). \]
2D
Multiple axes: suppose we have a two-dimensional image \(I\) and thus two-dimensional kernel \(K\). \[ S(i, j) = (I \ast K)(i,j) = \sum\limits_m \sum\limits_n I(m,n)K(i-m,j-n) \] By commutativity: \[ S(i, j) = (K \ast I)(i,j) = \sum\limits_m \sum\limits_n I(i-m,j-n)K(m,n) \]
Commutativity
The commutative property of convolution arises because we have flipped the kernel relative to the input, in the sense that as m increases, the index into the input increases, but the index into the kernel decreases. The only reason to flip the kernel is to obtain the commutative property.
Cross-correlation - Definition
\[ S(i, j) = (I \ast K)(i,j) = \sum\limits_m \sum\limits_n I(i+m,j+n)K(m,n) \]
Properties
Analogies with weight/bias
Algebraic analogy
Discrete convolution can be viewed as a matrix multiplication, but matrix has several entries constrained to be equal to other entries. Also, matrices are very sparse, because kernel is usually much smaller than the input.
In two dimensions, a doubly block circulant matrix corresponds to convolution.
For univariate discrete convolution, each row of the matrix is constrained to be equal to the row above shifted by one element.
\[ \begin{pmatrix} 2 & -1 & 0 & \cdots & \cdots & \cdots & \cdots & 0\\ -1 & 2 & -1 & 0 & & & & \vdots\\ 0 & -1 & 2 & -1 & \ddots & & & \vdots\\ \vdots & 0 & \ddots & \ddots & \ddots & \ddots & & \vdots\\ \vdots & & \ddots & \ddots & \ddots & \ddots & 0 & \vdots\\ \vdots & & & \ddots & -1 & 2 & -1 & 0\\ \vdots & & & & 0 & -1 & 2 & -1\\ 0 & \cdots & \cdots & \cdots & \cdots & 0 & -1 & 2\\ \end{pmatrix} \]
Ideas
Note
Complexity notes
For matrix multiplication, in case of \(m\) inputs and \(n\) otputs: \(O(m \times n)\).
Limit number of output connections to \(k\), get \(O(k \times n)\) complexity
Improvement
It’s possible to obtain good performance while keeping \(k \ll m\).
Sparse connectivity, viewed from below.
Sparse connectivity, viewed from above.
Large receptive field.
Definition
Parameter sharing refers to using the same parameter for more than one function in a model. As a synonym for parameter sharing, one can say that a network has tied weights.
Efficiency of edge detection.The image on the right was formed by taking each pixel in the original image and subtracting the value of its neighboring pixel on the left.
Definition
To say a function is equivariant means that if the input changes, the output changes in the same way.
Specifically, a function f(x) is equivariant to a function g if \(f(g(x)) = g(f(x))\).
In the case of convolution, if we let g be any function that translates the input, that is, shifts it, then the convolution function is equivariant to g.
Kernel example.
Shape
The collection of kernels defining a discrete convolution has a shape corresponding to some permutation of \((n, m, k_1, \dots, k_N)\), where
Shape
The following properties affect the output size \(o_j\) of a convolutional layer along axis \(j\):
Stride \(s\)=2.
For multiple feature maps, they are convolved with distinct kernels, and the results are summed up elementwise to produce the output feature map.
Note
\[ \begin{align*} &y = x \ast w \\ &y[i] = \sum\limits_{k=-\infty}^{+\infty} x[i-k]w[k] \end{align*} \]
How to deal with infinity?
Padding modes
There are three modes of padding that are commonly used in practice: full, same, and valid.
Pros/cons
Example
Padding with size \(p\), input size \(n\) and filter size \(m\), \(m \leq n\): \[ \begin{gather} y[i] = \sum\limits_{k=0}^{m-1} x^p [i+m-k]w[k] \end{gather} \]
Output size of a convolution is determined by: \[ \begin{align*} &o = \lfloor \frac{n+2p-m}{s} + 1\rfloor \end{align*} \]
Description
Pooling works very much like a discrete convolution, but replaces the linear combination described by the kernel with some other function.
Average pooling.
Max pooling.
Max pooling.
Max/mean pooling
Pooling invariance example
Properties
The following properties affect the output size \(o_j\) of a pooling layer along axis \(j\):
Definition
A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs.
Examples
Why?
In order to make the representation approximately invariant to small translations of the input.
Invariance to translation means that if we translate the input by a small amount, the values of most of the pooled outputs do not change.
When?
Stride: 1 pixel, width: 3 pixels. Bottom row: shifted right.
If we pool over the outputs of separately parameterized convolutions, the features can learn which transformations to become invariant to.
Pooling with downsampling. Here we use max pooling with a pool width of three and a stride between pools of two. This reduces the representation size by a factor of two, which reduces the computational and statistical burden on the next layer. Note that the rightmost pooling region has a smaller size but must be included if we do not want to ignore some of the detector units.
Advantages
Layer stages

The components of a typical convolutional neural network layer.
Number of parameters for a CNN: \(m_1 \times m_2 \times 3 \times 5 + 5\). Number of parameters for a fully-connected NN: \((n_1 \times n_2 \times 3) \times (n_1 \times n_2 \times 5)\).
Layer dimensions: