Before diving deep into image recognition and especially CNN architecture will talk about how we see images and how our brain decides what it is. Our eyes capture the lights and colors on the retina. The receptors on the retina pass these signals to the optic nerve which passes them to the brain to make sense of this information. The whole visual pathway plays an important role in the process of understanding and making sense of what we see around us. It is this system inside us which allows us to make sense of the picture above, the text in this article and all other visual recognition tasks we perform everyday.

We’ve been doing this since our childhood. We were taught to recognize an umbrella, a dog, a cat or a human being. Can we teach computers to do so? Can we make a machine which can see and understand as well as humans do?

Computers “see” the world in a different way than we do. They can only “see” anything in form of numbers.To teach computers to make sense out of this array of numbers is a challenging task. Computer scientists have spent decades to build systems, algorithms and models which can understand images. Today in the era of Artificial Intelligence and Machine Learning we have been able to achieve remarkable success in identifying objects in images, identifying the context of an image, detect emotions etc. One of the most popular algorithm used in computer vision today is Convolutional Neural Network or CNN.

Convolutional Neural Networks

CNN is composed of two major parts:

1) Feature Extraction

This is the part of CNN architecture from where this network derives its name. Convolution is the mathematical operation which is central to the efficacy of this algorithm. Let’s understand on a high level what happens inside the red enclosed region. The input to the red region is the image which we want to classify and the output is a set of features. Think of features as attributes of the image, for instance, an image of a cat might have features like whiskers, two ears, four legs etc. A handwritten digit image might have features as horizontal and vertical lines or loops and curves. Let’s see how do we extract such features from the image.

Feature Extraction: Convolution

Convolution in CNN is performed on an input image using a filter or a kernel. To understand filtering and convolution make a small peephole with the help of your index finger and thumb by rolling them together as you would do to make a fist. Now through this peep hole look at your screen, you can look at a very small part of the screen through the peep hole. You will have to scan the screen starting from top left to right and moving down a bit after covering the width of the screen and repeating the same process until you are done scanning the whole screen.

Convolution of an image with a kernel works in a similar way. The kernel or the filter, which is a small matrix of values, acts as the peephole which performs a mathematical operation on the image while scanning the image in a similar way. For instance if the input image and the filter look like —

The filter (green) slides over the input image (blue) one pixel at a time starting from the top left. The filter multiplies its own values with the overlapping values of the image while sliding over it and adds all of them up to output a single value for each overlap.

In the above animation the value 4 (top left) in the output matrix (red) corresponds to the filter overlap on the top left of the image which is computed as —

Similarly we compute the other values of the output matrix. Note that the top left value, which is 4, in the output matrix depends only on the 9 values (3x3) on the top left of the original image matrix. It does not change even if the rest of the values in the image change. This is the receptive field of this output value or neuron in our CNN. Each value in our output matrix is sensitive to only a particular region in our original image. Scroll up to see the overlapping neurons receptive field diagram, do you notice the similarity?
Each adjacent value (neuron) in the output matrix has overlapping receptive fields like our red, blue & yellow neurons in the picture earlier. The animation below will give you a better sense of what happens in convolution.

Feature Extraction: Non-Linearity

If you go back and read about a basic neural network you will notice that each successive layer of a neural network is a linear combination of its inputs. The introduction of non-linearity or an activation function allows us to classify our data even if it is not linearly separable.

Left: Linearly separable vs. Right: Not linearly separable

Which leads us to another important operation — non-linearity or activation. After sliding our filter over the original image the output which we get is passed through another mathematical function which is called an activation function. The activation function usually used in most cases in CNN feature extraction is ReLU which stands for Rectified Linear Unit. Which simply converts all of the negative values to 0 and keeps the positive values the same.

So for a single image by convolving it with multiple filters we can get multiple output images. For the handwritten digit here we applied a horizontal edge extractor and a vertical edge extractor and got two output images. We can apply several other filters to generate more such outputs images which are also referred as feature maps.

Feature Extraction: Pooling

After a convolution layer once you get the feature maps, it is common to add a pooling or a sub-sampling layer in CNN layers. Pooling reduces the dimensionality to reduce the number of parameters and computation in the network. This shortens the training time and controls over-fitting.

The most frequent type of pooling is max pooling, which takes the maximum value in a specified window. The windows are similar to our earlier kernel sliding operation. This decreases the feature map size while at the same time keeping the significant information.

2) Classification

Alright, so now we have all the pieces required to build a CNN. Convolution, ReLU and Pooling. The output of max pooling is fed into the classifier we discussed initially which is usually a multi-layer perceptron a.k.a fully connected layer. Usually in CNNs these layers are used more than once i.e. Convolution -> ReLU -> Max-Pool -> Convolution -> ReLU -> Max-Pool and so on. We won’t discuss the fully connected layer in this article. You can read this article for a basic intuitive understanding of the fully connected layer.

Final Remarks

CNN is a very powerful algorithm which is widely used for image classification and object detection. The hierarchical structure and powerful feature extraction capabilities from an image makes CNN a very robust algorithm for various image and object recognition tasks. There are numerous different architectures of Convolutional Neural Networks like LeNet, AlexNet, ZFNet, GoogleNet, VGGNet, ResNet etc. But the basic idea behind these architectures remains the same.