Convolutional neural networks are a class of deep neural networks which are usually used for image analysis. They usually consist of convolutional, pooling and fully connected layers, which will all be explained in this article. The layers are usually fully connected, which tends to lead to overfitting in larger networks. Therefore, additional regularization and normalization layers are often used to avoid this problem.
This article will explain the functionality of the different layers in convolutional neural networks and show a simple CNN implementation in the commonly used Python machine learning framework keras.
1. Motivation
A question that you might ask yourself is why we are using these special convolutional neural networks at all. Why not just use a normal neural network for the task of image classification?
There are several reasons for that. One very important reason is that images tend to be relatively large, which leads to a lot of parameters in standard neural networks. As an example, let's use a 256x256 pixel color image. This image has 256*256 = 65536 pixels with 3 RGB color channels, which leads to ~196.000 input parameters. If we use a first hidden layer with 1024 neurons, we have more than 200 million parameters only in our first layer. Updating these parameters with backpropagation involves a huge amount of computations. Using convolutional neural networks bypasses this problem by using the concept of filters.
Apart from that, convolutional neural networks are more capable of detecting patterns in spatial areas of an image and can find patterns anywhere in the image. As an example, if we train a standard neural network with a certain cat image and then feed a shifted version of this picture, the neural network will probably not detect the cat because other neurons will be activated. Using the concept of filters, this can be avoided in convolutional neural networks.
2. keras Introduction
keras is a Python deep learning framework which is very user-friendly but also powerful.
Let us install keras and some other packages we need:
$ pip3 install keras tensorflow numpy mnist
Then, import the required packages into your program:
import numpy as np
import keras
I will use the MNIST dataset in this article. The mnist dataset consists of 60000 images of handwritten digits (0 - 9) and is a commonly used beginner dataset. The goal is to classify the digits in the images. Each image is 28x28 pixels large and has only one gray-scale channel. First, let's load the mnist dataset:
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

Now, the dataset should be preprocessed to be easily used by the keras package. We therefore convert the labels of the image to categorical labels. This converts our single label into a 10-element label vector which contains only zeros and a single one (representing the digit) for each image.
from keras.utils import to_categorical
train_images = np.expand_dims(train_images, axis=3)
test_images = np.expand_dims(test_images, axis=3)
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
print(train_images.shape) # (60000, 28, 28, 1)
print(train_labels.shape) # (60000, 10)
Let's start building the convolutional neural network. We first create a Sequential model in keras. Later, we then add the different types of layers to this model.
from keras.models import Sequential
model = Sequential()
3. Layers
3.1 Dense and Flatten
First, let us create a simple standard neural network in keras as a baseline. We start by flattening the image through the use of a Flatten layer. Then, we will use two fully connected layers with 32 neurons and ‘relu’ activation function as hidden layers and one fully connected softmax layer with ten neurons as our output layer.
from keras.layers import Flatten, Dense
model.add(Flatten(input_shape=(28, 28, 1)))
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(10, activation='softmax'))
Let us verify that everything worked as expected by viewing the model summary.
model.summary()
OUT:
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten_1 (Flatten) (None, 784) 0
_________________________________________________________________
dense_1 (Dense) (None, 32) 25120
_________________________________________________________________
dense_2 (Dense) (None, 32) 1056
_________________________________________________________________
dense_3 (Dense) (None, 10) 330
=================================================================
Total params: 26,506
Trainable params: 26,506
Non-trainable params: 0
_________________________________________________________________
To evaluate the performance of this baseline network, we need to set the loss function and choose the optimizer we want to use. Since we have categorical labels in our classification task, we will use the categorical crossentropy loss. As an optimizer, we will use the commonly used Adam optimizer. Then, we train (fit) the network with batch size 128 and 10 epochs.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_images, train_labels,
batch_size=128,
epochs=10,
validation_data=(test_images, test_labels))
OUT:
.....
Epoch 10/10
60000/60000 [==============================] - 2s 26us/step
- loss: 0.2728 - accuracy: 0.9237 - val_loss: 0.3089 - val_accuracy: 0.9233
As we can see, the baseline model already achieves a pretty good test accuracy of 92.33% for this task. However, let us see if we can improve these results by using a convolutional neural network.
3.2 Convolutional Layer
Convolutional neural networks are networks that use convolutional layers. These layers are the main reason why CNNs outperform NNs when it comes to image classification. Let us look at how these convolutional layers work.
Convolutional layers consist of a set of filters, which are used to look at spatial areas of the image. For example, if we use a 28 x 28 x 3 image, we could use a 5 x 5 x 3 filter, which looks at a 5x5 pixel area of the image. The last dimension of the filter is always equal to the last input dimension (the number of channels of the image).

The output of the convolutional layer is calculated by performing an element-wise multiplication
of the filter with the area of the image that the filter currently looks at. Then, the results of the multiplications are summed up to create one of the outputs. Finally, the filter is moved one pixel (or multiple pixels, see later) to the right and the next output is calculated. This is done for all rows and columns so that the output is a 2D matrix.
(Output dimension for the example above: (28-5+1) x (28-5+1) = 24 x 24).
.png)
Normally, multiple filters are used in a convolutional layer, resulting in a 3-dimensional output of dimensions (outputDIM1 x outputDIM2 x numFilters).
Padding: When using a filter with a size of 2x2 or larger, the output dimension of the convolutional layer will always be smaller than the input dimension. To avoid that, we can use a technique called padding. This means that we add zeros around the input of the convolutional layer to get an output that is larger. For example, if we use a 28x28 image with filter size 3 and a padding of 1 (this adds layer of zeros around the image), the output of the convolutional layer will also be of dimension 28x28.
Note: In keras, we can specify padding=”same” to automatically add padding such that the output and input dimensions are the same.
.png)
Stride: The stride specifies the number of pixels that the filter is moved between the spatial areas that it looks at. If we use a stride of 2, the filter is moved two pixels to the right to produce the next output.
keras:
Now, let us create a simple convolutional neural network in keras and see how it performs. First, I will add two convolutional layers to our current neural network. The first layer uses 64 filters of size 3x3, the second one uses 32 filters of the same size.
from keras.layers import Dense, Conv2D, Flatten
model = Sequential()
model.add(Conv2D(64, (3, 3), strides=(1, 1), padding='same', activation='relu', input_shape=(28, 28, 1)))
model.add(Conv2D(32, (3, 3), strides=(1, 1), padding='same', activation='relu'))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(10, activation='softmax'))
When looking at the summary, we can see that the convolutional layers have very few parameters compared to the fully connected layers in the network. The large number of parameters in the fully connected layers can result in a longer training time and larger memory footprint. Let us train the network and see the results.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_images, train_labels,
batch_size=128,
epochs=10,
validation_data=(test_images, test_labels))
OUT:
.....
Epoch 10/10
60000/60000 [==============================] - 188s 3ms/step
- loss: 0.0137 - accuracy: 0.9956 - val_loss: 0.1022 - val_accuracy: 0.9795
The network now achieves an accuracy of 97.95%, which is a solid result. However, the training time has increased a lot. Lets see if we can reduce it throughout the rest of this article.
3.2 Pooling Layer
The pooling layer is another neural network layer that is often used in CNNs. Its concept is very simple. Neighbor pixels in images often have similar values and therefore, neighboring outputs of a convolutional layer often contain redundant information. Pooling layers address this problem by removing some of this redundancy.
Pooling layers look at an area of the convolutional output and perform a simple operation on it. Examples for these operations are max, min and average. For example, a max-pooling layer chooses the maximum value from all its input values as output. The area that the pooling layer looks at is defined by the pool size. A pool size of 2 means that the pooling operations look at a 2x2 area and the output dimensions are both halfed. For example, applying such a pooling layer to a 24x24 output of a convolutional layer results in a 12x12 pooling layer output. The number of channels remains untouched by the pooling layer.

Let's add pooling layers to our neural network and see the results.
from keras.layers import Dense, Conv2D, Flatten, MaxPooling2D
model = Sequential()
model.add(Conv2D(64, (3, 3), strides=(1, 1), padding='same', activation='relu', input_shape=(28, 28, 1)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, (3, 3), strides=(1, 1), padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.summary()
OUT:
Model: "sequential_3"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_3 (Conv2D) (None, 28, 28, 64) 640
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 14, 14, 64) 0
_________________________________________________________________
conv2d_4 (Conv2D) (None, 14, 14, 32) 18464
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 7, 7, 32) 0
_________________________________________________________________
flatten_3 (Flatten) (None, 1568) 0
_________________________________________________________________
dense_5 (Dense) (None, 32) 50208
_________________________________________________________________
dense_6 (Dense) (None, 32) 1056
_________________________________________________________________
dense_7 (Dense) (None, 10) 330
=================================================================
Total params: 70,698
Trainable params: 70,698
Non-trainable params: 0
_________________________________________________________________
The summary shows that we could reduce our number of trainable parameters by more than a factor of 10 compared to the previous case. When training the network using the same command that we used earlier, we can see that the time was reduced from 188s to 35s per episode while the accuracy increased to 98.38%.
OUT:
.....
Epoch 10/10
60000/60000 [==============================] - 35s 584us/step
- loss: 0.0229 - accuracy: 0.9929 - val_loss: 0.0621 - val_accuracy: 0.9838
4. Further improvements to the architecture
The architecture we created in this article is still fairly simple. In this part, I want to mention some options to further improve the network. I won't explain these techniques in detail here, if you need more information on them, a simple Google search will help.
Regularization:
The network we created does not seem to overfit the data. If overfitting occurs in your application/network, adding Dropout layers or other regularization techniques (L1, L2, L1 and L2, see keras documentation for details) can help reduce bias in the network.
Batch Normalization:
Adding BatchNormalization layers can help speed up the training of the neural network.
Removing Fully Connected Layers:
Removing the fully connected layers at the end of the network can decrease the computational complexity. In the network from this article, removing the two hidden fully connected layers results in a small reduction of training time while having a comparable accuracy.
Inception Modules:
The inception module was invented by Google and basically is a small network inside the network. It combines multiple parallel convolutional and pooling layers that work well on image classification tasks. The inception module with dimensionality reduction should be used rather than its naive version.
Residual connections:
In very deep networks, adding residual connections between the layers can help the network perform better.
5. Conclusion
In this article, we looked at the concept of convolutional neural networks. We first took a closer look at convolutional layers and pooling layers, which are the most important layers in CNNs.
To get some practical experience, we looked at an example implementation of a CNN in keras which already achieved an accuracy of more than 98% on the mnist training set after only 10 epochs.
I hope that you have learned that CNNs are very powerful tools for image classification tasks and that the article helps you to implement your very own CNN.
Cheers,
Jakob.