In this post I’ll demonstrate how to use convolutional neural networks to classify between a dog and a cat. In particular, we will look at

  • how to read images
  • how to design a simple convolutional neural network in Keras
  • how to train and evaluate the model

We will use Keras and Tensorflow to make a deep neural network model.

First head over to There you can download the training dataset called You don’t need as it is only for making predictions and submitting the results to Kaggle. We will create our own training and testing set.

First extract the archive. There are 25,000 images of cats and dogs. One thing we immediately notice is that the images have different sizes. Also, images have different contrast and brightness.


First we’ll import the necessary libraries

import matplotlib.pyplot as plt
import keras
import numpy as np
import glob
import os
from PIL import Image
import tqdm

Now we will define some constants. IMG_DIR is the directory where our images are. IM_WIDTH and IM_HEIGHT are the width and height of images after we do some pre-processing.

IMG_DIR = "./train/"
IM_WIDTH = 128

We will need to read the images and make corresponding labels. So if an image is a dog then its label will be 1 otherwise it will be 0.

We loop through all jpeg files in the directory, read them and resize so that every image is of same size. We will convert this to a numpy array and add it to images list. For labels we will check the file name and determine the label for that image. After we are done will all images, we will return images and labels as numpy arrays. Note that reading all the files at once like this will use a lot of memory! If you have memory issues, create a subset of the data (make sure it contains equal number of images for both cat and dog) and read them instead of the full dataset.

def read_images(directory, resize_to=(128, 128)):
    Reads images and labels from the given directory
    :param directory directory from which to read the files
    :param resize_to a tuple of width, height to resize the images
    : returns a tuple of list of images and labels
    files = glob.glob(directory + "*.jpg")
    images = []
    labels = []
    for f in tqdm.tqdm_notebook(files):
        im =
        im = im.resize(resize_to)
        im = np.array(im) / 255.0
        im = im.astype("float32")
        label = 1 if "dog" in f.lower() else 0
    return np.array(images), np.array(labels)
X, y = read_images(directory=IMG_DIR, resize_to=(IM_WIDTH, IM_HEIGHT))
# make sure we have 25000 images if we are reading the full data set.
# Change the number accordingly if you have created a subset
assert len(X) == len(y) == 25000

Now that we have read all the images, lets split it into training and testing set. We will use only the training set for training the model and testing set for evaluating the model. This is done to evaluate how the model will perform on unseen data. There is a function built into sklearn library called train_test_split which we will use to split our data.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.3)
# remove X and y since we don't need them anymore
# otherwise it will just use the memory
del X
del y

After splitting, we have 17,500 images for training and 7,500 for testing.

X_train.shape, X_test.shape
> ((17500, 128, 128, 3), (7500, 128, 128, 3))

Lets check some images. Let’s define a couple of helper functions to visualize images in a grid and also to convert numeric label to string.

def plot_images(images, labels):
    n_cols = min(5, len(images))
    n_rows = len(images) // n_cols
    fig = plt.figure(figsize=(8, 8))
    for i in range(n_rows * n_cols):
        sp = fig.add_subplot(n_rows, n_cols, i+1)
def humanize_labels(labels):
    Converts numeric labels to human friendly string labels
    :param labels numpy array of int
    :returns numpy array of human friendly labels
    return np.where(labels == 1, "dog", "cat")
plot_images(X_train[:20], humanize_labels(y_train[:20]))


Now we can build our model.
There are two things to note:

  1. We are working with images
  2. We need to classify from two categories (dog or cat) which is called binary classification

When working with images, we use convolutional neural networks. We will make a simple convolutional neural network with Keras using a functional API.

from keras.layers import Input, Dense, Conv2D, BatchNormalization, Activation, Flatten, MaxPool2D
from keras.models import Model
image_input = Input(shape=(IM_HEIGHT, IM_WIDTH, 3))
x = Conv2D(filters=32, kernel_size=7)(image_input)
x = Activation("relu")(x)
x = BatchNormalization()(x)
x = MaxPool2D()(x)
x = Conv2D(filters=64, kernel_size=3)(x)
x = Activation("relu")(x)
x = BatchNormalization()(x)
x = MaxPool2D()(x)
x = Conv2D(filters=128, kernel_size=3)(x)
x = Activation("relu")(x)
x = BatchNormalization()(x)
x = MaxPool2D()(x)
x = Flatten()(x)
x = Dense(units=64)(x)
x = Activation("relu")(x)
x = BatchNormalization()(x)
x = Dense(units=1)(x)
x = Activation("sigmoid")(x)
model = Model(inputs=image_input, outputs=x)
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 128, 128, 3)       0         
conv2d_7 (Conv2D)            (None, 122, 122, 32)      4736      
activation_11 (Activation)   (None, 122, 122, 32)      0         
batch_normalization_9 (Batch (None, 122, 122, 32)      128       
max_pooling2d_7 (MaxPooling2 (None, 61, 61, 32)        0         
conv2d_8 (Conv2D)            (None, 59, 59, 64)        18496     
activation_12 (Activation)   (None, 59, 59, 64)        0         
batch_normalization_10 (Batc (None, 59, 59, 64)        256       
max_pooling2d_8 (MaxPooling2 (None, 29, 29, 64)        0         
conv2d_9 (Conv2D)            (None, 27, 27, 128)       73856     
activation_13 (Activation)   (None, 27, 27, 128)       0         
batch_normalization_11 (Batc (None, 27, 27, 128)       512       
max_pooling2d_9 (MaxPooling2 (None, 13, 13, 128)       0         
flatten_3 (Flatten)          (None, 21632)             0         
dense_5 (Dense)              (None, 64)                1384512   
activation_14 (Activation)   (None, 64)                0         
batch_normalization_12 (Batc (None, 64)                256       
dense_6 (Dense)              (None, 1)                 65        
activation_15 (Activation)   (None, 1)                 0         
Total params: 1,482,817
Trainable params: 1,482,241
Non-trainable params: 576

First we need to define an input layer. The shape of input is image height by width by channels. This is the format when you are using tensorflow backend. If you are using Theano then the shape should be channels by height by width. In our case the shape would be (IM_HEIGHT, IM_WIDTH, 3). “3” is the number of color channels since we are working with RGB images.

Next we define a convolutional layer with 64 filters with kernel size 3. We then feed the output of the image_input to this layer. Next we define a RELU activation layer. RELU is commonly used activation function because it allows the model to converge faster.
Then we define a batch normalization layer. This layer normalizes the output of previous activation layer by subtracting the batch mean by batch standard deviation. This improves stability of the neural network. Next we add max-pooling layer. This combination of convolutional layer, activation and batch normalization is usually called convolution block.
We repeat the convolution block 2 times with different filters. We can repeat this as many times as we want allowing us to make deep neural networks. But notice that we are changing the number of filters after each convolution block. Also, in the model summary we can see that after each convlolution the spatial dimension (width and height) is reduced. Similarly max pooling significantly reduces the spatial dimension.

To understand the concepts of convolution layers, pooling, normalizing etc. I highly recommend you read from . It has all the necessary details about how convolution works, what parameters should we use and such.

After we decide that we have enough convolution blocks, we need fully connected layer to make the final classification. Fully connected layer is known as Dense layer in Keras. A Dense layer expects input in a form of 1 dimensional array. So we need to flatten the 3 dimensional data to 1 D. In Keras we can do it by using a layer called Flatten. Next we add a dense layer with 64 units or neurons. Like before we will use RELU activation and batch normalization. Finally we add another Dense layer that will actually produce the output that we want. Since we are doing a binary classification we only need one neuron that will produce a value from 0 to 1. For this we need to use sigmoid activation since the output from sigmoid is always between 0 and 1.

Now we need to define a model. In Keras, we can do this by instantiating a Model object. We need to specify what are the inputs and outputs of the model. Then we compile the model. Here we should specify the optimizer to use. There are many choices for which optimization algorithm to use like SGD, Adam, RMSProp etc. We will use “adam” as the optimizer. Next we need to define loss. Since this is a binary classification, we need to use binary_crossentropy. Next we also specify what metrics we would like to see during training. Since we are doing classification, we are interested in accuracy. So we specify “accuracy”.

You can look into Keras documentation for details about different optimizers, loss, metrics.

Now let’s train the model for 3 epochs with batch size of 64. Out of 17,500 training images we will take 64 images at a time and give it to the model to train, then we again take next 64 images and give it. This is called an iteration. Once we give all of 17,500 images to the model, we consider 1epoch to be finished. If you increase the batch size then you’ll need more memory in the GPU. When building your own models, you will have to experiment with different values and see which performs better., y_train, batch_size=64, epochs=3)
Epoch 1/3
17500/17500 [==============================] - 77s - loss: 0.5352 - acc: 0.7402    
Epoch 2/3
17500/17500 [==============================] - 73s - loss: 0.4048 - acc: 0.8182    
Epoch 3/3
17500/17500 [==============================] - 73s - loss: 0.3339 - acc: 0.8555

After 3 epochs, we have a training accuracy of 85%. But this is not what we are actually interested in. The real evaluation of the model should be on the data that was not seen by the model during the training process. Since we have already created a testing set before, we’ll use that to evaluate our model’s performance. Keras models have a function called evaluate which will give us the performance of our model. Evaluate function returns its outputs in a list and it may not be clear the numbers mean. So we can print metrics_names property to see what each numbers mean.

model.evaluate(X_test, y_test, batch_size=128)
['loss', 'acc']  
[0.56499939314524328, 0.73893333314259846]

So the first number is the loss and second number is accuracy. We achieved 73% accuracy in our test data which is lower than the accuracy reported during training. You can change the parameters in convolution layers, dense layers to see how it impacts the model’s performance.

Finally lets visualize some of the predictions made by our model.

predictions = model.predict(X_test)
predictions = np.where(predictions.flatten() > 0.5, 1, 0)
# plot random 20
p = np.random.permutation(len(predictions))
plot_images(X_test[p[:20]], humanize_labels(predictions[p[:20]]))

prediction image here

Leave a comment