• To train and evaluate a CNN to predict gender from an image of a person

First, download the data-set from here. It is a data-set of simple binary gender labels for a large number of images of human faces. We’ll do some exploratory data analysis first and download the images.


Lets import the required libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tqdm
%matplotlib inline

The file that you downloaded contains links to the actual images and their labels. We’ll use pandas to read the csv file see what it contains.

df = pd.read_csv("./data_source.csv")
print("Total records = ", len(df))

There are 64,084 total records and many columns that we don’t care about. The column names are also pretty long. We’ll only select a subset of columns as well as rename them. Then we’ll also select only the rows that has confidence of 1. pandas makes it very easy to do so. In this data-set confidence value indicates how sure the humans were while labeling the data. Its value ranges from 0 to 1 where 1 indicates absolute certainty.

# select only the columns that we are interested in
df = df[["_unit_id", "please_select_the_gender_of_the_person_in_the_picture",
    "please_select_the_gender_of_the_person_in_the_picture:confidence", "image_url"]]
# rename the columns
df.columns = ["id", "gender", "confidence", "url"]
# only select the rows that has confidence of 1.0
df = df[df["confidence"] == 1]
print("Total records = ", len(df))

Now we have 64,075 rows after choosing only the rows that had confidence value of 1. Let’s check how many samples/rows we have for each gender.


  id confidence url
female 7364 7364 7364
male 47592 47592 47592
unsure 9119 9119 9119

There are a lot of images for “male”. There are also some images with gender “unsure”. We’ll visualize few data samples from each category and then sample the data in such a way that each category has more or less same number of samples. This is important to make sure that the model learns equally about all categories.

# helper function to display image urls
from IPython.display import HTML, display
def display_images(df, category_name="male", count=12):
    filtered_df = df[df["gender"] == category_name]
    p = np.random.permutation(len(filtered_df))
    p = p[:count]
    img_style = "width:180px; margin:0px; float:left;border:1px solid black;"
    images_list = "".join(["<img src="{}">".format(img_style, u) for u in filtered_df.iloc[p].url])

display_images(df, category_name="female", count=15) Female Collage

display_images(df, category_name="male", count=15) Male Collage

display_images(df, category_name="unsure", count=15) Unsure Collage

Images in “unsure” category are either not images of a person or it contains more than one person or the person’s face is not facing the camera. There are also some images which could perfectly be labelled as a male or a female. Similarly in “male” and “female” category we can see some images of a cartoon or just text. For now we’ll just ignore those for simplicity. If our model does not perform well, then we’ll revisit the data cleaning part.

Now let’s create the dataframe that contains equal number of samples from male and female categories only.

df_male = df[df["gender"] == "male"]
df_female = df[df["gender"] == "female"]
# to make both categories have equal number of samples
# we'll take the counts of the category that has lowest
# number of samples
min_samples = min(len(df_male), len(df_female))
# for indexing randomly
p = np.random.permutation(min_samples)
df_male = df_male.iloc[p]
df_female = df_female.iloc[p]
print("Total male samples = ", len(df_male))
print("Total female samples = ", len(df_female))
df = pd.concat([df_male, df_female])
Total male samples =  7364
Total female samples =  7364

In the remaining part of this post we’ll download the images using the urls provided. We’ll also split 30% of the images for testing and 70% for training. While downloading the images, some of them can get corrupted so we’ll also check for those as well. The code to download the images is straightforward but is not efficient since one image is downloaded at a time. We could use multi-processing or async features. I’ll update the post if I implement a better version.

import os
import requests
from io import BytesIO
from PIL import Image
def download_images(df, data_dir="./data"):
    genders = df["gender"].unique()
    for g in genders:
        g_dir = "{}/{}".format(data_dir, g)
        if not os.path.exists(g_dir):
    for index, row in tqdm.tqdm_notebook(df.iterrows()):
        filepath = "{}/{}/{}.jpg".format(data_dir, row["gender"], row["id"])
        if os.path.exists(filepath):
            resp = requests.get(row["url"])
            im = Image.open(BytesIO(resp.content))
            print("Error while downloading %s" % row["url"])
DATA_DIR = "./data"
download_images(df, data_dir=DATA_DIR)  
# create train/test folder for each gender
import glob
TRAIN_DIR = DATA_DIR + "/train"
TEST_DIR = DATA_DIR + "/test"
for d in [TRAIN_DIR, TEST_DIR]:
    for g in df["gender"].unique():
        final_dir = "{}/{}".format(d, g)
        if not os.path.exists(final_dir):
from random import shuffle
import math
import shutil
split_ratio = 0.7 # we'll reserve 70% of the images for training set
def validate_and_move(files, target_dir):
    for f in tqdm.tqdm_notebook(files):
        # try to open the file to make sure that this is not corrupted
            im = Image.open(f)
            shutil.copy(f, target_dir)
#             os.remove(f)
for gender in df["gender"].unique():
    gender_dir = "{}/{}".format(DATA_DIR, gender)
    pattern = "{}/*.jpg".format(gender_dir)
    all_files = glob.glob(pattern)
    train_up_to = math.ceil(len(all_files) * split_ratio)
    train_files = all_files[:train_up_to]
    test_files = all_files[train_up_to:]
    validate_and_move(train_files, TRAIN_DIR + "/" + gender)
    validate_and_move(test_files, TEST_DIR + "/" + gender)

So far we did some basic visualization and prepared our dataset. We’ll build and train a model in the next part.