All Articles

Face Detection on Custom Dataset with Detectron2 and PyTorch using Python

TL;DR Learn how to prepare a custom Face Detection dataset for Detectron2 and PyTorch. Fine-tune a pre-trained model to find face boundaries in images.

Face detection is the task of finding (boundaries of) faces in images. This is useful for

  • security systems (the first step in recognizing a person)
  • autofocus and smile detection for making great photos
  • detecting age, race, and emotional state for markering (yep, we already live in that world)

Historically, this was a really tough problem to solve. Tons of manual feature engineering, novel algorithms and methods were developed to improve the state-of-the-art.

These days, face detection models are included in almost every computer vision package/framework. Some of the best-performing ones use Deep Learning methods. OpenCV, for example, provides a variety of tools like the Cascade Classifier.

In this guide, you’ll learn how to:

  • prepare a custom dataset for face detection with Detectron2
  • use (close to) state-of-the-art models for object detection to find faces in images
  • You can extend this work for face recognition.

Here’s an example of what you’ll get at the end of this guide:


Detectron 2


Detectron2 is a framework for building state-of-the-art object detection and image segmentation models. It is developed by the Facebook Research team. Detectron2 is a complete rewrite of the first version.

Under the hood, Detectron2 uses PyTorch (compatible with the latest version(s)) and allows for blazing fast training. You can learn more at introductory blog post by Facebook Research.

The real power of Detectron2 lies in the HUGE amount of pre-trained models available at the Model Zoo. But what good that would it be if you can’t fine-tune those on your own datasets? Fortunately, that’s super easy! We’ll see how it is done in this guide.

Installing Detectron2

At the time of this writing, Detectron2 is still in an alpha stage. While there is an official release, we’ll clone and compile from the master branch. This should equal version 0.1.

Let’s start by installing some requirements:

!pip install -q cython pyyaml==5.1
!pip install -q -U 'git+'

And download, compile, and install the Detectron2 package:

!git clone detectron2_repo
!pip install -q -e detectron2_repo

At this point, you’ll need to restart the notebook runtime to continue!

!pip install -q -U watermark
%reload_ext watermark
%watermark -v -p numpy,pandas,pycocotools,torch,torchvision,detectron2
CPython 3.6.9
IPython 5.5.0

numpy 1.17.5
pandas 0.25.3
pycocotools 2.0
torch 1.4.0
torchvision 0.5.0
detectron2 0.1
import torch, torchvision
import detectron2
from detectron2.utils.logger import setup_logger

import glob

import os
import ntpath
import numpy as np
import cv2
import random
import itertools
import pandas as pd
from tqdm import tqdm
import urllib
import json
import PIL.Image as Image

from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor, DefaultTrainer
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer, ColorMode
from import DatasetCatalog, MetadataCatalog, build_detection_test_loader
from detectron2.evaluation import COCOEvaluator, inference_on_dataset
from detectron2.structures import BoxMode

import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc

%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style='whitegrid', palette='muted', font_scale=1.2)

HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]


rcParams['figure.figsize'] = 12, 8


Face Detection Data

Our dataset is provided by Dataturks, and it is hosted on Kaggle. Here’s an excerpt from the description:

Faces in images marked with bounding boxes. Have around 500 images with around 1100 faces manually tagged via bounding box.

I’ve downloaded the JSON file containing the annotations and uploaded it to Google Drive. Let’s get it:

!gdown --id 1K79wJgmPTWamqb04Op2GxW0SW9oxw8KS

Let’s load the file into a Pandas dataframe:

faces_df = pd.read_json('face_detection.json', lines=True)

Each line contains a single face annotation. Note that multiple lines might point to a single image (e.g. multiple faces per image).

Data Preprocessing

The dataset contains only image URLs and annotations. We’ll have to download the images. We’ll also normalize the annotations, so it’s easier to use them with Detectron2 later on:

os.makedirs("faces", exist_ok=True)

dataset = []

for index, row in tqdm(faces_df.iterrows(), total=faces_df.shape[0]):
    img = urllib.request.urlopen(row["content"])
    img =
    img = img.convert('RGB')

    image_name = f'face_{index}.jpeg''faces/{image_name}', "JPEG")

    annotations = row['annotation']
    for an in annotations:

      data = {}

      width = an['imageWidth']
      height = an['imageHeight']
      points = an['points']

      data['file_name'] = image_name
      data['width'] = width
      data['height'] = height

      data["x_min"] = int(round(points[0]["x"] * width))
      data["y_min"] = int(round(points[0]["y"] * height))
      data["x_max"] = int(round(points[1]["x"] * width))
      data["y_max"] = int(round(points[1]["y"] * height))

      data['class_name'] = 'face'


Let’s put the data into a dataframe so we can have a better look:

df = pd.DataFrame(dataset)
print(df.file_name.unique().shape[0], df.shape[0])
409 1132

We have a total of 409 images (a lot less than the promised 500) and 1132 annotations. Let’s save them to the disk (so you might reuse them):

df.to_csv('annotations.csv', header=True, index=None)

Data Exploration

Let’s see some sample annotated data. We’ll use OpenCV to load an image, add the bounding boxes, and resize it. We’ll define a helper function to do it all:

def annotate_image(annotations, resize=True):
  file_name = annotations.file_name.to_numpy()[0]
  img = cv2.cvtColor(cv2.imread(f'faces/{file_name}'), cv2.COLOR_BGR2RGB)

  for i, a in annotations.iterrows():
    cv2.rectangle(img, (a.x_min, a.y_min), (a.x_max, a.y_max), (0, 255, 0), 2)

  if not resize:
    return img

  return cv2.resize(img, (384, 384), interpolation = cv2.INTER_AREA)

Let’s start by showing some annotated images:

img_df = df[df.file_name == df.file_name.unique()[0]]
img = annotate_image(img_df, resize=False)



img_df = df[df.file_name == df.file_name.unique()[1]]
img = annotate_image(img_df, resize=False)



Those are good ones, the annotations are clearly visible. We can use torchvision to create a grid of images. Note that the images are in various sizes, so we’ll resize them:

sample_images = [annotate_image(df[df.file_name == f]) for f in df.file_name.unique()[:10]]
sample_images = torch.as_tensor(sample_images)
torch.Size([10, 384, 384, 3])
sample_images = sample_images.permute(0, 3, 1, 2)
torch.Size([10, 3, 384, 384])
plt.figure(figsize=(24, 12))
grid_img = torchvision.utils.make_grid(sample_images, nrow=5)

plt.imshow(grid_img.permute(1, 2, 0))


You can clearly see that some annotations are missing (column 4). That’s real life data for you, sometimes you have to deal with it in some way.

Face Detection with Detectron 2

It is time to go through the steps of fine-tuning a model using a custom dataset. But first, let’s save 5% of the data for testing:

df = pd.read_csv('annotations.csv')

IMAGES_PATH = f'faces'

unique_files = df.file_name.unique()

train_files = set(np.random.choice(unique_files, int(len(unique_files) * 0.95), replace=False))
train_df = df[df.file_name.isin(train_files)]
test_df = df[~df.file_name.isin(train_files)]

The classical traintestsplit won’t work here, cause we want a split amongst the file names.

The next parts are written in a bit more generic way. Obviously, we have a single class - face. But adding more should be as simple as adding more annotations to the dataframe:

classes = df.class_name.unique().tolist()

Next, we’ll write a function that converts our dataset into a format that is used by Detectron2:

def create_dataset_dicts(df, classes):
  dataset_dicts = []
  for image_id, img_name in enumerate(df.file_name.unique()):

    record = {}

    image_df = df[df.file_name == img_name]

    file_path = f'{IMAGES_PATH}/{img_name}'
    record["file_name"] = file_path
    record["image_id"] = image_id
    record["height"] = int(image_df.iloc[0].height)
    record["width"] = int(image_df.iloc[0].width)

    objs = []
    for _, row in image_df.iterrows():

      xmin = int(row.x_min)
      ymin = int(row.y_min)
      xmax = int(row.x_max)
      ymax = int(row.y_max)

      poly = [
          (xmin, ymin), (xmax, ymin),
          (xmax, ymax), (xmin, ymax)
      poly = list(itertools.chain.from_iterable(poly))

      obj = {
        "bbox": [xmin, ymin, xmax, ymax],
        "bbox_mode": BoxMode.XYXY_ABS,
        "segmentation": [poly],
        "category_id": classes.index(row.class_name),
        "iscrowd": 0

    record["annotations"] = objs
  return dataset_dicts

We convert every annotation row to a single record with a list of annotations. You might also notice that we’re building a polygon that is of the exact same shape as the bounding box. This is required for the image segmentation models in Detectron2.

You’ll have to register your dataset into the dataset and metadata catalogues:

for d in ["train", "val"]:
  DatasetCatalog.register("faces_" + d, lambda d=d: create_dataset_dicts(train_df if d == "train" else test_df, classes))
  MetadataCatalog.get("faces_" + d).set(thing_classes=classes)

statement_metadata = MetadataCatalog.get("faces_train")

Unfortunately, evaluator for the test set is not included by default. We can easily fix that by writing our own trainer:

class CocoTrainer(DefaultTrainer):

  def build_evaluator(cls, cfg, dataset_name, output_folder=None):

    if output_folder is None:
        os.makedirs("coco_eval", exist_ok=True)
        output_folder = "coco_eval"

    return COCOEvaluator(dataset_name, cfg, False, output_folder)

The evaluation results will be stored in the coco_eval folder if no folder is provided.

Fine-tuning a Detectron2 model is nothing like writing PyTorch code. We’ll load a configuration file, change a few values, and start the training process. But hey, it really helps if you know what you’re doing 😂

For this tutorial, we’ll use the Mask R-CNN X101-FPN model. It is pre-trained on the COCO dataset and achieves very good performance. The downside is that it is slow to train.

Let’s load the config file and the pre-trained model weights:

cfg = get_cfg()


cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url(

Specify the datasets (we registered those) we’ll use for training and evaluation:

cfg.DATASETS.TRAIN = ("faces_train",)
cfg.DATASETS.TEST = ("faces_val",)

And for the optimizer, we’ll do a bit of magic to converge to something nice:

cfg.SOLVER.BASE_LR = 0.001
cfg.SOLVER.MAX_ITER = 1500
cfg.SOLVER.STEPS = (1000, 1500)
cfg.SOLVER.GAMMA = 0.05

Except for the standard stuff (batch size, max number of iterations, and learning rate) we have a couple of interesting params:

  • WARMUP_ITERS - the learning rate starts from 0 and goes to the preset one for this number of iterations
  • STEPS - the checkpoints (number of iterations) at which the learning rate will be reduced by GAMMA

Finally, we’ll specify the number of classes and the period at which we’ll evaluate on the test set:

cfg.MODEL.ROI_HEADS.NUM_CLASSES = len(classes)


Time to train, using our custom trainer:

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)

trainer = CocoTrainer(cfg)

Evaluating Object Detection Models

Evaluating object detection models is a bit different when compared to evaluating standard classification or regression models.

The main metric you need to know about is IoU (intersection over union). It measures the overlap between two boundaries - the predicted and ground truth one. It can get values between 0 and 1.

IoU=area of overlaparea of union\text{IoU}=\frac{\text{area of overlap}}{\text{area of union}}

Using IoU, one can define a threshold (e.g. >0.5) to classify whether a prediction is a true positive (TP) or a false positive (FP).

Now you can calculate average precision (AP) by taking the area under the precision-recall curve.

Now AP@X (e.g. AP50) is just AP at some IoU threshold. This should give you a working understanding of how object detection models are evaluated.

I suggest you read the mAP (mean Average Precision) for Object Detection tutorial by Jonathan Hui if you want to learn more on the topic.

I’ve prepared a pre-trained model for you, so you don’t have to wait for the training to complete. Let’s download it:

!gdown --id 18Ev2bpdKsBaDufhVKf0cT6RmM3FjW3nL
!mv face_detector.pth output/model_final.pth

We can start making predictions by loading the model and setting a minimum threshold of 85% certainty at which we’ll consider the predictions as correct:

cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")
predictor = DefaultPredictor(cfg)

Let’s run the evaluator with the trained model:

evaluator = COCOEvaluator("faces_val", cfg, False, output_dir="./output/")
val_loader = build_detection_test_loader(cfg, "faces_val")
inference_on_dataset(trainer.model, val_loader, evaluator)

Finding Faces in Images

Next, let’s create a folder and save all images with predicted annotations in the test set:

os.makedirs("annotated_results", exist_ok=True)

test_image_paths = test_df.file_name.unique()
for clothing_image in test_image_paths:
  file_path = f'{IMAGES_PATH}/{clothing_image}'
  im = cv2.imread(file_path)
  outputs = predictor(im)
  v = Visualizer(
    im[:, :, ::-1],
  instances = outputs["instances"].to("cpu")
  v = v.draw_instance_predictions(instances)
  result = v.get_image()[:, :, ::-1]
  file_name = ntpath.basename(clothing_image)
  write_res = cv2.imwrite(f'annotated_results/{file_name}', result)

Let’s have a look:

annotated_images = [f'annotated_results/{f}' for f in test_df.file_name.unique()]
img = cv2.cvtColor(cv2.imread(annotated_images[0]), cv2.COLOR_BGR2RGB)



img = cv2.cvtColor(cv2.imread(annotated_images[1]), cv2.COLOR_BGR2RGB)



img = cv2.cvtColor(cv2.imread(annotated_images[3]), cv2.COLOR_BGR2RGB)



img = cv2.cvtColor(cv2.imread(annotated_images[4]), cv2.COLOR_BGR2RGB)



Not bad. Not bad at all. I suggest you explore more images on your own, too!

Note that some faces have multiple bounding boxes (on the second image) with different degrees of certainty. Maybe training the model longer will help? How about adding more or augmenting the existing data?


Congratulations! You now know the basics of Detectron2 for object detection! You might be surprised by the results, given the small dataset we have. That’s the power of large pre-trained models for you 😍

You learned how to:

  • prepare a custom dataset for face detection with Detectron2
  • use (close to) state-of-the-art models for object detection to find faces in images
  • You can extend this work for face recognition.