Curiousily

Reproducible Machine Learning and Experiment Tracking Pipeline with Python and DVC

22.05.2020 — Deep Learning, Machine Learning, DVC, Reproducibility — 5 min read

TL;DR Learn how to build a reproducible ML pipeline using DVC and Python. You’ll build an end-to-end example with 2 experiments and compare model evaluation metrics between them.

In this tutorial, you’ll build a complete reproducible ML pipeline with Python and DVC. The approach is ML library/toolkit agnostic, but we’ll use scikit-learn.

Source code on GitHub

Here’s what we’ll go over:

Why your work must be reproducible?
Overview of DVC
Create a new ML project from scratch
Add the first (baseline) experiment
Added DVC to the project
Build a complete ML pipeline
Add seconds experiment
Compare the evaluation metrics between experiments

Reproducibility crisis?

Imagine that a paper is proposing a new method for solving a task and the main objective is improved by 10%. WOW! New SOTA! Or is it?

Reproducing the experiments is the only way to see for yourself. As a bonus, you’ll get a deeper understanding of the method. But how easy it is to do it?

Unfortunately, many authors don’t include their source code when publishing a paper. The reproducibility crisis is real! To comab this, some major ML conferences (NeurIPS and ICML) have some requirements to ensure reproducibility. The reproducibility checklist is one effort to summarise the main points. Things are getting better, but improvements are still needed.

Experimenting with ML boils down to writing and reading (a lot of) code. And what do you do when you want to find the truth? You go to the source! The source code (Yeah, I am watching too much Dom Mazzetti).

Reproducibility in the real world (a.k.a. your work)

All of this is great, but should you care? After all, you’re using ML in the real world!

You should care even more! The only good way to check if a piece of code is doing what the author intended is to show it to a lot of people. ML projects involve a lot more than “regular” code, though. Making your experiments hard to reproduce is a sure way to make someone give up on the review and go with a “f*ck it, I am out”.

Ok, how do you make your experiments reproducible?

Reproducing ML experiments with DVC

DVC stands for Data Version Control. It is a free and open-source project that helps you version control your experiments, store large files (on a variety of storage services), track metrics, and build completely reproducible pipelines.

Remotes

DVC doesn’t really handle big file storage for you. It stores metafiles that point to the location of the files. Those places are known as remotes. Here are some of the remotes that DVC supports:

Amazon s3
Microsoft Azure Blob Storage
Google Drive
Google Cloud Storage
SSH
Hadoop Distributed File System (hdfs)
HTTP and HTTPS protocols
Directory on your file system (local)

End-to-end example

We’ll have a look at a complete ML experiment and integrate it with DVC.

The data we’re going to use is listings of Udemy courses - 3.682 courses listings from 4 different subjects. The objective is to predict the number of students for each course.

Pretty much every ML pipeline can be boiled down to the following steps (this can be a never-ending cycle):

Create dataset
Create features
Train a model
Evaluate the model
Deploy the model (if better than previous)

In this example, we’ll skip the deployment altogether and focus on experimenting.

The first experiment

One of the good things about DVC is that you can put off the integration until the very end of your first experiment. We’ll do just that - start with a plain old Python project.

Here’s the initial file structure:

1.
2├── assets (dir)
3├── Pipfile
4├── Pipfile.lock
5└── studentpredictor (dir)

The studentpredictor directory will hold the source code, while assets will contain data and DVC related files.

We’ll manage the dependencies using Pipenv. Here are the contents of the Pipfile:

1[[source]]
2name = "pypi"
3url = "https://pypi.org/simple"
4verify_ssl = true
5
6[dev-packages]
7black = "==19.10b0"
8isort = "*"
9flake8 = "*"
10
11[packages]
12dvc = "*"
13gdown = "*"
14pandas = "*"
15scikit-learn = "*"
16
17[requires]
18python_version = "3.8"

Run this command in the root of your project once you add the file:

1pipenv install --dev

We’ll store the config as source code in the studentpredictor/config.py file:

1from pathlib import Path
2
3
4class Config:
5    RANDOM_SEED = 42
6    ASSETS_PATH = Path("./assets")
7    ORIGINAL_DATASET_FILE_PATH = ASSETS_PATH / "original_dataset" / "udemy_courses.csv"
8    DATASET_PATH = ASSETS_PATH / "data"
9    FEATURES_PATH = ASSETS_PATH / "features"
10    MODELS_PATH = ASSETS_PATH / "models"
11    METRICS_FILE_PATH = ASSETS_PATH / "metrics.json"

Create your dataset

The first step is to get the dataset. I’ve already uploaded the CSV file to Google Drive. Add the studentpredictor/create_dataset.py file with the following contents:

1import gdown
2import numpy as np
3import pandas as pd
4from sklearn.model_selection import train_test_split
5
6from config import Config
7
8np.random.seed(Config.RANDOM_SEED)
9
10Config.ORIGINAL_DATASET_FILE_PATH.parent.mkdir(parents=True, exist_ok=True)
11Config.DATASET_PATH.mkdir(parents=True, exist_ok=True)
12
13gdown.download(
14    "https://drive.google.com/uc?id=1gkYBOIMm8pAGunRoI3OzQHQrgOdaRjfC",
15    str(Config.ORIGINAL_DATASET_FILE_PATH),
16)
17
18df = pd.read_csv(str(Config.ORIGINAL_DATASET_FILE_PATH))
19
20df_train, df_test = train_test_split(
21    df, test_size=0.2, random_state=Config.RANDOM_SEED,
22)
23
24df_train.to_csv(str(Config.DATASET_PATH / "train.csv"), index=None)
25df_test.to_csv(str(Config.DATASET_PATH / "test.csv"), index=None)

We make all necessary directories and split the data into train and test. The resulting data frames are saved as CSV.

Create features

We’ll do some simple feature engineering to keep this part easy to understand. Create the studentpredictor/create_features.py file and fill it with this:

1from datetime import date
2
3import pandas as pd
4
5from config import Config
6
7Config.FEATURES_PATH.mkdir(parents=True, exist_ok=True)
8
9train_df = pd.read_csv(str(Config.DATASET_PATH / "train.csv"))
10test_df = pd.read_csv(str(Config.DATASET_PATH / "test.csv"))
11
12
13def extract_features(df):
14    df["published_timestamp"] = pd.to_datetime(df.published_timestamp).dt.date
15    df["days_since_published"] = (date.today() - df.published_timestamp).dt.days
16    return df[["num_lectures", "price", "days_since_published", "content_duration"]]
17
18
19train_features = extract_features(train_df)
20test_features = extract_features(test_df)
21
22train_features.to_csv(str(Config.FEATURES_PATH / "train_features.csv"), index=None)
23test_features.to_csv(str(Config.FEATURES_PATH / "test_features.csv"), index=None)
24
25train_df.num_subscribers.to_csv(
26    str(Config.FEATURES_PATH / "train_labels.csv"), index=None
27)
28test_df.num_subscribers.to_csv(
29    str(Config.FEATURES_PATH / "test_labels.csv"), index=None
30)

The only real feature we’re creating is the days_since_published. We get it from the published date of the course. We’re saving the features and labels as CSV files.

Train a model

We’ll start with a baseline model. In this case - Linear Regression. Put this into studentpredictor/train_model.py:

1import pickle
2
3import pandas as pd
4from sklearn.linear_model import LinearRegression
5
6from config import Config
7
8Config.MODELS_PATH.mkdir(parents=True, exist_ok=True)
9
10X_train = pd.read_csv(str(Config.FEATURES_PATH / "train_features.csv"))
11y_train = pd.read_csv(str(Config.FEATURES_PATH / "train_labels.csv"))
12
13model = LinearRegression()
14model = model.fit(X_train, y_train.to_numpy().ravel())
15
16pickle.dump(model, open(str(Config.MODELS_PATH / "model.pickle"), "wb"))

We dump the trained model with pickle. Ready to evaluate that bad boy!

Evaluation

We’ll focus on two metrics RMSE and $R^2$ . Here is the studentpredictor/evaluate_model.py file:

1import json
2import pickle
3
4import pandas as pd
5from sklearn.metrics import mean_squared_error
6
7from config import Config
8
9X_test = pd.read_csv(str(Config.FEATURES_PATH / "test_features.csv"))
10y_test = pd.read_csv(str(Config.FEATURES_PATH / "test_labels.csv"))
11
12model = pickle.load(open(str(Config.MODELS_PATH / "model.pickle"), "rb"))
13
14r_squared = model.score(X_test, y_test)
15
16y_pred = model.predict(X_test)
17rmse = mean_squared_error(y_test, y_pred)
18
19with open(str(Config.METRICS_FILE_PATH), "w") as outfile:
20    json.dump(dict(r_squared=r_squared, rmse=rmse), outfile)

We’re writing the resulting metrics in a JSON file. How we’re going to use that? More on that later.

The project structure should now look like this:

1.
2├── assets (dir)
3├── Pipfile
4├── Pipfile.lock
5└── studentpredictor
6    ├── config.py
7    ├── create_dataset.py
8    ├── create_features.py
9    ├── evaluate_model.py
10    └── train_model.py

Adding DVC

You’ll interact with DVC mostly via the CLI. It is a tool that plays nice with GIT (understands tags and branches) and is language agnostic.

Initialize DVC

1dvc init

and add remote storage (local in this case)

1dvc remote add -d localremote /tmp/dvc-storage

disable analytics (optional)

1dvc config core.analytics false

This is a good place for a checkpoint:

1git add .
2git commit -m "Add DVC config"
3git push

Building a pipeline

We’re ready to build the pipeline. DVC creates a graph with dependencies and outputs for each stage.

We’ll use dvc run to make each step reproducible. Let’s start with the dataset:

1dvc run -f assets/data.dvc \
2    -d studentpredictor/create_dataset.py \
3    -o assets/data \
4    python studentpredictor/create_dataset.py

Let’s dissect what is happening here:

-f assets/data.dvc saves the metafile used by DVC to reproduce this step
-d studentpredictor/create_dataset.py adds this script as a dependency for this step
-o assets/data tells that the outputs will be stored in that directory

Finally, we invoke the script that will do the actual work.

The stage for feature creation looks like this:

1dvc run -f assets/features.dvc \
2    -d studentpredictor/create_features.py \
3    -d assets/data \
4    -o assets/features \
5    python studentpredictor/create_features.py

Importantly, we add assets/data as a dependency for this step. This will force the execution of the previous step if something has changed.

You can probably figure out the training stage:

1dvc run -f assets/models.dvc \
2    -d studentpredictor/train_model.py \
3    -d assets/features \
4    -o assets/models \
5    python studentpredictor/train_model.py

The final stage - evaluation:

1dvc run -f assets/evaluate.dvc \
2    -d studentpredictor/evaluate_model.py \
3    -d assets/features \
4    -d assets/models \
5    -M assets/metrics.json \
6    python studentpredictor/evaluate_model.py

You’ll note that this step doesn’t specify outputs. But we have -M assets/metrics.json? This tells DVC that this is a metrics file (JSON and text files are currently supported).

Your first DVC pipeline is complete. Let’s save the progress:

1git add .
2git commit -m "Linear Regression experiment with DVC"
3git push

We’ll also create a tag for the experiment (you’ll see why in a second):

1git tag -a "lr-experiment" -m "Experiment with Linear Regression"

Now we can use some DVC magic to see the evaluation metrics for our model:

1dvc metrics show -T

This should output something like this:

1lr-experiment:
2    assets/metrics.json:
3        r_squared: 0.03570513102945361
4        rmse: 6777.509886999257

Experimenting with Random Forest

Why did we do all this work? Was it all worth it?

Let’s start a second experiment with Random Forest regressor. Replace the contents of studentpredictor/train_model.py:

1import pickle
2
3import pandas as pd
4from sklearn.ensemble import RandomForestRegressor
5
6from config import Config
7
8Config.MODELS_PATH.mkdir(parents=True, exist_ok=True)
9
10X_train = pd.read_csv(str(Config.FEATURES_PATH / "train_features.csv"))
11y_train = pd.read_csv(str(Config.FEATURES_PATH / "train_labels.csv"))
12
13model = RandomForestRegressor(
14    n_estimators=150, max_depth=6, random_state=Config.RANDOM_SEED
15)
16model = model.fit(X_train, y_train.to_numpy().ravel())
17
18pickle.dump(model, open(str(Config.MODELS_PATH / "model.pickle"), "wb"))

Let’s reproduce the complete pipeline using the new regressor:

1dvc repro assets/evaluate.dvc

DVC is smart enough to rerun only the steps that have changed and rewrite its internal graph.

Let’s save the second experiment:

1git add .
2git commit -m "Add Random Forest experiment"
3git push

and create a tag for it:

1git tag -a "rf-experiment" -m "Experiment with Random Forest"

We can now compare the two experiments:

1dvc metrics show -T

1lr-experiment:
2    assets/metrics.json:
3        r_squared: 0.03570513102945361
4        rmse: 6777.509886999257
5rf-experiment:
6    assets/metrics.json:
7        r_squared: 0.15391037892455683
8        rmse: 6348.533500735664

You can do the same thing with branches, too (if that is your thing).

Summary

You can now build a complete reproducible ML pipelines with Python and DVC. Note that you can do it with any ML library/toolkit. How would you apply this to your experiments?

Source code on GitHub

Here’s what we did:

Why your work must be reproducible?
Overview of DVC
Create a new ML project from scratch
Add the first (baseline) experiment
Added DVC to the project
Build a complete ML pipeline
Add seconds experiment
Compare the evaluation metrics between experiments

Do you make your experiment reproducible? How do you do it? How do you track your metrics? I am waiting for your answers in the comments below!

References

Want to be a Machine Learning expert?

Join the weekly newsletter on Data Science, Deep Learning and Machine Learning in your inbox, curated by me! Chosen by 10,000+ Machine Learning practitioners. (There might be some exclusive content, too!)

You'll never get spam from me

Hacker's Guide to Neural Networks in JavaScript

Build Machine Learning models (especially Deep Neural Networks) that you can easily integrate with existing or new web apps. Think of your ReactJs, Vue, or Angular app enhanced with the power of Machine Learning models.

Get SH*T Done with PyTorch

Learn how to solve real-world problems with Deep Learning models (NLP, Computer Vision, and Time Series). Go from prototyping to deployment with PyTorch and Python!

Hacker's Guide to Machine Learning with Python

This book brings the fundamentals of Machine Learning to you, using tools and techniques used to solve real-world problems in Computer Vision, Natural Language Processing, and Time Series analysis. The skills taught in this book will lay the foundation for you to advance your journey to Machine Learning Mastery!

Hands-On Machine Learning from Scratch

This book will guide you on your journey to deeper Machine Learning understanding by developing algorithms in Python from scratch! Learn why and when Machine learning is the right tool for the job and how to improve low performing models!