TL;DR Learn how to build a reproducible ML pipeline using DVC and Python. You’ll build an end-to-end example with 2 experiments and compare model evaluation metrics between them.
In this tutorial, you’ll build a complete reproducible ML pipeline with Python and DVC. The approach is ML library/toolkit agnostic, but we’ll use scikit-learn.
Here’s what we’ll go over:
Imagine that a paper is proposing a new method for solving a task and the main objective is improved by 10%. WOW! New SOTA! Or is it?
Reproducing the experiments is the only way to see for yourself. As a bonus, you’ll get a deeper understanding of the method. But how easy it is to do it?
Unfortunately, many authors don’t include their source code when publishing a paper. The reproducibility crisis is real! To comab this, some major ML conferences (NeurIPS and ICML) have some requirements to ensure reproducibility. The reproducibility checklist is one effort to summarise the main points. Things are getting better, but improvements are still needed.
Experimenting with ML boils down to writing and reading (a lot of) code. And what do you do when you want to find the truth? You go to the source! The source code (Yeah, I am watching too much Dom Mazzetti).
All of this is great, but should you care? After all, you’re using ML in the real world!
You should care even more! The only good way to check if a piece of code is doing what the author intended is to show it to a lot of people. ML projects involve a lot more than “regular” code, though. Making your experiments hard to reproduce is a sure way to make someone give up on the review and go with a “f*ck it, I am out”.
Ok, how do you make your experiments reproducible?
DVC stands for Data Version Control. It is a free and open-source project that helps you version control your experiments, store large files (on a variety of storage services), track metrics, and build completely reproducible pipelines.
DVC doesn’t really handle big file storage for you. It stores metafiles that point to the location of the files. Those places are known as remotes. Here are some of the remotes that DVC supports:
We’ll have a look at a complete ML experiment and integrate it with DVC.
The data we’re going to use is listings of Udemy courses - 3.682 courses listings from 4 different subjects. The objective is to predict the number of students for each course.
Pretty much every ML pipeline can be boiled down to the following steps (this can be a never-ending cycle):
In this example, we’ll skip the deployment altogether and focus on experimenting.
One of the good things about DVC is that you can put off the integration until the very end of your first experiment. We’ll do just that - start with a plain old Python project.
Here’s the initial file structure:
1.2├── assets (dir)3├── Pipfile4├── Pipfile.lock5└── studentpredictor (dir)
studentpredictor directory will hold the source code, while
assets will contain data and DVC related files.
We’ll manage the dependencies using
Pipenv. Here are the contents of the
1[[source]]2name = "pypi"3url = "https://pypi.org/simple"4verify_ssl = true56[dev-packages]7black = "==19.10b0"8isort = "*"9flake8 = "*"1011[packages]12dvc = "*"13gdown = "*"14pandas = "*"15scikit-learn = "*"1617[requires]18python_version = "3.8"
Run this command in the root of your project once you add the file:
1pipenv install --dev
We’ll store the config as source code in the
1from pathlib import Path234class Config:5 RANDOM_SEED = 426 ASSETS_PATH = Path("./assets")7 ORIGINAL_DATASET_FILE_PATH = ASSETS_PATH / "original_dataset" / "udemy_courses.csv"8 DATASET_PATH = ASSETS_PATH / "data"9 FEATURES_PATH = ASSETS_PATH / "features"10 MODELS_PATH = ASSETS_PATH / "models"11 METRICS_FILE_PATH = ASSETS_PATH / "metrics.json"
The first step is to get the dataset. I’ve already uploaded the CSV file to Google Drive. Add the
studentpredictor/create_dataset.py file with the following contents:
1import gdown2import numpy as np3import pandas as pd4from sklearn.model_selection import train_test_split56from config import Config78np.random.seed(Config.RANDOM_SEED)910Config.ORIGINAL_DATASET_FILE_PATH.parent.mkdir(parents=True, exist_ok=True)11Config.DATASET_PATH.mkdir(parents=True, exist_ok=True)1213gdown.download(14 "https://drive.google.com/uc?id=1gkYBOIMm8pAGunRoI3OzQHQrgOdaRjfC",15 str(Config.ORIGINAL_DATASET_FILE_PATH),16)1718df = pd.read_csv(str(Config.ORIGINAL_DATASET_FILE_PATH))1920df_train, df_test = train_test_split(21 df, test_size=0.2, random_state=Config.RANDOM_SEED,22)2324df_train.to_csv(str(Config.DATASET_PATH / "train.csv"), index=None)25df_test.to_csv(str(Config.DATASET_PATH / "test.csv"), index=None)
We make all necessary directories and split the data into train and test. The resulting data frames are saved as CSV.
We’ll do some simple feature engineering to keep this part easy to understand. Create the
studentpredictor/create_features.py file and fill it with this:
1from datetime import date23import pandas as pd45from config import Config67Config.FEATURES_PATH.mkdir(parents=True, exist_ok=True)89train_df = pd.read_csv(str(Config.DATASET_PATH / "train.csv"))10test_df = pd.read_csv(str(Config.DATASET_PATH / "test.csv"))111213def extract_features(df):14 df["published_timestamp"] = pd.to_datetime(df.published_timestamp).dt.date15 df["days_since_published"] = (date.today() - df.published_timestamp).dt.days16 return df[["num_lectures", "price", "days_since_published", "content_duration"]]171819train_features = extract_features(train_df)20test_features = extract_features(test_df)2122train_features.to_csv(str(Config.FEATURES_PATH / "train_features.csv"), index=None)23test_features.to_csv(str(Config.FEATURES_PATH / "test_features.csv"), index=None)2425train_df.num_subscribers.to_csv(26 str(Config.FEATURES_PATH / "train_labels.csv"), index=None27)28test_df.num_subscribers.to_csv(29 str(Config.FEATURES_PATH / "test_labels.csv"), index=None30)
The only real feature we’re creating is the days_since_published. We get it from the published date of the course. We’re saving the features and labels as CSV files.
We’ll start with a baseline model. In this case - Linear Regression. Put this into
1import pickle23import pandas as pd4from sklearn.linear_model import LinearRegression56from config import Config78Config.MODELS_PATH.mkdir(parents=True, exist_ok=True)910X_train = pd.read_csv(str(Config.FEATURES_PATH / "train_features.csv"))11y_train = pd.read_csv(str(Config.FEATURES_PATH / "train_labels.csv"))1213model = LinearRegression()14model = model.fit(X_train, y_train.to_numpy().ravel())1516pickle.dump(model, open(str(Config.MODELS_PATH / "model.pickle"), "wb"))
We dump the trained model with pickle. Ready to evaluate that bad boy!
We’ll focus on two metrics RMSE and R2. Here is the
1import json2import pickle34import pandas as pd5from sklearn.metrics import mean_squared_error67from config import Config89X_test = pd.read_csv(str(Config.FEATURES_PATH / "test_features.csv"))10y_test = pd.read_csv(str(Config.FEATURES_PATH / "test_labels.csv"))1112model = pickle.load(open(str(Config.MODELS_PATH / "model.pickle"), "rb"))1314r_squared = model.score(X_test, y_test)1516y_pred = model.predict(X_test)17rmse = mean_squared_error(y_test, y_pred)1819with open(str(Config.METRICS_FILE_PATH), "w") as outfile:20 json.dump(dict(r_squared=r_squared, rmse=rmse), outfile)
We’re writing the resulting metrics in a JSON file. How we’re going to use that? More on that later.
The project structure should now look like this:
1.2├── assets (dir)3├── Pipfile4├── Pipfile.lock5└── studentpredictor6 ├── config.py7 ├── create_dataset.py8 ├── create_features.py9 ├── evaluate_model.py10 └── train_model.py
You’ll interact with DVC mostly via the CLI. It is a tool that plays nice with GIT (understands tags and branches) and is language agnostic.
and add remote storage (local in this case)
1dvc remote add -d localremote /tmp/dvc-storage
disable analytics (optional)
1dvc config core.analytics false
This is a good place for a checkpoint:
1git add .2git commit -m "Add DVC config"3git push
We’re ready to build the pipeline. DVC creates a graph with dependencies and outputs for each stage.
dvc run to make each step reproducible. Let’s start with the dataset:
1dvc run -f assets/data.dvc \2 -d studentpredictor/create_dataset.py \3 -o assets/data \4 python studentpredictor/create_dataset.py
Let’s dissect what is happening here:
-f assets/data.dvcsaves the metafile used by DVC to reproduce this step
-d studentpredictor/create_dataset.pyadds this script as a dependency for this step
-o assets/datatells that the outputs will be stored in that directory
Finally, we invoke the script that will do the actual work.
The stage for feature creation looks like this:
1dvc run -f assets/features.dvc \2 -d studentpredictor/create_features.py \3 -d assets/data \4 -o assets/features \5 python studentpredictor/create_features.py
Importantly, we add
assets/data as a dependency for this step. This will force the execution of the previous step if something has changed.
You can probably figure out the training stage:
1dvc run -f assets/models.dvc \2 -d studentpredictor/train_model.py \3 -d assets/features \4 -o assets/models \5 python studentpredictor/train_model.py
The final stage - evaluation:
1dvc run -f assets/evaluate.dvc \2 -d studentpredictor/evaluate_model.py \3 -d assets/features \4 -d assets/models \5 -M assets/metrics.json \6 python studentpredictor/evaluate_model.py
You’ll note that this step doesn’t specify outputs. But we have
-M assets/metrics.json? This tells DVC that this is a metrics file (JSON and text files are currently supported).
Your first DVC pipeline is complete. Let’s save the progress:
1git add .2git commit -m "Linear Regression experiment with DVC"3git push
We’ll also create a tag for the experiment (you’ll see why in a second):
1git tag -a "lr-experiment" -m "Experiment with Linear Regression"
Now we can use some DVC magic to see the evaluation metrics for our model:
1dvc metrics show -T
This should output something like this:
1lr-experiment:2 assets/metrics.json:3 r_squared: 0.035705131029453614 rmse: 6777.509886999257
Why did we do all this work? Was it all worth it?
Let’s start a second experiment with Random Forest regressor. Replace the contents of
1import pickle23import pandas as pd4from sklearn.ensemble import RandomForestRegressor56from config import Config78Config.MODELS_PATH.mkdir(parents=True, exist_ok=True)910X_train = pd.read_csv(str(Config.FEATURES_PATH / "train_features.csv"))11y_train = pd.read_csv(str(Config.FEATURES_PATH / "train_labels.csv"))1213model = RandomForestRegressor(14 n_estimators=150, max_depth=6, random_state=Config.RANDOM_SEED15)16model = model.fit(X_train, y_train.to_numpy().ravel())1718pickle.dump(model, open(str(Config.MODELS_PATH / "model.pickle"), "wb"))
Let’s reproduce the complete pipeline using the new regressor:
1dvc repro assets/evaluate.dvc
DVC is smart enough to rerun only the steps that have changed and rewrite its internal graph.
Let’s save the second experiment:
1git add .2git commit -m "Add Random Forest experiment"3git push
and create a tag for it:
1git tag -a "rf-experiment" -m "Experiment with Random Forest"
We can now compare the two experiments:
1dvc metrics show -T
1lr-experiment:2 assets/metrics.json:3 r_squared: 0.035705131029453614 rmse: 6777.5098869992575rf-experiment:6 assets/metrics.json:7 r_squared: 0.153910378924556838 rmse: 6348.533500735664
You can do the same thing with branches, too (if that is your thing).
You can now build a complete reproducible ML pipelines with Python and DVC. Note that you can do it with any ML library/toolkit. How would you apply this to your experiments?
Here’s what we did:
Do you make your experiment reproducible? How do you do it? How do you track your metrics? I am waiting for your answers in the comments below!
You'll never get spam from me
This book brings the fundamentals of Machine Learning to you, using tools and techniques used to solve real-world problems in Computer Vision, Natural Language Processing, and Time Series analysis. The skills taught in this book will lay the foundation for you to advance your journey to Machine Learning Mastery!