Introduction
Keeping track of machine learning experiments, including an evolving code-base, data set and configuration files can sometimes be a challenge. However, being able to reproduce results obtained in the past is an important part of keeping AI workflows transparent and build trust in its solution. In this tutorial, we will show you how to quickly adapt your environment, code and workflow in order to automate logging of all relevant information and create a transparent, reproducible pipeline.
Tutorial tested with DVC 2.9.3 and MLflow 1.22.0.
Process overview
There are many tools and many configurations possible on how to achieve our goal. We chose a simple combination of DVC as a datastore and mlflow as an experiment tracker in this situation. Using a datastore doesn't require any change to your code and is just a way to version control your data. As a consequence, instead of referring to your data file in the code, you will have to use its reference using the DVC API instead. Using mlflow requires a little more work as you need to define in your code what an experiment is and what needs to be logged for each of them. This is only a couple of lines of code and most of the logging can be automated when using common framework such as pytorch or tensorflow.
DVC as a datastore
Repository setup
The role of a datastore is to store and manage collections of data. DVC works on top of Git and is language- and framework-agnostic. It can store data locally or in storage providers such as AWS S3, Azure Blob Storage, SFTP and HDFS; in our case, we will store it locally. To avoid performing diffs on large and potentially binary files, DVC creates MD5 hash of each file instead and those are versioned by Git. Here is how to get started
$ git init $ dvc init $ git commit -m "initialize dvc"
Just like Git, DVC will store all relevant information in the hidden folder .dvc, so that last step adds this folder to our Git repository (including the DVC config file .dvc/config).
$ dvc remote add -d remote /path/to/remote $ git commit .dvc/config -m "added remote storage"
That last steps allows us to define a folder where to push the data so that several members from the same project can access consistent data version, as we will see in a bit. Assuming your dataset currently live in data/mydata.h5,
$ dvc add data/mydata.h5 $ git add data/.gitignore data/mydata.hd5.dvc $ git commit -m "start tracking dataset" $ git tag -a "v1" -m "dataset description" $ dvc push (optional: only if you defined a remote folder earlier)
This created a data/mydata.h5.dvc file containing the MD5 hash and added the actual data file to data/.gitignore. We will see how to interact with the datastore shortly, but before let's assume you have a new version of the dataset, called data/mydata_v2.h5. You can now overwrite data/mydata.h5 with the latest dataset and do
$ dvc add data/mydata.h5 $ git add data/mydata.h5.dvc $ git commit -m "data: updated dataset" $ git tag -a "v2" -m "describe the changes" $ dvc push (optional: only if you defined a remote folder earlier)
Accessing the data
Accessing a given version directly from the code
Once different versions of the dataset are stored by DVC, you do not need to maintain all the datasets locally by hand. Using DVC API, you can directly refer to a specific version of the dataset in your code as follows:
import dvc.api path = "data/mydata.h5" repo = "path/to/git/repo" version = "v1" # this could also be the associated git commit hash data_url = dvc.api.get_url( path=path, repo=repo, rev=version ) ... data = read_data(data_url) # just change "data/mydata.h5" to "data_url" in that case
Reverting the dataset on the filesystem to a previous version
In case you'd like to modify your dataset from a past version (e.g., obtain a "v2b" in our example), you first need to pull the version you'd like to use from DVC, update it as you like and push the new version back to DVC with the appropriate tag:
$ git checkout tags/v1 # we want to get data/mydata.hd5.dvc associated with v1 $ dvc pull # we pull the actual dataset associated with data/mydata.hd5.dvc $ git checkout master # back on the latest commit ... (do any change to the dataset you want, based on that original version) (the next few steps are the same as when we updated the dataset above) $ dvc add data/mydata.h5 $ git add data/mydata.h5.dvc $ git commit -m "data: updated dataset" $ git tag -a "v2b" -m "describe the changes" $ dvc push (optional: only if you defined a remote folder earlier)
MLflow to track experiments
Now that we have a way to track and refer to different versions of our dataset, we are ready to setup a complete experiment tracking framework with MLflow.
import dvc.api import mlflow path = "data/mydata.h5" repo = "path/to/git/repo" version = "v1" # this could also be the associated git commit hash data_url = dvc.api.get_url(path=path, repo=repo, rev=version) data = read_data(data_url) # do the processing you need nrows = len(data) alpha, beta = 1, 0 score = myscoringfunction(data, alpha, beta) mlflow.set_experiment("tuto") with mlflow.start_run(): mlflow.log_param("data_url", data_url) mlflow.log_param("data_version", version) mlflow.log_param("nrows", nrows) mlflow.log_param("alpha", alpha) mlflow.log_param("beta", beta) mlflow.log_metric("score", score)
After running the above experiment with different dataset versions (by changing the "version" variable) and model parameters alpha and beta, we can check the outcome of the experiments by opening the MLflow UI ($ mlflow ui --port 1234). As you can see, both data_url
and version
are saved for each run, giving us confidence about what dataset lead to those results. (Note that for the UI to be accessible, you need to open a SSH tunnel to login node you run the MLflow UI from).
As the path and dataset versions are stored within our git repository, we can explore our commit messages to get more information about each dataset version.
$ git log --oneline --grep="data" 12e4749 (HEAD -> master, tag: v2b) data: alternate dataset using source 2 1ca651e (tag: v2) data: updated dataset with data from source 1 ab5c48f (tag: v1) start tracking dataset
Each run can be explored individually and as much information as you want can be stored either in the form of parameters or artifacts.
Exploring the full capacity of MLflow is beyond the scope of this tutorial but good resources can easily be found online. The main point of this tutorial was to demonstrate the use of MLflow in combination with DVC as a datastore. The main benefit compared to logging a simple path for the dataset, not associated with a version control commit, is that it becomes impossible to insure the dataset hasn't been modified since the model was run. Using DVC in that way allows us to keep an immutable history of the dataset for future inspection.
Other resources
- A video demo of this workflow https://www.youtube.com/watch?v=W2DvpCYw22o
- An MLflow x Optuna tutorial https://github.com/StefanieStoppel/pytorch-mlflow-optuna/tree/tutorial-basics