Versioning data projects seems to be the least popular feature in machine learning (i.e. data science projects). Fortunately, more people and organizations are starting to version their data, which includes both data sets and trained models. In software engineering, we often versioned our code with Git, for example, but traditional version control systems that are used for regular software projects aren’t quite sufficient for machine learning, as they need to be able to track the data sets, along with the code itself and resulting models.
Example of the structure of data sets and models without any data control system.
Let’s start from the version control system and for the purpose of this blog post, we will stick to Git. Git is software for tracking changes in any set of files, usually used for coordinating work among programmers collaboratively developing source code during software development1. In a nutshell, there’s a central code repository representing the current state of a project. We can copy the project, make some changes locally and push them to the central repository, and once the code is reviewed and accepted, it’s deployed to production. We’d like to see similar conventions and standards in data science and machine learning projects, but in this case, it’s not an obvious and popular procedure as it is in software development.
In machine learning, there are hundreds of experiments, and this is when problems may arise.
Data version control is a set of tools and processes that tries to adapt the version control process to the data world. Having systems in place that allow people to work quickly and pick up where others have left o would increase the speed and quality of delivered results. It would enable people to manage data transparently, run experiments effectively, and collaborate with others2.
What is DVC?
DVC stands for Data Version Control. It’s a command-line tool written in Python that helps data scientists manage, track and version data and models, as well as run reproducible experiments.
Basically, it’s a version control system for machine learning projects. You can think of DVC as a command line tool, it’s kind of Git for ML3.
It can track large data les (e.g. 10 GB data sets or ML models), version machine learning pipelines, organize and run experiments, make them self-descriptive and self-documented. Basically, we know how a model was produced, what kind of commands we need to run to reproduce the experiment and the kind of metrics that were produced as a result of this experiment.
– Git-compatible – DVC runs on top of any Git repository
– Storage agnostic – Amazon S3, Google Drive, Google Cloud Storage, SSH/SFTP, HTTP, local or network-attached storage, etc.
– Reproducible – the single ‘dvc repro’ command reproduces experiments end-to-end
– Low friction branching – DVC fully supports instantaneous Git branching, even with large les metric tracking – DVC includes a command to list all branches, along with metric values
– ML pipeline framework – DVC has a built-in way to connect ML steps into dependency graphs DAG (directed acyclic graph) language and framework-agnostic – Python, R, Julia, Scala Spark, Notebooks, atles/TensorFlow, PyTorch, etc.
– HDFS, Hive & Apache Spark – include Spark and Hive jobs in the DVC data versioning
– Track failures – retaining knowledge of failed attempts can save time in the future
There are several alternatives to the DVC (e.g. Git LFS, MLFlow, Pachyderm, Delta Lake, Neptune, lakeFS or Allegro Trains / ClearML). A number of them are more extensive and complete than others, while some lack certain functionalities. What distinguishes DVC from the crowd is its ease of use, similar functionalities to that of Git and the possibility of integration with ML pipelines as a part of an extensive system. It is denitely worth checking out!