In software development, version control is needed to manage different versions of code, so that engineers can keep track of changes and it’s also for reproduction purpose. GitHub is one of the most common tools and it’s good for text files in which most of the software code is stored. Version Control for machine learning (ML) is similar to software engineering, but it’s more complex in some sense, because we need to store not only the models but also data and pipeline (model + data). ML version control is particularly important for highly-regulated industries like healthcare or insurance because reproducibility and model compliance are requirements for machine learning systems in these industries.
The data can be in many types and format such as numerical, text, or videos which can be in much bigger volumes. Besides, before data can be used for training, it needs to be cleaned, formatted, and engineered. Therefore, we have to manage versions of the data used for model training, validating and testing in ML development. Furthermore, we also have to track how the data is combined with the model versions. That’s when the pipelines come into play. Pipelines contain the code versions, data versions, and libraries, system requirements, or configuration needed for model reproduction.
Data version control (DVC) is an open-source machine learning platform. It is also known as Git for ML. DVC focuses on data versions, but it can also track pipeline versioning and experiments. DVC can cope with versioning and organization of big amounts of data and store them in a well-organized, accessible way. With the help of DVC, we don’t need to rebuild previous version of models or redo the data preprocessing part to get the same results. We can share the ML models using cloud storage. This makes it easier for team members to run experiments and optimize the model with the shared data.
$ conda install -c conda-forge mamba
$ mamba install -c conda-forge dvc
The DVC commands are similar to Git, In this post, I will introduce the 10 essential commands used to initialize, manage, and share DVC projects.
DVC initialization is dependent on Git. We need to first initialize the Git and before initializing DVC.
# create a new folder for the project
# go to the project folder
// creating a Git repository
// initialize it by running dvc init
After DVC is initialized, we can see a directory called .dvc in the project folder to store the configuration file, cache location, and other internal directories.
get-url command is used to download a file or directory from a supported URL (for example
ssh://, and other protocols) into the local file system.
$ dvc get-url https://github.com/curran/data/blob/gh-pages/Rdatasets/csv/COUNT/fishing.csv
To start tracking a file or directory, use
dvc add path.csv
DVC stores information about the added file in a special
.dvc file named. It is a small text file with a human-readable format. This metadata file is a placeholder for the original data, and can be easily versioned like source code with Git
We can use Status command to track changes in the project pipelines and file changes either between the cache and workspace, or between the cache and remote storage.
DVC remotes provide a location to store and share data/models with a team or create a copy in a remote storage.
dvc remote add -d myremote s3://path.csv