DVC, which goes by Data Version Control, is essentially an experiment management tool for ML projects. DVC software is built upon Git and its main goal is to codify data, models and pipelines through the command line.
This is all possible because DVC replaces large files (such as datasets and ML models) with small metafiles that point to the original data. By doing that, it’s possible to keep these metafiles along with the source code of the project in a repository while large files are kept in at a remote data storage.
Since DVC works on top of Git, its syntax and workflow are also similar so that you’re able to treat data and model versioning just as you do with code. Although DVC can work stand-alone, it’s highly recommended to work alongside Git.
DVC can also manage the project’s pipelines to make experiments reproducible for all members. These pipelines are lightweight and are created using dependency graphs.
It’s important to note that DVC is free, open-source and platform agnostic, so it runs with a diverse option of OS, programming languages and ML libraries.
If you train models without a feature store, your setup might look something like this:
If your models pull data directly from a data store, you often have annoying duplication
Every model has to access the data and do some transformation to turn it into features, which the model then uses for training.
There’s a lot of duplication in this process — many of the models use many of the same features.
This duplication is one problem a feature store can solve. Every feature can be stored, versioned, and organized in your feature store. This pre-prepared data can then easily be used to train other models in the future. As a result, you’ll avoid calculating the datasets repeatedly. The data you used to train your model will also be available, and the entire training pipeline will be easier to reproduce.
A feature store breaks the coupling between your models and your data, leading to less duplication
Until recently, feature stores have mainly been used in internal machine learning platforms, such as Uber’s Michaelangelo. If you wanted to use feature stores outside a large corporation, you’d have to build your own from scratch. Luckily the open-source community is already changing that. But the options are still somewhat limited.
DVC isn’t really fully comparable to a feature store, although versioning your feature files properly can help solve some of the same issues.
Overall, DVC is a much lower-level solution than FEAST — it stores versions of large data efficiently. This can include your raw data, your features, and even your final model files.
DVC keeps all the different versions of your data, features, and model
Because DVC isn’t specifically built as a feature store, it’s missing many of the features you find in platforms like FEAST , especially when it comes to stream processing. Using a git-like model for version control makes a lot of sense if you look at batch processing, but for machine learning systems that ingest live data (for example, routing systems that take live traffic into account, or fraud detection systems that have to decide whether or not to block a specific transaction within milliseconds), it can be tricker to keep track of everything.
Platforms like FEAST support online and offline feature stores, using faster, key-value based stores when timing is more important and slower, more structured offline stores for keeping track of historical data over the years. While you could certainly implement something similar on top of DVC, it would take significant custom engineering work compared to using a specialized feature store.
Connect me through medium.