Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Machine Learning

Data Version Control for Machine Learning Models | by Nicole T. | Jan, 2023

admin by admin
January 13, 2023
in Machine Learning


Photo by Author at Oceanside Pier, CA

In software development, version control is needed to manage different versions of code, so that engineers can keep track of changes and it’s also for reproduction purpose. GitHub is one of the most common tools and it’s good for text files in which most of the software code is stored. Version Control for machine learning (ML) is similar to software engineering, but it’s more complex in some sense, because we need to store not only the models but also data and pipeline (model + data). ML version control is particularly important for highly-regulated industries like healthcare or insurance because reproducibility and model compliance are requirements for machine learning systems in these industries.

https://dvc.org/doc/use-cases/versioning-data-and-models

The data can be in many types and format such as numerical, text, or videos which can be in much bigger volumes. Besides, before data can be used for training, it needs to be cleaned, formatted, and engineered. Therefore, we have to manage versions of the data used for model training, validating and testing in ML development. Furthermore, we also have to track how the data is combined with the model versions. That’s when the pipelines come into play. Pipelines contain the code versions, data versions, and libraries, system requirements, or configuration needed for model reproduction.

Data version control (DVC) is an open-source machine learning platform. It is also known as Git for ML. DVC focuses on data versions, but it can also track pipeline versioning and experiments. DVC can cope with versioning and organization of big amounts of data and store them in a well-organized, accessible way. With the help of DVC, we don’t need to rebuild previous version of models or redo the data preprocessing part to get the same results. We can share the ML models using cloud storage. This makes it easier for team members to run experiments and optimize the model with the shared data.

https://dvc.org/doc/use-cases/versioning-data-and-models

Please see detailed instructions for installation in the link. I installed mine with conda to use DVC as a Python library.

$ conda install -c conda-forge mamba
$ mamba install -c conda-forge dvc

The DVC commands are similar to Git, In this post, I will introduce the 10 essential commands used to initialize, manage, and share DVC projects.

1. init

DVC initialization is dependent on Git. We need to first initialize the Git and before initializing DVC.


# create a new folder for the project
mkdir DVC_test
# go to the project folder
cd DVC_test

// creating a Git repository
git init

// initialize it by running dvc init
dvc init

After DVC is initialized, we can see a directory called .dvc in the project folder to store the configuration file, cache location, and other internal directories.

2. get-url

get-url command is used to download a file or directory from a supported URL (for example s3://, ssh://, and other protocols) into the local file system.

$ dvc get-url https://github.com/curran/data/blob/gh-pages/Rdatasets/csv/COUNT/fishing.csv

3. add

To start tracking a file or directory, use dvc add.

dvc add path.csv

DVC stores information about the added file in a special .dvc file named. It is a small text file with a human-readable format. This metadata file is a placeholder for the original data, and can be easily versioned like source code with Git

4. status

We can use Status command to track changes in the project pipelines and file changes either between the cache and workspace, or between the cache and remote storage.

5. remote

DVC remotes provide a location to store and share data/models with a team or create a copy in a remote storage.

dvc remote add -d myremote s3://path.csv

To be cont.…

https://dvc.org/



Source link

Previous Post

How AI Proof of Concept Helps You Succeed in Your AI Endeavor | by ITRex Group | Jan, 2023

Next Post

Julia vs Librosa vs TorchAudio for Audio Data Processing | by Max Hilsdorf | Jan, 2023

Next Post

Julia vs Librosa vs TorchAudio for Audio Data Processing | by Max Hilsdorf | Jan, 2023

How Thomson Reuters built an AI platform using Amazon SageMaker to accelerate delivery of ML projects

How to do Data Aggregation for powerful insights?

Related Post

Artificial Intelligence

3 Ways to Build a Geographical Map in Python Altair | by Angelica Lo Duca | Jan, 2023

by admin
January 30, 2023
Machine Learning

Want to get a quick and profound overview of the 42 most common used Machine Learning Algorithms? | by Murat Durmus (CEO @AISOMA_AG) | Jan, 2023

by admin
January 30, 2023
Machine Learning

Scan Business Cards to Excel or Google Contacts

by admin
January 30, 2023
Artificial Intelligence

Amazon SageMaker built-in LightGBM now offers distributed training using Dask

by admin
January 30, 2023
Artificial Intelligence

Don’t blame a Data Scientist on failed projects! | by Darya Petrashka | Dec, 2022

by admin
January 30, 2023
Edge AI

BrainChip Tapes Out AKD1500 Chip in GlobalFoundries 22nm FD SOI Process

by admin
January 30, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.