Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Artificial Intelligence

The DVC Guide: Data Version Control For All Your Data Science Projects | by Yash Prakash | Feb, 2023

admin by admin
February 17, 2023
in Artificial Intelligence


Become familiar with data versioning just like code versioning

Photo by Dmitri Sobolevski on Unsplash

As data scientists, we experiment with different versions of code, models, and data. Additionally, we even use version control system like Git to manage our code, track versions, move forward and backward in time, and share our code with our teams.

The versioning of code is important because it helps reproduce software on a much larger scale. The versioning of data is important is because it helps develop machine learning models with similar metrics at any given time by any developer in your team or organization.

Therefore, it is crucial to version your models as well as data. But veteran software engineers will know that using Git for storing large files is a big no.

Not only is Git inefficient with larger files, it is also not a standardized environment for storing large data files. Most data is stored in AWS S3 buckets, Google Cloud Storage, or any instututional remote storage server.

So how do we version data? Enter DVC.

DVC is a system for Data Version Control that works hand in hand with Git to track our data files. It even has a similar syntax like Git so it’s quite easy to learn.

Let’s take a look at some of the great data versioning features of DVC in this article. But first, lets make a new project folder and a virtual environment and install it as a Python package:

$ pip install "dvc[all]"

or if you use Pipenv:

$ pipenv shell
$ pipenv install "dvc[all]"

You should see an output like so:

installing DVC: image from author

Now, lets initialize a git repository. You should see the following output:

dvc init: image from author

Perfect! We can now go ahead into adding our data into DVC.

I have one data file in my project’s data folder like so:

folder structure: image from author

To run a size check from the terminal, use:

$ ls -lh data

You’ll see the following output as the data file is displayed as 5.2 MB.

data file check: image from author

We can now add this data file to DVC. Run:

$ dvc add data/train_shakespeare.txt

You’ll see the following output, prompting us to run the git add command:

dvc added data: image from author

We will now run the git add command:

$ git add data/train_shakespeare.txt.dvc data/.gitignore

Now that we’ve added our new .dvc file to our git tracking, we can go ahead and commit it to our git:

$ git commit -m "added data."

We can simply utilize Google Drive for storing our versioned datasets and in this tutorial we’re going to do exactly that.

Let’s create a new folder in our Google drive and look at its URL:

https://drive.google.com/drive/u/0/folders/cVtFRMoZKxe5iNMd-K_T50Ie

Highlighted in bold is the ID of the folder that we want to copy to our terminal so that DVC can track our data in that newly created Drive folder.

Let’s do that:

$ dvc remote add -d storage gdrive://cVtFRMoZKxe5iNMd-K_T50Ie

Time to commit our changes to git:

$ git commit .dvc/config -m "Configured remote storage."

Perfect! Now we can push our data to our remote storage.

$ dvc push 

It’ll ask for an authentication code or simply take you to perform authentication in your browser, simply follow the instructions and you’ll be good to go.

remote data in Google drive: image from author

If you or your colleagues want to access the remotely stored data, it can be done with the pull command.

But first, let’s delete the data and its cache stored locally so that we can pull it from remote:

$ rm -f data/train_shakespeare.txt
$ rm -rf .dvc/cache

Now, pull:

$ dvc pull

You’ll see the following output on pulling the file:

pulled file from remote: image from author

As you can see, once dvc is tracking your data file, pulling it from remote storage is a breeze.

Imagine if we want to track a new version of the same data file, we can easily add it to dvc and subsequently, to git again:

$ dvc add data/train_shakespeare.txt
$ git add data/train_shakespeare.txt.dvc

Now, you’ll see a new version of the .dvc file is ready to be committed to our git:

committing new .dvc file changes: image from author

Commit the file.

Now, we can push our latest dataset to remote storage:

$ dvc push

Looking at our Google drive, we can see that we have two versions of our data stored:

Google drive data versions: image from author

With DVC, it become easy to go back in time to an older version of a dataset.

If we look at the git log of our project so far, we see that we have commited two .dvc file versions to git:

Therefore, we must go back to our previous version of the .dvc file, as that is the one git is tracking.

First, simply do a Git checkout to an older commit, like so:

$ git checkout HEAT^1 data/train_shakespeare.txt.dvc 

Second, do a checkout of dvc:

$ dvc checkout

You’ll see the following output. We have now restored our data file to its previous version!

restoring data to its previous version: image from author

Additionally, if you want to keep these dataset changes, simply commit it to git again:

$ git commit data/train_shakespeare.txt.dvc -m "reverted data changes."

Perfect! Till now, you have learned most of the fundamental data versioning features of DVC. Great job!

DVC provides us with a massive helping hand in versioning datasets for our data science projects, and after this article, I hope you have some useful knowledge about getting started with it.

Practising on some sample projects and exploring the DVC documentation will be your best bet to advance your skill with this amazing tool.

If you liked this article, every week I put out a story in which I share little pieces of knowledge from the world of data science and programming in general. Follow me to never miss them! 😄

You can also connect with me on LinkedIn and Twitter.





Source link

Previous Post

A Journey into the World of Document Redaction | by Ahmed Mohamed | The Techlife | Feb, 2023

Next Post

Machine Learning Is Not Like Your Brain Part 4: The Neuron’s Limited Ability to Represent Precise Values

Next Post

Machine Learning Is Not Like Your Brain Part 4: The Neuron’s Limited Ability to Represent Precise Values

MediaTek Launches Dimensity 7200 to Amplify Gaming and Photography Smartphone Experiences

Building AI chatbots using Amazon Lex and Amazon Kendra for filtering query results based on user context

Related Post

Artificial Intelligence

10 Most Common Yet Confusing Machine Learning Model Names | by Angela Shi | Mar, 2023

by admin
March 26, 2023
Machine Learning

How Machine Learning Will Shape The Future of the Hiring Industry | by unnanu | Mar, 2023

by admin
March 26, 2023
Machine Learning

The Pros & Cons of Accounts Payable Outsourcing

by admin
March 26, 2023
Artificial Intelligence

Best practices for viewing and querying Amazon SageMaker service quota usage

by admin
March 26, 2023
Edge AI

March 2023 Edge AI and Vision Innovation Forum Presentation Videos

by admin
March 26, 2023
Artificial Intelligence

Hierarchical text-conditional image generation with CLIP latents

by admin
March 26, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.