2. .dvc
files
.dvc
files use the YAML 1.2 file format, which is a human-friendly data serialization format for all programming languages.
As I mention above, DVC creates one lightweight .dvc
file for each file or folder tracked with DVC.
When you take a peek inside the contents of images.dvc
, you will see the following entries:
The most interesting part is md5
. MD5 is a popular hashing function. It takes a file of arbitrary size and uses its contents to produce a string of characters of fixed length (32 characters in our case).
These characters can seem random, but they will always be the same if you rerun the hashing function on the file however many times. But, even if a single bit is changed in the file, the resulting hash will be completely different.
DVC uses these hashes (also called checksums) to differentiate whether two files are identical, completely different, or different versions of the same file.
For example, if I add a new fake image to the images
folder, the resulting MD5 hash inside images.dvc
will be different:
As mentioned earlier, you should track all .dvc
files with Git so that modifications to large assets become a part of your Git commits and history.
$ git add images.dvc
Find out more about how .dvc
files work from this page of the DVC user guide.
3. DVC cache
When you call dvc add
on a large asset, it gets copied into a special directory called DVC cache, located under .dvc/cache
.
The cache is the place where DVC keeps a pristine record of your data and models at different versions. The .dvc
files in the current working directory may be showing the latest or some other version of the large assets, but the cache will include all the different states the assets have been in since you started tracking them with DVC.
For example, let’s say you added a 1 GB data.csv
file to DVC. By default, the file will be both in your workspace and inside the .dvc/cache
folder, taking up twice as much space (2 GB).
Any subsequent changes tracked with dvc add data.csv
will create a new version of data.csv
with a new hash inside .dvc/cache
, taking up another gigabyte of memory.
So, you might already be asking — isn’t this highly inefficient? And the answer would be yes! At least, for single files, but we will see methods to mitigate this problem in the next section.
As for folders, it is a bit different.
When you track different versions of folders with dvc add dirname
, DVC is smart enough to detect only the files that changed within that directory. This means that unless you update every single file in the directory, DVC will cache only the versions of the changed files, which won’t take up much space.
In summary, think of DVC cache as a counterpart to Git’s staging area.
Learn more about the cache and internal DVC files from this user guide section.