Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Artificial Intelligence

How to turn your local (zip) data into a Huggingface Dataset | by Dr. Varshita Sher | Sep, 2022

admin by admin
September 7, 2022
in Artificial Intelligence


HUGGINGFACE DATASETS

Quickly load your dataset in a single line of code for training a deep learning model

If you have been working for some time in the field of deep learning (or even if you have only recently delved into it), chances are, you would have come across Huggingface — an open-source ML library that is a holy grail for all things AI (pretrained models, datasets, inference API, GPU/TPU scalability, optimizers, etc).

They also have a dedicated library — 🤗 Datasets for easily accessing and sharing datasets for Natural Language Processing (NLP), computer vision, and audio tasks.

pip install datasets

This library comes pre-installed with 2500+ datasets. You can check the list as follows:

from datasets import list_datasets
list_datasets()
*** OUTPUT ****['acronym_identification',
'ade_corpus_v2',
'adversarial_qa',
'aeslc',
'afrikaans_ner_corpus',
'ag_news',
...
]

To load any of these datasets in your current python script or jupyter notebook, simply pass the name of the dataset to load_dataset(). For instance, let’s try loading a popular audio dataset called superb with the asr (automatic speech recognition) configuration and inspect the first audio file. The output is a dictionary with six features — chapter_id, file, audio, id, speaker_id, and text.

from datasets import load_dataset
dataset = load_dataset("superb", "asr")
dataset[0]
*** OUTPUT ***
{'chapter_id': 1240,
'file': 'path/to/file.flac',
'audio': {
'array': array([0., 0.003, -0.0002,..., dtype=float32),
'path': 'path/to/file.flac',
'sampling_rate': 16000
}
'id': '103-1240-0000',
'speaker_id': 103,
'text': 'CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE '
}

One of the main reasons I started writing this article was because I wanted to fine-tune a 🤗 Transformer model using the Trainer API on a custom audio dataset (blog to follow shortly). Most tutorials I came across were using one of the popular datasets (such as Superb, Librispeech, etc) that come pre-installed into the library and ready to be used out-of-the-box.

Wanting to work with the Crema-D audio dataset from Kaggle, I thought — wouldn’t it be nice if we could also load our own custom data with a single line of code as above? Something along the lines of:

dataset = load_dataset("my_custom_dataset")

That’s exactly what we are going to learn how to do in this tutorial! So go ahead and click the Download button on this link to follow this tutorial. You should see the archive.zip containing the Crema-D audio files starting to download. It contains 7k+ audio files in the .wav format.

One main benefit of creating 🤗 datasets is that they are Arrow-backed. In other words, datasets are cached on disk. When needed, they are memory-mapped directly from the disk (which offers fast lookup) instead of being loaded in memory (i.e. RAM). Because of this, machines with relatively smaller (RAM) memory can still load large datasets using Huggingface datasets [Source].

Given that we need to work with the custom local CremaD dataset — meaning it is not yet ready to be loaded out-of-the-box using load_dataset(), we need to write a loading script instead. Each of the pre-installed datasets we saw above has its own loading scripts in the backend. Here is the one for superb dataset.

A loading script is a .py python script that we pass as input to load_dataset().(instead of a pre-installed dataset name). It contains information about the columns and their data types, specifies train-test splits for the dataset, handles downloading files, if needed, and generation of samples from the dataset.

A loading script also helps in decoupling dataset code from model training code for better readability and modularity.

Assuming we have been successful in creating this aforementioned script, we should then be able to load our dataset as follows:

ds = load_dataset(
dataset_config["LOADING_SCRIPT_FILES"],
dataset_config["CONFIG_NAME"],
data_dir=dataset_config["DATA_DIR"],
cache_dir=dataset_config["CACHE_DIR"]
)

wherein dataset_config is a simple dictionary containing the values:

dataset_config = {
"LOADING_SCRIPT_FILES": path/to/loading/script.py,
"CONFIG_NAME": "clean",
"DATA_DIR": path/to/zip/file,
"CACHE_DIR": path/to/cache/directory,
}

By passing data_dir while calling load_dataset(), we are telling the loading script where to look for the directory containing the audio files. Furthermore, setting a cache_dir will allow us to re-use the cached version of our dataset on subsequent calls to the load_dataset().

Lastly, we are going to focus on building only one configuration that we have named clean. However, one can have multiple configs within their dataset. For instance, in the superb example above, we loaded the dataset with a specific configuration i.e. asr but they also have five other configurations — ks, ic , si, sd, and er.

Similarly, for this tutorial, in addition to having a clean config that will contain the entire dataset, we can have a second configuration, say small, which can be a reduced dataset for testing purposes, or a third config, say fr, which can contain the french version of this dataset. (Towards the end of this tutorial, I will briefly discuss how to define multiple configs within the same loading script).

Quick detour

Before we begin writing a custom loading script for our dataset (contained in a zip file), I would like to point out how things would be done differently if we were dealing with the creation of 🤗datasets from files in a simpler data format like csv, JSON, etc. Examples below are taken directly from the documentation page:

dataset = load_dataset(‘csv’, data_files=[‘my_file_1.csv’, ‘my_file_2.csv’])
dataset = load_dataset(‘json’, data_files=’my_file.json’)
dataset = load_dataset(‘text’, data_files={‘train’: [‘my_text_1.txt’, ‘my_text_2.txt’], ‘test’: ‘my_test_file.txt’})
my_dict = {'id': [0, 1, 2], 'name': ['mary', 'bob', 'eve'], 'age': [24, 53, 19]}dataset = Dataset.from_dict(my_dict)
df = pd.DataFrame({"a": [1, 2, 3]})
dataset = Dataset.from_pandas(df)

Writing custom loading script

Coming back to our custom loading script, let’s create a new file called crema.py. This is what a typical loading script will look like for any new dataset:

Figure 1: Generated using the blank template provided by Huggingface.

As you can see, there are three main functions that need modification — info(), split_generator() and generate_examples(). Let’s look at them one by one:

Source: Official Huggingface Documentation

1. info()

The three most important attributes to specify within this method are:

  • description — a string object containing a quick summary of your dataset.
  • features — think of it like defining a skeleton/metadata for your dataset. That is, what features would you like to store for each audio sample? (Remember how superb dataset had six features defined for each audio file).
    For our audio classification task, we simply need to define a file and the corresponding label.
  • homepage — (optional) link to the homepage URL of the dataset.

Few things to consider:

  • Each column name and its type are collectively referred to as Features of the 🤗 dataset. It takes the form of a dict[column_name, column_type].
  • Depending on the column_type, we can have either have
    — datasets.Value (for integers and strings),
    — datasets.ClassLabel (for a predefined set of classes with corresponding integer labels),
    — datasets.Sequence feature (for list of objects).
    — and many more.
  • In our code, for simplicity, both file and label are defined as Value features of type string.
    Note: Apart from string, other data types include int32, bool, timestamp, etc. Check out the complete list here.
  • Apart from description, features , and homepage, you can check out here other attributes that can be specified within info() such as version number, supervised_keys, citation, etc.

2. split_generator()

This is the function that takes care of downloading or retrieving the data files. That’s why in the function definition in Figure 1, the download manager (i.e. dl_manager) is passed as one of the function parameters.

The DownloadManager has a pre-defined function called extract() that takes care of unzipping our dataset and accessing the audio files therein.

def _split_generator(self, dl_manager):
data_dir = dl_manager.extract(self.config.data_dir)
.
.
.
.

Note: If your zip (or tar) data is hosted on ftp link or a URL (for instance, this is where thesuperb dataset is currently stored) you can use dl_manager.download_and_extract() to take care of downloading and unzipping the files. Because we have already downloaded the .zip file locally, we simply need to extract files using the extract().

This function takes as input the path to the data directory (i.e. where archive.zip sits). Remember we pass this path as data_dir argument when calling load_dataset(), so this will be available as part of config and accessible through self.config.data_dir.

The output of the extract() function is a string containing the path to a cache directory where the file has been unzipped. For instance, in our case this will be:/Audio-Classification-Medium /cache_crema/downloads/extracted/d088ccc5a5716........ At this location, you’ll find a newly created folder called AudioWav with all our .wav audio files present.

Lastly, split_generator() also organizes data by splits using SplitGenerator. For now, we only have only one split i.e. train_splits returned by this function whose name we specify as train. In here, gen_kwargs refers to keyword arguments that will be needed to generate samples from this dataset. It contains two arguments —files and name — both of which will be forward to the _generate_examples() method next.

Note: There’s no limit to what can be passed within gen_kwargs. Try gen_kwargs={"files": data_dir, "name": "train", "useless_arg": "helloworld"}. Needless to say, only include kwargs you think will be needed to generate samples within the _generate_examples().

Tip: In the future, if you have separate datasets for test and validation splits, you can create more splits as follows:

3. generate_examples()

As previously mentioned, this method takes as parameters all things unpacked from gen_kwargs as given in _split_generators. In our case, that will be files and name:

def _generate_examples(self, files, name):
.
.
.

This method is in charge of generating (key, example) tuples — one by one — from the audio dataset (using yield) where example is a dictionary containing key-value pairs of audio files and labels. Because we don’t have explicit access to the labels, we need to extract them from the filenames (for ex: 1001_DFA_ANG_XX.wav) using split().

file = 1001_DFA_ANG_XX.wavlabel = file.split("_")[-2]
print(label)
**** OUTPUT ****
ANG

Note: According to the official dataset documentation, the file name contains useful metadata (separated by _) including the speaker_id (1001), sentence id (DFA), etc. If you would like to include them as part of the dataset, make sure you update info() to create new Features for each of them, before you can use them in generate_examples().

Before we can yield one example, we must create a list of all examples. Let’s do so by iterating over all files present in os.path.join(files, “AudioWav") directory.

Note1: If you’re wondering why we needed to os.path.join() above, remember that files is the path to the cache folder./Audio-Classification-Medium /cache_crema/downloads/extracted/d088ccc5a5716....... — there are no audio files here! At this location, a newly created AudioWav folder contains the needed .wav audio files. Took me hours of debugging for this one!
In hindsight, I should use
os.walk() from next time.

Note2: If you had an explicit csv/json file containing all the metadata including labels, the code for generate_examples() would look a bit different. Instead of iterating over all the files, you would need to (a) iterate over the rows in the csv file and (b) convert each row to a dictionary using .todict() — to create the examples list. See below for a dummy snippet:

Final code for crema.py.

Few additional changes to consider:

  • On line 28, we have set a class attribute i.e. DEFAULT_WRITER_BATCH_SIZE that says how many examples can stay in RAM while writing the dataset to an Arrow file. For memory-heavy data such as images, audio, or videos it is important to set this to a small value (such as 256) so as to not risk OOM errors and the iterator getting stuck. If we don’t set a value, Arrow’s default batch size (10000) is used which is too large for speech samples.
  • On line 29, we have defined the one and only configuration we are providing for this dataset i.e. clean using datasets.BuilderConfig which is the base class for building configurations.
    (P.S. Towards the end, we will see how to subclass BuilderConfig and add our own properties for defining multiple configurations).

Congrats, you’re now ready to load your dataset

Open a new python script or jupyter notebook:

dataset_config = {
"LOADING_SCRIPT_FILES": os.path.join(PROJECT_ROOT, "crema.py"),
"CONFIG_NAME": "clean",
"DATA_DIR": os.path.join(PROJECT_ROOT, "data/archive.zip"),
"CACHE_DIR": os.path.join(PROJECT_ROOT, "cache_crema"),
}
ds = load_dataset(
dataset_config["LOADING_SCRIPT_FILES"],
dataset_config["CONFIG_NAME"],
data_dir=dataset_config["DATA_DIR"],
cache_dir=dataset_config["CACHE_DIR"]
)
print(ds)********* OUTPUT ********DatasetDict({
train: Dataset({
features: ['file', 'label'],
num_rows: 7442
})
})

From hereon, you can either choose to use this dataset as is for model training (which is what I will be doing in my next tutorial) or (if you have ownership of the dataset) upload it to the Huggingface Dataset-Hub. Instructions can be found here.

Before finishing, it’s worthwhile to discuss a few things that can be done post the data loading step but before the model training step.

1. Split into train test and dev sets

# INTRODUCE TRAIN TEST VAL SPLITS# 90% train, 10% test + validation
train_testvalid = ds["train"].train_test_split(shuffle=True, test_size=0.1)
# Split the 10% test + valid in half test, half valid
test_valid = train_testvalid["test"].train_test_split(test_size=0.5)
# gather everything into a single DatasetDictds = DatasetDict({
"train": train_testvalid["train"],
"test": test_valid["test"],
"val": test_valid["train"],
}
)

2. Convert raw audio files into arrays

# CONVERING RAW AUDIO TO ARRAYSds = ds.map( lambda x: {
"array": librosa.load(x["file"],
sr=16000,
mono=False)[0]
}
)

3. Convert labels into ids

ds = ds.class_encode_column("label")

4. Selecting a subset of the dataset for dummy runs

ds["train"] = ds["train"].select(range(50))

P.S. Keep in mind that each map function, though time-consuming the first time around, is caching the output so subsequent mapcalls during model.train() won’t take that much time.

At the beginning of the article, I mentioned that we will be discussing the code snippet that allows multiple (dummy) configs. For this, we need to introduce a new class — let’s call this CremaConfig — which will be a subclass of datasets.BuilderConfig. Within this class, we have defined three attributes of our dataset including data_dir, url and citation.

Now, instead of defining the config as follows:

BUILDER_CONFIGS = [
datasets.BuilderConfig(name="clean", description="Train Set.")
]

we can now build instances of CremaConfig class to instantiate multiple configs. This allows us the flexibility to specify the name, data directory, url, etc. of each configuration.

BUILDER_CONFIGS = [
CremaConfig(name="clean", description="Train Set in English.", data_dir="path/to/english/dir", url="...", citation="..."),
CremaConfig(name="fr", description="Train Set in French.", data_dir="path/to/french/dir", url="...", citation="..."),
]

A huge shoutout to the pre-existing 🤗 documentation on this topic. I hope this tutorial was able to take the documentation a step further, filter the technical jargon, and showcase implementation on a real-world example!

As always if there’s an easier way to do/explain some of the things mentioned in this article, do let me know. In general, refrain from unsolicited destructive/trash/hostile comments!

Until next time ✨



Source link

Previous Post

A Guide to Neural Architecture Search | by Nilotpal Sinha PhD | Sep, 2022

Next Post

Transfer learning for TensorFlow image classification models in Amazon SageMaker

Next Post

Transfer learning for TensorFlow image classification models in Amazon SageMaker

Parkinson’s Disease Prediction [End to End Project] [ML& DL] | by Grajeshchary | Sep, 2022

Parkinson’s Disease Prediction [End to End Project] [ML& DL] | by Grajeshchary | Sep, 2022

Automate Operational Reports Distribution in HTML Emails using Python | by Samir Saci | Sep, 2022

Related Post

Artificial Intelligence

Dates and Subqueries in SQL. Working with dates in SQL | by Michael Grogan | Jan, 2023

by admin
January 27, 2023
Machine Learning

ChatGPT Is Here To Stay For A Long Time | by Jack Martin | Jan, 2023

by admin
January 27, 2023
Machine Learning

5 steps to organize digital files effectively

by admin
January 27, 2023
Artificial Intelligence

Explain text classification model predictions using Amazon SageMaker Clarify

by admin
January 27, 2023
Artificial Intelligence

Human Resource Management Challenges and The Role of Artificial Intelligence in 2023 | by Ghulam Mustafa Shoaib | Jan, 2023

by admin
January 27, 2023
Deep Learning

Training Neural Nets: a Hacker’s Perspective

by admin
January 27, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.