A wide range of audio data is available in the real world: speech, animal sounds, instruments — you name it. No wonder audio-based machine learning is a niche application across many sectors and industries. Compared to other types of data, audio data typically requires lots of time-consuming and resource-demanding processing steps, before we can feed it into a machine-learning model. This is why we focus on runtime optimization in this post.
By far, the most widely used framework for audio data processing is a combination of the two Python libraries NumPy and Librosa. It is, however, not without competition. In 2019, PyTorch released a library called TorchAudio that promises more efficient signal processing and I/O operations. Moreover, the programming language Julia is slowly gaining more popularity in the field, especially in academic research.
In this post, I am going to let all three frameworks solve a real-world speech recognition problem and compare the runtimes at different steps of the process.
If you just want to see the results, feel free to fly over or skip this section. The results should be interpretable to some extent without reading this.
To compare the three frameworks, I picked a specific real-world speech recognition task and wrote a processing script for each contestant. You can find the scripts in this GitHub repository. For the task, I picked 6 speech commands from Google’s “Speech Commands Dataset” (CC 4.0 license), each with around 2,300 examples, resulting in a total dataset size of 14,206. A CSV file was prepared which holds the file path as well as the class for each of the examples.
To solve the processing task, each program must perform the following steps:
- Load the dataset overview from a CSV file.
- Create an empty array to fill with the extracted features.
- For each audio file: [a] Load the audio file from a local path. [b] Extract a mel spectrogram (1 sec) from the signal. [c] Pad or truncate the mel spectrogram if necessary. [d] Write the mel spectrogram to the feature array.
- Normalize the feature array using Min-Max normalization
- Export the feature array to an appropriate data format.
I did my best to implement the algorithm in a comparable way in all three frameworks, down to the smallest detail. However, since I am quite new to Julia and TorchAudio, I cannot guarantee that I found the undisputed most efficient implementation there. You can always look at the code yourselves here.
To gain deeper insights into the strengths and weaknesses of each framework, I measured the runtime at different steps of the algorithm:
- After loading the libraries, helper functions, and basic parameters set at the beginning of the script.
- After loading the dataset overview from a CSV file.
- After extracting the mel spectrograms from all examples.
- After normalizing and exporting the data.
Furthermore, I duplicated the dataset multiple times to simulate how the algorithms would scale with increasing dataset size:
- 14,206 examples (1x)
- 24,412 examples (2x)
- 42,618 examples (3x)
- 56,824 examples (4x)
- 142,060 examples (10x)
For each dataset size, I ran the algorithm five times and computed the median runtime of each step. Every measurement was rounded to full seconds, so some processing steps were recorded as zero seconds. Because there was hardly any variation in the runtimes, no measures of variance are taken into account. All measurements were made on an Apple Mac Book Pro M1.
Total Runtime Comparison
In the graph below, the total runtimes of the three frameworks are compared at different dataset sizes. Because Librosa sticks out as much slower than the other two, the first subplot has a log-scaled y-axis. This way, it is easier to observe differences between Julia and TorchAudio. Keep in mind that the linear interpolation between the dots means different things in the regular and the log-scaled y-axis. Just use them as a visual aid for spotting trends.
The first thing we may observe is that Librosa is much slower than the other two frameworks — and by a large margin. TorchAudio is reliably more than 10x as fast as Librosa and so is Julia after a dataset size of ~30k. This was a major shock to me, for I had used Librosa exclusively for these kinds of tasks for more than three years.
The next thing we can see is that TorchAudio starts out with the fastest runtime, but is slowly overtaken by Julia. It seems that Julia starts to take the lead at around 33k examples. At 140k examples, Julia outclasses TorchAudio by a considerable margin, taking only 60% of TorchAudio’s runtime.
Let us look at the stepwise runtime measurements to see why Julia’s runtime scales so differently than Pythons.
Stepwise Runtime Comparison
The figure below shows the runtime share of each step in the algorithm, for each of the three frameworks.
We can see that for Librosa and TorchAudio, extracting the mel spectrograms takes up nearly all of the runtime. Of course, these two algorithms have almost the exact same code outside of the feature extraction step, which is done in either TorchAudio or Librosa. This tells us that the TorchAudio graph only has other influencing factors in the beginning because the feature extraction is faster than with Librosa. For larger dataset sizes, they quickly converge to the same runtime distribution.
In contrast, for Julia, the feature extraction step does not become dominant until a dataset size of 42k. Even at 142k examples, the other steps still make up for more than 25% of the runtime. This result is not surprising if you have used both, Julia and Python. As an interpreted language, Python has a low latency to get a library or a function going, but the actual execution is then rather slow. In contrast, Julia is a just-in-time (JIT) compiled language that gains speed by optimizing the subtasks of a program along the way. This JIT compiler creates a runtime overhead compared to Python which is then made up for in the long run.
Summary of Results
Here are the main results obtained in this simulation:
- Librosa underperformed by a factor of 10x or greater compared to the other frameworks throughout all dataset sizes.
- TorchAudio was the fastest framework for smaller or medium-sized datasets.
- Julia started out a bit slower than TorchAudio but took the lead with larger datasets.
- Even with 142k audio examples, Julia still took around 25% of its runtime for loading modules as well as loading and exporting the dataset. → Gets even more efficient when we move beyond 142k examples.
Of course, runtime speed is not the only relevant category. Is it worth learning Julia just to get faster signal processing code? Maybe in the long run… But if you are trying to build a quick solution and are familiar with Python, then TorchAudio is certainly the better choice. Even outside of runtime, there are other categories to consider, like software maturity or the potential for collaborating with co-workers, customers, or a community.
Another key limitation is that all the tests were made for one specific use case. It is not clear what would happen when dealing with longer audio files or when extracting other audio features. Also, there are many different approaches to designing a feature extraction algorithm and the one used here is not necessarily the most optimal or most widely used one.
Lastly, I am neither an expert for Julia nor for TorchAudio, yet. It is likely that my implementations are not the most runtime-efficient ones you could possibly build.
If I had to come up with a conclusion that is somewhere in the upper right quadrant of the “true X useful” plane, it would be this one
Considering nothing but runtime speed, Librosa should never be used, TorchAudio should be used for small or medium-sized datasets, and Julia should be used for larger datasets.
A less bold one — and my preferred conclusion — would be this one:
If you are currently using Librosa, consider exchanging parts of your code with TorchAudio functionalities, as they appear to be much faster. On top, learning Julia may prove useful for greater workloads or for implementing custom signal processing methods that are fast out-of-the-box.