What is Source Separation?
In the field of signal processing, source separation describes the task of breaking down an audio signal into multiple source audio signals. This concept is not only relevant for music, but also for speech or machine sounds. For example, you might want to separate the voices of two speakers in a podcast so you can edit the voices separately.
Why is Source Separation so Difficult?
Not everyone is a musician. Even fewer people are musicians with a penchant for data & AI. Oftentimes, when I talk to non-musicians, I get the impression that they think you can simply “take the voice and remove it from the audio”. This makes sense, because why else would there be instrumentals on the B-side of albums, or thousands of karaoke versions of popular songs available at every pub? In fact, separating vocals from instrumental is really simple — when you have access to the individual tracks of the mix…
However, in the real world, all we have is waveforms. A waveform is the closest computer representation we have to a real, physical audio event. The waveform is also the prerequisite for turning digital audio back into real sound, for example through speakers. This means that if you want to separate a piece of music into two sources (vocals and instrumental), you need to find a way to take the combined waveform and split it into two separate waveforms, each capturing one of the sources accurately and exclusively.
To highlight this, you can find three waveforms in the figure below. The first one represents a guitar, the second captures vocals sung over the guitar track. The third waveform is the combination of the guitar and vocals, i.e. the full song.
For me as the producer of this track, providing you with the vocals and instrumental is a trivial task, as I can simply send you the original recordings of both. However, as consumers of music…