Researchers at Sony Computer Science Laboratories (CSL) say they have recently developed a new deep learning method to enhance and restore the quality of heavily compressed songs and audio recordings, such as those compressed by lossy codecs (like MP3) with high compression rates. Lossy audio codecs compress (and decompress) digital audio streams by removing information that tends to be inaudible in human perception; but under high compression rates, such codecs may introduce a variety of impairments in the audio signal.
“Many works have tackled the problem of audio enhancement and compression artifact removal using deep learning techniques,” say the researchers. “However, only a few works tackle the restoration of heavily compressed audio signals in the musical domain. In this study, we test a stochastic generator for a generative adversarial network (GAN) architecture for this task.”
GANs are machine learning models in which two neural networks “compete” to make increasingly accurate or reliable predictions. Like other GANs, the model created by the researchers is comprised of two separate models, known as the “generator (G)” and the “critic (D).” The generator receives an excerpt of an MP3-compressed musical audio signal, represented through a spectrogram (i.e., a visual representation of an audio signal’s spectrum frequencies).
The generator continuously learns to produce a restored version of this original signal, which is lower in size. Meanwhile, the GAN architecture’s critic component learns to distinguish between the original, high-quality files and restored versions, thus spotting differences between them. Ultimately, the information gathered by the critic is used to improve the quality of the restored files, ensuring that the music or audio data present in the restored files is as faithful as possible to that in the original.
The researchers evaluated their GAN-based architecture in a series of tests, which were aimed at determining whether their model could improve the quality of the MP3 inputs and generate compressed samples that are of higher quality and closer to an original file than those created by other baseline models for audio compression. Their results were highly promising, say the researchers, as they found that the model’s restorations of heavily compressed MP3 files (16 kbit/s and 32 kbit/s) were typically better than the original compressed files, as they sounded better to expert human listeners.
When using weaker compression rates (64 kbit/s mono), on the other hand, the team found that their model achieved slightly worse results than the baseline MP3 compression tools.
“We perform an extensive evaluation of the different experiments utilizing objective metrics and listening tests,” say the researchers. “We find that the models can improve the quality of audio signals over the MP3 versions for 16 and 32 kbit/s and that the stochastic generators are capable of generating outputs that are closer to the original signals than those of the deterministic generators.”
As part of their study, the researchers also showed that their architecture could successfully generate and add realistic high-frequency content that improved the audio quality of compressed songs. The generated content included percussive elements, a singing voice producing sibilants or plosives (i.e., “s” and “t” sounds), and guitar sounds.
In the future, say the researchers, the model they created could help to reduce the size of MP3 music files significantly without altering their content or creating easily perceivable errors. This could have significant implications for the storage and transmission of music on both streaming apps (e.g., Spotify, Apple Music, etc.) and modern electronic devices, including smartphones, tablets and computers.