Can you use AI to create a new record from your favorite artist?
Recent breakthroughs in machine learning have shown that models are now capable of understanding complex data such as text and images. OpenAI’s Jukebox proves that even music can be modeled precisely by a neural network.
Music is a complex object to model. You have to take into consideration both simple features such as tempo, loudness, and pitch and more complex features such as lyrics, instruments, and musical structure.
Using advanced machine learning techniques, OpenAI has found a way to convert raw audio into a representation that other models can use.
This article will explain what Jukebox can do, how it works, and the current limitations of the technology.
What is Jukebox AI?
Jukebox is a neural net model by OpenAI that can generate music with singing. The model can produce music in a variety of genres and artists styles.
For example, Jukebox can produce a rock song in the style of Elvis Presley or a hip hop tune in the style of Kanye West. You can visit this website to explore how effective the model is at capturing the sound of your favorite musical artists and genres.
The model requires a genre, artist, and lyrics as input. This input guides a model trained on millions of artists and lyric data.
How does Jukebox work?
Let’s look at how Jukebox manages to generate novel raw audio from a model trained on millions of songs.
Encoding Process
While some music generation models use MIDI training data, Jukebox is trained on the actual raw audio file. To compress the audio into a discrete space, Jukebox uses an auto-encoder approach known as VQ-VAE.
VQ-VAE stands for Vector Quantized Variational Autoencoder, which might sound a bit complicated, so let’s break it down.
First, let’s try to understand what we want to do here. Compared to lyrics or sheet music, a raw audio file is vastly more complex. If we want our model to “learn” from songs, we will have to transform it into a more compressed and simplified representation. In machine learning, we call this underlying representation a latent space.
An autoencoder is an unsupervised learning technique that uses a neural network to find non-linear latent representations for a given data distribution. The autoencoder consists of two parts: an encoder and decoder.
The encoder tries to find the latent space from a set of raw data while the decoder uses the latent representation to try to reconstruct it back into its original format. The autoencoder essentially learns how to compress the raw data in such a way that minimizes reconstruction error.
Now that we know what an autoencoder does, let’s try to understand what we mean by a “variational” autoencoder. Compared to typical autoencoders, variational autoencoders add a prior to the latent space.
Without diving into the mathematics, adding a probabilistic prior keeps the latent distribution closely compacted. The main difference between a VAE and a VQ-VAE is that the latter uses a discrete latent representation rather than a continuous one.
Each VQ-VAE level independently encodes the input. The bottom level encoding produces the highest-quality reconstruction. The top-level encoding retains essential musical information.
Using Transformers
Now that we have the music codes encoded by VQ-VAE, we can try to generate music in this compressed discrete space.
Jukebox uses autoregressive transformers to create the output audio. Transformers are a type of neural network that works best with sequenced data. Given a sequence of tokens, a transformer model will try to predict the next token.
Jukebox uses a simplified variant of Sparse Transformers. Once all prior models are trained, the transformer generates compressed codes which are then decoded back into raw audio using the VQ-VAE decoder.
Artist and Genre Conditioning in Jukebox
Jukebox’s generative model is made more controllable by providing additional conditional signals during the training step.
The first models are provided by artists and genre labels for each song. This reduces the entropy of the audio prediction and allows the model to achieve better quality. The labels also enable us to steer the model in a particular style.
Besides the artist and genre, timing signals are added during training time. These signals include the length of the song, the start time of a particular sample, and the fraction of the song that has elapsed. This additional information helps the model understand audio patterns that rely on the overall structure.
For example, the model may learn that the applause for live music happens at the end of a song. The model can also learn, for example, that some genres have longer instrumental sections than others.
Lyrics
The conditioned models mentioned in the previous section are capable of generating a variety of singing voices. However, these voices tend to be incoherent and unrecognizable.
To control the generative model when it comes to lyric generation, the researchers provide more context at training time. To help map lyric data to the timing on the actual audio, the researchers used Spleeter to extract vocals and NUS AutoLyricsAlign to obtain word-level alignments of the lyrics.
Limitations of Jukebox Model
One of the main limitations of Jukebox is its understanding of larger musical structures. For example, a short 20-second clip of the output may sound impressive, but listeners will notice that the typical musical structure of repeating choruses and verses is absent in the final output.
The model is also slow to render. It takes approximately 9 hours to fully render one minute of audio. This limits the number of songs that can be generated and prevents the model from being used in interactive applications.
Lastly, the researchers have noted that the sample dataset is primarily in English and displays primarily Western music conventions. AI researchers can focus future research on generating music in other languages and non-Western music styles.
Conclusion
The Jukebox project highlights the growing capability of machine learning models to create an accurate latent representations of complex data such as raw audio. Similar breakthroughs are happening in the text, as seen in projects like GPT-3, and images, as seen in OpenAI’s DALL-E 2.
While the research in this space has been impressive, there are still concerns about intellectual property rights and the impact these models may have on creative industries as a whole. Researchers and creatives should continue to closely collaborate to ensure that these models can continue to improve.
Future generative music models may soon be able to act as a tool for musicians or as an application for creatives who need a custom music for projects.

Leave a Reply