Scientific paper by Google about Music generation | | | | Turtles AI
Scientific paper by Google about Music generation
DukeRem27 January 2023
Google just presented a scientific paper dealing with the topic of music generation from text. Something very similar to what is already happening with systems such as Dall-E, Stable Diffusion and Midjourney for images. However, Google has decided, at least for the time being, not to make the generative system public, as there could be unintentional plagiarism of existing songs (or parts of them) that formed the training set. This does not detract from the fact that the result is incredible and that, probably in the near future, music can also be produced from the text.
The full paper can be accessed and downloaded at the following address:
https://arxiv.org/pdf/2301.11325.pdf
Below is, in brief, its contents.
Conditional neural audio generation is a rapidly growing field that encompasses a wide range of applications, from text-to-speech and lyrics-conditioned music generation to audio synthesis from MIDI sequences. These tasks rely on a certain level of temporal alignment between the conditioning signal and the corresponding audio output. However, recent work has begun to explore generating audio from sequence-wide, high-level captions, such as "whistling with wind blowing." While these models represent a breakthrough, they are currently limited to simple acoustic scenes and have difficulty generating rich audio sequences with long-term structure and multiple stems, such as a music clip.
One promising approach to addressing these limitations is the AudioLM framework, which casts audio synthesis as a language modelling task in a discrete representation space and leverages a hierarchy of coarse-to-fine audio discrete units (or tokens) to achieve both high-fidelity and long-term coherence over dozens of seconds. Additionally, by making no assumptions about the content of the audio signal, AudioLM can be trained on audio-only corpora without any annotation, making it suitable for a wide range of audio signals. However, a major challenge facing this approach is the scarcity of paired audio-text data, which is in stark contrast to the image domain where the availability of massive datasets has contributed significantly to recent breakthroughs in image generation.
To address this challenge, Google introduces MusicLM, a model for generating high-fidelity music from text descriptions. MusicLM leverages AudioLM's multi-stage autoregressive modelling as the generative component, but extends it to incorporate text conditioning. To address the lack of paired data, they rely on MuLan, a joint music-text model that is trained to project music and its corresponding text description to representations close to each other in an embedding space. This shared embedding space eliminates the need for captions at training time altogether, and allows training on massive audio-only corpora. When trained on a large dataset of unlabeled music, MusicLM learns to generate long and coherent music at 24 kHz, for text descriptions of significant complexity, such as "enchanting jazz song with a memorable saxophone solo and a solo singer" or "Berlin 90s techno with a low bass and strong kick."
To evaluate MusicLM, Google introduces MusicCaps, a new high-quality music caption dataset with 5.5k examples prepared by expert musicians, which they publicly release to support future research. Our experiments show that MusicLM outperforms previous systems in terms of quality and adherence to the caption. Additionally, they demonstrate that MusicLM can be extended to accept an additional melody in the form of audio as conditioning to generate a music clip that follows the desired melody and is rendered in the style described by the text prompt.
While music generation has the potential to be a powerful tool, it also poses risks, particularly in terms of the potential misappropriation of creative content. To address these risks, Google conducts a thorough study of memorization by adapting and extending the methodology of other researchers, used for text-based large language models. Our findings show that when feeding MuLan embeddings to MusicLM, the sequences of generated tokens significantly differ from the corresponding sequences in the training set, indicating that MusicLM does not simply memorize the training data.