Controllable Music Loops Generation with MIDI and Text via Multi-Stage Cross Attention and Instrument-Aware Reinforcement Learning

Guan Yuan Chen, Von Wun Soo

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The burgeoning field of text-to-music generation models has shown great promise in their ability to generate high-quality music aligned with users' textual descriptions. These models effectively capture abstract/global musical features such as style and mood. However, they often inadequately produce the precise rendering of critical music loop attributes, including melody, rhythms, and instrumentation, which are essential for modern music loop production. To overcome this limitation, this paper proposed a Loops Transformer and a Multi-Stage Cross Attention mechanism that enable a cohesive integration of textual and MIDI input specifications. Additionally, a novel Instrument-Aware Reinforcement Learning technique was introduced to ensure the correct adoption of instrumentation. We demonstrated that the proposed model can generate music loops that simultaneously satisfy the conditions specified by both natural language texts and MIDI input, ensuring coherence between the two modalities. We also showed that our model outperformed the state-of-the-art baseline model, MusicGen, in both objective metrics (by lowering the FAD score by 1.3, indicating superior quality with lower scores, and by improving the Normalized Dynamic Time Warping Distance with given melodies by 12%) and subjective metrics (by +2.56% in OVL, +5.42% in REL, and +7.74% in Loop Consistency). These improvements highlight our model's capability to produce musically coherent loops that satisfy the complex requirements of contemporary music production, representing a notable advancement in the field. Generated music loop samples can be explored at: https://loopstransformer.netlify.app/.

Original languageEnglish
Title of host publicationMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages6851-6859
Number of pages9
ISBN (Electronic)9798400706868
DOIs
StatePublished - 28 10 2024
Event32nd ACM International Conference on Multimedia, MM 2024 - Melbourne, Australia
Duration: 28 10 202401 11 2024

Publication series

NameMM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia

Conference

Conference32nd ACM International Conference on Multimedia, MM 2024
Country/TerritoryAustralia
CityMelbourne
Period28/10/2401/11/24

Bibliographical note

Publisher Copyright:
© 2024 ACM.

Keywords

  • controllable music generation
  • loop generation
  • reinforcement learning
  • residual vector quantization
  • text-to-music generation
  • transformer

Fingerprint

Dive into the research topics of 'Controllable Music Loops Generation with MIDI and Text via Multi-Stage Cross Attention and Instrument-Aware Reinforcement Learning'. Together they form a unique fingerprint.

Cite this