Abstract
Audio language modeling remains a challenging problem in machine learning due to the high dimensionality and temporal complexity of raw waveform data. Recent approaches have introduced neural audio codecs that compress audio into discrete token sequences, enabling the use of language modeling techniques. However, many of these methods rely on multiple codebooks or hierarchical structures, which increase system complexity and computational cost.
In this work, we present MQGAN, a compact and efficient vector-quantized spectrogram codec designed to make audio language modeling easy. MQGAN combines a convolutional autoencoder, trained adversarially, with a vocoder, enabling reconstruction of 44.1KHz audio from a single 1k codebook operating at 86 tokens per second. We demonstrate that these compressed spectrogram tokens can be modeled autoregressively using a simple LSTM, achieving coherent music generation with just 18 million parameters trained in under a day on a single MI300X GPU. Our results suggest that spectrogram-based tokenization offers a simple yet effective path toward efficient audio language models.
Inference
Samples
To test out and validate our approach, we trained a small autoregressive baseline--MusicLSTM--on a 500 hour subset of the MTG Jamendo dataset. The examples below present spectrogram reconstructions (image) and their corresponding audio (player) side by side.