CS 798: AI Music Generation

Chris Donahue

Estimated study time: 1 hr 40 min

Table of contents

⚡ Why make it up

UW has strong machine-learning graduate offerings (CS 885, STAT 946) but nothing dedicated to music and audio generation — the subfield that produced Jukebox, MusicGen, Stable Audio, Suno, and Udio in rapid succession after 2020. CS 798 fills that gap by treating audio as a first-class modality: the compression and generation challenges are genuinely different from images and text, and the music-specific structure (pitch, rhythm, timbre, harmony) demands its own set of tools. Chris Donahue built the CMU 15-798 course on exactly this material; the notes follow his F2025 syllabus closely.

Sources and References

Primary textbook — Meinard Müller, Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications, 2nd ed., Springer, 2021.

Supplementary texts — Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016; Sander Dieleman, “Generative modelling in latent space” (blog, 2025); Yoshua Bengio et al., “A Neural Probabilistic Language Model,” JMLR, 2003.

Key papers — Dhariwal et al. “Jukebox: A Generative Model for Music” (OpenAI, 2020); Rombach et al. “High-Resolution Image Synthesis with Latent Diffusion Models” (CVPR 2022); Evans et al. “Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion” (Stability AI, 2024); van den Oord et al. “Neural Discrete Representation Learning” (VQ-VAE, NeurIPS 2017); Copet et al. “Simple and Controllable Music Generation” (MusicGen, NeurIPS 2023); Huang et al. “Music Transformer” (ICLR 2019); Simon & Oore “Performance RNN” (Magenta, 2017); Huang & Yang “Pop Music Transformer” (ACM MM 2020); Thickstun et al. “Anticipatory Music Transformer” (ICML 2024); Engel et al. “DDSP” (ICLR 2020); Lipman et al. “Flow Matching for Generative Modeling” (ICLR 2023); Song et al. “Consistency Models” (ICML 2023); Liu et al. “DiffSinger” (AAAI 2022); San Roman et al. “AudioSeal” (Meta, 2024).

Online resources — ISMIR proceedings (ismir.net); Sander Dieleman’s blog (sander.ai); Jay Alammar’s illustrated Transformer (jalammar.github.io); CMU 15-798 course materials (Fall 2025, Chris Donahue).

Community notes — Xiaohongshu user 八荒春’s public class notes on music generation AI.

Part I: Foundations of Latent Generative Modeling

Chapter 1: Two-Stage Generative Modeling

Why Compress Before Generating

Almost every modern generative model for images, audio, and video — Stable Diffusion, Suno, Udio, MusicGen — shares the same architectural skeleton: first compress the raw signal into a compact latent representation, then train a generative model on those latents. This two-stage strategy did not emerge from aesthetic preference; it is a hard-won engineering compromise forced by the sheer scale of raw media signals.

Consider what it means to model music directly. A three-minute piece of CD-quality audio plays at 44,100 samples per second, yielding over 7.9 million samples per channel, or 15.9 million for stereo. Any model that must process this raw stream encounters two fundamental obstacles. First, the computational cost is prohibitive: Transformer self-attention scales as $\mathcal{O}(n^2)$ in sequence length, so doubling the audio duration quadruples the memory. At 44.1 kHz, even a 10-second clip produces a 441,000-step sequence — orders of magnitude beyond what any practical Transformer can process. Second, and more subtly, the learning problem is intractable: meaningful structure in music spans many timescales simultaneously. A model predicting the 15-millionth sample must, in principle, recall what happened at sample 100,000 — a dependency that no fixed-capacity hidden state can reliably maintain.

Two-stage generative modeling cuts through both problems at once. In the first stage, an autoencoder maps raw audio $x$ to a compact latent representation $z = E(x)$ and learns a decoder $D(z) \approx x$. The encoder is trained to discard perceptually irrelevant information — phase micro-variation, inaudible high-frequency noise — while preserving the structure that matters: pitch, rhythm, timbre, harmony. In the second stage, a generative model learns the distribution $p(z)$ in this compressed space. Because $z$ is far smaller than $x$, Stage 2 can be trained efficiently; because the encoder has already organized the representation, $p(z)$ is a more learnable target than $p(x)$.

The two stages must be trained separately and sequentially. If Stage 2 gradients were allowed to flow back into Stage 1, the autoencoder’s geometry would drift to serve the generative model’s current objectives rather than ensuring faithful reconstruction — and both stages would learn poorly. Fixing the encoder after Stage 1 guarantees a stable coordinate system for Stage 2.

Reconstruction, Bottleneck, and Prior Losses

A complete two-stage system involves three families of losses, each serving a distinct purpose.

Reconstruction losses keep the encoder–decoder pair honest. The simplest is an L2 regression loss $\mathcal{L}_{\text{recon}} = \|x - D(E(x))\|_2^2$, which minimizes the mean squared difference between input and reconstruction in the raw signal domain. Regression loss works well for recovering low-frequency content and large-scale structure, but it fails on fine details: because natural signals concentrate most energy in low frequencies, the optimal L2 solution is a blurry average that minimizes squared pixel or sample error without reproducing sharp edges or high-frequency transients.

Perceptual losses correct this deficiency by comparing inputs in the activation space of a frozen pretrained network rather than in raw-signal space. For images the pretrained network is typically VGG or LPIPS; for audio, spectral representations serve the same role. A multi-resolution spectral loss of the form

\[ \mathcal{L}_{\text{spec}} = \mathbb{E}_x \Bigl[\sum_i \bigl\| |\text{STFT}_i(x)| - |\text{STFT}_i(\hat{x})| \bigr\|_2 \Bigr] \]

compares STFT magnitude spectra at multiple window sizes but ignores phase, because phase information is nearly random — perceptually insignificant and statistically unlearnable. By penalizing spectral magnitude mismatch, the loss nudges the decoder to reproduce harmonic structure and timbral detail rather than chasing waveform alignment.

Adversarial losses add a discriminator that attempts to distinguish real inputs from reconstructions. The autoencoder is then trained to fool the discriminator — a formulation borrowed from GANs. Unlike regression loss, adversarial loss is not graded by distance: it simply asks whether the output lies on the manifold of real data. This makes it especially effective at recovering high-frequency local statistics and realistic textures, since those are exactly the signals a discriminator can detect. In practice, adversarial loss is often disabled for the first several thousand training steps to prevent instability before the autoencoder has learned basic reconstruction.

Bottleneck losses constrain the latent space directly rather than via reconstruction. The two dominant choices are vector quantization and KL regularization. Vector quantization forces every encoder output to snap to the nearest entry in a learned codebook; this discretizes the latent and prevents it from storing arbitrary noise. KL regularization, following the VAE framework, penalizes the divergence between the posterior $q_\phi(z|x)$ and a standard Gaussian prior: $\mathcal{L}_{\text{KL}} = \text{KL}(q_\phi(z|x) \| \mathcal{N}(0, I))$. In modern latent diffusion systems (following Stable Diffusion), the KL coefficient is made extremely small — so small that the KL term no longer functions as a true variational lower bound. What remains is not a principled Bayesian objective but a shape regularizer that suppresses outlier latent values and keeps the dynamic range of $z$ manageable, making Stage 2 numerically well-conditioned.

Finally, Stage 2 losses train the generative model itself: negative log-likelihood for autoregressive models, or a noise-prediction objective for diffusion models. These losses operate entirely in latent space and do not affect the Stage 1 parameters.

Rate–Distortion–Modelability

Classical information theory gives us the rate–distortion tradeoff: compress more aggressively (lower rate) and you incur larger reconstruction error (higher distortion), with the rate–distortion curve describing the Pareto frontier between them. Two-stage generative modeling introduces a third dimension: modelability, which measures how easy it is for Stage 2 to learn the distribution $p(z)$.

The tension arises because the compression methods that minimize distortion at a given rate are not, in general, the methods that maximize modelability. Lossless entropy coding, for instance, achieves zero distortion at the minimum bit rate — yet entropy-coded representations are, by design, maximally unstructured. Every bit is statistically independent of every other; there is no spatial correlation, no smooth topology, nothing for a neural network to grab onto. The result is a latent that is information-theoretically optimal but effectively unlearnable.

The tensor size reduction factor (TSR) is a convenient scalar for comparing compression ratios across architectures:

\[ \text{TSR} = \frac{\text{total elements in bottleneck}}{\text{total elements in input}} \]

For a two-dimensional image of shape $H \times W \times 3$ compressed to $h \times w \times c$ with downsampling factor $f = H/h = W/w$, the TSR works out to $c/(3f^2)$. For audio, the formula is one-dimensional: an audio clip of length $L$ samples compressed to a latent of length $L/f_s$ with $C$ channels gives TSR $= C/(2 \cdot f_s)$ for stereo input. Choosing the right TSR is critical: too high a TSR (aggressive compression) and the decoder cannot faithfully reconstruct; too low a TSR and Stage 2 must model near-raw data, defeating the purpose of latents.

Beyond rate and distortion, the following three properties govern how well-suited a latent space is for Stage 2 learning:

Capacity — how many bits of information are stored per latent position. Controlled by three knobs: the spatial or temporal downsampling factor (fewer latent positions), the channel dimension (more bits per position), and for discrete latents, the codebook size (bits per code).
Curation — which bits of information from $x$ make it into $z$. A well-curated latent contains musically relevant information (pitch, rhythm, timbre) but not perceptually irrelevant noise (phase jitter, inaudible transients).
Shaping — the geometric organization of information within $z$. Retaining the spatial or temporal grid structure of the input is particularly important: convolutional and Transformer-based generative models exploit local correlations and position encodings, both of which require that the latent positions have a meaningful spatial layout.

Curating and Shaping the Latent Space

The key practical insight is that preserving grid topology is worth the efficiency cost. A latent $z$ that maintains a $H' \times W'$ grid for images (or $T'$ time steps for audio) is somewhat wasteful — a highly-compressed area of the image still gets as many latent positions as an information-dense area — but it gives Stage 2 a Euclidean manifold with local correlations that convolutional networks and Transformers are specifically built to exploit. Global latents (a single vector encoding the entire input) force Stage 2 to learn entirely without spatial structure, which is far harder.

Two advanced techniques regularize for modelability explicitly. Co-training a generative prior feeds Stage 2’s loss back into Stage 1’s encoder, nudging the encoder to organize its latent in a way that Stage 2 finds learnable, not merely in a way that Stage 1 finds reconstructable. Pretrained representation supervision (e.g., matching DINO features) tells the encoder “align your latents with representations that are already known to be semantically structured and learnable,” rather than discovering that structure from scratch. Both techniques trade some reconstruction quality for improved Stage 2 training dynamics.

A third technique is equivariance regularization: if a deterministic transformation $T$ is applied to the input (such as a pitch transposition or a temporal stretch), the encoder should respond predictably — $E(T(x)) \approx \hat{T}(E(x))$ for a known latent-space transformation $\hat{T}$. Enforcing equivariance keeps the latent manifold geometrically regular: transformations that Stage 2 architectures already exploit — translation equivariance in convolutions, permutation equivariance in attention — are built into Stage 1 rather than left for Stage 2 to rediscover from data. In audio, a pitch-equivariant encoder whose latent shifts by a constant offset in response to a semitone transposition gives Stage 2 a natural symmetry to exploit and improves generalization across musical keys without requiring explicit transposition augmentation at every training step.

Sander Dieleman has also explored diffusion decoders — replacing Stage 1’s single-pass decoder with a denoising diffusion model that refines $\hat{x}$ from $z$ iteratively. Diffusion decoders can produce higher-fidelity reconstructions and avoid the instabilities of GAN discriminators. Their cost is inference latency: a multi-step decoding loop at the reconstruction stage partially cancels the efficiency gains of operating in latent space. DALL-E 3’s Consistency Decoder addresses this with knowledge distillation — compressing the diffusion decoder to just two steps while preserving most of the quality.

Chapter 2: Jukebox and Hierarchical Audio VQ-VAE

Hierarchical Discrete Autoencoders

OpenAI’s Jukebox (2020) was the first system to demonstrate realistic music generation in diverse styles with novel control inputs — artist, genre, lyrics, and timing — all from raw audio waveforms. Its architectural strategy illustrates nearly everything from Chapter 1 in concrete form.

The fundamental difficulty Jukebox had to solve is raw-audio sequence length. Four minutes of CD-quality audio at 44,100 Hz yields roughly $10^7$ time steps — over ten million dimensions of raw input. Even if a model could process such sequences, the sheer distance between meaningful musical events (a chorus returning after two minutes, a lyric phrase resolving its harmonic suspension) makes end-to-end learning hopeless.

Jukebox’s Stage 1 trains three independent VQ-VAEs with hop lengths (temporal downsampling factors) of 128×, 32×, and 8×. The resulting token rates are approximately 345 Hz, 1,378 Hz, and 5,513 Hz respectively. A 4-minute piece of music at 44,100 Hz becomes sequences of roughly 64,000, 256,000, and 1,000,000 tokens in these three levels — still long, but manageable with sparse attention techniques.

Why train three entirely separate autoencoders rather than a single hierarchical VQ-VAE? The answer is hierarchy collapse. In a hierarchical architecture trained end-to-end, the bottom level (finest resolution) quickly learns to reconstruct the input nearly perfectly on its own; the upper levels, finding that their codes are barely needed, degenerate — producing latent representations with little meaningful structure. Training independently, with different compression rates, forces each level to encode a distinct facet of the signal: the 128× top level captures melodic shape and long-range harmonic progressions; the 32× middle level encodes mid-range textural patterns; the 8× bottom level preserves fine acoustic details and timbre.

Each VQ-VAE is built around the standard VQ loss:

\[ \mathcal{L} = \mathcal{L}_{\text{recons}} + \mathcal{L}_{\text{codebook}} + \beta \mathcal{L}_{\text{commit}} \]

where $\mathcal{L}_{\text{recons}}$ measures reconstruction fidelity, $\mathcal{L}_{\text{codebook}}$ pulls codebook entries toward encoder outputs (with EMA updates for stability), and $\mathcal{L}_{\text{commit}}$ pulls encoder outputs toward codebook entries to prevent them from “wandering.” Each codebook has 2,048 entries. A standard $\ell_2$ reconstruction loss on raw waveforms produces muddy, dull audio — the network learns to reproduce low-frequency energy while abandoning perceptually critical high-frequency details. To counteract this, Jukebox adds a multi-resolution spectral loss: it computes STFTs at multiple window sizes, compares magnitude spectra (not phase), and penalizes discrepancies. Phase carries very little perceptual information — the human auditory system is largely phase-blind — but amplitude spectra directly determine timbre and harmonic richness. The spectral loss acts like a perceptual high-pass filter, forcing the decoder to reconstruct not just the right average waveform but the right instantaneous frequency content.

A persistent failure mode in discrete autoencoders is codebook collapse: some codes are never selected, reducing effective vocabulary. Jukebox fixes this with random restarts — any code with below-threshold usage is re-initialized to a randomly selected encoder output from the current batch, forcing it to relearn a relevant prototype.

Sparse-Attention Cascaded Priors

Stage 2 trains a cascade of three autoregressive language models, one per VQ-VAE level:

\[ p(z) = p(z_{\text{top}}) \, p(z_{\text{middle}} \mid z_{\text{top}}) \, p(z_{\text{bottom}} \mid z_{\text{middle}}, z_{\text{top}}) \]

Even the top-level sequences (the most compressed, at 345 Hz) run to about 24,000 tokens for a 70-second excerpt. Standard full self-attention over 8,192-token windows is the upper limit here. To fit within this window, Jukebox employs sparse attention: the 1D token sequence is reshaped into a 2D array (e.g., 128 rows × 64 columns), and attention is computed with axis-aligned patterns — attending within the current row, along the current column, and across a strided sample of prior rows. This captures both local and coarse long-range dependencies without paying $\mathcal{O}(n^2)$ on the full sequence.

The upsamplers — the middle and bottom level priors — receive the higher-level tokens as conditioning signals. Each upsampler runs a conditioning network (a deep residual WaveNet) over the upper tokens, applies strided convolutions to upsample them to the target resolution, and injects the result as positional conditioning into the lower-level Transformer. This top-down conditioning mechanism mirrors how a conductor first sketches the orchestral structure and then fills in instrumental details: the broad melodic arc is determined at the top, and progressively finer detail is added conditionally.

Conditioning on Artist, Genre, Lyrics, Timing

All three priors model a conditional distribution $p(z \mid c)$ where $c = (\text{artist}, \text{genre}, \text{lyrics}, \text{timing})$. Each conditioning signal is injected as an additional embedding that sums with the token embedding at every position.

Artist and genre conditioning uses a closed vocabulary of known artists and genres. The artist embedding is substantially stronger than the genre embedding, because the most salient invariant of an artist — their vocal timbre and style — dominates the acoustic signal. When you mix a country singer’s artist embedding with a hip-hop genre embedding, the generated audio sounds country. Voice is a strong prior; genre is a weak one.

Timing conditioning provides three scalar signals: total song duration, the absolute chunk offset in the original song, and the relative offset (chunk offset / total duration). This tells the model “you are in the verse of a three-minute song” rather than “you are 90 seconds into some audio” — enabling it to mimic the statistical patterns of introductions, bridges, and outros.

Lyrics conditioning is architecturally the most complex. Without it, models conditioned only on artist and genre generate something that sounds like singing but produces incomprehensible babbling. To inject lyrics, Jukebox uses forced alignment to identify which lyrics fragment corresponds to each training chunk, trains a character-level lyrics encoder, and converts the top-level prior into a sequence-to-sequence encoder–decoder model, with cross-attention from audio tokens to lyrics encoder representations. Crucially, the cross-attention layers are added with zero initialization — so that at the start of lyrics fine-tuning, the model behaves identically to the unconditional prior, and the newly added parameters learn incrementally without disrupting the already-learned music distribution.

Compute, Limitations, and Evaluation Critiques

Jukebox’s compute budget is staggering: approximately 24,000 V100-days of total training (equivalent to roughly 75 metric tons of CO₂ — about 17.5 car-years of driving). The 5-billion-parameter top prior required 512 V100s for four weeks for music alone, plus two more weeks for the lyrics-conditioned variant.

Inference is correspondingly slow. Generating one minute of top-level tokens requires about one hour of compute; upsampling to the full three-level resolution adds another eight hours on top. These are upper bounds, and speculative decoding or distillation techniques could reduce them substantially — but as published, Jukebox is not an interactive system.

What is perhaps most surprising is that, for all its scale, Jukebox appeared without a single standard quantitative evaluation. No FAD scores, no KAD, no comparison against held-out human judgments. The paper contains only informal listening examples. Retrospectively, this represents a significant gap — there is no way to know whether a model that took 24,000 V100-days to train is better or worse than one that took 10,000. Rigorous evaluation of generated music remains an open problem to this day, and Jukebox’s omission of it was noticed.

Part II: Continuous-Latent Diffusion for Music

Chapter 3: Latent Diffusion Models

From Pixel-Space DDPM to Latent Diffusion

Denoising Diffusion Probabilistic Models (DDPM; Ho et al., NeurIPS 2020) define a forward Markov chain that progressively corrupts a data sample $x_0$ into Gaussian noise:

\[ q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\, \sqrt{1-\beta_t}\, x_{t-1},\; \beta_t I\right) \]

A key property of this Gaussian Markov chain is the closed-form marginal:

\[ q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\, \sqrt{\bar{\alpha}_t}\, x_0,\; (1-\bar{\alpha}_t) I\right), \quad \bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s) \]

so any $x_t$ can be sampled directly from $x_0$ in a single step without unrolling the chain. The reverse process is parameterized as another Gaussian:

\[ p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\left(x_{t-1};\; \mu_\theta(x_t, t),\; \Sigma_\theta(x_t, t)\right) \]

Minimizing the evidence lower bound (ELBO) over the reverse chain simplifies — after reparameterization — to a noise prediction objective:

\[ \mathcal{L}_{\text{DDPM}} = \mathbb{E}_{x_0, \epsilon \sim \mathcal{N}(0,I), t}\!\left[\|\epsilon - \epsilon_\theta(x_t, t)\|_2^2\right] \]

The model $\epsilon_\theta$ is typically a U-Net that takes the noisy sample and a time embedding as input and predicts the noise that was added. DDPM produces excellent samples, but pixel-space DDPM is expensive: the U-Net must process full-resolution images at every denoising step, and sampling requires hundreds of such forward passes.

Latent Diffusion Models (LDM; Rombach et al., CVPR 2022) solve this cost problem by replacing pixel-space operations with latent-space ones. A Stage 1 autoencoder compresses images to a latent $z = E(x) \in \mathbb{R}^{h \times w \times c}$ with spatial downsampling factor $f$ (so $h = H/f$, $w = W/f$). The Stage 2 diffusion model then operates entirely on $z$, and the full training objective becomes:

\[ \mathcal{L}_{\text{LDM}} = \mathbb{E}_{E(x), \epsilon \sim \mathcal{N}(0,I), t}\!\left[\|\epsilon - \epsilon_\theta(z_t, t)\|_2^2\right] \]

Because $z$ preserves the 2D spatial structure of the image (unlike a fully-compressed global code), the U-Net’s convolutional inductive bias remains intact: local spatial correlations are still present in $z$, enabling the same skip-connection architecture that works on raw images.

Perceptual Compression Stage

The Stage 1 autoencoder in LDM is a KL-regularized VAE with perceptual and adversarial losses. A pure L2 reconstruction loss produces blurry outputs because the model learns to predict the average over all plausible reconstructions consistent with the latent; adding perceptual loss (VGG features) and patch-based adversarial loss pushes the decoder to produce locally realistic textures even at the cost of pixel-level accuracy.

The compression factor $f$ controls the TSR: with $f = H/h = W/w$ and $c$ latent channels,

\[ \text{TSR} = \frac{h \cdot w \cdot c}{H \cdot W \cdot 3} = \frac{c}{3f^2} \]

Rombach et al. find that $f \in \{4, 8, 16\}$ are all viable, with $f = 8$ offering the best balance. Below $f = 4$, the latent is nearly pixel-space — expensive and slow to process. Above $f = 16$, too much information is lost in the compression and the decoder cannot faithfully reconstruct fine details.

Crucially, a mild compression ratio preserves the spatial inductive bias that Stage 2 depends on. Earlier discrete approaches (DALL-E 1) aggressively compressed images to 32×32 token grids and autoregressively modeled the 1D flattened sequence — severing the image’s 2D spatial structure. LDM keeps $z$ as a 2D tensor, letting the U-Net operate with standard convolutional locality assumptions intact.

Cross-Attention Conditioning and Classifier-Free Guidance

To condition generation on text or any other modality, LDM introduces a domain expert encoder $\tau_\theta(y)$ that maps the conditioning signal to an intermediate representation, which is then injected into the U-Net via cross-attention. At each attention layer:

\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_h}}\right) V \]

where queries $Q = W_Q^{(i)} \cdot \phi_i(z_t)$ come from the U-Net’s intermediate features, and keys and values $K = W_K^{(i)} \cdot \tau_\theta(y)$, $V = W_V^{(i)} \cdot \tau_\theta(y)$ come from the condition. For text conditioning, $\tau_\theta$ is typically a CLIP ViT-L/14 text encoder.

During training, conditioning is randomly dropped to train an unconditional model simultaneously. At inference, classifier-free guidance (CFG) amplifies the conditional signal:

\[ \epsilon_{\text{guided}} = \epsilon_{\text{uncond}} + s \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}}) \]

A guidance scale $s > 1$ sharpens adherence to the prompt at the cost of reduced diversity. Setting $s = 7.5$ is typical for text-to-image; $s = 3.0$ for music-to-audio.

One difference between images and audio that affects diffusion’s generalizability: images are spatially smooth (adjacent pixels are highly correlated), whereas audio is temporally high-frequency (a 44.1 kHz waveform oscillates tens of thousands of times per second). The same U-Net architecture that exploits 2D spatial smoothness in images must be substantially redesigned to handle audio’s much faster temporal oscillations.

Riffusion: Image Diffusion on Mel-Spectrograms

Riffusion (Forsgren & Martiros, 2022) illustrates a revealing shortcut: music can be generated by running a standard image diffusion model on mel-spectrograms rather than photographs. A mel-spectrogram — a 2D array with time on the horizontal axis and mel-frequency on the vertical axis — looks enough like an image that Stable Diffusion’s U-Net, fine-tuned on a small music dataset, learns to synthesize plausible mel-spectrograms from text prompts. A Griffin-Lim reconstruction or a pretrained vocoder then converts the spectrogram back to audio.

Riffusion works because the mel-spectrogram’s 2D spatial structure matches the U-Net’s inductive biases almost perfectly: pitch corresponds to vertical position, rhythm to horizontal patterns, and timbral texture to fine-grained local structure — all learnable via standard convolutions. The system required no audio-specific architecture changes whatsoever.

Its importance is more conceptual than practical. Audio quality is limited by the lossy spectrogram-to-waveform conversion and by the domain mismatch between natural image statistics and music spectrogram statistics. But Riffusion concretely demonstrates the representation as inductive bias principle from Chapter 1: by choosing a 2D grid representation that aligns the problem geometry with an existing generative model’s strengths, a working music generator can be assembled without audio-specific design. Every system in this chapter makes the same bet — the decisive choice is always how to represent the signal, not how to redesign the network.

Chapter 4: Stable Audio

Variable-Length Generation via Timing Conditioning

Stable Audio (Evans et al., Stability AI, 2024) extends the LDM pipeline to 44.1 kHz stereo audio generation with a critical innovation: timing conditioning that decouples generation quality from audio duration.

Previous audio diffusion models generated fixed-length outputs by design — a model trained on 10-second clips could not generalize to 30 seconds without retraining. In music, this is a fundamental limitation: a listener can immediately tell whether a piece has a meaningful beginning, middle, and end, or whether it simply stops arbitrarily. Stable Audio addresses this by training the diffusion model to generate a fixed 95.1-second window while conditioning on two scalar signals:

seconds_start: the temporal position of the output within a hypothetical full song (e.g., a value of 30 means “this clip starts 30 seconds into the song”).
seconds_total: the intended total duration of the song.

These scalars are converted to per-second sinusoidal embeddings and concatenated with the text embedding as additional condition tokens fed to the U-Net via cross-attention. The model thereby learns structural associations — that intros tend to arrive when seconds_start is near zero and seconds_total is large, that outros involve a gradual density decrease near seconds_start ≈ seconds_total, and so on. Generating a 30-second clip requires the model to run the full 95.1-second computation (with the final ~65 seconds collapsing to silence), so the real-time factor does not scale linearly with target duration.

Stereo 44.1 kHz VAE Design

The Stage 1 autoencoder is a 133M-parameter fully-convolutional VAE derived from the Descript Audio Codec architecture, with the vector quantizer removed in favor of a continuous latent. It processes 44.1 kHz stereo input with a temporal downsampling factor of 1024, producing a latent of shape $z \in \mathbb{R}^{64 \times (L/1024)}$ — 64 channels per latent time step. The tensor size reduction factor is 32×.

The Stage 1 loss combines four terms:

\[ \mathcal{L}_{\text{VAE}} = 1.0 \cdot \mathcal{L}_{\text{spec}} + 0.1 \cdot \mathcal{L}_{\text{adv}} + 5.0 \cdot \mathcal{L}_{\text{fm}} + 10^{-4} \cdot \mathcal{L}_{\text{KL}} \]

The spectral loss $\mathcal{L}_{\text{spec}}$ is a multi-resolution STFT magnitude loss with window sizes 2048, 4096, and 8192, with A-weighting applied before the STFT to emphasize perceptually important frequency bands. The adversarial loss uses a multi-scale STFT discriminator with hinge GAN loss. The feature-matching loss $\mathcal{L}_{\text{fm}} = \sum_\ell \|D_\ell(x) - D_\ell(\hat{x})\|$ matches intermediate discriminator activations, providing a stable signal that supplements adversarial training. The tiny KL coefficient (1e-4) acts purely as a scale regularizer, not as a proper variational objective.

A noteworthy engineering detail: after 460,000 training steps, Stability froze the encoder and continued fine-tuning only the decoder for an additional 640,000 steps. Freezing the encoder at this point prevents further changes to the latent geometry — which would invalidate any Stage 2 model trained on the frozen encoder’s latents — while allowing the decoder to continue improving reconstruction quality independently.

Fast Sampling with DPM-Solver++

The Stage 2 diffusion model is a 907M-parameter U-Net with four resolution levels (channel counts 1024–1280, downsampling factors 1–4) trained with a v-prediction objective rather than the standard noise-prediction formulation. V-prediction reparameterizes the denoising target as the “velocity” — the derivative of the data along the noising trajectory — which improves numerical stability at high noise levels:

\[ \mathcal{L}_{\text{diff}} = \mathbb{E}_{z_0, t, \epsilon}\!\left[\|v_t - v_\theta(z_t, t, c)\|^2\right], \quad v_t = \sqrt{\bar{\alpha}_t}\, \epsilon - \sqrt{1-\bar{\alpha}_t}\, z_0 \]

For inference, Stable Audio uses DPM-Solver++ (Lu et al., 2022), a high-order ODE solver for diffusion models that typically achieves good quality in 25–50 function evaluations rather than the 1,000 steps DDPM requires. At 8 seconds of compute for 95 seconds of audio, the real-time factor is $\text{RTF} = 8/95 \approx 0.084$ — roughly twelve times faster than real time, a practical milestone for commercial deployment.

Text conditioning is provided by a self-trained 108M-parameter CLAP encoder (using the next-to-last layer’s embeddings, which the authors found provided stronger controllability than the final layer). Using domain-matched CLAP embeddings rather than a general-purpose T5 encoder improves text-audio alignment on stock-music-style prompts but may generalize less well to open-ended natural language descriptions.

Chapter 5: Beyond DDPM — Score Matching, Consistency, and Flow

Score-Based Generative Modeling

Score-based generative modeling (Song & Ermon, NeurIPS 2019) provides a complementary theoretical framework for diffusion. Instead of learning a denoiser, a score model learns the score function $s_\theta(x, t) \approx \nabla_x \log p_t(x)$ — the gradient of the log data density at noise level $t$. Sampling then proceeds by integrating a stochastic differential equation (SDE):

\[ dx = -\frac{1}{2}\beta(t)\, x\, dt - \beta(t)\, \nabla_x \log p_t(x)\, dt + \sqrt{\beta(t)}\, dW \]

Song et al. (NeurIPS 2021) showed that DDPM and score matching are two views of the same process — DDPM’s noise prediction is equivalent to score matching up to a scaling constant. More consequentially, they also derived a deterministic probability flow ODE that generates the same marginal densities as the SDE but allows fast numerical integration, connecting generative modeling to neural ODEs and continuous normalizing flows.

Consistency Models and Few-Step Distillation

A fundamental inefficiency in diffusion sampling is that the reverse process must be integrated in many small steps. Consistency models (Song et al., ICML 2023) address this by training a function $f_\theta(x_t, t)$ that directly maps any noisy sample back to the original clean data — satisfying the self-consistency property: for any two points $(x_t, t)$ and $(x_{t'}, t')$ on the same ODE trajectory, $f_\theta(x_t, t) = f_\theta(x_{t'}, t')$. If this is achieved, the model can generate in a single step, or use two steps for significantly improved quality.

Training consistency models can be done either by distillation from a pretrained diffusion model (learning to match the ODE solver’s multi-step output in one step) or by direct consistency training without a pretrained teacher. DALL-E 3’s Consistency Decoder, applied in Stable Diffusion’s reconstruction, uses the distillation approach — compressing what would be a 50-step diffusion decoding into two steps while maintaining visual fidelity.

Flow Matching

Flow matching (Lipman et al., ICLR 2023) trains a vector field $u_\theta(x, t)$ to transform a simple source distribution (Gaussian noise) into the data distribution via an ODE $dx = u_\theta(x, t)\, dt$. The key insight is that, for certain carefully chosen probability paths (e.g., linear interpolation between noise and data), the target vector field can be computed analytically at every training point without solving an ODE during training:

\[ \mathcal{L}_{\text{FM}} = \mathbb{E}_{t, x_0, x_1}\!\left[\|u_\theta(x_t, t) - (x_1 - x_0)\|^2\right], \quad x_t = (1-t)\, x_0 + t\, x_1 \]

This simple regression loss is not only easy to train but produces straighter, faster ODE trajectories than DDPM noise schedules — allowing high-quality sampling in 10–20 function evaluations. Flow matching is now the training framework behind Meta’s Voicebox (Le et al., 2023) and Audiobox (Vyas et al., 2023), both of which condition on text and audio prompts to generate speech and sound effects. The transition from DDPM to flow matching in production audio systems follows the same arc as the image domain’s transition from DDPM to rectified flows in Stable Diffusion 3 and FLUX.

Part III: Discrete-Token Codec Language Models

Chapter 6: Neural Audio Codecs and Neural Vocoders

VQ-VAE: The Foundation

VQ-VAE (van den Oord et al., NeurIPS 2017) introduced discrete latent variables to the VAE framework by replacing the reparameterized Gaussian posterior with a nearest-codebook-entry lookup. The codebook $\mathcal{C} = \{e_1, \ldots, e_K\} \subset \mathbb{R}^D$ is a set of $K$ learned prototype vectors. The encoder maps each input to a continuous feature map $z_e(x)$; the quantizer replaces each $D$-dimensional position in $z_e$ with its nearest codebook entry:

\[ z_q(x) = \text{Quantize}(E(x)) = e_k, \quad k = \arg\min_j \|z_e - e_j\|_2 \]

The decoder then operates on $z_q$ rather than $z_e$.

Why discrete latents? Continuous VAEs suffer from posterior collapse when paired with powerful decoders: the model learns to ignore the latent entirely, with $q_\phi(z|x)$ collapsing to the prior $p(z)$. This happens because the KL term in the ELBO rewards making $q \approx p$, and a strong decoder can achieve low reconstruction loss without using the latent at all. VQ-VAE sidesteps this by making the posterior deterministic — a one-hot assignment to the nearest codebook entry — so the KL term becomes constant and can be dropped from optimization. The latent must carry information because the decoder has no other source.

The non-differentiability of the argmin lookup is handled with the straight-through estimator: during the forward pass, the quantized value $z_q$ is used; during the backward pass, gradients flow directly from $z_q$ to $z_e$ as if no quantization had occurred:

\[ \frac{\partial \mathcal{L}}{\partial z_e} \approx \frac{\partial \mathcal{L}}{\partial z_q} \]

The complete training objective has three components:

\[ \mathcal{L} = \underbrace{\|x - D(e_k)\|_2^2}_{\text{reconstruction}} + \underbrace{\|\text{sg}[z_e] - e_k\|_2^2}_{\text{codebook loss}} + \beta\underbrace{\|z_e - \text{sg}[e_k]\|_2^2}_{\text{commitment loss}} \]

The codebook loss (with $z_e$ stop-gradients) moves codebook entries toward encoder outputs; the commitment loss (with $e_k$ stop-gradients) moves encoder outputs toward codebook entries. Jointly, they ensure that the codebook and encoder remain synchronized throughout training. EMA codebook updates — a running average of encoder outputs assigned to each code — are more stable than gradient descent and are used in nearly all practical implementations.

VQ-VAE-2 scales this to a hierarchy: a top encoder captures global structure (object layout, harmonic sketch), a bottom encoder adds local detail, and two separate codebooks are quantized independently. At inference time, a PixelCNN-style prior is trained on the top codes, followed by a conditional prior for the bottom codes given the top, enabling full generative sampling.

Residual Vector Quantization: SoundStream and EnCodec

VQ-VAE produces a single discrete code per temporal position, limiting its representational power. Residual Vector Quantization (RVQ), used in SoundStream (Zeghidour et al., NeurIPS 2021) and EnCodec (Défossez et al., 2022), cascades multiple codebooks to refine the quantization residual:

Quantize the encoder output $z$ with codebook $\mathcal{C}_1$, yielding code $c_1$ and residual $r_1 = z - e_{c_1}$.
Quantize $r_1$ with codebook $\mathcal{C}_2$, yielding code $c_2$ and residual $r_2 = r_1 - e_{c_2}$.
Continue for $K$ levels.

This produces a $K$-vector of codes $(c_1, \ldots, c_K)$ per temporal position where $c_k$ depends on all prior levels. Crucially, the residual dependency is real: $c_2$ encodes what $c_1$ failed to capture, so $c_2$ is meaningless without knowing $c_1$. This dependency structure poses a significant challenge for generative modeling, as we will see in Chapter 7.

EnCodec operates at 32 kHz with a frame rate of 50 Hz (one code group per 20 ms) and $K = 4$ codebooks of size 2048. The bitrate works out to $32000/640 \times 4 \times \log_2(2048) = 2.2$ kbps — significantly below high-quality MP3 (300+ kbps) yet capable of producing perceptually indistinguishable audio for speech and music.

Codebook Collapse and EMA Stability

Codebook collapse — a small fraction of codes capturing most of the usage while the rest sit idle — remains a persistent training challenge. The most effective countermeasure is random restarts: any code that falls below a minimum usage threshold is re-initialized by randomly selecting an encoder output from the current batch, forcing the dead code back into active service. This “rescue” mechanism is combined with EMA updates for the active codes:

\[ N_i^{(t)} = \gamma N_i^{(t-1)} + (1-\gamma) n_i^{(t)}, \quad m_i^{(t)} = \gamma m_i^{(t-1)} + (1-\gamma) \sum_{j} z_{i,j}^{(t)}, \quad e_i^{(t)} = m_i^{(t)} / N_i^{(t)} \]

where $n_i^{(t)}$ is the count of encoder outputs assigned to code $i$ in the current batch, and $z_{i,j}^{(t)}$ are those outputs. The exponential moving average $\gamma$ (typically 0.99) provides smooth online updates without requiring the codebook to be included in gradient calculations.

Neural Vocoders: HiFi-GAN, BigVGAN, and Vocos

Neural vocoders synthesize high-fidelity audio waveforms from intermediate representations such as mel-spectrograms or codec latents. They sit at the final decoding stage of nearly every audio generation pipeline.

HiFi-GAN (Kong et al., NeurIPS 2020) trains a convolutional generator alongside two families of discriminators: multi-scale discriminators that operate on the raw waveform at different temporal resolutions, and multi-period discriminators that reshape the waveform into 2D periodic structures and apply 2D convolutions. The combined adversarial training forces the generator to produce audio that is realistic at every temporal scale and every oscillation frequency — avoiding the “blurry” mean waveform that regression-only training would produce. HiFi-GAN generates 22.05 kHz speech in real time on a single GPU with essentially indistinguishable quality from ground truth.

BigVGAN (Lee et al., 2023) scales HiFi-GAN substantially (up to 112M parameters), replacing standard activation functions with snake activation and anti-aliased processing to prevent aliasing artifacts in high-frequency generation. It demonstrates that scaling vocoder capacity meaningfully improves quality for out-of-distribution inputs — a critical property for text-to-music systems where the mel-spectrogram conditioner may be far from any training distribution.

Vocos (Siuzdak, 2023) takes a different approach: instead of generating the raw waveform directly, it generates the real and imaginary STFT coefficients and reconstructs the waveform via inverse STFT. This circumvents the difficulty of modeling phase directly while still producing high-quality audio. A ConvNeXt backbone with global-context depthwise convolutions replaces the multi-scale dilated convolution stack of HiFi-GAN, achieving comparable quality with fewer parameters and faster inference.

Chapter 7: MusicGen and Codebook Interleaving Patterns

The Road to MusicGen: AudioLM and MusicLM

Before MusicGen unified audio generation into a single Transformer stage, two systems from Google established the dominant two-stage semantic paradigm for audio codec language modeling.

AudioLM (Borsos et al., NeurIPS 2023) observed that high-quality long-form speech and piano could be generated by decoupling what is being played from how it sounds acoustically. It uses two separate token streams at different semantic levels. Semantic tokens derived from a self-supervised model (w2v-BERT) capture high-level phonetic or melodic structure — they change slowly, track large-scale musical events, and form a compact sequence well-suited to language modeling. Acoustic tokens from SoundStream reproduce fine-grained waveform details — timbre, room acoustics, articulatory texture. The generative model factorizes the joint distribution as

\[ p(\text{acoustic tokens}) = p(\text{semantic tokens}) \cdot p(\text{acoustic tokens} \mid \text{semantic tokens}) \]

so the system first samples a semantic skeleton (capturing what), then conditions acoustic generation on it (producing how). This factorization is tractable because semantic tokens are far fewer per second and statistically more structured than raw acoustic tokens.

MusicLM (Agostinelli et al., ICML 2023) extended the AudioLM framework to text-conditional music generation. Text conditioning enters through MuLan, a joint audio-text embedding trained contrastively on 370,000 hours of music paired with free-text descriptions. At inference time, the MuLan text embedding conditions the semantic-token generation stage; the acoustic stages run conditioned on the semantics and text jointly. A melody conditioning variant accepts a hummed or whistled audio clip, encodes it through MuLan’s shared space, and uses the resulting embedding as a melodic prior — enabling the first practical system for melody-guided text-to-music generation. MusicLM was the first model to produce multi-minute, stylistically coherent music from free-form text prompts at reasonable quality, and its release prompted significant commercial and research interest.

The price of this quality was a three-stage cascade — semantic prior, coarse acoustic model, fine acoustic model — each with its own architecture and training objective. Errors compound across stages: a poor semantic prediction propagates into a globally incoherent acoustic realization that the downstream acoustic model cannot correct. MusicGen’s central contribution was to show that, with the right token serialization strategy, a single Transformer could jointly model all four RVQ codebooks in one pass, eliminating the cascade while matching or exceeding MusicLM’s quality.

The Challenge of Multi-Stream Token Modeling

MusicGen (Copet et al., NeurIPS 2023) achieves text-conditioned and melody-conditioned music generation with a single-stage autoregressive Transformer — no cascading stages, no separate semantic and acoustic models. Its core contribution is a systematic analysis of how to model RVQ’s multi-stream token structure.

The problem is this: EnCodec produces a $T' \times K$ matrix of tokens for audio of duration $T$ seconds (at 50 Hz and $K = 4$ codebooks, 30 seconds of audio yields $1500 \times 4 = 6000$ tokens). Transformers natively model 1D sequences. How should the 2D token matrix be serialized into a 1D sequence for next-token prediction?

Three strategies are possible, each corresponding to a different factorization of the joint distribution $p(c_{1,1}, c_{1,2}, \ldots, c_{T,K})$.

Flattening serializes all tokens strictly row-major: $c_{1,1}, c_{1,2}, c_{1,3}, c_{1,4}, c_{2,1}, \ldots$. This is the exact AR factorization — each token has access to all prior tokens, including the cross-stream dependencies that RVQ encodes. But the sequence length blows up to $T \times K$: for 30 seconds at $K = 4$, attention memory scales as $\mathcal{O}((T \cdot K)^2) = \mathcal{O}(6000^2) \approx 36 \times 10^6$ cells, which is prohibitively expensive.

Parallel prediction predicts all $K$ codebooks at each time step simultaneously: $(c_{t,1}, \ldots, c_{t,K})$ are all produced independently conditioned only on $c_{\lt t,\cdot}$. This collapses the sequence length back to $T$ but violates the fundamental RVQ dependency: $c_{t,2}$ encodes the residual after removing the contribution of $c_{t,1}$, so predicting $c_{t,2}$ without knowing $c_{t,1}$ is ill-posed. In practice, parallel prediction produces poor audio quality because high-frequency detail (encoded in later codebooks) cannot be made consistent with the low-frequency skeleton (encoded in the first codebook).

Delay pattern (the MusicGen solution) introduces a time offset: codebook $k$ is shifted forward by $k-1$ steps relative to codebook 1. The model predicts, at step $s$:

\[ (c_{s,1},\; c_{s-1,2},\; c_{s-2,3},\; c_{s-3,4}) \]

This means that when predicting $c_{s,2}$, the model has already seen $c_{s,1}$ (shifted one step earlier) — restoring the within-step dependency structure at the cost of only $K-1$ additional time steps. A corrected count for 30 seconds of audio: $T + (K-1) = 1500 + 3 = 1503$ steps, not 1500 as the paper incorrectly states.

The delay pattern was not invented for music: it originated in pGSLM (Kharitonov et al., Interspeech 2021), a spoken-language generative model that used the same time-offset technique to handle the multi-stream unit-pitch tokens of spoken language generation. MusicGen adapted pGSLM’s strategy to the RVQ setting, where the inter-stream dependency is the codebook residual relationship rather than speech-unit co-articulation.

The delay pattern is elegantly simple but arithmetically non-obvious; Donahue caught an error in the paper’s sequence-count description and it serves as a good example of how details matter in discretization.

Single-Stage Transformer Architecture

MusicGen’s Transformer is a decoder-only model (GPT-style) with causal masking. The input at each step is the sum of the $K$ delayed codebook embeddings:

\[ e_s = \sum_{k=1}^{K} \text{Embed}_k(c_{s-k+1, k}) + \text{pos}(s) \]

using sum rather than concatenation because the delay pattern aligns different codebooks to the same logical time step — summing effectively creates a single embedding that fuses information from all codebooks at that step. The output is $K$ separate linear heads, one per codebook, each predicting a distribution over 2,048 codes.

Sampling uses top-$k = 250$ filtering (retaining only the 250 highest-probability tokens at each step) without nucleus sampling, which Copet et al. find unnecessary for music. CFG is applied during inference: the same model handles both conditional and unconditional generation by randomly dropping conditioning during training at a 20% rate, and at inference by combining logits as

\[ \text{Logits}_{\text{final}} = \text{Logits}_{\text{uncond}} + w \cdot (\text{Logits}_{\text{cond}} - \text{Logits}_{\text{uncond}}), \quad w = 3.0 \]

Text and Melody Conditioning

Text conditioning is provided via cross-attention to T5-XXL embeddings (4.6B parameters). The choice of T5 over CLAP reflects a deliberate tradeoff: T5 provides richer instruction-following and better adherence to unusual textual descriptions, whereas CLAP’s shared audio-text embedding space is more suitable for multi-prompt interpolation and direct music retrieval. MusicGen uses T5 as the default and shows CLAP in an ablation — a pragmatic acknowledgment that both have their place.

Melody conditioning is MusicGen’s most creative contribution. Naïvely conditioning on a reference audio clip would lead the model to copy-paste the reference rather than generate music in a different style. The fix is an information bottleneck: the reference audio is first separated into its melodic component using Demucs, then a chromagram is computed using a large time window (blurring temporal detail and audio-specific texture), retaining only the coarse pitch class sequence. The resulting chromagram — essentially “what keys are being played and roughly when” — is rich enough to constrain the melody but impoverished enough to force the model to “hallucinate” a fresh arrangement in a new timbre and style. Style transfer via structured information bottleneck is a recurring theme in creative AI.

Non-Autoregressive Codec Generation: SoundStorm and MAGNeT

MusicGen’s delay pattern is an autoregressive solution: it generates tokens one step at a time in a fixed left-to-right order. Two contemporaneous systems demonstrated that non-autoregressive (NAR) generation can produce comparable quality substantially faster.

SoundStorm (Borsos et al., Google, 2023) applies the MaskGIT masked prediction paradigm to audio codec tokens. During training, a random subset of acoustic tokens is masked, and the model is trained to predict all masked positions simultaneously conditioned on the unmasked tokens and on pre-computed semantic tokens from AudioLM. During inference, generation proceeds by confidence-ranked iterative refinement: the model predicts all masked tokens in parallel, then re-masks the lowest-confidence predictions and re-runs — repeating for a small number of rounds (typically 16 steps) rather than generating thousands of tokens sequentially. SoundStorm achieves a roughly 100× speedup over autoregressive decoding while maintaining perceptual quality on speech and piano comparable to the cascaded AudioLM baseline.

MAGNeT (Ziv et al., Meta, 2024) applies a similar masked prediction strategy directly to MusicGen’s delay-patterned token streams, eliminating the two-stage semantic-then-acoustic pipeline entirely. The model uses a joint prediction objective over all $K$ codebooks simultaneously with a token-level classifier-free guidance variant that conditions on text and the current unmasked token context. A key design finding is that different codebook levels benefit from different mask rates: the first codebook (capturing broad musical structure) should be masked sparsely so the model can plan globally, while higher codebooks (fine acoustic detail) tolerate aggressive masking. MAGNeT generates ten seconds of music in roughly 1.5 seconds of compute — approximately 7× faster than MusicGen-Large at comparable FAD — while supporting variable-length output natively.

The fundamental tradeoff between autoregressive and non-autoregressive generation is well-studied in text. AR models are maximally flexible (each token can attend to all prior tokens) but are inherently sequential; NAR models parallelize trivially but sacrifice some conditional precision. In the music codec setting, the RVQ residual dependency adds a complication that delay-pattern AR handles by temporal offsetting, and that NAR systems handle by jointly predicting all codebooks with shared context. Neither approach has decisively won: AR models hold a quality edge on long-form generation with complex structural repetition, while NAR models are the practical choice for low-latency interactive applications.

Part IV: Symbolic Music Generation

Chapter 8: From Language Models to Music Transformers

Statistical Language Modeling and the Curse of Dimensionality

Statistical language modeling begins with a deceptively simple definition. A language model assigns a probability to any sequence of tokens $w = [w_1, w_2, \ldots, w_T] \in V^T$ drawn from a vocabulary $V$. By the chain rule of probability, this joint distribution factorizes as:

\[ P(w_1^T) = \prod_{t=1}^T P(w_t \mid w_1^{t-1}) \]

so learning a language model reduces to learning the conditional distribution $P(w_t \mid w_{\lt t})$ — the probability of the next token given all prior tokens. For symbolic music, the tokens are note events (pitch, velocity, timing); the language model predicts the next event given the musical context built up so far.

The earliest approach, n-gram models, assumes a Markov property: the current token depends only on the $N-1$ preceding tokens:

\[ P(w_t \mid w_{\lt t}) \approx P(w_t \mid w_{t-N+1}, \ldots, w_{t-1}) \]

This is estimated by counting co-occurrences in a large corpus. The fatal problem is the curse of dimensionality: the space of all possible $N$-grams over a vocabulary of size $|V|$ has $|V|^N$ entries. For a vocabulary of 16,000 words, an 8-gram table requires $16000^8 \approx 10^{35}$ parameters — orders of magnitude beyond what any corpus can populate. With small $N$ the model lacks context; with large $N$ the counts are overwhelmingly sparse.

Bengio’s Neural Probabilistic Language Model

Bengio et al. (JMLR 2003) solved the dimensionality problem with distributed representations. The core idea: instead of treating each word as an isolated symbol, map every word $w_i$ to a continuous real vector $C(w_i) \in \mathbb{R}^m$ where $m \ll |V|$. The joint embedding space captures semantic similarity — “dog” and “cat” receive similar vectors because they appear in similar linguistic contexts. A neural network $g_\omega$ then predicts the next word from the concatenated embeddings of the context:

\[ P_\theta(w_i \mid w_{i-N+1}, \ldots, w_{i-1}) = \text{Softmax}(g_\omega(C(w_{i-1}), \ldots, C(w_{i-N+1}))) \]

The embedding matrix $C$ and the network parameters $\omega$ are trained jointly by minimizing the negative log-likelihood:

\[ \mathcal{L}_{\text{NLL}} = \frac{1}{|W|} \sum_{w_i \in W} -\log P_\theta(w_i \mid w_{\lt i}) \]

Because similar words share similar embeddings, training on one phrase (e.g., “The cat is walking”) automatically improves the model’s probability estimate for semantically related phrases (e.g., “The dog is walking”) — the model generalizes across the exponential space of unseen n-grams. Bengio reported a 252 perplexity on the Brown corpus versus 312 for the best n-gram baseline — a 19% relative improvement, modest by modern standards but revolutionary in its principle.

Perplexity (PPL) is a human-readable summary of model performance:

\[ \text{PPL}(W_{\text{test}}) = e^{\text{NLL}(W_{\text{test}})} \]

A perplexity of 252 means the model, at each step, is equivalent to uniformly guessing among 252 equally probable next tokens. A fresh, randomly initialized model should have PPL equal to the vocabulary size — a useful sanity check at the start of training.

Bengio’s paper was the seed from which word2vec, GloVe, ELMo, and ultimately the token embeddings in modern Transformers directly grew.

Performance RNN

Performance RNN (Simon & Oore, Magenta 2017) is the first application of recurrent language modeling to expressive piano music. Its architectural contribution is modest — an LSTM applied to a symbol sequence — but its representation is creative.

The key insight is that musical performance (the actual playing of a piece) is qualitatively different from the score (the notated pitches and rhythms). A human pianist brings phrasing, rubato (intentional temporal deviation), dynamic variation, and pedaling that are completely absent from the score. Earlier symbolic generation systems worked from scores, producing mechanical, robotically even output. Performance RNN models the performance directly, in a 388-token event vocabulary:

128 NOTE_ON tokens (one per MIDI pitch, initiating a note press)
128 NOTE_OFF tokens (one per MIDI pitch, releasing a note)
100 TIME_SHIFT tokens (10 ms increments from 10 ms to 1 second, encoding micro-timing deviations)
32 VELOCITY tokens (quantizing MIDI velocity 0–127 into 32 bins)

This representation is dense: a one-second passage with many overlapping notes might produce 50–100 tokens. But it captures the micro-temporal and dynamic nuance that distinguishes human performance from score rendering.

Training data comes from the Yamaha e-Piano Competition — 1,186 recordings of competitive piano performances. This dataset is ideal for several reasons: all recordings are from a single instrument (piano), removing the confound of timbre variation; all are competitive repertoire, imposing coherent statistical structure; and all timing and velocity details are captured from the physical keys, not estimated post-hoc.

Performance RNN’s limitations are instructive. The LSTM architecture provides approximately 30 seconds of effective context before long-range dependencies fade. Beyond that, the model begins to “noodle” — generating locally plausible events with no connection to earlier material. It has no mechanism for generating a recurring theme, returning to a key area, or building toward a climax. These failures motivate the next generation of symbolic models.

Music Transformer and Relative Self-Attention

Music Transformer (Huang et al., ICLR 2019) replaces the LSTM with a Transformer, achieving the first reliable long-range structural coherence in piano generation — repeating motifs across hundreds of tokens, developing and varying themes in ways Performance RNN never managed. The architectural contribution is efficient relative self-attention.

The vanilla Transformer uses absolute positional encodings: each position $i$ gets a sinusoidal embedding $p_i = [\sin(f_1 i), \cos(f_1 i), \ldots, \sin(f_J i), \cos(f_J i)]$, where frequencies $f_j$ span several orders of magnitude (analogous to a clock with hands running at different speeds). This encodes “where in the sequence,” but music often cares more about “how far apart” — the relative distance between a motif and its variation is more meaningful than their absolute positions.

Shaw et al. (NAACL 2018) introduced relative attention by adding a learned position-distance embedding $r = j - i$ to each attention score:

\[ \text{RelativeAttention} = \text{softmax}\!\left(\frac{QK^\top + S_{\text{rel}}}{\sqrt{D_h}}\right)V \]

where $S_{\text{rel}}^{ij} = q_i^\top e_{j-i}^r$ for a learnable embedding $e_r \in \mathbb{R}^{D_h}$ for each relative offset $r$. The problem: storing all relative embeddings in an intermediate $L \times L \times D_h$ tensor requires $\mathcal{O}(L^2 D)$ memory — unacceptable for long sequences.

Music Transformer’s key contribution is a skewing trick that reduces this to $\mathcal{O}(LD)$. Instead of building the full intermediate tensor, the matrix $QE_r^\top \in \mathbb{R}^{L \times L}$ is computed directly — each element $(i, r)$ is the inner product of query $i$ with relative embedding $r$. But the positions in this matrix are indexed by $(i, r)$, not $(i, j)$. The skewing trick applies a change of coordinates $j = r - (L-1) + i$ to efficiently reindex $QE_r^\top$ into the correct $S_{\text{rel}}^{ij}$ matrix without constructing the intermediate 3D tensor. Memory drops from $\mathcal{O}(L^2 D)$ to $\mathcal{O}(LD)$ — a factor of $L$ improvement that makes relative attention tractable for sequences of thousands of tokens.

Music Transformer was evaluated on two datasets: JSB Chorales (four-part Bach counterpoint, quantized to a 16th-note grid) and the Yamaha e-Piano Competition recordings (event-based, following Performance RNN). The Bach chorale task shows much lower NLL than piano performance — unsurprisingly, given that strict counterpoint rules (voice leading, cadence formulas, harmonic grammar) provide powerful predictive structure, whereas expressive piano performance is substantially harder to predict.

Chapter 9: Token Vocabularies and the Anticipatory Transformer

Symbolic Representation Schemes

The choice of token vocabulary for symbolic music is a design decision with significant downstream consequences. Several schemes have been proposed, each encoding different aspects of musical structure.

MIDI-like / event-based representation (as in Performance RNN) encodes note-on and note-off events with velocity and time-shift tokens. It is maximally expressive for performance data but produces variable-length sequences that grow rapidly with polyphony. The time-shift vocabulary creates implicit temporal quantization: sequences of time-shift tokens represent durations, and fine-grained timing requires many tokens for long notes.

REMI (Huang & Yang, IJCAI 2020) — Revamped MIDI-derived events — reintroduces explicit bar and beat-position tokens, making meter structure explicit in the token stream. A REMI sequence begins each bar with a Bar token and encodes note onsets with a Position token specifying their location within the bar. This enables the Pop Music Transformer to learn bar-level and beat-level regularities (like the four-measure phrase structure ubiquitous in popular music) that were invisible in time-shift-based representations.

Compound word representation (Hsiao et al., 2021) packs multiple note attributes — pitch, duration, velocity, beat position — into a single “compound” token by jointly embedding them. This dramatically reduces sequence length (compared to event-based streams) while preserving all the information, though at the cost of a larger embedding vocabulary.

The non-Western limitation of all these schemes deserves emphasis. MIDI’s 128-pitch vocabulary assumes Western 12-tone equal temperament — it cannot represent the microtonal ornaments of Carnatic music (gamaka), the quarter tones of Middle Eastern maqam, or the glissando techniques of the Chinese guqin. Symbolic generation systems trained on MIDI are therefore structurally biased toward Western music, and this bias is difficult to audit or correct without first redesigning the representation.

Anticipatory Music Transformer

Anticipatory Music Transformer (Thickstun et al., ICML 2024) addresses the problem of controllable symbolic generation — how to generate music that satisfies user-specified constraints (e.g., “this measure should end with an A-minor chord”) without the awkward engineering of a separate infilling model.

The key realization is framed as a change of representation. Standard symbolic sequences interleave control tokens at the exact moment they become relevant. The anticipatory formulation shifts control tokens earlier, by a fixed anticipation interval $\delta$ (the paper uses $\delta = 5$ seconds). A control token $u_k$ with scheduled time $s_k$ is inserted just after the first event in the sequence with onset time $\geq s_k - \delta$. In other words, the model “sees” upcoming constraints five seconds before they must be realized — like a conductor giving a downbeat gesture five seconds before the downbeat itself.

To support this, the paper uses an absolute-time tokenization: each musical event $e_i$ is represented as the triple $(t_i, d_i, n_i)$ — absolute onset time, duration, and pitch — serialized as three consecutive tokens:

\[ x_{3i-2} = t_i, \quad x_{3i-1} = d_i, \quad x_{3i} = n_i \]

The interleaved sequence $a_{1:N+K} = \text{interleave}_\delta(e_{1:N}, u_{1:K})$ places control token $u_k$ at position $\tau_{u_k} = k + \arg\min_{0 \leq j \leq N}\{t_j \geq s_k - \delta\}$ in the merged sequence. This formulation requires no modifications to the training objective — it is simply next-token prediction on the interleaved sequence — yet the resulting model naturally conditions generation on future events.

Thickstun et al. frame this problem formally as a temporal point process (TPP), providing a principled probabilistic language for evaluating when and whether generation stops — a subtle but important contribution for open-ended generation tasks.

Music exists simultaneously in at least three qualitatively different representational registers: the visual score (a 2D image of notation), the symbolic domain (a discrete, time-ordered event stream such as MIDI), and the audio domain (a continuous waveform). Each representation exposes different aspects of the music, and a complete computational pipeline must be able to translate faithfully between all three.

The translation graph has six directed edges. Two are relatively mature: score-to-symbolic via optical music recognition (OMR), and symbolic-to-audio via synthesis. Two are active research frontiers: audio-to-symbolic via automatic music transcription (AMT), and audio-to-score (which composes AMT with automatic score engraving). The remaining two — symbolic-to-score (notation rendering) and score-to-audio — are largely solved by classical software (LilyPond, MuseScore, and conventional samplers).

Optical music recognition is the score-to-symbolic direction. A document image passes through staff-line detection, symbol segmentation, and pitch/rhythm classification before a grammar-based decoder assembles the full score. Traditional OMR pipelines (Audiveris, SheetVision) break into hand-crafted stages; deep OMR systems (Deep Scores, MUSCIMA++) treat the problem as object detection followed by graph assembly. The main difficulty is the combinatorial explosion of symbol interactions: two noteheads at the same x-position may form a chord or belong to different voices, and this ambiguity cascades through downstream processing. End-to-end sequence models trained on paired (image, MusicXML) data have substantially narrowed this gap, but real-world printed scores with ornaments, cross-staff beams, or non-standard notation remain challenging.

Automatic music transcription is the audio-to-symbolic direction and is widely regarded as an open grand challenge. The fundamental difficulty is the superposition problem: a piano chord produces a single mixed waveform from which the pitches, onset times, durations, and velocities of each note must be disentangled. The state of the art is MT3 (Gardner et al., ICLR 2022), a T5 encoder-decoder that tokenizes the spectrogram as a sequence of patches and autoregressively decodes MIDI events. MT3 handles multi-instrument music by conditioning on an instrument list, achieving strong results on the MAPS piano dataset and the MusicNet multi-instrument benchmark. The model’s token vocabulary follows a time-shift, note-on, note-off, program-change scheme similar to Performance RNN. Remaining failure modes include polyphony in the extreme upper register, sympathetic resonance between strings (which creates ghost notes), and percussion where pitch is ambiguous.

Symbolic-to-audio synthesis has undergone a revolution with MIDI-DDSP (Wu et al., ICLR 2022). Rather than using a wavetable sampler, MIDI-DDSP runs a DDSP synthesizer conditioned on per-note pitch, onset, duration, and instrument identity, and then applies a learned neural post-filter to add realistic timbre variation. Because the synthesis is differentiable with respect to the MIDI parameters, it enables gradient-based score editing — one can, for instance, optimize the note velocities in a MIDI file to minimize perceptual distance to a target audio recording. The audio quality is competitive with commercial sample libraries while remaining fully controllable.

Cross-modal generation is the synthesis of all three registers simultaneously or in any order. The Score Transformer (Wang et al., ISMIR 2023) takes a lead-sheet image (melody line + chord symbols) and generates a MIDI arrangement, bridging vision and symbolic domains. Systems like Music ControlNet condition diffusion generation on symbolic tracks (chroma, melody, or drumbeat) converted to time-frequency control maps, completing the symbolic-to-audio direction with explicit controllability. The long-term goal is a unified model that can respond to a query in any modality — “here is a score excerpt; generate audio in the style of Ravel” or “here is a recording; produce the most likely MIDI transcription and then continue it for 16 bars” — by routing through a shared latent representation that respects the geometry of all three domains.

Part V: Music Information Retrieval Foundations

Chapter 10: Time-Frequency Analysis and Differentiable Synthesis

Short-Time Fourier Transform and Mel-Spectrograms

Music generation systems live on a foundation of music information retrieval techniques that describe audio in terms humans find meaningful — pitch, rhythm, harmony, timbre — rather than the raw waveform that computers store. The short-time Fourier transform (STFT) is the universal starting point:

\[ X(n, k) = \sum_{m=0}^{N-1} x(m)\, w(m - n)\, e^{-j 2\pi k m / N} \]

where $w$ is a window function (Hann, Hamming), $n$ is the frame index, and $k$ is the frequency bin. The STFT produces a complex-valued spectrogram: taking the magnitude $|X(n,k)|$ discards phase and yields a time-frequency representation that is visually interpretable and perceptually motivated.

The frequency resolution of the STFT is uniform in Hertz, but human pitch perception is approximately logarithmic: doubling the frequency corresponds to a jump of one octave, regardless of whether that doubling takes you from 200 Hz to 400 Hz or from 2,000 Hz to 4,000 Hz. The mel scale warps the linear frequency axis to match this logarithmic perception, and a mel-filterbank — a set of overlapping triangular filters spaced uniformly on the mel scale — summarizes STFT bins into ~80–128 mel channels. The resulting mel-spectrogram is the near-universal input to audio neural networks.

Constant-Q Transform and Chroma

The Constant-Q Transform (CQT) is an alternative to the STFT in which the frequency bins are spaced geometrically (constant ratio between adjacent bins) rather than linearly. This gives the CQT the property of octave invariance: a musical interval (e.g., a perfect fifth) always spans the same number of CQT bins regardless of the absolute pitch. The CQT is therefore natural for harmonic analysis and polyphonic transcription, where octave and interval relationships are central.

Chroma features (or pitch class profiles) project the CQT onto the 12 semitone classes of Western equal temperament, discarding octave information. The result is a 12-dimensional vector per frame that captures the harmonic content — the set of pitch classes active at each moment — without encoding the specific octave or the timbre of the instrument. Chroma is used in key estimation, chord recognition, cover-song detection, and, as we saw in Chapter 7, in MusicGen’s melody conditioning information bottleneck.

DDSP: Differentiable Digital Signal Processing

DDSP (Engel et al., ICLR 2020) bridges the gap between the classical signal processing toolbox and modern deep learning by making traditional DSP operations differentiable and therefore trainable end-to-end.

A classical synthesizer produces sound by combining oscillators (sine waves at specified frequencies and amplitudes), filters (shaping the spectral envelope), noise generators, and reverb processors. Each stage is physically interpretable: the oscillator frequency encodes pitch, the filter envelope encodes timbre and vowel quality, the noise component encodes breathiness or bow noise. Classical synthesis is thus highly controllable but inflexible — it can only produce sounds within the parameters its designer chose.

DDSP keeps these physical synthesis modules but treats their parameters (fundamental frequency $F_0$, harmonic amplitudes, filter coefficients) as the output of a neural network rather than as hand-specified values. A small encoder-decoder network listens to a few seconds of audio and predicts the synthesis parameters that, when fed through the differentiable synthesis chain, reproduce the input. Because every step is differentiable, the entire system trains with standard gradient descent:

\[ \hat{x} = \text{Synthesis}(F_0(z), A(z), H(z), n(z)) \approx x \]

DDSP achieves roughly a 70× compression factor over raw waveforms, and the learned parameters are interpretable: $F_0$ tracks pitch, $A$ tracks loudness. One can smoothly interpolate between two instruments in parameter space to produce a realistic morph between, say, a flute and a violin. DDSP has since been extended to music source separation, timbre transfer, and neural audio synthesis at scales far beyond the original single-instrument demonstration.

Chapter 11: Rhythm, Structure, and Automatic Transcription

Beat and Downbeat Tracking

Rhythm in music has a hierarchical structure: the fastest regular pulse (the tatum) nests inside beats, which nest inside measures, which nest inside phrases. Beat tracking — automatically identifying the beat times in an audio recording — is one of the oldest and most practically important MIR tasks.

Contemporary beat trackers combine neural networks (typically RNNs with CRF-style decoding or Transformers) trained to predict onset strength functions — the probability that a new musical event begins at each time frame — with dynamic Bayesian network (DBN) post-processing that enforces tempo consistency and periodicity. The DBN models tempo as a state variable that evolves slowly over time, penalizing large tempo jumps and favoring beat placements that align with detected onset peaks.

Downbeat tracking identifies the first beat of each measure. It requires the model to infer metric structure — knowing that a 4/4 waltz emphasizes beat 1 differently from beats 2, 3, 4 — from the audio signal alone. Recent neural systems jointly track beats and downbeats via multi-task learning, exploiting the fact that downbeat positions are constrained to be a subset of beat positions.

Structural Segmentation and Musical Form

Structural segmentation divides an audio recording into large-scale sections (verse, chorus, bridge, intro, outro) and identifies the segment boundaries. The standard computational approach uses a self-similarity matrix: a square matrix where element $(i, j)$ measures the acoustic similarity between time frames $i$ and $j$. Repeated sections appear as off-diagonal blocks of high similarity — a verse returning three times appears as three parallel stripes at the same mutual offsets.

Boundary detection applies a checkerboard kernel to the self-similarity matrix: a change in content shows up as a sudden drop in local self-similarity, which the kernel amplifies into a peak. Segmentation algorithms typically produce not just boundary times but also structural labels (A, B, C…) indicating which segments are repetitions of which others — capturing the verse-chorus form of pop music or the AABA structure of jazz standards.

Automatic Music Transcription

Automatic music transcription (AMT) converts audio into a symbolic representation — typically a piano roll indicating which pitches are active at each time step, with onset and offset times. Transcription is hard because: (1) many instruments emit harmonics that overlap with the fundamentals of other notes; (2) onset detection requires distinguishing new notes from held notes; and (3) simultaneous notes produce a combined waveform whose decomposition into individual pitches is ill-posed.

Contemporary AMT systems apply neural networks to CQT-based representations, with separate prediction heads for onsets, offsets, and frames. The best systems now achieve near-human accuracy on piano transcription but remain fragile for polyphonic ensembles with overlapping timbres. AMT connects to symbolic generation: high-quality transcription enables training symbolic models on large audio corpora, and generated symbolic output can be rendered to audio to evaluate structural consistency.

Part VI: Specialized Generation Domains

Chapter 12: Singing Voice Synthesis

Pitch-Conditioned Vocal Generation

Singing voice synthesis (SVS) occupies a niche between speech synthesis (TTS) and music audio generation: it must produce natural vocal sound (as in TTS) while following a musical score specifying precise pitches, durations, and lyrics (as in music). The key challenges beyond TTS are: continuous pitch control (singing requires sustaining exact pitches, not the rising-falling intonation of speech), expressive vocal ornaments (vibrato, portamento, melisma), and phoneme-to-note alignment (one phoneme may be stretched across many beats of one sustained note).

Early SVS systems used hand-crafted parametric models. The neural era began with models that predict mel-spectrogram features conditioned on score inputs and synthesize waveforms via vocoders. The critical inputs to an SVS model are: MIDI note (pitch and duration), phoneme sequence with phoneme-to-note alignment, and optional performance parameters such as vibrato depth.

DiffSinger, VISinger, and RVC

DiffSinger (Liu et al., AAAI 2022) applies DDPM to the SVS problem. Rather than predicting a mel-spectrogram in a single forward pass, DiffSinger iteratively denoises from Gaussian noise to a mel-spectrogram conditioned on the acoustic features predicted from the score. A separate “shallow diffusion mechanism” first trains a fast deterministic baseline and then refines it with diffusion only for the hardest (highest-detail) portion of the denoising schedule, reducing required steps from 1,000 to ~100. DiffSinger achieves state-of-the-art naturalness and expressiveness on Chinese Mandarin singing, where rich tonal variation makes the task particularly challenging.

VISinger (Tae et al., 2021) adapts the VITS (Variational Inference with adversarial learning for end-to-end TTS) architecture to singing. VITS eliminates the intermediate mel-spectrogram by jointly learning the entire pipeline — text/score encoder, variational latent representation, and waveform decoder — in an end-to-end adversarial framework. The SVS adaptation adds pitch conditioning and phoneme-duration modeling specific to singing.

RVC (Retrieval-based Voice Conversion, community release 2023) is not strictly a synthesis system but a voice conversion tool that became widely used. Given a small reference corpus of a target speaker’s voice and a source audio clip (spoken or sung), RVC extracts the pitch and content features of the source, retrieves the most similar feature vectors from the target corpus, and synthesizes audio that sounds like the target speaking or singing the source content. RVC democratized personalized voice cloning — with the attendant ethical concerns about identity impersonation.

Chapter 13: Audio Editing, Inpainting, and Source Separation

Editing and Inpainting with Diffusion

Generative models trained for full synthesis can often be adapted for editing — modifying specific parts of an audio clip while preserving the rest — through inpainting. In the image diffusion literature, inpainting replaces the unknown region with noise at a chosen noise level and runs the denoising process conditioned on the known region (via a mask). For audio, the procedure is analogous but the semantics are different: “regions” are temporal intervals, and the task might be replacing a wrong note, adding an instrument to a gap, or extending a clip beyond its original length.

AudioLDM2 (Liu et al., 2023) extends the latent diffusion framework to arbitrary audio (music, speech, sound effects) with a text-conditioned inpainting capability. A masked spectrogram or latent region is denoised conditioned on the surrounding context and the text prompt. The system handles variable-length inputs by padding and masking, and the diffusion model learns to treat masked regions as conditionally independent given the text and context.

MusicGen-Edit adapts the MusicGen codec-LM framework for editing by masking tokens in the target region and autoregressively regenerating them conditioned on the surrounding context and a new text description. Unlike diffusion-based inpainting, which can complete the masked region in a single (multi-step) pass, codec-LM editing must generate tokens sequentially, which is slower but produces outputs that are compositionally consistent with the autoregressive training objective.

Source Separation as a Conditioning Primitive

Source separation — decomposing a mixed audio signal into its constituent sources (vocals, drums, bass, other instruments) — is not only a useful end-task but a powerful tool for conditioning music generation. We saw this in Chapter 7 where MusicGen uses Demucs to isolate the melodic stem before computing the chroma information bottleneck. The same idea generalizes: source separation creates “handles” on individual instruments within a mix, enabling target-stem conditioning, stem-level editing, and remixing workflows.

Demucs (Défossez et al., 2021) is a time-domain U-Net that operates directly on waveforms rather than spectrograms. An encoder with strided convolutions and LSTM layers encodes the mixture; a symmetric decoder with skip connections and transposed convolutions produces four output stems (vocals, drums, bass, other). Hybrid Transformer Demucs (HT-Demucs, Défossez et al., 2022) adds a dual-domain processing path that operates in both the time domain and the complex spectrogram domain, fusing the two representations at multiple scales. HT-Demucs achieves state-of-the-art separation quality on the MusDB18 benchmark while running at roughly 3× real time on a consumer GPU.

BandSplitRNN (Luo & Yu, ICASSP 2023) splits the frequency spectrum into non-overlapping sub-bands, applies a separate RNN to each band, and periodically exchanges information across bands via a band-interaction module. This explicit frequency decomposition exploits the statistical independence of different frequency regions — bass instruments rarely produce high-frequency harmonics — producing efficient models that outperform LSTM-based spectral methods on clean-frequency instruments.

Chapter 14: Multimodal Music Generation

Video-to-Music and Foley Synthesis

Music generation conditioned on visual input connects two sensory modalities that the human brain integrates effortlessly — a film scene’s emotional character is conveyed jointly by its imagery and soundtrack. Video-to-music (V2M) systems generate a synchronized audio track from a video, learning the statistical association between visual events and musical characteristics.

V2Meow (Su et al., 2023) uses a dual-encoder architecture: a video encoder extracts frame-level visual features, and an autoregressive Transformer generates VQ-VAE audio tokens conditioned on the visual sequence via cross-attention. The model is trained on large corpora of paired video-music data, learning implicit associations between visual dynamism (fast motion, bright colors, emotional expressions) and musical energy, tempo, and mood.

Foley synthesis generates the incidental sounds of a scene — footsteps, breaking glass, ambient wind — to match the visual events. Unlike background music, foley sounds are tightly time-locked to specific visual events, requiring precise temporal alignment. Foley synthesis systems typically predict a timeline of sound events from video features and then synthesize each event using a sound-event-conditioned audio model.

Dance-to-Music Generation

Dance-to-music generation (D2M) takes human body motion — typically represented as 3D skeleton keypoints over time — and generates music that matches the rhythm and character of the movement. D2M-GAN (Zhu et al., 2022) uses a GAN framework where the generator produces music from motion features and a discriminator judges whether the motion-music pairing is coherent. A key challenge is temporal alignment: the music’s beat and meter must sync with the dancer’s movements, requiring the model to simultaneously generate a musically coherent audio signal and time it to the motion sequence’s rhythmic structure.

These multimodal applications illustrate a general principle: music generation is not an isolated problem. Music is embedded in culture, narrative, and embodied movement, and systems that ignore these connections will produce audio that is technically plausible but contextually hollow.

Part VII: Representation, Provenance, Evaluation, and Ethics

Chapter 15: Music Representation Learning

Self-Supervised Audio Encoders

The evaluation and conditioning of generative music models depend critically on learned audio representations — embedding functions that map audio to vectors where musically similar inputs are close. Training such representations requires large labeled datasets, which are expensive to obtain. Self-supervised methods learn representations without manual labels by designing pretext tasks that force the model to capture musically relevant structure.

CLAP (Contrastive Language-Audio Pretraining; Wu et al., 2023) adapts the CLIP framework to audio. An audio encoder and a text encoder are trained jointly on paired audio-caption data, with a contrastive loss that pulls matched audio-text pairs together and pushes mismatched pairs apart in a shared embedding space:

\[ \mathcal{L}_{\text{CLAP}} = -\frac{1}{N} \sum_i \log \frac{\exp(\text{sim}(a_i, t_i)/\tau)}{\sum_j \exp(\text{sim}(a_i, t_j)/\tau)} \]

CLAP embeddings enable zero-shot audio classification (compute the CLAP similarity between audio and text labels), text-to-audio retrieval, and — as we saw in Stable Audio — text conditioning for generation. CLAP score (the average cosine similarity between generated audio and the conditioning text) has emerged as a standard automatic metric for evaluating text-to-audio alignment.

MERT (Li et al., 2023) applies a BERT-style masked acoustic modeling paradigm to music. Random chunks of a mel-spectrogram are masked, and the model is trained to predict the masked segments. Unlike BERT for text, music MERT uses two targets simultaneously: an acoustic teacher (Encodec tokens) and a musical teacher (constant-Q spectral features aligned with pitch) — arguing that music requires both acoustically faithful and musically structured representations.

MusicFM (Won et al., 2024) is a foundation model for music understanding trained on a mixture of pretraining objectives — masked prediction, contrastive alignment, and next-frame prediction — on a large-scale music corpus. MusicFM features are competitive across a wide range of downstream tasks including music tagging, key estimation, beat tracking, chord recognition, and singer identification, demonstrating that a single model can develop broadly useful musical representations if given sufficient data and a diverse pretraining objective.

Why Representation Learning Underpins Generation

The relationship between representation learning and generation is bidirectional. Good representations make generation easier: latent diffusion models (Chapters 3–4) depend on Stage 1 autoencoders that organize the latent space sensibly, and the autoencoder’s quality directly determines the ceiling on Stage 2 generation quality. Conversely, generative models, by learning to synthesize realistic music, must implicitly learn rich musical representations.

CLAP and MERT have become the standard feature extractors underlying evaluation metrics (FAD, CLAP score) and conditioning mechanisms (text-to-music, audio-to-audio). A weakness of this setup is representation leakage: if the same CLAP encoder is used both for conditioning during generation training and for evaluation, the evaluation metric may overestimate quality because the model has learned to produce outputs that score well under that specific encoder, not necessarily outputs that sound good to human ears.

Chapter 16: Evaluating Generative Music Systems

Automatic Metrics

Fréchet Audio Distance (FAD; Kilgour et al., ISMIR 2019) is the audio analog of FID (Fréchet Inception Distance) for images. FAD extracts embeddings from a large set of generated clips and a large set of reference (real) clips using a pretrained audio feature model (typically VGGish), fits a multivariate Gaussian to each set, and computes the Fréchet distance between the two Gaussians:

\[ \text{FAD} = \|\mu_g - \mu_r\|^2 + \text{Tr}\!\left(\Sigma_g + \Sigma_r - 2\sqrt{\Sigma_g \Sigma_r}\right) \]

A lower FAD indicates that the generated audio distribution is closer to the reference distribution. Like FID, FAD captures distributional overlap but not individual sample quality — a model that generates three perfect samples and nothing else would score well on FAD if those samples span the reference distribution.

KL divergence between classifier probability distributions (using a pretrained audio classifier) captures whether the generated clips are semantically plausible: a model that generates only one type of sound would show a KL divergence spike. CLAP score measures text-audio alignment directly as the cosine similarity between CLAP embeddings of the generated clip and the input text prompt.

Subjective Evaluation: MOS and MUSHRA

For evaluating audio quality, the field draws on ITU-T P.800’s Mean Opinion Score (MOS): listeners rate each audio clip on a five-point scale (1 = bad, 5 = excellent). For comparing multiple systems to a reference, MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) presents several conditions including a hidden reference and one or more degraded anchors, asking listeners to rate each on a 0–100 scale. By including a known anchor, MUSHRA calibrates each listener’s scale and makes comparisons across sessions more reliable.

What Metrics Miss

Every automatic metric for music generation captures only a fragment of what humans care about. FAD and CLAP score are blind to long-range musical structure: a model that generates a beautiful 10-second loop repeated ad infinitum would score well on both metrics despite producing musically incoherent output. Neither metric captures novelty — whether the generated music contains genuinely new melodic ideas or merely recombines training data — or cultural specificity — whether a piece labeled “Carnatic improvisation” actually follows the grammar and ornamentation conventions of that tradition.

Stable Audio’s evaluation suite (Stability AI, 2024) extended beyond FAD and CLAP score with human ratings for musicality (the expressiveness of melody and harmony), musical structure (the presence of intro/development/outro), and stereo correctness — a small but meaningful step toward multidimensional evaluation. Even so, Chris Donahue notes that “humans are bad at evaluating diversity” in generated outputs, since judging diversity requires listening to and remembering many samples simultaneously. Automatic diversity metrics (distribution coverage in embedding space) may be better calibrated for this dimension, even if they are worse for quality.

Chapter 17: Training-Data Attribution and Watermarking

Memorization and Data Influence

As music generation models grow in scale, the boundary between learning a distribution and memorizing training examples becomes blurrier. Memorization occurs when a model can regenerate near-verbatim copies of training audio given a prompt similar to that audio’s metadata. This is not merely an academic concern: in the context of music, memorization is a copyright infringement mechanism, and detecting it is both technically challenging and legally consequential.

The leading technical approach to memorization detection is nearest-neighbor retrieval: for each generated clip, compute a distance (in embedding space or at the waveform level) to every training example, and flag cases where the minimum distance falls below a threshold. MusicGen’s paper includes a memorization analysis of this kind, finding no significant memorization at their evaluated scale — though the paper acknowledges that the analysis is limited to exact-token matching and may miss paraphrastic memorization (generating audio that is the same song with a slightly different timbre).

Data influence methods (TracIn, Vyas et al.; TRAK, Park et al.) estimate how much each training example contributed to a specific model output. These methods trace gradients from the output back through the training objective to individual training points, identifying which recordings most influenced the model’s decision to generate a particular clip. Data influence analysis serves both legal attribution (identifying the source material) and scientific purposes (understanding what the model has learned about harmonic grammar from its training corpus).

Neural Audio Watermarking

Neural watermarking embeds imperceptible signals in audio during generation, enabling provenance tracking after deployment. Unlike metadata tags (which can be stripped) or file-format watermarks (which are destroyed by re-encoding), neural watermarks are embedded in the audio signal itself and should survive standard post-processing.

AudioSeal (San Roman et al., Meta, 2024) trains an encoder-decoder pair: the encoder embeds a binary watermark (up to 32 bits) into audio by adding a learned, imperceptible perturbation, and a localized detector network predicts, at each 0.05-second frame, whether a watermark is present and what bits it encodes. The detection is localized in time, meaning the detector can identify when in a long recording the watermarked section appears — useful for detecting watermark removal attempts that clip or rearrange audio. AudioSeal demonstrates <0.1% false positive rate at 97% true positive rate even after OPUS re-encoding, DAC compression, and additive noise.

WavMark (Chen et al., 2023) uses a convolutional watermarker trained with a perceptual loss (ensuring watermark inaudibility) and a capacity-rate objective (maximizing bits per second of audio). SynthID-Audio (Google DeepMind, 2024) integrates similar watermarking directly into the Lyria music generation model, embedding watermarks before the audio is decoded from latent space.

Chapter 18: Generative Music in the World

Building Generative Music Platforms

The gap between a research demo and a production platform is vast. The Suno team’s engineering challenges, presented at this course’s industry lecture, illuminate several dimensions that academic papers typically elide.

Inference efficiency is the foundational constraint. At thousands of concurrent users, the GPU budget per generation is measured in seconds, not minutes. This requires aggressive optimization: quantization, speculative decoding (using a small draft model to propose token sequences that the main model accepts or rejects in batch), flash attention, and careful batching to maximize GPU utilization. A generation that takes three minutes on a research cluster might be intolerable if it cannot be reduced to three seconds on production infrastructure.

Meaningful evaluation without a ground truth is genuinely hard. Training loss is a poor proxy for musical quality: a model with slightly lower cross-entropy than a competitor might produce notably worse music on human listening tests, because the gap between “modeling the distribution of training data” and “producing aesthetically compelling output” is not captured by likelihood-based metrics. Suno evaluates internally with rapid A/B listening tests and external Discord community feedback — informal, but calibrated to actual user preferences.

Anti-bot and anti-misuse measures are necessary from day one. API scraping, synthetic stream generation, and prompt injection for copyright circumvention are real adversarial pressures. Watermarking (Chapter 17) provides one line of defense; rate limiting and prompt filtering provide others.

Cultural diversity in training data is both an ethical and a product quality issue. Western pop and rock are massively overrepresented in publicly available music corpora, so models trained without correction will generate generic Western commercial music by default and will produce low-quality output for prompts requesting Afrobeats, Qawwali, or South Indian classical. Suno uses Discord community feedback as a signal for quality across genres that are harder to evaluate automatically.

Interaction Design for Human–AI Co-Creation

Sara Adkins’ workflow illustrates what human–AI co-creation looks like in practice. She begins by generating a classical harp piece, applies Suno’s cover feature to convert it to an electronic style (obtaining a glitchy music-box sound), extends it with Suno’s Extend feature to add a new section, uses negative prompting — [Negative: Music, Notes, Harmony, Melody] — to generate pure noise textures, and combines all elements in Ableton Live with reversal, drums, and synthesizer layers. The AI is not the author; it is a stochastic instrument that the human musician plays by carefully choosing prompts and iterating.

Brian Eno articulated the philosophy of generative music in 1996, writing about his album Generative Music 1: “All my ambient music is based on this idea — that it is possible to make a system or set of rules, which once set in motion, will make music for you.” Contemporary AI music systems are Eno’s generative philosophy scaled to neural network parameters trained on millions of recordings. The rules are no longer hand-written but learned; the music is no longer a deterministic loop but a sample from a distribution. The underlying attitude — that composition can be a design of processes rather than a specification of outcomes — remains the same.

The Suno “Shimmer” artifact illustrates how generative AI systems develop unintended affordances. Suno v4 had a bug that produced a faint 12 kHz buzzing artifact — the Shimmer — in certain outputs. Users, far from complaining, adopted it as a stylistic element, requesting it deliberately through prompts. The boundary between “defect” and “character” in generative music turns out to be socially negotiated, not technically determined.

Non-Western Music and the Pop-Bias of Training Data

Every trained model encodes the biases of its training corpus. For music generation, the predominant bias is Western commercial pop: most large-scale music datasets are scraped from streaming platforms whose catalogs are dominated by English-language pop, rock, hip-hop, and electronic music. A model trained on this corpus will, when given an underspecified prompt, produce Western pop by default, and will struggle with prompts requesting music from traditions with fundamentally different tonal systems, rhythmic organizations, or performance practices.

The computational representation of non-Western music is itself a challenge. MIDI’s 128-pitch vocabulary assumes 12-tone equal temperament — it cannot represent:

Indian classical music’s 22 shrutis (microtonal intervals smaller than a semitone) and ornamental techniques like gamaka
Arabic maqam with quarter-tone intervals
Chinese guqin sliding ornaments (按, 吟, 猱, 绰) that cannot be discretized to fixed pitches
Blues inflections and jazz microtones

The CompMusic project (Serra et al., ISMIR) has developed datasets and annotation schemes specifically for Carnatic, Hindustani, Makam, and Beijing Opera traditions, recognizing that generic MIR tooling transfers poorly. Building generation models that represent these traditions adequately requires not just larger datasets but fundamentally different representations and evaluation criteria designed with knowledge of the specific tradition.

Societal Implications and Open Questions

Music generation AI raises questions that are simultaneously legal, economic, cultural, and philosophical.

Copyright and attribution: The RIAA’s lawsuits against Suno and Udio (2024) argue that training on copyrighted recordings constitutes infringement, regardless of whether generated outputs reproduce specific melodies. The legal outcome will shape the permissible training data for all audio generation systems. Data influence methods (Chapter 17) may eventually provide a technical basis for training-data attribution — enabling users to discover which recordings a model drew on to generate a particular output — but whether attribution creates legal liability, or licensing obligations, remains unsettled.

Labor displacement: Professional session musicians, stock-music composers, and soundtrack producers are among the workers most immediately affected by systems that generate competent background music on demand. The economic impact is real even if legally ambiguous: a music supervisor who once purchased ten stock tracks at $50 each can now generate fifty variants for the cost of API calls. Whether this displacement is offset by new creative roles enabled by AI tools, as the music software industry expanded creative opportunities previously, is a contested empirical question.

Homogenization of aesthetic culture: A model that learns the distribution of existing music is, by construction, biased toward the mean of that distribution. Models trained on streaming platforms’ most-played tracks will tend to produce music that sounds like the current charts — reinforcing existing taste rather than diversifying it. Whether generative AI will eventually enable unprecedented creative diversity (by lowering the cost of exploring unusual combinations) or will accelerate taste homogenization (by amplifying the most statistically typical) remains genuinely open.

Identity and voice: The ability to clone a specific artist’s vocal timbre — already demonstrated by RVC and similar systems — raises questions of artistic identity and consent. A generated song “in the voice of” a living artist, indistinguishable from their actual recordings, can be used for defamation, fraud, or simply to dilute the value of the real artist’s catalog. Watermarking (Chapter 17) and deepfake-detection tools provide partial technical countermeasures, but the deeper question is normative: under what circumstances, if any, is voice cloning permissible, and who should have the right to consent?

These questions do not have easy answers. They are, however, the questions that anyone building or deploying music generation systems has an obligation to engage with. Technical mastery of diffusion models and codec language models is necessary for this field; it is not sufficient.