How do transformer models trained on music data generate new musical sequences?
AThey replay the training data with random variations added
BThey predict the probability distribution of the next musical token given a preceding sequence, sampling from that distribution to generate new tokens
CThey interpolate linearly between examples in the training set
DThey apply harmonic rules programmed explicitly by the training data engineers
Transformer music models learn to predict the next token (note, chord, audio code) from context. Generation proceeds autoregressively: each predicted token is appended to the context, and the model predicts the next, producing novel sequences that follow learned statistical patterns.
Question 2 True / False
True or false: AI source separation tools like Demucs can perfectly isolate individual instruments from a professional mix with no audible artifacts.
TTrue
FFalse
Answer: False
Current source separation models produce high-quality but imperfect separations — leakage (faint bleed from other sources), artifacts at complex passages, and quality degradation on heavily processed or unusual sounds are common. They are highly useful tools but not perfect isolators.
Question 3 Short Answer
What is the fundamental technical difference between transformer-based and diffusion-based audio generation?
Think about your answer, then reveal below.
Model answer: Transformer models generate audio autoregressively as sequences of tokens, predicting one token at a time from prior context. Diffusion models learn to reverse a noise process, starting from random noise and iteratively denoising toward structured audio conditioned on a text or audio prompt.
Transformers operate in a discrete token space and can maintain long-range structure but generate sequentially (slow). Diffusion models operate in continuous signal or latent space and can generate in parallel (one pass of denoising), but controlling fine-grained musical structure is more challenging.
Question 4 Multiple Choice
A music producer uses an AI mixing assistant to set initial EQ and compression on tracks. What is the most accurate characterization of this workflow?
AThe AI produces the final mix; the producer reviews for copyright compliance
BThe AI provides a starting point based on genre and instrument analysis, which the producer then refines using their own judgment and taste
CAI mixing is indistinguishable from human mixing and replaces the need for a mix engineer
DAI tools in mixing only function at the mastering stage
AI mixing tools (iZotope Neutron, LANDR mixing) analyze audio and apply statistically learned processing as a starting point. This removes the blank-slate problem and saves setup time, but professional engineers always refine AI suggestions — the aesthetic, genre, and emotional decisions remain human.