DAC 203: Sound Production and Analysis

Luis Sotelo

Estimated study time: 39 minutes

Table of contents

Sources and References

Primary textbook — Ken C. Pohlmann, Principles of Digital Audio, 6th ed. (McGraw-Hill, 2011). Supplementary texts — David Sonnenschein Sound Design; Biewen & Dilworth Reality Radio; Karen Collins Game Sound; Michel Chion Audio-Vision; R. Murray Schafer The Soundscape; Bobby Owsinski The Mixing Engineer’s Handbook. Online resources — Audacity Reference Manual (manual.audacityteam.org); MIT OCW 21M.380 Music and Technology; BBC Academy podcast resources.

Chapter 1 — The Physics of Sound and Hearing

Sound begins as a mechanical disturbance. A vibrating source — a guitar string, a vocal fold, a speaker cone — pushes on the molecules of the surrounding medium, and those molecules jostle their neighbors, passing a local pattern of compression and rarefaction outward at the speed of sound. In air at room temperature that speed is about 343 metres per second. Sound is therefore not a thing that travels but a behavior of the medium, a traveling pattern of pressure change. Cut away the air and the vibration has nothing to propagate through, which is why space, notoriously, is silent.

Pohlmann opens Principles of Digital Audio by insisting on this mechanical grounding because every later abstraction — numbers in a file, a waveform on a screen, a plug-in parameter in a DAW — is ultimately anchored to a real displacement of real air molecules, converted back into motion by a loudspeaker or headphone driver at the end of the chain. Forgetting that anchor is the fastest way to make bad engineering decisions.

A single pure tone corresponds to a sinusoidal variation in pressure over time. Three parameters describe it: frequency, the number of cycles per second, measured in hertz; amplitude, the magnitude of the pressure deviation, usually expressed logarithmically in decibels of sound pressure level; and phase, the position within a cycle at some reference instant. Frequency maps approximately onto the perceived sensation of pitch, amplitude onto loudness, and phase becomes audible chiefly when two related signals are combined and their interference produces cancellation or reinforcement. Real sounds are almost never pure tones. A sustained note on an oboe is a complex periodic signal that Fourier analysis decomposes into a fundamental plus a series of harmonic partials, and the particular recipe of partials — their relative amplitudes and phases — is most of what the ear registers as timbre. A cymbal crash or a consonant in speech is aperiodic, broadband, and its spectrum is a continuous distribution rather than a picket fence of harmonics.

The human hearing system is a remarkable but nonlinear transducer. The outer ear funnels pressure variations through the ear canal to the eardrum; the middle-ear ossicles perform an impedance match from air to the fluid of the inner ear; and along the basilar membrane of the cochlea, different regions respond preferentially to different frequencies, so the cochlea effectively performs a running spectral analysis before anything reaches the auditory nerve. A healthy young listener responds to roughly 20 hertz at the low end to 20 kilohertz at the high end, but the upper bound declines with age and noise exposure, and the low end is felt almost as much as heard. Sensitivity is frequency-dependent: the ear is most acute around two to five kilohertz, where speech consonants live, and comparatively insensitive at the extremes. The equal-loudness contours first mapped by Fletcher and Munson and refined in ISO 226 capture this nonuniformity, and they explain why a mix that sounds balanced at loud monitor level can seem thin and bass-less at whisper volume.

Two other perceptual facts matter constantly in production. Masking means that a loud sound in one frequency region can render nearby quieter sounds inaudible, an effect that underpins perceptual audio coding such as MP3 and AAC and that also informs mixing decisions — a snare can easily be buried by a cymbal in the same spectral neighborhood. Binaural localization means that the brain infers the direction of a sound source from interaural time and level differences plus spectral cues imparted by the shape of the head and pinnae, and these cues are what stereo, surround, and binaural recordings attempt to reproduce. Finally, the ear is enormously tolerant of some distortions and brutally intolerant of others. Small nonlinear harmonic distortion can sound warm and musical; tiny amounts of wow and flutter, clicks, or abrupt gain jumps are noticed instantly. Any production workflow has to be built around what the ear actually cares about, not around what a meter happens to show.

Chapter 2 — From Analog to Digital: Sampling and Quantization

Digital audio is an agreement to represent a continuous waveform by a sequence of numbers. Two independent choices define that representation: how often we measure the waveform, and how finely we can express each measurement. The first is sampling, the second is quantization.

The Nyquist–Shannon sampling theorem, which Pohlmann walks through in detail, states that a signal whose frequency content is strictly limited to below half the sampling rate can be reconstructed exactly from its samples. The Nyquist frequency is therefore half the sampling rate, and anything above it has to be removed before sampling, because frequencies above Nyquist will alias: they fold back down into the audible range as spurious lower-frequency content that no downstream process can tell apart from real signal. A brick-wall anti-alias filter in the analog domain, followed by a matching reconstruction filter on the playback side, is therefore not optional — it is the price of admission to digital audio. Compact-disc audio settled on 44.1 kHz sampling precisely because it leaves a margin above the nominal 20 kHz ceiling of human hearing for a realizable filter skirt. Professional production commonly uses 48 kHz, with 88.2, 96, and 192 kHz available for higher-rate workflows; the benefits at these higher rates come less from capturing extra audible bandwidth than from relaxing filter design and from headroom for pitch and time manipulation.

Quantization assigns each sample to the nearest available level on a finite ladder. With n-bit linear pulse-code modulation you get two-to-the-n levels; each sample therefore rounds off to the ladder, and the rounding error behaves, to a first approximation, like an added noise floor. The classical result is that ideal linear PCM gives roughly six decibels of dynamic range per bit, so 16-bit audio has about 96 dB of signal-to-quantization-noise ratio and 24-bit audio has about 144 dB. Sixteen bits is enough for distribution to a good listening environment; twenty-four bits is the professional working resolution because it buys headroom for gain staging, processing, and summing without flirting with the noise floor.

Two refinements deserve mention. Dither is a small, carefully shaped random signal added before requantization. Counter-intuitively, injecting noise improves the situation: without dither, quantization error is correlated with signal and produces ugly low-level distortion products; with properly shaped dither, the error becomes a benign broadband hiss, which the ear ignores far more readily than correlated distortion. Every time a project is bounced from a higher bit depth to a lower one, dither belongs in the chain. Noise shaping goes further, redistributing the dither spectrum away from the frequencies where the ear is most sensitive and into regions where masking hides it, further improving perceived dynamic range at a given word length.

Once the signal is a stream of numbers, the rest of the processing pipeline — filtering, compression, reverberation, pitch shifting, and so on — is digital signal processing. A DSP algorithm is ultimately a recipe that, at each new sample, computes an output sample from the incoming sample and a small amount of internal state. Finite impulse response filters have only a finite memory of past inputs; infinite impulse response filters feed their own outputs back into the computation and so can model resonant behavior with fewer coefficients. Many of the most important effects — equalizers, compressors, reverberators — are built out of combinations of these primitive blocks. A producer does not need to implement them from scratch, but understanding what is happening under the hood makes it much easier to diagnose why, for example, stacking too much linear-phase EQ introduces pre-echo, or why a plug-in introduces latency that Pro Tools then has to compensate for.

Chapter 3 — The DAW: From Audacity to Pro Tools

A Digital Audio Workstation is the software environment in which recording, editing, mixing, and often mastering happen. DAWs share a common conceptual vocabulary — tracks, clips, transport controls, a mixer with channel strips, inserts, sends, buses, and a master output — and they differ mostly in ergonomics, pricing, and ecosystem. Every modern DAW, from the free and open-source Audacity to the industry-standard Pro Tools, rests on the same foundation laid out in Chapter 2: audio is a stream of samples, processing is DSP, and the user’s job is to arrange, shape, and balance those streams.

Audacity is the natural entry point because it strips the model to essentials. According to the Audacity Reference Manual, a project consists of one or more audio tracks laid out horizontally along a timeline; tracks can be mono or stereo; clips within a track can be imported from files, recorded live from an input device, or generated by built-in tools. The core operations are deceptively powerful: File → Import → Audio brings a file into a track; the spacebar plays and stops; the selection tool picks a range on which cuts, copies, pastes, silences, and effects will act; the Label Track lets you mark points and regions on the timeline by pressing Ctrl+B, which is how podcast editors rough-cut a conversation. Audacity saves work as an .aup3 project file — a SQLite database that keeps references to audio data — and exports finished work to interchange formats such as WAV, FLAC, or MP3 through the Export dialog. The destructive nature of many Audacity operations is a useful teacher: it forces beginners to keep backups, commit to decisions, and think about signal flow.

At the other end of the spectrum, Pro Tools is the long-reigning standard in post-production and commercial music. Its documented strengths are session management, tight hardware integration, automatic plug-in delay compensation, and a mature clip-based nondestructive editing model in which trims, fades, and crossfades are parameters on a region rather than rewritten audio on disk. Its Edit Window and Mix Window separate timeline-centric and console-centric views of the same session, and its scripting and session exchange formats are robust enough that a project can travel between studios without losing automation or routing. Logic Pro, Ableton Live, Reaper, Cubase, and Studio One cover the same functional ground from different angles: Logic leans toward songwriters on Apple hardware, Live toward performance and clip-launching workflows, Reaper toward customizability and price-sensitive users, Cubase toward orchestral scoring, Studio One toward single-window efficiency. Students who understand one DAW thoroughly can usually pick up another in an afternoon.

Regardless of tool, a professional project is built around a small set of habits. Name tracks at creation time, never “Audio 1 — Audio 7”. Keep take folders or playlists rather than deleting alternate takes. Commit to a sample rate and bit depth at session start and do not change them mid-project. Use folders, groups, and color to express structure. Save incrementally; a session called mix_v01, mix_v02, and so on is easier to rescue than a single mix_final_FINAL_actually.ptx. These habits are boring and they are what separate a usable session from an unreplicable one six months later.

Chapter 4 — Microphones, Preamps, and the Signal Chain

A microphone converts acoustic energy into an electrical signal, and every downstream choice in a project depends on what that conversion sounds like. Three transducer principles dominate. Dynamic microphones use a diaphragm attached to a coil suspended in a magnetic field; motion of the diaphragm induces a current in the coil. They are rugged, tolerant of very loud sources, and relatively inexpensive, which is why you find the Shure SM57 and SM58 on almost every stage in the world. Condenser microphones use a charged diaphragm as one plate of a capacitor; motion of the diaphragm changes capacitance and therefore voltage. They need an external polarizing voltage, usually forty-eight-volt phantom power delivered up the XLR cable, and in return they offer higher sensitivity, lower self-noise, and extended high-frequency response, which makes them the default for voice-over, classical recording, and studio vocal work. Ribbon microphones suspend a thin metal ribbon between magnets; they capture transients with a characteristic smoothness and usually present a figure-of-eight polar pattern natively.

Polar patterns describe how a microphone’s sensitivity varies with the angle of arrival. An omnidirectional mic is roughly equally sensitive in all directions and therefore captures a lot of room, which is a blessing in a good room and a curse in a bad one. A cardioid is most sensitive on axis and rejects sound arriving from directly behind, which is why it dominates live and spot-miking applications. Supercardioid and hypercardioid tighten the front lobe further at the cost of a small rear pickup lobe. Figure-of-eight is equally sensitive front and back and dead on the sides, which is the basis of classic stereo techniques such as Blumlein pairs and mid-side. Two effects complicate the picture: proximity effect, the bass boost that directional microphones exhibit when very close to a source, which a vocalist can use expressively or accidentally; and the off-axis coloration that makes any directional microphone sound tonally different from behind or the side than it does from the front.

The microphone is only the first link in a signal chain whose weakest point sets the quality of the whole recording. The mic’s output at a few millivolts travels down a balanced cable to a preamplifier — either a standalone unit, a channel on a console, or the built-in front end of an audio interface — that raises the signal to line level, typically with adjustable gain of sixty decibels or more. From the preamp the signal may pass through analog processing (a compressor, an equalizer) or go straight to an analog-to-digital converter, whose anti-alias filter and quantizer produce the numbers the DAW stores. Gain staging across this chain is a constant preoccupation: set the preamp so that the loudest expected input peaks comfortably below the converter’s clipping point, leaving headroom for surprises, and resist the urge to print EQ or compression that you can as easily add later. Because digital systems clip hard — any sample whose magnitude exceeds the maximum codeable value is silently truncated and produces audible distortion — the rule of thumb is to track at an average level of roughly minus eighteen decibels on a full-scale meter, with peaks nowhere near zero.

Cables, connectors, and grounding deserve a paragraph of respect. Balanced XLR and TRS cables reject common-mode hum by carrying the signal twice, once inverted, so that ambient interference picked up on both conductors cancels at the differential receiver. Unbalanced TS cables on long runs in an electrically noisy environment are an invitation to hum and buzz. A ground loop created by two pieces of equipment sharing more than one electrical path produces a stubborn sixty- or fifty-hertz whine that no plug-in will fully remove. A rigorously labeled, well-patched studio saves days of debugging over its lifetime.

Chapter 5 — The Recording Studio and Acoustic Treatment

A room is a musical instrument. The walls, ceiling, and floor reflect and absorb sound, and the composite of the direct signal from the source and the reflections arriving at the microphone is what the microphone actually captures. A recording made in a well-treated room sounds like the source. A recording made in a bad room — one that is too small, too reflective, or too modal — captures the room at least as much as the performance, and there is no plug-in that cleanly separates them after the fact.

The relevant acoustic phenomena fall into three groups. Early reflections arrive within roughly the first twenty milliseconds, are not perceived as separate echoes, and produce comb-filter coloration when they combine with the direct sound — that hollow, phasey quality you hear in a tiled bathroom. Reverberation is the statistical wash of late, closely spaced reflections that decays exponentially; its character is described by the time it takes to drop sixty decibels, known as RT60, and by its frequency dependence. Room modes are standing waves between parallel surfaces at frequencies whose wavelengths match the room dimensions, producing huge bass peaks and nulls at specific positions. Small rooms suffer most from modes and early reflections; large rooms suffer from long reverberation times and the practical problem of getting rid of bass energy at all.

Treatment is correspondingly layered. Broadband absorption — thick panels of rigid mineral wool or fibreglass, wrapped in fabric, mounted with an air gap behind them — tames reflections from mid and high frequencies. Bass traps in the corners, where modal pressure maxima concentrate, address the low-frequency problem that thin absorbers cannot touch. Diffusion, implemented with geometric scatterers such as Schroeder quadratic-residue diffusers, breaks up specular reflections into a diffuse field without removing energy, which keeps a room from sounding dead. A typical project studio arranges first-reflection absorbers at the mirror points between the monitors and the listening position, corner bass traps, a reflective or diffusive rear wall, and a speaker setup that places the monitors at the apex of an equilateral triangle with the listener’s head, angled inward, at ear height.

Isolation is the other half of the problem — keeping outside noise out and inside noise in. Isolation is fundamentally about mass, decoupling, and airtight seals, not about foam on the walls. A properly isolated room is built as a room within a room on resilient mounts, with heavy dense walls, floating floors, and specially designed doors and ventilation. For field and home recording, isolation is usually impractical; the pragmatic response is to schedule recording when the neighborhood is quiet and to choose mic positions and polar patterns that minimize off-axis pickup of HVAC, refrigerators, traffic, and other persistent offenders. Owsinski’s Mixing Engineer’s Handbook makes the same point that every seasoned engineer repeats: fix it at the source, not in the mix. The time spent getting a room, a mic position, and a performance right is the cheapest time in the entire production cycle.

Chapter 6 — Field Recording

Field recording is the practice of capturing sound away from a controlled studio: ambiences for film and games, wildlife and environmental soundscapes for documentary and sound art, interviews in homes and offices, the raw material of radio storytelling. It rewards patience, planning, and a small kit that works reliably in the cold, the rain, and the noise of the real world.

A typical field rig has three parts. The recorder — a handheld unit such as the Zoom H5 or H6, a Tascam DR-series, or a higher-end Sound Devices MixPre — provides XLR inputs with phantom power, solid preamps, internal memory or SD card storage, headphone monitoring, and battery power. The microphone is chosen for the job: a shotgun for dialogue and directional ambience, a stereo pair (or a dedicated stereo mic) for musical ambiences and environments, a contact mic or hydrophone for material a traditional mic cannot reach. Accessories are not optional in the field: a shock mount, a proper windscreen — a furry “dead cat” over a basket windshield, not a foam sock — headphones with enough isolation to hear what is actually going onto the recording, spare batteries, spare cards, and a notebook to log takes and slate contents.

Technique in the field is mostly about attention. A field recordist learns to listen through headphones for things the eye would never notice: the hum of a fluorescent fixture, the distant idle of a diesel generator, a wind gust on an unshielded mic capsule, the rustle of a coat sleeve. Levels are set conservatively — the rule of thumb of peaks around minus twelve decibels full scale, with averages much lower, holds here as well — because a distorted take cannot be salvaged while a quiet take can always be normalized. Many recordists keep a safety track, a duplicate channel recorded ten or twelve decibels lower, for unpredictable loud events. Wind management is its own discipline: even a modest breeze across an unshielded microphone produces rumbling low-frequency noise that a high-pass filter cannot fully remove, so the windshield and sometimes a second layer of fur are among the most important pieces of gear the recordist owns.

A good field recordist also keeps documentation. Each take is slated — a verbal announcement at the head of the clip saying what the recording is, where, when, and any relevant conditions — and logged in a spreadsheet or notebook with filename, duration, location, and subjective rating. Six months later, a diligent log is the difference between a usable archive and a hard drive full of anonymous WAV files. The BBC Academy’s field guides and the soundscape-tradition literature descending from Murray Schafer’s The Soundscape both emphasize this ethic of the attentive, note-taking recordist, because what you hear at a location and what ends up on the disc are never quite the same, and only the log lets the editor later reconstruct the original intent.

Chapter 7 — Editing: Cuts, Crossfades, and Comping

Editing is where the raw material becomes a story. The operations are simple in isolation — cut, move, trim, crossfade, fade in, fade out — but the decisions behind them are what distinguish a professional edit from a clumsy one. Every DAW offers a nondestructive clip model in which a region on the timeline points into an underlying audio file; trimming a region reveals or hides more of that file, and the underlying audio is not actually rewritten. This is why edits can be revised cheaply for as long as the project exists.

The cardinal rule of cutting audio is to make cuts at zero-crossings, the instants at which the waveform passes through zero. A cut at a non-zero point introduces a discontinuity — an instantaneous jump in sample value — and the ear hears discontinuities as clicks. DAWs offer a zero-crossing search command for exactly this reason, and they also provide automatic short crossfades at edit points, which ramp the outgoing material down while ramping the incoming material up over a few milliseconds and mask residual discontinuities. For dialogue edits, crossfades of a few milliseconds are standard; for music, longer crossfades at loop points or between takes can be used to hide transitions entirely. The key is consistency: rough cuts without crossfades are a recipe for subtle clicks that will torment a mastering engineer.

Comping — short for compositing — is the practice of assembling a final performance from pieces of several takes. A vocalist sings the song four times through; the editor keeps verse one from take two, the bridge from take three, and the last chorus from take four; cuts are placed on breaths and consonants; crossfades are tucked under them. A good comp is invisible, and invisibility is earned by matching the takes in level, tone, and delivery before cutting. Pro Tools’ playlists, Logic’s take folders, and Reaper’s take-lanes all exist for this workflow. The same discipline applies to podcast interview editing: you cut out stumbles, ums, false starts, and dead air, and you place fades so the result sounds like a single fluent utterance that the speaker could plausibly have delivered in one breath.

Two ethical lines run through editing. The first is accuracy: in documentary and journalism, edits must not misrepresent what a subject said. Biewen’s chapters in Reality Radio return repeatedly to this point — the edit compresses and clarifies, but it does not invent. If you splice out a qualifier and change the meaning, that is no longer editing, it is fabrication. The second is taste: a smooth edit serves the story, and a flashy edit distracts from it. The best edits are the ones the listener never notices.

Chapter 8 — Mixing Fundamentals

Mixing is the process of balancing individual tracks against one another so that the result tells the story the listener is supposed to hear. A mix is not a restoration of some lost truth in the recording; it is a set of deliberate choices about what to foreground, what to push back, what to color, and what to leave alone. The main tools are level, panning, equalization, dynamics, time-based effects (reverb and delay), and automation over time.

Level is the biggest and crudest lever. Before touching any other processor, set static faders so that the instruments sit in roughly the right proportions against one another. A surprising amount of a mix is just this: a drum kit slightly too loud versus a vocal slightly too quiet is worse than any EQ decision you will make later. Panning places sources in the stereo image. Mono elements that anchor the song — kick, snare, bass, lead vocal — usually sit in the center; stereo elements such as overheads, pianos, pads, and background vocals spread outward; and careful symmetry makes the mix sound balanced in headphones as well as speakers.

Equalization shapes the spectral content of each track. A parametric EQ with adjustable frequency, gain, and Q (bandwidth) per band lets you do two things: corrective work such as rolling off rumble below eighty hertz on a vocal or notching a room resonance out of a guitar, and creative work such as carving space for the kick and bass to coexist or adding presence to a vocal around three kilohertz. The most important EQ technique is not boosting a band to bring something forward; it is cutting a competing band on another instrument to make room. Every frequency shared between two tracks is a fight, and the mix engineer’s job is to negotiate those fights.

Dynamics processors manage the relationship between loud and quiet. A compressor reduces the level of peaks above a threshold by a ratio, effectively shrinking dynamic range, and with careful attack and release settings it can glue a performance together or shape the envelope of a transient. An expander or gate does the opposite, suppressing audio below a threshold to remove bleed and noise. Used tastefully, compression is inaudible and brings a track forward; used carelessly, it flattens music into lifelessness. A limiter is a compressor with a very fast attack and a very high ratio, used to catch occasional peaks without touching the rest of the signal.

Reverb and delay place sources in a virtual space and in time. Short plate-like reverbs add sheen without pushing a source backward; long hall reverbs simulate distance; slap-back delays create a sense of room without the wash of reverb; rhythmic delays add a musical echo aligned to tempo. Almost always, time-based effects live on sends and buses rather than on individual inserts, so that many tracks share the same reverb and consequently sound as if they are in the same space. Finally, automation — the recording of moves on levels, pans, sends, and plug-in parameters — turns a static mix into a dynamic one, where a chorus comes forward, a verse retreats, and a single consonant is ducked out of the way of another instrument’s entrance.

Owsinski frames mixing as the interplay of six elements: balance, frequency range, panorama, dimension, dynamics, and interest. A mix that handles the first five but has no interest is technically clean and emotionally flat. The job of the engineer, once the mechanical problems are solved, is to leave enough rough edges and moments of surprise that a listener wants to keep listening.

Chapter 9 — Mastering Basics

Mastering is the final production stage, in which finished mixes are prepared for distribution. The mastering engineer is the last set of ears, working on the two-track stereo output rather than on multi-track stems, and bringing specific goals: consistency across a release, translation across playback systems, competitive loudness for the medium, and technical compliance with format standards.

The mastering tool set is a narrower version of the mixer’s. A chain typically includes a gentle corrective equalizer, a careful broadband or multiband compressor, an optional “glue” processor, a brickwall limiter, and a high-quality dither stage. Processing is usually subtle — small fractions of a decibel — because the mix comes in already balanced. If a mastering chain is doing more than three or four decibels of anything, something is probably wrong upstream.

Loudness is the most debated topic in modern mastering. The historical loudness war pushed masters to clip so aggressively that dynamics were crushed out entirely. The rise of loudness normalization on streaming services — Spotify, Apple Music, YouTube, and Tidal play back to a target around minus fourteen LUFS — has defanged that incentive. A mix mastered to minus eight LUFS is turned down to match one mastered to minus fourteen, and the more dynamic mix often sounds better after the adjustment. Current best practice is to master for the platform: podcasts around minus sixteen LUFS integrated; streaming music around minus fourteen; broadcast has its own standards (EBU R128, ATSC A/85).

The final product is delivered in the appropriate format — a DDP image for CD replication, 44.1 kHz 16-bit WAV for digital distribution, platform-specific targets for streaming. ISRC codes and metadata are embedded. A reference master is saved, because three months later somebody will want a radio edit or an instrumental version.

Chapter 10 — Listening as a Critical Practice

Every other chapter in this book assumes that you can hear what you are doing, and hearing in the technical, critical sense is a skill that has to be developed deliberately. Murray Schafer, whose The Soundscape launched the acoustic ecology tradition in the 1970s, coined the term ear cleaning for the practice of sitting down and listening to the sonic environment without trying to do anything about it: identifying sources, textures, distances, foregrounds and backgrounds, and the signatures that give a place its unique soundscape. The exercise is not mystical; it is calibration. A listener who has done it for a few hours picks up on room tone, HVAC, electrical hum, and microphone self-noise that an unpracticed listener simply does not notice.

Schafer’s vocabulary is useful because it gives names to things that otherwise slip past. A keynote is the ground-bass of a soundscape, the sound that is always there and that one stops consciously hearing — the traffic hum of a city, the surf on a coast, the wind in a forest. A signal is a foregrounded sound demanding attention — a bell, a siren, a phone. A soundmark, by analogy with a landmark, is a sound particular to a place and valued by the community there. Acoustic ecology asks whether the soundscape of a place has been degraded, whether its soundmarks are in danger, and whether production and design choices contribute to or push back against the homogenization of sonic environments.

Critical listening in the production sense overlaps with but is not identical to acoustic-ecology listening. When evaluating a mix, you listen for balance, frequency distribution, dynamics, spatial placement, and timing, and you learn to describe what you hear in language specific enough to act on. Instead of “the vocal sounds off,” you train yourself to say “there is a buildup around three hundred hertz making the vocal cloudy, and the sibilance around seven kilohertz is fatiguing.” Chion’s Audio-Vision distinguishes several modes of listening — causal (what is making this sound?), semantic (what does it mean?), and reduced (what are its intrinsic qualities as sound, independent of source and meaning?) — and the reduced mode is the one a mix engineer has to be able to switch into on demand. It is also the mode in which most sound art and musique concrète live.

A reflective listening exercise, of the kind many audio courses assign, is an invitation to do all of this in writing: to sit somewhere, listen for twenty or thirty minutes, notice what one notices, and then report on it with care. The payoff is practical — better ears make better mixes — and also ethical, because the practice reminds the producer that the sonic environment is something that can be attended to, described, and changed on purpose.

Chapter 11 — Sound for Reality Radio, Podcasts, and Documentary

Audio storytelling is a distinct craft with a distinct lineage. Radio features in the BBC and CBC traditions, This American Life, the boom in narrative podcasts after Serial, and the contemporary independent podcast ecosystem rest on the same basic idea: that a story told in sound can put a listener in the presence of people they have never met, with an intimacy that text and video cannot quite reach. Biewen and Dilworth’s Reality Radio is the standard anthology on this craft, and its contributors return again and again to two themes: structure and voice.

Structure is how the story is put together. A narrative podcast is not a recorded essay; it is a sequence of scenes joined by narration. A scene is a moment where something happens. Between scenes, the narrator contextualizes and connects. Ira Glass’ formulation breaks the form into anecdote — pulling the listener forward by “and then what happened?” — and reflection — stepping back to ask “why does this matter?” A story that is all anecdote is trivial; a story that is all reflection is a lecture; the craft is the braid. Eric Nuzum’s Make Noise insists that a show needs a clear premise, a clear audience, and a clear promise to that audience.

Voice has two meanings. Narrowly, it is the sound of the person speaking — the tone, rhythm, and personal quality that makes the listener feel spoken to. Broadly, it is the point of view: whose story is this, whose values are in play? Documentary ethics live in this second sense. Whose silences are you cutting out? Whose stumbles are you keeping in? The Reality Radio writers are unanimous that these choices cannot be outsourced to the mix.

Technically, the craft has a short list of staples. Tape is raw recorded material, and tape is king; no amount of clever narration can rescue a piece with thin tape. Ambience or room tone is a few seconds of each scene’s background, used to smooth edits and fill pauses under narration. Tracking is recording the host’s narration, take by take against a script. Scoring is the use of music under scenes to shape pacing, used judiciously because music can manipulate the listener in ways that erode trust. Mixing a podcast targets around minus sixteen LUFS integrated, with dialogue intelligible in a car with windows down. Distribution is solved by RSS: a podcast is at heart an RSS feed that points to MP3 files on a host, read by directories such as Apple Podcasts, Spotify, and Overcast. The creator owns the feed; directories merely link to it.

Acoustic justice and digital storytelling for social impact extend this toolkit to communities whose stories the mainstream has historically ignored. Oral history projects, community radio, and participatory podcasts use the same tools to center voices that might otherwise go unheard. The technical skill is identical; the ethical posture — of listening first, of shared authorship, of consent — is the harder part.

Chapter 12 — Music Production Basics

Producing a piece of music means taking it from an idea to a finished master. A common arc runs from pre-production through tracking, editing, mixing, and mastering. Pre-production is arrangement and rehearsal: decisions about tempo, key, form, and instrumentation. A song that is written and arranged well is much easier to produce than one being repaired in the studio.

Tracking is the recording of the performances. A rhythm section might be tracked together in a room, then overdubs added one at a time. An electronic project might never involve a microphone, with every sound originating in software synthesizers. The constants are a metronome reference, careful gain staging, and the habit of capturing multiple takes.

MIDI — Musical Instrument Digital Interface — is the grammar of electronic music production. A MIDI track contains note-on and note-off events, velocities, and controller data, none of which is audio; the sound is generated when those events are sent to a software or hardware instrument. The advantages are obvious: you can edit a wrong note, change the instrument without re-recording, and quantize timing. The disadvantage is that quantization can kill feel, and a heavily quantized part sounds robotic. Good MIDI practice keeps some human timing variance and uses groove templates rather than rigid grid snapping.

Arrangement in the DAW is where a song’s shape is sculpted. Intros, verses, choruses, and bridges are moved and duplicated; instrumentation is thinned in some sections and thickened in others; drops, lifts, and transitions are built from automation, risers, and reversed hits. The arranging toolkit of popular music — filter sweeps, drum fills, changes in chord voicing, the introduction and removal of instruments at structural boundaries — is implemented with automation and clip arrangement rather than notation.

The producer is always moving between musical decisions, technical decisions, and aesthetic decisions. The work is iterative, and the best producers can sit with a mix for days without losing perspective.

Chapter 13 — Sound for Digital Games and Interactive Media

Sound for linear media — film, radio, podcasts, recorded music — is fundamentally a timeline problem. You know exactly when the character opens the door, so you place the door sound exactly then. Sound for games is different, because the player decides when the door opens, and the sound has to be ready for any timing, any repetition, any context. Karen Collins’ Game Sound and the documentation for middleware engines such as FMOD and Wwise frame game audio around this interactivity.

The architecture of game audio is typically layered. At the bottom sits a pool of sound assets — short audio files recorded or synthesized ahead of time: footsteps, gunshots, UI clicks, voice lines, musical stems. Above this sits a middleware engine such as FMOD Studio or Audiokinetic Wwise, which exposes the assets to the game engine (Unity, Unreal, or a custom engine). The middleware lets sound designers define events that the game code triggers by name — “player/footstep”, “weapon/pistol/fire” — rather than hard-coding specific files.

That logic is where the craft lives. A footstep event on stone might randomly pick one of six samples to avoid the “machine gun” effect, pitch-shift it by a few cents, vary its level slightly, and pan it according to the player’s camera orientation. A gunshot might layer a close perspective with a distant perspective whose reverb wetness depends on the real-time size of the space the player is in. A musical system might cross-fade between layers of a stem mix as the player’s state changes — calm, tense, combat — with transitions quantized to the next beat.

Adaptive and interactive music are two flavors of the same idea. Adaptive music changes in response to game state: combat layer fades in, exploration layer fades out. Interactive music responds directly to player input: a rhythm game scores the player’s timing against a beat grid. Collins traces the lineage from chiptunes through redbook CD audio to modern middleware, and argues that what makes game sound feel right is that the player cannot predict the moment of any given cue, so the designer has to build a system whose outputs are plausible over the entire space of player actions.

Spatial audio in games extends stereo and surround techniques to three-dimensional sound using interaural time and level differences, head-related transfer function filtering, and distance-dependent attenuation. VR and AR push this further, requiring audio to update in real time as the player’s head rotates. Games reward iterative discipline: the designer playtests constantly, notices that a sword swing feels weak or that ambient music wears thin after ten loops, and returns to the middleware to adjust.

Chapter 14 — Sound Art and the Acoustic Ecology Tradition

Sound art is a broad category for artistic practices that take sound as their primary medium rather than as a support for something else. Gallery installations, field recording projects, phonographic releases, radio works, sound sculptures, and durational listening pieces all belong here. What unites them is a refusal to subordinate sound to image, narrative, or song form.

The intellectual lineage runs through several overlapping traditions. Musique concrète, founded by Pierre Schaeffer in the 1940s, treated recorded sounds as primary compositional material and introduced reduced listening, the disciplined mode in which one attends to a sound independently of its source. Electroacoustic and computer-music traditions extended this with live processing, spatialization, and algorithmic composition. Acoustic ecology, the Schafer tradition, brought a conservationist ethic to the soundscape. Field recording as an artistic practice treats an unedited or lightly edited recording of a place as a finished work worth an hour of a listener’s attention.

Chion’s audiovisual theory is relevant here too. He distinguishes synchresis — the reflex by which the brain fuses a sound with an image whenever they coincide in time — from the independent expressive power of sound. In sound art, synchresis is often deliberately frustrated; a room-sized installation asks the listener to attend to sound without a visual anchor.

The acoustic ecology lineage carries one last claim. The sonic environment of any community is shaped, for good or ill, by the tools and habits that produce it: urban noise, the default sounds of consumer devices, the musical choices of public spaces, the mix decisions on the news broadcasts that play in every waiting room. Students of sound production are, whether they intend it or not, participants in the making of that environment. The discipline of listening, the habits of the signal chain, the care of the field recordist, the ethics of the documentary editor, and the formal imagination of the sound artist all converge on a single serious claim: that the world sounds the way it sounds in part because people decided, with their microphones and editing tools, that it should. Learning to make those decisions well is what this course is for.

Back to top