1.2.1 Audio-only and Video-only (Prerecorded)

In Plain Language

[1.2.1 Audio-only and Video-only (Prerecorded), Level A]^[1] applies to prerecorded media that carries information in only one sensory channel. If the content is audio-only -- a podcast episode, an interview recording, a voice memo embedded via <audio> -- ship a text transcript that presents the same information in correctly sequenced prose, including speaker labels and meaningful non-speech sounds.

If the content is video-only -- a silent product animation, surveillance footage, an assembly demo with no narration embedded via <video> -- ship either a text description of the visual track or a separate audio track that narrates it. Either alternative has to convey the same information a sighted viewer would extract from the pixels^[1].

Why It Matters

A screen reader handed an <audio controls src="podcast.mp3"></audio> with no accompanying transcript announces "audio" and exposes play and seek controls. There is no mechanism for the assistive technology to surface the speech inside the file, so a deaf or hard-of-hearing user gets the player chrome and nothing else.
A blind user landing on a <video> element with a silent visual track hears only the ambient audio (often none) when they press play. Without a text description or a descriptive audio track, the visual information -- on-screen text, demonstrated steps, spatial relationships -- is unrecoverable from the element itself.
1.2.1 is the transcript criterion, not the caption criterion. Captions live in 1.2.2 Captions (Prerecorded) and apply to synchronized media where audio and video carry complementary information. A podcast episode is audio-only and falls under 1.2.1; a webinar recording with a talking head and slides is synchronized media and falls under 1.2.2. Shipping captions on a podcast does not satisfy 1.2.1 -- the criterion asks for a time-independent text document that a user can read, search, and navigate with a screen reader^[1].
The W3C defines an acceptable alternative as a "document including correctly sequenced text descriptions of time-based visual and auditory information," with speaker identification and significant non-speech sounds such as applause, laughter, and audience questions where they carry meaning^[1]. A partial summary or marketing blurb does not clear the bar.
Transcripts compound in value beyond the primary audience: they expose the content to full-text search, they index in search engines, and they serve users on metered connections or in quiet environments who cannot play audio. The regulatory requirement targets deaf and blind users; the downstream benefits reach everyone who would rather read than listen.

Examples

Do: Provide a transcript for audio-only content

<audio controls src='podcast-ep12.mp3'></audio>

<details><summary>Read transcript</summary>

<p>[Host] Welcome to episode 12...</p>

</details>

✔ Full transcript provided alongside the audio player

<audio controls src="podcast-ep12.mp3"></audio>

<details>
  <summary>Read transcript</summary>
  <p>[Host] Welcome to episode 12. Today we discuss...</p>
  <p>[Guest] Thanks for having me. The key issue is...</p>
</details>

Don't: Audio-only content with no transcript

<audio controls src='podcast-ep12.mp3'></audio>

✘ No transcript -- deaf and hard-of-hearing users cannot access this content

<!-- FAILS: no transcript provided -->
<audio controls src="podcast-ep12.mp3"></audio>

<!-- Users who cannot hear the audio have
     no way to access the content -->

Do: Provide a text alternative for video-only content

<video controls src='assembly-demo.mp4'></video>

<div class='transcript'>

<h3>Text description</h3>

<p>Step 1: Attach the base plate...</p>

</div>

✔ Text description conveys the same visual information

<video controls src="assembly-demo.mp4"></video>

<div class="transcript">
  <h3>Text description</h3>
  <p>Step 1: Attach the base plate to the frame
    using the four corner bolts.</p>
  <p>Step 2: Slide the panel into the guide
    rails from left to right.</p>
</div>

Don't: Video-only content with no alternative

<video controls src='assembly-demo.mp4'></video>

✘ No text alternative -- blind users cannot know what the video demonstrates

<!-- FAILS: no text alternative provided -->
<video controls src="assembly-demo.mp4"></video>

<!-- Blind users have no way to understand
     what is shown in the video -->

How to Fix It

Inventory every prerecorded audio-only and video-only asset. Look for <audio> elements, <video> elements whose source files carry no dialogue track, embedded podcast players, silent GIF-to-MP4 animations, and any media that communicates information in a single sensory channel. Live media is out of scope for 1.2.1 -- it belongs to 1.2.9.
For each audio-only file, produce a verbatim transcript. Include speaker labels ([Host], [Guest]), the spoken content in sequence, and non-speech sounds that carry meaning (applause, laughter, audience questions, a door slam that ends the scene)^[1]. Background music that is purely atmospheric does not need to be transcribed; a lyric that is part of the content does.
For each video-only file, write a text description of the visual track -- or produce a narrated audio track. The description has to cover on-screen text, demonstrated actions, spatial relationships, and anything else a sighted viewer would pick up. A reader who never watches the video should finish with the same understanding. Generic captions ("product spins on turntable") are not enough when the video is teaching a procedure.
Expose the alternative in the DOM next to the player. A collapsible <details>/<summary> pair immediately after the media element keeps the transcript discoverable without forcing scroll. A visible <section> with a heading works just as well. A link to a separate transcript page is acceptable if the link is adjacent to the player and its accessible name identifies the target media -- avoid "transcript" on its own when a page has multiple players.
Do not conflate 1.2.1 with 1.2.2. A <track kind="captions"> child of a <video> element satisfies 1.2.2 for synchronized media, not 1.2.1 for audio-only or video-only media. If the asset has only one sensory channel, a time-independent transcript or description is the remediation; a WebVTT caption file on its own is not.
Keep alternatives in sync with the media. When a podcast episode is re-cut or a demo video is reshot, the transcript or description has to be regenerated. Stale alternatives fail the criterion in spirit and mislead the users who rely on them.

References

[1] W3C (2023). Understanding Success Criterion 1.2.1: Audio-only and Video-only (Prerecorded). W3C, Accessed 2026-04-07. https://www.w3.org/WAI/WCAG22/Understanding/audio-only-and-video-only-prerecorded.html ↩ ↩ ↩ ↩ ↩