1.2.2 Captions (Prerecorded)

In Plain Language

SC 1.2.2 Captions (Prerecorded)^[1] requires captions for all prerecorded audio content in synchronized media. The scope is the entire audio track, not only dialogue: captions must convey speaker identification, meaningful sound effects, and music cues that carry information, synchronized to the frame they describe^[1].

Auto-generated captions from a hosting platform are a starting draft, not a conformance artifact. Unreviewed machine transcripts routinely drop punctuation, mangle proper nouns and technical terminology, and omit non-speech audio entirely -- none of which satisfies 1.2.2. The single exception in the Understanding document: synchronized media presented as a media alternative to text already on the page, labeled as such^[1].

Why It Matters

Deaf and hard-of-hearing users have no other route to the audio track. A video without captions is an opaque block of content; a video with dialogue-only captions hides plot-bearing sound effects and music cues that sighted, hearing users get for free.
1.2.2 is distinct from 1.2.4 Captions (Live) and 1.2.5 Audio Description (Prerecorded). 1.2.2 covers the audio track of prerecorded media; 1.2.5 covers visual-only information a blind user would miss. A tutorial that passes 1.2.2 can still fail 1.2.5 if the narrator silently points at the screen.
Captions are parsed by search indexers, clip tools, and translation pipelines. A correct WebVTT file is the same artifact that powers transcript search, chapter navigation, and in-video keyword jumping.
Caption consumers extend past the Deaf and hard-of-hearing audience to users in sound-hostile environments (open-plan offices, transit, shared spaces) and users reading in a second language -- but the conformance bar is set by the first group.

Examples

Do: Attach a WebVTT captions track with speaker cues

<video controls>

<source src='interview.mp4' type='video/mp4'>

<track kind='captions' src='interview-en.vtt' srclang='en' label='English' default>

</video>

✔ Captions track included with speaker labels and accurate timing

<video controls>
  <source src="interview.mp4" type="video/mp4">
  <track kind="captions" src="interview-en.vtt"
         srclang="en" label="English" default>
</video>

<!-- interview-en.vtt -->
WEBVTT

00:00:01.000 --> 00:00:04.500
<v Host>Welcome to the program. Today we discuss...

00:00:05.000 --> 00:00:08.200
<v Guest>Thanks for having me. The key point is...

Don't: Ship unreviewed platform auto-captions

<video controls src='lecture.mp4'></video>

Auto-captions: "The aria label a tribute helps a sister technology..."

✘ Unreviewed auto-captions with errors -- "aria-labelledby" became "aria label a tribute"

<!-- FAILS: auto-captions are inaccurate -->
<video controls src="lecture.mp4"></video>

<!-- Platform auto-captions show:
     "The aria label a tribute helps a sister technology"
     instead of:
     "The aria-labelledby attribute helps assistive technology"
     ASR errors on proper nouns and technical terms
     leave captions technically present but wrong -->

Do: Caption the full audio track, not just speech

00:00:12.000 --> 00:00:14.500

[applause]

00:00:15.000 --> 00:00:18.000

<v Speaker>Thank you. Let me begin...

✔ Sound effects and music cues are captioned alongside dialogue

<!-- WebVTT with meaningful sounds -->
WEBVTT

00:00:10.000 --> 00:00:12.000
[upbeat music playing]

00:00:12.000 --> 00:00:14.500
[applause]

00:00:15.000 --> 00:00:18.000
<v Speaker>Thank you. Let me begin with an overview.

Don't: Ship synchronized media with no captions track at all

<video controls src='webinar.mp4'></video>

✘ No captions or transcript -- deaf users cannot access the spoken content

<!-- FAILS: no <track kind="captions"> -->
<video controls src="webinar.mp4"></video>

<!-- A <track kind="subtitles"> does not satisfy 1.2.2
     either -- subtitles assume the viewer can hear the
     audio and only translate dialogue. Captions are the
     audio track rendered as text. -->

How to Fix It

Inventory every prerecorded synchronized media asset -- anything that pairs an audio track with moving images. Tutorials, webinars, product demos, onboarding videos, interviews, conference recordings, marketing reels. Background-only decorative video with no audio track is out of scope for 1.2.2.
Produce a verbatim transcript of the audio track, including speaker turns, proper nouns, and technical terms. Treat platform auto-captions as a draft to edit, not an output to ship. Automatic speech recognition systematically fails on product names, code identifiers, and acronyms -- the exact vocabulary a technical audience needs.
Add speaker identification for multi-voice media. WebVTT's voice span <v Speaker Name> attaches a speaker label to a cue; plain bracketed prefixes like [Host] work in any caption format. Without speaker cues, dialogue collapses into an undifferentiated wall of text.
Caption non-speech audio that carries meaning. Sound effects ([door slams], [phone ringing]), ambient cues that establish setting, and music that signals tone or scene change all belong in the track. Omit incidental room tone. The test is whether a sighted hearing viewer is getting information from the sound that a Deaf viewer would otherwise miss^[1].
Synchronize cue timing to the frame. A caption that appears a second late or hangs after the speaker has moved on breaks the lip-reading and context cues caption users rely on. WebVTT cue timestamps are HH:MM:SS.mmm with millisecond precision for a reason.
Attach the track via HTML5 <track> with kind="captions", not kind="subtitles". The two kinds are semantically distinct: subtitles assume the viewer hears the audio and only need a translation of dialogue; captions are a full rendering of the audio track for viewers who cannot hear it. Set srclang, a human-readable label, and default if this is the primary caption track. WebVTT^[2] is the format HTML5 video players natively parse; TTML and SRT are acceptable on players that support them, but WebVTT is the only format the HTML <track> element is required to handle.
Decide open vs closed captions deliberately. Closed captions (delivered via <track>) let the viewer toggle and restyle them; open captions (burned into the video pixels) cannot be turned off or resized and fail silently when the video is reencoded or clipped. Prefer closed captions unless the delivery platform offers no text track support.
Do not rely on the "media alternative for text" exception unless the page genuinely contains the full text the video presents and the video is labeled as an alternative. This is the only carve-out in 1.2.2 and it is narrow^[1].

References

[1] W3C (2023). Understanding Success Criterion 1.2.2: Captions (Prerecorded). W3C, Accessed 2026-04-07. https://www.w3.org/WAI/WCAG22/Understanding/captions-prerecorded.html ↩ ↩ ↩ ↩ ↩
[2] W3C (2019). WebVTT: The Web Video Text Tracks Format. W3C, Accessed 2026-04-07. https://www.w3.org/TR/webvtt1/ ↩