1.2.4 Captions (Live)

In Plain Language

1.2.4 Captions (Live) is a Level AA criterion that applies to any synchronized media where audio is broadcast live -- webinars, streamed town halls, virtual classes, conference keynotes. A caption track must be rendered synchronously with the audio as the audio is spoken, not stitched in after the event^[1]. The caption track must convey dialogue, speaker identification, and audio cues significant to understanding.

This is the sibling criterion to 1.2.2 (prerecorded captions). The split matters because the production pipeline is different: 1.2.2 is a file-prep problem, 1.2.4 is a live-ingest problem. Post-event captions on a recording do not retroactively satisfy 1.2.4 -- deaf and hard-of-hearing users have to be able to follow the event as it happens, including the Q&A, not read a transcript the next day.

Why It Matters

Two production mechanisms exist. ASR (automatic speech recognition) runs cheaply and scales -- the platform transcribes the audio stream in real time and renders the text as a caption track. Word-error rate climbs on technical vocabulary, proper nouns, accented speech, and overlapping speakers, and ASR has no model of meaning, so a misheard acronym propagates into the caption verbatim. CART (Communication Access Realtime Translation) uses a trained human stenographer typing the audio in real time on a stenotype keyboard; accuracy holds up on jargon, multi-speaker panels, and unusual names because the captioner is listening for sense, not phonemes.
ASR passes 1.2.4 in a narrow technical sense -- captions exist, synchronized to the audio -- but frequently fails the "equivalent" bar that 1.2.2 and 1.2.4 both inherit from the WCAG definition of captions^[1]. For unscripted content with domain-specific vocabulary, CART is the reliable path to compliance.
Deaf and hard-of-hearing users cannot defer participation to a post-event transcript -- the live interaction (Q&A, chat, polls, emergency instructions) is the event. A 48-hour post-event transcript answers 1.2.2 at best; it does not answer 1.2.4.
The caption track also supports users in sound-off environments, second-language users processing written English faster than spoken, and anyone who loses the audio thread to latency, packet loss, or a noisy room. Captions are load-bearing for a much wider audience than the WCAG deaf/hard-of-hearing framing suggests.
Two-way personal telephony is out of scope. The Understanding document is explicit that casual two-party multimedia calls between individuals are not in scope for 1.2.4 -- the criterion targets broadcast-style live media, where a host is producing the stream for an audience^[1].

Examples

Do: Use a professional CART captioner for live events

<div class='live-stream'>

<video id='live-feed' autoplay>...</video>

<div role='log' aria-live='polite' aria-label='Live captions'>

<p>[CART captioner output streams here]</p>

</div>

✔ Professional CART captioner provides accurate, real-time captions

<div class="live-stream">
  <video id="live-feed" autoplay>
    <source src="stream-url" type="video/mp4">
  </video>

  <!-- CART captioner output displayed in real time -->
  <div role="log" aria-live="polite"
       aria-label="Live captions">
    <p>Good morning, everyone. Today we will review
    the quarterly accessibility audit results.</p>
  </div>
</div>

<!-- A CART (Communication Access Realtime
     Translation) provider types captions live,
     achieving 98%+ accuracy on technical content,
     proper nouns, and multi-speaker dialogue. -->

Don't: Rely solely on unmonitored auto-captions

<video id='stream' autoplay>...</video>

Speaker says: "The Section 508 refresh aligned with WCAG 2.0 Level AA."

Auto-caption shows: "The section 500 and 8 refresh a line with wag two point oh level double a"

✘ Unreviewed auto-captions garble technical terms and acronyms

<!-- FAILS: auto-captions with no human review -->
<video id="stream" autoplay>
  <source src="live-url" type="video/mp4">
</video>

<!-- Platform auto-captions enabled, but:
     - "Section 508" becomes "section 500 and 8"
     - "WCAG" becomes "wag"
     - Speaker names are not identified
     - No punctuation or capitalization
     Nobody is monitoring or correcting the output -->

Do: Embed captions directly in the video player with speaker identification

<video id='webinar' autoplay>

<track kind='captions' src='live-feed.vtt'

srclang='en' label='English' default>

</video>

✔ Live caption track with speaker labels and accurate terminology

<video id="webinar" autoplay>
  <source src="webinar-stream" type="video/mp4">
  <track kind="captions" src="live-feed.vtt"
         srclang="en" label="English" default>
</video>

<!-- Live caption feed (WebVTT streamed in real time):

WEBVTT

00:01:05.000 --> 00:01:08.500
<v Sarah Chen>The compliance deadline is March 15th.

00:01:09.000 --> 00:01:12.000
<v David Park>Which standards apply -- WCAG 2.1 AA
or the full 2.2 update?

Speaker identification helps users follow
multi-person conversations. -->

Don't: Promise captions will be available "after the event"

<video id='town-hall' autoplay>...</video>

<p>Captions will be added to the recording within 48 hours.</p>

✘ Deaf users are excluded from the live event -- post-event captions do not satisfy this criterion

<!-- FAILS: no captions during the live event -->
<video id="town-hall" autoplay>
  <source src="townhall-stream" type="video/mp4">
</video>

<p>Captions will be added to the recording
within 48 hours.</p>

<!-- This fails WCAG 1.2.4 because captions must
     be available DURING the live broadcast.
     Post-event captions satisfy 1.2.2 (prerecorded)
     but not 1.2.4 (live). Deaf users cannot
     participate in Q&A or real-time discussion. -->

How to Fix It

Inventory the live synchronized media in scope. Webinars, livestreamed town halls, virtual conference sessions, online classes, product launches, and any broadcast that pairs live audio with video. Two-way personal calls between individuals are out of scope for 1.2.4^[1], but the moment a session becomes host-to-audience it is in scope.
Decide the mechanism per-event, not per-platform. For scripted or low-jargon content (marketing keynotes, recorded-feeling webinars) ASR with a human monitor correcting on the fly is often sufficient. For anything unscripted, technical, multi-speaker, or legally sensitive -- public meetings, earnings calls, medical briefings, product technical deep-dives -- engage a CART provider. Do not let platform-default ASR be the fallback by accident.
Route CART output into the platform's caption ingest. Most major meeting and streaming platforms (for example Zoom, Microsoft Teams, Google Meet, YouTube Live, Vimeo) expose live caption integration through a caption track endpoint that a CART provider's software can post to, so the stenographer's text renders in the same caption UI the platform uses for ASR. Treat the endpoint name and auth model as platform-specific and verify against the vendor's current documentation before configuring a live event.
If ASR is the mechanism, assign a human monitor. ASR without a corrective loop passes the synchrony test but fails the equivalency test on acronyms, names, and jargon. A monitor with edit access to the caption stream can correct word-error-rate hotspots in real time; this hybrid is cheaper than full CART and materially closer to the "equivalent" bar in the WCAG captions definition^[1].
Render speaker identification. Panels, Q&A, and multi-host events need speaker labels in the caption track -- WebVTT supports this via the <v Speaker Name> cue syntax, and CART providers type speaker changes inline. Without identification, the caption stream collapses distinct voices into one undifferentiated transcript.
Default captions to on. Enable the caption display on the player so viewers do not have to discover a toggle mid-event. If you expose a "CC" control, it should default to visible for live synchronized media in scope for 1.2.4.
Rehearse the caption path. Do a technical check before the live broadcast: verify the caption source (CART or ASR) is connected to the platform's ingest, verify latency is within a few seconds of the audio, verify the caption track is legible at the sizes and contrasts your audience will see, and verify speaker labels resolve correctly. Catch the mis-wired ingest in rehearsal, not in the first ten minutes of the event.

References

[1] W3C (2023). Understanding Success Criterion 1.2.4: Captions (Live). W3C, Accessed 2026-04-07. https://www.w3.org/WAI/WCAG22/Understanding/captions-live.html ↩ ↩ ↩ ↩ ↩