EN 301 549 6.5 -- Video Communication Quality
What It Is
EN 301 549 v3.2.1 clause 6.5 applies where ICT that provides two-way voice communication also includes real-time video. The clause sets three hard floors: the ICT shall support at least QVGA (320x240) resolution, shall support a frame rate of at least 20 frames per second, and shall keep the time offset between speech and video within 100 ms. VGA resolution and 30 fps are called out as preferred. The informative notes attached to the clause prefer end-to-end latency below 400 ms, improving down to 100 ms, and flag that audio leading video is more disruptive than video lag. Clause 6.5 points at ITU-T F.703 as the underlying model, which treats voice, real-time text, and video as three interleaved media in a single "total conversation" session rather than as independent streams[1].
Why It Matters
Sign languages carry grammar on the face. Brow position marks questions and conditionals, mouth morphemes disambiguate signs that share a handshape, and eye gaze marks referents -- so a video stream that preserves the hands but smears the face strips out half the linguistic signal. Frame rate is the load-bearing variable: below 20 fps, fingerspelling and rapid handshape transitions fuse into motion blur and the receiver has to guess. Encoders that hit a bandwidth ceiling and respond by dropping frame rate first (instead of dropping resolution first) turn a legible call into an unreadable one without warning. End-to-end latency compounds the problem -- once round-trip delay climbs past roughly 400 ms, turn-taking collapses because each signer has to wait out the previous utterance before replying, and the call stops feeling like a conversation.
How It Relates to WCAG
WCAG does not set video-call transport thresholds. Clause 6.5 is a telephony-side requirement with no direct WCAG success criterion; it lives in the ICT-functionality half of EN 301 549 (clauses 5-13) that sits alongside, not under, the WCAG-mapped web clauses in 9 and 10.
Practical Implications
- Meet the hard floor first: 320x240 minimum, 20 fps minimum, 100 ms maximum audio-video offset. Treat VGA and 30 fps as the real targets.
- Configure adaptive bitrate to drop resolution before frame rate when bandwidth tightens. A 320x240 stream at 25 fps is intelligible to a signer; a 720p stream at 10 fps is not.
- Keep end-to-end latency under 400 ms. Above that, sign-language turn-taking breaks even if every frame is perfect.
- Do not auto-disable video when bandwidth drops. For a Deaf user on a sign-language call, killing the video track ends the conversation.
- Test with native signers on real networks, not with a talking-head benchmark. Rapid hand motion and facial grammar stress encoders in ways a static face does not.
- Prefer codecs and profiles that handle motion without dropping keyframes under congestion.
Related Clauses
Sources
- ETSI EN 301 549 v3.2.1, clause 6.5 Video communication quality[1]
- ITU-T Recommendation F.703 -- Multimedia conversational services ("total conversation")