Streaming & UX

Semantic Turn Endpointing

Decide when a voice user has yielded the floor by classifying the partial transcript's semantic completeness rather than a fixed silence timeout, so the agent replies quickly without cutting the speaker off mid-thought.

Problem

Acoustic silence is a poor proxy for conversational completeness. A long timeout adds close to a second of latency to every reply and makes the agent feel sluggish, while a short timeout fires on natural hesitations and filler words and cuts the speaker off. The same threshold also cannot tell a backchannel such as 'uh-huh' from a genuine attempt to interrupt, so the agent either talks over the caller or freezes mid-sentence.

Solution

Run a small turn-detection model over the streaming transcript in parallel with voice-activity detection. The model scores whether the user's utterance is a complete thought; a complete utterance commits the turn after a short pause while an incomplete one waits longer, so the agent answers fast on finished sentences and stays patient through hesitations. While the agent is speaking, a second classifier labels detected user speech as a backchannel, which is ignored, or a barge-in, which ducks the text-to-speech and yields the floor. The acoustic timeout remains as a floor so the turn always eventually commits.

When to use

A real-time voice agent needs to feel responsive while callers pause, hesitate, and use filler words.
Backchannels and partial interruptions are common and a fixed silence threshold mislabels them.
A streaming transcript is available early enough to score turn completion before the acoustic timeout.

Open the full interactive page →

Diagram, neighbourhood map, code examples, related patterns and full provenance.

Problem

Solution

When to use

Related