Semantic Turn Endpointing

also known as Semantic Turn Detection, Model-Based Endpointing

Decide when a voice user has yielded the floor by classifying the partial transcript's semantic completeness rather than a fixed silence timeout, so the agent replies quickly without cutting the speaker off mid-thought.

Context

A voice agent runs a duplex audio loop: streaming speech-to-text transcribes the caller while text-to-speech plays the agent's reply. The loop must decide, many times per turn, whether the caller has finished speaking or merely paused. A fixed voice-activity-detection silence threshold answers this from raw acoustics, waiting for a gap of, say, 800 milliseconds and then committing the turn.

Problem

Acoustic silence is a poor proxy for conversational completeness. A long timeout adds close to a second of latency to every reply and makes the agent feel sluggish, while a short timeout fires on natural hesitations and filler words and cuts the speaker off. The same threshold also cannot tell a backchannel such as 'uh-huh' from a genuine attempt to interrupt, so the agent either talks over the caller or freezes mid-sentence.

Forces

A short silence threshold lowers response latency but commits the turn during natural pauses; a long threshold avoids interrupting but makes every reply feel slow.
Raw energy-based detection is cheap and easy to ship, but it confuses backchannels and hesitations with end-of-turn and with barge-in.
Semantic completeness lives in the words, not the waveform, so reading it needs the partial transcript, which itself arrives with streaming-recognition lag.

Example

A customer calls a support line handled by a voice agent and says 'I'd like to change my flight... to, um, the one on Friday.' A fixed-timeout agent commits after the pause at 'change my flight' and answers the wrong question. A semantic-endpointing agent reads the partial transcript, sees the sentence is unfinished, waits through the 'um', and only answers once the caller has actually finished.

Diagram

flowchart TD A[Mic audio] --> B[VAD plus streaming STT] B --> C{Utterance complete?} C -- yes --> D[Commit turn and reply] C -- no --> B F[User speech during TTS] --> G{Backchannel or barge-in?} G -- backchannel --> H[Ignore and keep speaking] G -- barge-in --> I[Duck TTS and yield floor]

Solution

Therefore:

Run a small turn-detection model over the streaming transcript in parallel with voice-activity detection. The model scores whether the user's utterance is a complete thought; a complete utterance commits the turn after a short pause while an incomplete one waits longer, so the agent answers fast on finished sentences and stays patient through hesitations. While the agent is speaking, a second classifier labels detected user speech as a backchannel, which is ignored, or a barge-in, which ducks the text-to-speech and yields the floor. The acoustic timeout remains as a floor so the turn always eventually commits.

What this pattern forbids. The agent must not commit a turn on silence duration alone; it can yield the floor only after the turn-detection model judges the utterance complete or the fallback timeout elapses, and a detected backchannel cannot count as barge-in.

And the patterns that stand alongside it, or against it —

complementsInterruptible Agent Execution★— Treat pause, resume, and cancel as a first-class control surface on every long-running agent so users can halt expensive or off-track trajectories mid-task while state is preserved for resumption.
complementsMultilingual Voice Agent Stack★— Compose a voice agent as a tightly co-located pipeline of speech-to-text, language-aware LLM reasoning, and text-to-speech, where one vendor owns all three so language and dialect propagate cleanly across stages.
complementsLiminal-State Detection·— Infer the human's attentional state (just-woke, focused, winding-down, distracted) from message timing and tone, and adapt response shape so the agent meets the person where they actually are.