Semantic Turn Endpointing
also known as Semantic Turn Detection, Model-Based Endpointing
Decide when a voice user has yielded the floor by classifying the partial transcript's semantic completeness rather than a fixed silence timeout, so the agent replies quickly without cutting the speaker off mid-thought.
Context
A voice agent runs a duplex audio loop: streaming speech-to-text transcribes the caller while text-to-speech plays the agent's reply. The loop must decide, many times per turn, whether the caller has finished speaking or merely paused. A fixed voice-activity-detection silence threshold answers this from raw acoustics, waiting for a gap of, say, 800 milliseconds and then committing the turn.
Problem
Acoustic silence is a poor proxy for conversational completeness. A long timeout adds close to a second of latency to every reply and makes the agent feel sluggish, while a short timeout fires on natural hesitations and filler words and cuts the speaker off. The same threshold also cannot tell a backchannel such as 'uh-huh' from a genuine attempt to interrupt, so the agent either talks over the caller or freezes mid-sentence.
Forces
- A short silence threshold lowers response latency but commits the turn during natural pauses; a long threshold avoids interrupting but makes every reply feel slow.
- Raw energy-based detection is cheap and easy to ship, but it confuses backchannels and hesitations with end-of-turn and with barge-in.
- Semantic completeness lives in the words, not the waveform, so reading it needs the partial transcript, which itself arrives with streaming-recognition lag.
Example
A customer calls a support line handled by a voice agent and says 'I'd like to change my flight... to, um, the one on Friday.' A fixed-timeout agent commits after the pause at 'change my flight' and answers the wrong question. A semantic-endpointing agent reads the partial transcript, sees the sentence is unfinished, waits through the 'um', and only answers once the caller has actually finished.
Diagram
Solution
Therefore:
Run a small turn-detection model over the streaming transcript in parallel with voice-activity detection. The model scores whether the user's utterance is a complete thought; a complete utterance commits the turn after a short pause while an incomplete one waits longer, so the agent answers fast on finished sentences and stays patient through hesitations. While the agent is speaking, a second classifier labels detected user speech as a backchannel, which is ignored, or a barge-in, which ducks the text-to-speech and yields the floor. The acoustic timeout remains as a floor so the turn always eventually commits.
What this pattern forbids. The agent must not commit a turn on silence duration alone; it can yield the floor only after the turn-detection model judges the utterance complete or the fallback timeout elapses, and a detected backchannel cannot count as barge-in.
And the patterns that stand alongside it, or against it —
- complementsInterruptible Agent Execution★— Treat pause, resume, and cancel as a first-class control surface on every long-running agent so users can halt expensive or off-track trajectories mid-task while state is preserved for resumption.
- complementsMultilingual Voice Agent Stack★— Compose a voice agent as a tightly co-located pipeline of speech-to-text, language-aware LLM reasoning, and text-to-speech, where one vendor owns all three so language and dialect propagate cleanly across stages.
- complementsLiminal-State Detection·— Infer the human's attentional state (just-woke, focused, winding-down, distracted) from message timing and tone, and adapt response shape so the agent meets the person where they actually are.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.