Unified Voice Interface

also known as Voice Abstraction Layer, TTS/STT/STS Unified API, Provider-Agnostic Voice

Expose text-to-speech, speech-to-text, and real-time speech-to-speech through a single interface so a voice agent can swap providers without rewriting the loop.

This pattern helps complete certain larger patterns —

specialisesMultilingual Voice Agent Stack★— Compose a voice agent as a tightly co-located pipeline of speech-to-text, language-aware LLM reasoning, and text-to-speech, where one vendor owns all three so language and dialect propagate cleanly across stages.

Context

A team is building a voice agent at a moment when the provider landscape is moving fast: OpenAI's realtime API, Google's voice models, ElevenLabs for text-to-speech (TTS), Deepgram for speech-to-text (STT), Azure, Amazon Web Services, and a growing set of on-device options. The agent needs some combination of three voice capabilities: TTS, which turns text into audio; STT, which turns audio into text; and real-time speech-to-speech (STS), which takes audio in and produces audio out without the round-trip through text. Capability, price, and quality shift between providers faster than the team can rewrite application code.

Problem

Each provider ships its own software development kit, with its own streaming chunk format, its own audio framing, its own lifecycle events for things like "the user started talking" or "partial transcript ready", and its own way of exposing real-time speech-to-speech versus the older text-to-speech and speech-to-text shapes. Writing the agent loop directly against one of those kits binds the entire application to that vendor's release cadence and pricing, and forecloses a switch for cost, quality, latency, or feature reasons. The team needs one interface that spans all three modes and treats the provider as a configuration choice.

Forces

TTS, STT, and STS have meaningfully different control-flow shapes (one-shot vs streaming vs bidirectional), but the application wants one mental model.
Realtime speech-to-speech needs bidirectional audio framing — half-duplex APIs cannot fully emulate it.
Provider feature parity is incomplete: not every provider offers all three modes or all voices.
Latency budgets in voice are tight (sub-300ms turn-taking); abstraction overhead must be small.
Voice-event vocabulary (turn-start, partial-transcript, barge-in, voice-activity) needs to be unified across providers.

Example

A consumer voice assistant team wants to ship realtime speech-to-speech on iOS, fall back to TTS+STT on platforms where realtime is unavailable, and run STT-only for transcription-mode users. They build their agent loop against a unified Voice interface with `speak`, `listen`, and `converse` methods plus a capability flag for `realtime_sts`. On iOS the loop picks the realtime provider; on Android it falls back to TTS+STT through the same interface; transcription-mode disables `speak` entirely. When a cheaper TTS provider lands, the change is a configuration switch — the agent loop does not move.

Diagram

flowchart TD L[Agent loop] --> V[Voice interface] V --> Cap[Capability flags<br/>tts / stt / sts] V --> A1[OpenAI realtime adapter] V --> A2[ElevenLabs TTS adapter] V --> A3[Deepgram STT adapter] V --> A4[Azure / AWS / on-device ...] A1 --> P1[(Realtime API)] A2 --> P2[(TTS API)] A3 --> P3[(STT API)] A4 --> Pn[(...)]

Solution

Therefore:

Define a Voice interface with three primary methods — `speak(text) -> AudioStream`, `listen(audio_stream) -> TranscriptStream`, `converse(audio_stream) -> AudioStream` (the realtime STS path) — and a uniform event vocabulary (`turn_start`, `partial_transcript`, `final_transcript`, `barge_in`, `voice_activity_start/stop`). Each provider implementation declares which modes and voices it supports via capability flags; the agent loop checks capability rather than provider name. Pair with streaming-typed-events (the underlying typed event transport), multilingual-voice-agent (language adaptation on top), and provider-string-routing (string-addressed provider selection). Treat realtime STS as a first-class mode, not a flavour of TTS+STT, because the bidirectional framing differs.

What this pattern forbids. The agent loop must call voice operations through the unified interface and must read provider capability via capability flags; the loop is not allowed to import provider-specific voice SDK classes.

The smaller patterns that complete this one —

usesTranslation Layer★★— Insert a typed boundary between the agent's clean domain model and a messy or legacy external API.

And the patterns that stand alongside it, or against it —

complementsStreaming Typed Events★★— Push partial results to the client as typed events as they become available, rather than waiting for the full response.
complementsProvider-String Routing★— Select the model and provider for a request through a single namespaced string (`provider/model`) backed by env-var credentials, so the caller specifies what to run with one parameter rather than a typed provider object.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.

Used in frameworks

References

Provenance

Source: patterns/unified-voice-interface.md on GitHub · commit 7965435 · view history
Added to catalog: 2026-05-20
Last updated: 2026-05-21
Contribute: open an issue or PR at github.com/agentpatternscatalog/patterns.