Unified Voice Interface
Expose text-to-speech, speech-to-text, and real-time speech-to-speech through a single interface so a voice agent can swap providers without rewriting the loop.
Problem
Each provider ships its own software development kit, with its own streaming chunk format, its own audio framing, its own lifecycle events for things like "the user started talking" or "partial transcript ready", and its own way of exposing real-time speech-to-speech versus the older text-to-speech and speech-to-text shapes. Writing the agent loop directly against one of those kits binds the entire application to that vendor's release cadence and pricing, and forecloses a switch for cost, quality, latency, or feature reasons. The team needs one interface that spans all three modes and treats the provider as a configuration choice.
Solution
Define a Voice interface with three primary methods — `speak(text) -> AudioStream`, `listen(audio_stream) -> TranscriptStream`, `converse(audio_stream) -> AudioStream` (the realtime STS path) — and a uniform event vocabulary (`turn_start`, `partial_transcript`, `final_transcript`, `barge_in`, `voice_activity_start/stop`). Each provider implementation declares which modes and voices it supports via capability flags; the agent loop checks capability rather than provider name. Pair with streaming-typed-events (the underlying typed event transport), multilingual-voice-agent (language adaptation on top), and provider-string-routing (string-addressed provider selection). Treat realtime STS as a first-class mode, not a flavour of TTS+STT, because the bidirectional framing differs.
When to use
- Building voice agents that may switch providers for cost, quality, or latency reasons.
- Multiple voice modes (TTS, STT, realtime STS) are in play in the same product.
- The application UI consumes a uniform voice-event vocabulary independent of provider.
- Provider capability gaps must be discoverable at runtime.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.