Multilingual Voice Agent Stack
also known as Voice-First Multilingual Agent, STT-LLM-TTS Pipeline, Indic Voice Agent
Compose a voice agent as a tightly co-located pipeline of speech-to-text, language-aware LLM reasoning, and text-to-speech, where one vendor owns all three so language and dialect propagate cleanly across stages.
Context
A team is building a voice agent for a market where users speak one of many regional languages and dialects, such as India's 22 scheduled languages or Iberian Spanish and Catalan. The product runs on telephony channels (phone calls, WhatsApp voice) where written input is rare and the agent has to converse in the user's own language at sub-second turn-taking latency.
Problem
Bolting a generic English-trained large language model between a generic speech-to-text (STT) component and a generic text-to-speech (TTS) component loses dialect, code-switching, and accent the moment audio is transcribed. Quality drops at each stage multiply across the pipeline, the model silently replies in a slightly off pivot language, and end-to-end latency exceeds the roughly one-second budget that natural conversation tolerates. Telephony audio (8 kHz) makes every stage noisier still.
Forces
- STT, LLM, TTS each have their own multilingual coverage curve.
- Real conversation tolerates ~1s round-trip latency; slower than that breaks the illusion.
- Dialect and code-switching are the norm, not the exception.
- Telephony imposes 8 kHz audio constraints on top.
Example
A food-delivery startup launches a voice ordering line in Spain by chaining a generic English-trained STT, an English LLM, and a generic TTS. Customers in Catalan and Andalusian Spanish are misheard, the LLM responds in slightly off Spanish, and the TTS speaks with a flat American accent. The team rebuilds as a multilingual-voice-agent with all three stages from one vendor that supports Iberian Spanish and Catalan, dialect tags propagated end-to-end, and TTS voices native to the target languages. Order completion rates climb sharply.
Diagram
Solution
Therefore:
Build the voice agent as a co-located pipeline whose components share language identity and dialect signals end-to-end. Use STT models trained on the target languages and accents. Pass detected language tags as structured metadata to the LLM. Use TTS voices native to the target language; do not translate back to English mid-pipeline. Optimise for streaming at every hop (incremental STT, streaming LLM, streaming TTS) to hit sub-second turn-taking. Treat code-switching as first-class; do not force a single-language assumption.
What this pattern forbids. Language identity and dialect tags must propagate through every hop; mid-pipeline silent translation to a pivot language (e.g. English) is forbidden.
The smaller patterns that complete this one —
- usesStreaming Typed Events★★— Push partial results to the client as typed events as they become available, rather than waiting for the full response.
- usesStructured Output★★— Constrain the model's output to conform to a JSON Schema (or similar typed shape).
- generalisesUnified Voice Interface★— Expose text-to-speech, speech-to-text, and real-time speech-to-speech through a single interface so a voice agent can swap providers without rewriting the loop.
And the patterns that stand alongside it, or against it —
- complementsMulti-Model Routing★★— Send each request to the cheapest model that can handle it well.
- complementsTranslation Layer★★— Insert a typed boundary between the agent's clean domain model and a messy or legacy external API.
- alternative-toComputer Use★— Let the model drive a desktop end-to-end via screenshots plus virtual mouse/keyboard tool calls instead of bespoke per-app APIs.
- complementsCode-Switching-Aware Agent★— Treat mixed-language input (e.g. Hinglish in Roman script) as the expected shape, and design tokenisation, language tagging, and tool routing to handle it natively without forcing the user to commit to one language.
- alternative-toDelayed Streams Modeling★— Convert streaming speech tasks into a single decoder-only autoregressive problem by time-aligning the parallel input and output streams with a fixed offset in preprocessing, eliminating the learned read/write policy that cascade pipelines require.
Neighbourhood
Click any neighbour to follow the language. Scroll to zoom, drag to pan.