Delayed Streams Modeling
Convert streaming speech tasks into a single decoder-only autoregressive problem by time-aligning the parallel input and output streams with a fixed offset in preprocessing, eliminating the learned read/write policy that cascade pipelines require.
Problem
Cascading three models adds the latency of each stage to the user-perceived delay, and every handoff between them is a place where errors compound or interruptions break the pipeline. The language model cannot start reasoning until the speech-to-text stage commits to a transcription, and the text-to-speech stage cannot start speaking until the language model commits to a reply. The learned read/write policy added on top of this in simultaneous translators is itself a separate model that is hard to train, sensitive to the chosen delay budget, and has its own failure modes. None of these architectures handle full-duplex dialogue — both sides talking and listening at once — without further hacks.
Solution
In preprocessing, represent each training example as parallel token streams (source and target) interleaved on a shared time axis, with the target stream offset by a fixed delay (the chosen latency budget, e.g. 1-3 seconds for translation, ~80ms for full-duplex dialogue). Train a standard decoder-only transformer to autoregressively predict the next interleaved token. At inference, feed source tokens as they arrive and read off target tokens at the offset position — no learned policy decides when to emit, the offset structure does. The same architecture handles speech-to-text (text stream offset behind audio), text-to-speech (audio stream offset behind text), simultaneous translation (target language offset behind source), and full-duplex dialogue (each speaker's stream offset behind the joint conversation).
When to use
- Latency budget is tight (sub-second to few-second).
- Task is naturally a stream-to-stream transduction (speech, translation, dialogue).
- Time-aligned paired data is available or can be synthesized.
- Cascade complexity (STT+LLM+TTS) is dominating engineering cost or latency.
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.