XII · Streaming & UXEmerging

Delayed Streams Modeling

also known as DSM, Modélisation à flux décalés, Time-Aligned Stream Decoder, Single-Decoder Speech Agent

Convert streaming speech tasks into a single decoder-only autoregressive problem by time-aligning the parallel input and output streams with a fixed offset in preprocessing, eliminating the learned read/write policy that cascade pipelines require.

Context

A team is building a low-latency speech system — a real-time translator, a voice assistant that has to hold a conversation, or a full-duplex dialogue agent where the human and the agent can talk over each other. The conventional architecture is a cascade: a speech-to-text (STT) model transcribes the user's audio, a language model reasons about the text, and a text-to-speech (TTS) model produces the reply audio. Simultaneous-translation systems usually add a separate "read/write policy" that decides at each moment whether to wait for more input or emit the next chunk of output.

Problem

Cascading three models adds the latency of each stage to the user-perceived delay, and every handoff between them is a place where errors compound or interruptions break the pipeline. The language model cannot start reasoning until the speech-to-text stage commits to a transcription, and the text-to-speech stage cannot start speaking until the language model commits to a reply. The learned read/write policy added on top of this in simultaneous translators is itself a separate model that is hard to train, sensitive to the chosen delay budget, and has its own failure modes. None of these architectures handle full-duplex dialogue — both sides talking and listening at once — without further hacks.

Forces

  • Streaming low-latency speech requires emitting output before input is finished.
  • Cascade architectures accumulate latency across stages.
  • Learned read/write policies are extra training problems with their own failure modes.
  • A single decoder-only model is simpler to train and deploy than a cascade.
  • Time-alignment between streams (e.g. translated speech lagging source speech by a fixed offset) can be enforced in preprocessing instead of learned at inference.

Example

A simultaneous translator app needs French speech out within two seconds of English speech in, on-device. The team trains a single delayed-streams decoder with target French audio offset 2s behind source English audio. At inference the user speaks; French tokens stream out two seconds later from the same model — no separate STT, no separate LLM, no learned read/write policy. The same architecture, retrained with a tiny offset and both speakers' audio as parallel streams, powers their full-duplex dialogue assistant.

Diagram

Solution

Therefore:

In preprocessing, represent each training example as parallel token streams (source and target) interleaved on a shared time axis, with the target stream offset by a fixed delay (the chosen latency budget, e.g. 1-3 seconds for translation, ~80ms for full-duplex dialogue). Train a standard decoder-only transformer to autoregressively predict the next interleaved token. At inference, feed source tokens as they arrive and read off target tokens at the offset position — no learned policy decides when to emit, the offset structure does. The same architecture handles speech-to-text (text stream offset behind audio), text-to-speech (audio stream offset behind text), simultaneous translation (target language offset behind source), and full-duplex dialogue (each speaker's stream offset behind the joint conversation).

What this pattern forbids. The model must not predict output tokens ahead of the configured offset — emission position is structural, not learned. The architecture forbids inserting a separate read/write policy or cascade stage; the offset is the policy.

And the patterns that stand alongside it, or against it —

  • alternative-toStreaming Typed Events★★Push partial results to the client as typed events as they become available, rather than waiting for the full response.
  • alternative-toMultilingual Voice Agent StackCompose a voice agent as a tightly co-located pipeline of speech-to-text, language-aware LLM reasoning, and text-to-speech, where one vendor owns all three so language and dialect propagate cleanly across stages.

Neighbourhood

Click any neighbour to follow the language. Scroll to zoom, drag to pan.