Moshi (Kyutai)
Type: full-code · Vendor: Kyutai · Language: Python, Rust · Status: active · Status in practice: experimental · First released: 2024-09-17
Full-duplex speech-text foundation model from Kyutai that models the user's and the agent's audio as two parallel streams, handling turn-taking natively in real time instead of via a separate voice-activity-detection pipeline.
Description. Moshi (arXiv 2410.00037) is reported as the first real-time full-duplex spoken large language model, with roughly 200ms practical latency. It generates speech tokens from the residual quantizer of the Mimi streaming neural audio codec, models its own speech and the user's speech in parallel streams, and predicts text tokens (an inner monologue) corresponding to its own speech. This replaces the traditional cascade of voice-activity detection, speech recognition, dialogue, and text-to-speech with a single model.
Agent loop shape. Single full-duplex speech model rather than a cascade. Moshi continuously models two audio streams — its own speech and the user's — in parallel, predicting speech tokens from the Mimi codec plus a text inner monologue. Because both streams run continuously, the model decides when to speak or yield natively, without a separate voice-activity-detection or endpointing stage.
Primary use cases
- real-time full-duplex voice conversation
- native turn-taking without a separate VAD or endpointing pipeline
- low-latency speech-to-speech dialogue
Key concepts
- Two parallel audio streams → semantic-turn-endpointing (docs) — Moshi models its own and the user's speech as two streams, enabling simultaneous speech and native turn-taking without separate VAD.
- Mimi neural audio codec (docs) — A streaming neural audio codec; Moshi generates speech tokens from its residual quantizer.
- Inner monologue (docs) — Moshi predicts text tokens corresponding to its own speech, which improves generation quality.
Patterns this full-code implements —
- ★Semantic Turn Endpointing
Moshi handles turn-taking natively via its dual-stream architecture instead of a silence-threshold VAD — the full-duplex realisation of this pattern.
- ★Unified Voice Interface
A single speech-text model owns the whole speech loop (recognition, dialogue, and speech generation) rather than a cascade of separate providers.