Moshi (Kyutai)
Full-duplex speech-text foundation model from Kyutai that models the user's and the agent's audio as two parallel streams, handling turn-taking natively in real time instead of via a separate voice-activity-detection pipeline.
Description
Moshi (arXiv 2410.00037) is reported as the first real-time full-duplex spoken large language model, with roughly 200ms practical latency. It generates speech tokens from the residual quantizer of the Mimi streaming neural audio codec, models its own speech and the user's speech in parallel streams, and predicts text tokens (an inner monologue) corresponding to its own speech. This replaces the traditional cascade of voice-activity detection, speech recognition, dialogue, and text-to-speech with a single model.
Solution
Single full-duplex speech model rather than a cascade. Moshi continuously models two audio streams — its own speech and the user's — in parallel, predicting speech tokens from the Mimi codec plus a text inner monologue. Because both streams run continuously, the model decides when to speak or yield natively, without a separate voice-activity-detection or endpointing stage.
Primary use cases
- real-time full-duplex voice conversation
- native turn-taking without a separate VAD or endpointing pipeline
- low-latency speech-to-speech dialogue
Open the full interactive page →
Diagram, neighbourhood map, code examples, related patterns and full provenance.