Full-Code · Voice & Conversationalactive

Moshi (Kyutai)

Type: full-code · Vendor: Kyutai · Language: Python, Rust · Status: active · Status in practice: experimental · First released: 2024-09-17

Links: homepage repo

Full-duplex speech-text foundation model from Kyutai that models the user's and the agent's audio as two parallel streams, handling turn-taking natively in real time instead of via a separate voice-activity-detection pipeline.

Description. Moshi (arXiv 2410.00037) is reported as the first real-time full-duplex spoken large language model, with roughly 200ms practical latency. It generates speech tokens from the residual quantizer of the Mimi streaming neural audio codec, models its own speech and the user's speech in parallel streams, and predicts text tokens (an inner monologue) corresponding to its own speech. This replaces the traditional cascade of voice-activity detection, speech recognition, dialogue, and text-to-speech with a single model.

Agent loop shape. Single full-duplex speech model rather than a cascade. Moshi continuously models two audio streams — its own speech and the user's — in parallel, predicting speech tokens from the Mimi codec plus a text inner monologue. Because both streams run continuously, the model decides when to speak or yield natively, without a separate voice-activity-detection or endpointing stage.

Primary use cases

real-time full-duplex voice conversation
native turn-taking without a separate VAD or endpointing pipeline
low-latency speech-to-speech dialogue

flowchart TD fw["Moshi (Kyutai)"] fw --> p1["Semantic Turn Endpointing<br/>(first-class)"] fw --> p2["Unified Voice Interface<br/>(supported)"]

Key concepts

Two parallel audio streams → semantic-turn-endpointing (docs) — Moshi models its own and the user's speech as two streams, enabling simultaneous speech and native turn-taking without separate VAD.
Mimi neural audio codec (docs) — A streaming neural audio codec; Moshi generates speech tokens from its residual quantizer.
Inner monologue (docs) — Moshi predicts text tokens corresponding to its own speech, which improves generation quality.

Patterns this full-code implements —

References

Provenance

Last analyzed: 2026-06-27
Last updated: 2026-06-27
Verification status: partial