RouteLLM
Type: full-code · Vendor: LM-Sys · Language: Python · License: Apache-2.0 · Status: active · Status in practice: experimental
Research framework for training and serving LLM routers that dynamically dispatch each query between a stronger, more expensive model and a cheaper but weaker model based on a learned difficulty score and a per-request cost threshold.
Description. RouteLLM is LM-Sys's open-source framework for cost-effective LLM serving. It trains learned routers — matrix factorisation (mf), similarity-weighted Elo ranking (sw_ranking), and a BERT classifier — on preference data so that incoming queries are dispatched to a strong (expensive) or weak (cheap) model based on a predicted win-rate score against a caller-supplied cost threshold. The project ships an OpenAI-compatible server that drops in as a routing layer in front of two model endpoints, plus benchmarks documenting >2x cost reduction at fixed quality on MT-Bench, MMLU, and GSM8K. Companion paper: arXiv 2406.18665.
Agent loop shape. Routing proxy in front of two model endpoints. A trained router (mf / sw_ranking / bert) scores each incoming request for predicted strong-vs-weak win rate; if the score exceeds the request's cost threshold, the query is dispatched to the strong model, otherwise to the weak model. The response is returned through an OpenAI-compatible API. Routing is stateless per request; the router is trained offline on preference data and loaded as a model artefact.
Primary use cases
- research on learned LLM routing
- cost-aware serving across strong/weak model pairs
- drop-in OpenAI-compatible routing proxy
- benchmarking routing strategies on MT-Bench / MMLU / GSM8K
Key concepts
- Matrix factorisation router (mf) (docs) — Recommended router type; a matrix-factorisation model trained on preference data to score strong-vs-weak win rate for a query.
- BERT classifier router (bert) (docs) — BERT classifier trained on preference data to predict whether the strong model would win against the weak model on this query.
- Similarity-weighted Elo router (sw_ranking) (docs) — Weighted Elo calculation where each preference vote is weighted by similarity to the user's prompt.
- Strong vs weak two-model routing (docs) — RouteLLM focuses on routing between exactly two models: one stronger and more expensive, one cheaper but weaker.
- Cost threshold → complexity-based-routing (docs) — Per-request scalar that determines the cost-quality tradeoff; if the router's predicted win rate exceeds it the strong model is used.
- Preference-data training (docs) — Routers are trained offline on human preference data (e.g. Chatbot Arena votes plus augmentations) rather than hand-coded rules.
Patterns this full-code implements —
- ★★Multi-Model Routing
Core function: route every query to one of two configured models (strong / weak) via a learned router.
- ★Complexity-Based Routing
The trained router scores predicted strong-vs-weak win rate (a proxy for query difficulty) and compares it against a caller-supplied cost threshold — the canonical complexity-based routing shape.