Full-Code · Orchestration Frameworksactive

RouteLLM

Type: full-code · Vendor: LM-Sys · Language: Python · License: Apache-2.0 · Status: active · Status in practice: experimental

Links: homepage repo

Research framework for training and serving LLM routers that dynamically dispatch each query between a stronger, more expensive model and a cheaper but weaker model based on a learned difficulty score and a per-request cost threshold.

Description. RouteLLM is LM-Sys's open-source framework for cost-effective LLM serving. It trains learned routers — matrix factorisation (mf), similarity-weighted Elo ranking (sw_ranking), and a BERT classifier — on preference data so that incoming queries are dispatched to a strong (expensive) or weak (cheap) model based on a predicted win-rate score against a caller-supplied cost threshold. The project ships an OpenAI-compatible server that drops in as a routing layer in front of two model endpoints, plus benchmarks documenting >2x cost reduction at fixed quality on MT-Bench, MMLU, and GSM8K. Companion paper: arXiv 2406.18665.

Agent loop shape. Routing proxy in front of two model endpoints. A trained router (mf / sw_ranking / bert) scores each incoming request for predicted strong-vs-weak win rate; if the score exceeds the request's cost threshold, the query is dispatched to the strong model, otherwise to the weak model. The response is returned through an OpenAI-compatible API. Routing is stateless per request; the router is trained offline on preference data and loaded as a model artefact.

Primary use cases

research on learned LLM routing
cost-aware serving across strong/weak model pairs
drop-in OpenAI-compatible routing proxy
benchmarking routing strategies on MT-Bench / MMLU / GSM8K

flowchart TD fw["RouteLLM"] fw --> p1["Open-Weight Cascade<br/>(core)"] fw --> p2["Multi-Model Routing<br/>(first-class)"] fw --> p3["Complexity-Based Routing<br/>(first-class)"]

Key concepts

Matrix factorisation router (mf) (docs) — Recommended router type; a matrix-factorisation model trained on preference data to score strong-vs-weak win rate for a query.
BERT classifier router (bert) (docs) — BERT classifier trained on preference data to predict whether the strong model would win against the weak model on this query.
Similarity-weighted Elo router (sw_ranking) (docs) — Weighted Elo calculation where each preference vote is weighted by similarity to the user's prompt.
Strong vs weak two-model routing (docs) — RouteLLM focuses on routing between exactly two models: one stronger and more expensive, one cheaper but weaker.
Cost threshold → complexity-based-routing (docs) — Per-request scalar that determines the cost-quality tradeoff; if the router's predicted win rate exceeds it the strong model is used.
Preference-data training (docs) — Routers are trained offline on human preference data (e.g. Chatbot Arena votes plus augmentations) rather than hand-coded rules.