Full-Code · Orchestration Frameworksactive

RouteLLM

Type: full-code  ·  Vendor: LM-Sys  ·  Language: Python  ·  License: Apache-2.0  ·  Status: active  ·  Status in practice: experimental

Links: homepage repo

Research framework for training and serving LLM routers that dynamically dispatch each query between a stronger, more expensive model and a cheaper but weaker model based on a learned difficulty score and a per-request cost threshold.

Description. RouteLLM is LM-Sys's open-source framework for cost-effective LLM serving. It trains learned routers — matrix factorisation (mf), similarity-weighted Elo ranking (sw_ranking), and a BERT classifier — on preference data so that incoming queries are dispatched to a strong (expensive) or weak (cheap) model based on a predicted win-rate score against a caller-supplied cost threshold. The project ships an OpenAI-compatible server that drops in as a routing layer in front of two model endpoints, plus benchmarks documenting >2x cost reduction at fixed quality on MT-Bench, MMLU, and GSM8K. Companion paper: arXiv 2406.18665.

Agent loop shape. Routing proxy in front of two model endpoints. A trained router (mf / sw_ranking / bert) scores each incoming request for predicted strong-vs-weak win rate; if the score exceeds the request's cost threshold, the query is dispatched to the strong model, otherwise to the weak model. The response is returned through an OpenAI-compatible API. Routing is stateless per request; the router is trained offline on preference data and loaded as a model artefact.

Primary use cases

  • research on learned LLM routing
  • cost-aware serving across strong/weak model pairs
  • drop-in OpenAI-compatible routing proxy
  • benchmarking routing strategies on MT-Bench / MMLU / GSM8K

Key concepts

  • Matrix factorisation router (mf) (docs)Recommended router type; a matrix-factorisation model trained on preference data to score strong-vs-weak win rate for a query.
  • BERT classifier router (bert) (docs)BERT classifier trained on preference data to predict whether the strong model would win against the weak model on this query.
  • Similarity-weighted Elo router (sw_ranking) (docs)Weighted Elo calculation where each preference vote is weighted by similarity to the user's prompt.
  • Strong vs weak two-model routing (docs)RouteLLM focuses on routing between exactly two models: one stronger and more expensive, one cheaper but weaker.
  • Cost threshold complexity-based-routing (docs)Per-request scalar that determines the cost-quality tradeoff; if the router's predicted win rate exceeds it the strong model is used.
  • Preference-data training (docs)Routers are trained offline on human preference data (e.g. Chatbot Arena votes plus augmentations) rather than hand-coded rules.

Patterns this full-code implements

Provenance

  • Last analyzed:
  • Last updated:
  • Verification status: verified