vLLM

Type: full-code · Vendor: vLLM Project (originally UC Berkeley Sky Computing Lab) · Language: Python · License: Apache-2.0 · Status: active · Status in practice: mature · First released: 2023-06-20

Links: homepage docs repo

vLLM is an open-source LLM inference and serving engine the operator runs on its own hardware to serve models behind an OpenAI-compatible API.

Description. vLLM is an open-source library for LLM inference and serving that the operator deploys on its own GPUs or CPUs. It manages attention key and value memory with PagedAttention and batches incoming requests continuously to raise throughput. It exposes an OpenAI-compatible API server so existing clients point at a self-hosted endpoint. The project supports a range of hardware including NVIDIA, AMD, and CPU targets, and is released under Apache 2.0.

Agent loop shape. vLLM is the inference backend, not the agent loop itself. An agent or application sends generation requests to a vLLM server through its OpenAI-compatible API; vLLM batches incoming requests continuously, manages attention key and value memory with PagedAttention, and returns generations to the caller, who runs the surrounding agent orchestration.

Primary use cases

self-hosted LLM inference and serving
high-throughput batched model serving
OpenAI-compatible endpoints on operator-owned hardware
running open-weight models across diverse accelerators

flowchart TD fw["vLLM"] fw --> p1["Sovereign Inference Stack<br/>(core)"] fw --> p2["Prompt Caching<br/>(supported)"]

Key concepts

PagedAttention → prompt-caching (docs) — vLLM's memory-management technique that stores the attention key/value cache in non-contiguous pages, so concurrent requests share GPU memory and common prefixes are reused instead of duplicated.
Continuous batching (docs) — The scheduler that admits and retires requests at the token step rather than at the batch boundary, keeping the accelerator busy and raising request throughput under mixed load.
OpenAI-compatible API server (docs) — A drop-in server exposing /v1/chat/completions and related endpoints, so clients written against the OpenAI SDK point at a self-hosted vLLM endpoint without code changes.
LLM / engine entrypoint (docs) — vLLM's in-process generation interface (the LLM class and the underlying engine) for offline batched inference, distinct from the long-running API server.

Patterns this full-code implements —

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.

Alternatives & relatives

app · closed product
OpenRouter
competes-with
OpenRouter routes requests to hosted model providers behind one API, whereas vLLM is the engine an operator runs on its own hardware to be one of those endpoints; they sit on opposite sides of the se…

vLLM

Neighbourhood

Alternatives & relatives

Listed as alternative by (1)

References

Provenance