vLLM
Type: full-code · Vendor: vLLM Project (originally UC Berkeley Sky Computing Lab) · Language: Python · License: Apache-2.0 · Status: active · Status in practice: mature · First released: 2023-06-20
vLLM is an open-source LLM inference and serving engine the operator runs on its own hardware to serve models behind an OpenAI-compatible API.
Description. vLLM is an open-source library for LLM inference and serving that the operator deploys on its own GPUs or CPUs. It manages attention key and value memory with PagedAttention and batches incoming requests continuously to raise throughput. It exposes an OpenAI-compatible API server so existing clients point at a self-hosted endpoint. The project supports a range of hardware including NVIDIA, AMD, and CPU targets, and is released under Apache 2.0.
Agent loop shape. vLLM is the inference backend, not the agent loop itself. An agent or application sends generation requests to a vLLM server through its OpenAI-compatible API; vLLM batches incoming requests continuously, manages attention key and value memory with PagedAttention, and returns generations to the caller, who runs the surrounding agent orchestration.
Primary use cases
- self-hosted LLM inference and serving
- high-throughput batched model serving
- OpenAI-compatible endpoints on operator-owned hardware
- running open-weight models across diverse accelerators
Key concepts
- PagedAttention → prompt-caching (docs) — vLLM's memory-management technique that stores the attention key/value cache in non-contiguous pages, so concurrent requests share GPU memory and common prefixes are reused instead of duplicated.
- Continuous batching (docs) — The scheduler that admits and retires requests at the token step rather than at the batch boundary, keeping the accelerator busy and raising request throughput under mixed load.
- OpenAI-compatible API server (docs) — A drop-in server exposing /v1/chat/completions and related endpoints, so clients written against the OpenAI SDK point at a self-hosted vLLM endpoint without code changes.
- LLM / engine entrypoint (docs) — vLLM's in-process generation interface (the LLM class and the underlying engine) for offline batched inference, distinct from the long-running API server.
Patterns this full-code implements —
- ★Sovereign Inference Stack
vLLM is an open-source LLM inference and serving engine the operator deploys on its own GPUs/CPUs, keeping the entire inference stack inside infrastructure the operator controls.
- ★★Prompt Caching
vLLM serves prefix caching as a built-in serving feature, reusing the cached unchanging prompt prefix across requests so repeated prefixes cut recomputation; the engine is the provider side that make…
Neighbourhood
Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.