Ollama

Type: full-code · Vendor: Ollama · Language: Go · License: MIT · Status: active · Status in practice: mature · First released: 2023-07-07

Links: homepage docs repo

Ollama downloads and runs open-weight language models on the operator's own machine and serves them over a local API, including an OpenAI-compatible endpoint, so applications can run inference without sending prompts to a third party.

Description. Ollama is an open-source runtime for running large language models locally. It pulls model weights from a library, runs inference on the local machine, and exposes a local REST API on port 11434. It also provides an OpenAI-compatible endpoint so existing OpenAI client code can be pointed at the local instance by changing the base URL. Because inference runs locally, prompts and outputs do not leave the operator's machine unless cloud models are explicitly used.

Agent loop shape. Ollama is the inference engine an agent loop calls rather than the loop itself. A model is pulled into the local library and loaded on demand; the application sends a request naming the model to the local API (native or the OpenAI-compatible /v1 path) and receives a completion. The same endpoint serves any pulled model, so the caller selects which model answers by passing its name, and all computation stays on the local host.

Primary use cases

running open-weight LLMs locally
serving local inference over a REST API
OpenAI-compatible drop-in for existing clients
keeping prompts and data on the operator's own machine

flowchart TD fw["Ollama"] fw --> p1["Sovereign Inference Stack<br/>(core)"] fw --> p2["Provider-String Routing<br/>(supported)"] fw --> p3["Tool Use<br/>(supported)"] fw --> p4["Structured Output<br/>(supported)"]

Key concepts

Model library / pull (docs) — A registry of open-weight models that `ollama pull` downloads to the local machine; `ollama ls` lists the local catalog the runtime can load on demand.
Local API (port 11434) → sovereign-inference-stack (docs) — The REST endpoint Ollama serves on the local host for generate/chat/embeddings requests, so applications send a model name and receive completions without leaving the machine.
OpenAI compatibility → provider-string-routing (docs) — A parallel /v1 endpoint that mimics the OpenAI API so existing OpenAI client code reaches a local model just by changing the base URL and model name.
Modelfile (docs) — A declarative file specifying a base model plus parameters, system prompt, and template, used to build customized local models with `ollama create`.