Promptfoo

Type: full-code · Vendor: Promptfoo, Inc. · Language: TypeScript · License: MIT · Status: active · Status in practice: mature · First released: 2023

Links: homepage docs repo

Promptfoo is an open-source command-line tool that runs declarative assertion-based test suites against prompts, models, and RAG or agent systems, and can red-team them for vulnerabilities.

Description. Promptfoo evaluates prompts, models, and RAG or agent pipelines against a YAML test suite of assertions, returning pass or fail and a non-zero exit code in CI when a test fails. Assertions include deterministic checks and model-graded checks such as llm-rubric, where an LLM grades the output against custom criteria. It also provides a red-teaming mode that generates simulated adversarial inputs to find vulnerabilities before deployment.

Agent loop shape. Promptfoo has no agent loop of its own. It is run from the command line over a configuration that lists prompts, providers, and test cases with assertions. For each test case it calls the configured provider, applies each assertion to the output, and aggregates pass or fail results, exiting non-zero in CI on any failure. In red-team mode it instead generates adversarial inputs and runs them against the target to surface failures.

Primary use cases

assertion-based prompt and model evaluation in CI
model-graded scoring of open-ended outputs
red-teaming LLM applications for vulnerabilities

flowchart TD fw["Promptfoo"] fw --> p1["Eval as Contract<br/>(core)"] fw --> p2["LLM-as-Judge<br/>(first-class)"] fw --> p3["Red-Team Sandbox Reproduction<br/>(first-class)"] fw --> p4["Prompt Variant Evaluation<br/>(first-class)"]

Key concepts

Assertion → eval-as-contract (docs) — A declarative output check attached to a test case; deterministic assertions (contains, cost, latency) and model-graded assertions (llm-rubric) together decide whether the case passes.
llm-rubric → llm-as-judge (docs) — A model-graded assertion type where an LLM grades a free-form output against stated criteria, used when no exact-match check fits the expected behaviour.
Provider → prompt-variant-evaluation (docs) — A configured model or endpoint under test; listing several providers lets one test suite run across multiple models for side-by-side comparison.
Red team (Promptfoo) → red-team-sandbox-reproduction (docs) — A mode that curates and generates a diverse set of malicious intents targeting potential vulnerabilities and runs them against the application, either as one-off scans or continuously in the deployment pipeline.

Promptfoo

Neighbourhood

Alternatives & relatives

Listed as alternative by (4)

References

Provenance