Meta LlamaFirewall (AlignmentCheck)

Type: full-code · Vendor: Meta · Language: Python · License: MIT · Status: active · Status in practice: emerging · First released: 2025-04-29

Links: homepage docs repo

LlamaFirewall is a guardrail framework that runs a set of scanners on the inputs, outputs, and reasoning trace of an LLM agent to detect and mitigate prompt injection, goal hijacking, and misalignment before the agent acts.

Description. LlamaFirewall is an open-source security framework from Meta's Purple Llama project that wraps an LLM agent with composable scanners. PromptGuard classifies user inputs and untrusted tool content for direct prompt injection and jailbreak attempts, while AlignmentCheck audits the agent's chain-of-thought in real time for goal hijacking and indirect injection. A CodeShield scanner and customizable regex filters cover code generation and additional risks. The scanners can be plugged into different stages of an agent's workflow, from raw input ingestion to final output.

Agent loop shape. LlamaFirewall acts as a policy engine that orchestrates security scanners across stages of the agent's workflow. Untrusted inputs and tool content are scanned by PromptGuard for direct prompt injection before they reach the model; as the agent reasons, AlignmentCheck inspects the running chain-of-thought for goal hijacking and indirect injection; code outputs are checked by CodeShield. A scanner that flags a risk can block the input or action so the agent does not proceed on poisoned content.

Primary use cases

scanning agent inputs for prompt injection and jailbreaks
auditing an agent's reasoning trace for goal hijacking
detecting misalignment in multi-step agentic operations
layered guardrails around LLM agents

flowchart TD fw["Meta LlamaFirewall (AlignmentCheck)"] fw --> p1["Tool Output Poisoning Defense<br/>(core)"] fw --> p2["Input/Output Guardrails<br/>(core)"] fw --> p3["Prompt Injection Defense<br/>(first-class)"]

Key concepts

PromptGuard → prompt-injection-defense (docs) — A fast BERT-style classifier scanner that flags direct prompt-injection and jailbreak attempts in user inputs and untrusted content such as web and tool data.
AlignmentCheck → tool-output-poisoning (docs) — A scanner that audits the agent's running chain-of-thought with few-shot prompting and semantic analysis to detect goal hijacking, indirect injection, and misalignment before the agent acts.
CodeShield → input-output-guardrails (docs) — A static-analysis scanner that examines LLM-generated code for security issues in real time on the output side of the agent.
Scanner composition (docs) — The framework's plug-in model where individual scanners are attached to chosen workflow stages to compose layered defenses spanning raw input ingestion to final output actions.

Patterns this full-code implements —

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.

Anti-patterns avoided

✕Goal Hijacking
PromptGuard 2 detects direct jailbreak attempts in real time, blocking injected instructions that would redirect the agent's goal.

Meta LlamaFirewall (AlignmentCheck)

Neighbourhood

Anti-patterns avoided

Alternatives & relatives

Listed as alternative by (2)

References

Provenance