Framework · Enterprise Platforms

Meta LlamaFirewall (AlignmentCheck)

LlamaFirewall is a guardrail framework that runs a set of scanners on the inputs, outputs, and reasoning trace of an LLM agent to detect and mitigate prompt injection, goal hijacking, and misalignment before the agent acts.

Description

LlamaFirewall is an open-source security framework from Meta's Purple Llama project that wraps an LLM agent with composable scanners. PromptGuard classifies user inputs and untrusted tool content for direct prompt injection and jailbreak attempts, while AlignmentCheck audits the agent's chain-of-thought in real time for goal hijacking and indirect injection. A CodeShield scanner and customizable regex filters cover code generation and additional risks. The scanners can be plugged into different stages of an agent's workflow, from raw input ingestion to final output.

Solution

LlamaFirewall acts as a policy engine that orchestrates security scanners across stages of the agent's workflow. Untrusted inputs and tool content are scanned by PromptGuard for direct prompt injection before they reach the model; as the agent reasons, AlignmentCheck inspects the running chain-of-thought for goal hijacking and indirect injection; code outputs are checked by CodeShield. A scanner that flags a risk can block the input or action so the agent does not proceed on poisoned content.

Primary use cases

  • scanning agent inputs for prompt injection and jailbreaks
  • auditing an agent's reasoning trace for goal hijacking
  • detecting misalignment in multi-step agentic operations
  • layered guardrails around LLM agents

Open the full interactive page

Diagram, neighbourhood map, code examples, related patterns and full provenance.