Meta Llama Guard 3

Type: full-code · Vendor: Meta · Language: N/A · License: Llama 3.1 Community License · Status: active · Status in practice: mature · First released: 2024-07-23

Links: homepage docs repo

Llama Guard 3 is a fine-tuned Llama 3.1 8B safety classifier that labels prompts and responses as safe or unsafe against a fixed hazard taxonomy.

Description. Meta Llama Guard 3 is an open-weight content-safety classifier from Meta, built by fine-tuning the Llama-3.1-8B pretrained model for safety classification. It takes a prompt or a model response and generates a safe/unsafe label plus, when unsafe, the violated hazard categories. The classification can be applied to both LLM inputs and LLM outputs, giving the host application a signal to refuse out-of-policy content.

Agent loop shape. Llama Guard is a single-pass classifier rather than an agent loop: it takes a conversation prompt or response and emits a safe/unsafe label with violated categories, and the calling application decides whether to allow, block, or refuse based on that output.

Primary use cases

classifying prompts and responses as safe or unsafe
input and output content moderation for LLM apps
emitting machine-readable hazard category codes
gating tool-call and code-interpreter abuse

flowchart TD fw["Meta Llama Guard 3"] fw --> p1["Multimodal Guardrails<br/>(core)"] fw --> p2["Refusal<br/>(core)"] fw --> p3["Typed Refusal Codes<br/>(core)"] fw --> p4["Input/Output Guardrails<br/>(core)"]

Key concepts

Hazard taxonomy (S1-S14) → typed-refusal-codes (docs) — The fixed set of 14 content-safety categories, aligned to the MLCommons hazard taxonomy plus a Code Interpreter Abuse category, that the model emits as the machine-readable violation reason.
Prompt vs response classification → input-output-guardrails (docs) — The two modes the same classifier runs in — classifying the user prompt before the model answers, and classifying the model's response after — so it can guard both sides of a turn.
Safe / unsafe label → refusal (docs) — The single-token binary verdict the model generates for an input, followed by the violated category codes when the verdict is unsafe, which the host app uses as a refusal signal.

Meta Llama Guard 3

Neighbourhood

Alternatives & relatives

Listed as alternative by (2)

References

Provenance