Meta Llama Guard 3
Type: full-code · Vendor: Meta · Language: N/A · License: Llama 3.1 Community License · Status: active · Status in practice: mature · First released: 2024-07-23
Llama Guard 3 is a fine-tuned Llama 3.1 8B safety classifier that labels prompts and responses as safe or unsafe against a fixed hazard taxonomy.
Description. Meta Llama Guard 3 is an open-weight content-safety classifier from Meta, built by fine-tuning the Llama-3.1-8B pretrained model for safety classification. It takes a prompt or a model response and generates a safe/unsafe label plus, when unsafe, the violated hazard categories. The classification can be applied to both LLM inputs and LLM outputs, giving the host application a signal to refuse out-of-policy content.
Agent loop shape. Llama Guard is a single-pass classifier rather than an agent loop: it takes a conversation prompt or response and emits a safe/unsafe label with violated categories, and the calling application decides whether to allow, block, or refuse based on that output.
Primary use cases
- classifying prompts and responses as safe or unsafe
- input and output content moderation for LLM apps
- emitting machine-readable hazard category codes
- gating tool-call and code-interpreter abuse
Key concepts
- Hazard taxonomy (S1-S14) → typed-refusal-codes (docs) — The fixed set of 14 content-safety categories, aligned to the MLCommons hazard taxonomy plus a Code Interpreter Abuse category, that the model emits as the machine-readable violation reason.
- Prompt vs response classification → input-output-guardrails (docs) — The two modes the same classifier runs in — classifying the user prompt before the model answers, and classifying the model's response after — so it can guard both sides of a turn.
- Safe / unsafe label → refusal (docs) — The single-token binary verdict the model generates for an input, followed by the violated category codes when the verdict is unsafe, which the host app uses as a refusal signal.
Patterns this full-code implements —
- ★Multimodal Guardrails
A natively multimodal safety classifier evaluates mixed text-and-image prompts and responses together, classifying each as safe or unsafe across modalities rather than text only.
- ★★Refusal
A safety classifier evaluates each prompt and response against a hazard taxonomy and emits safe/unsafe plus the violated categories, giving the host application the signal to refuse out-of-policy req…
- ★Typed Refusal Codes
Defines a single fixed taxonomy of machine-readable hazard codes (S1-S14, aligned to the MLCommons taxonomy) that the guard model emits as the refusal/block reason, so unsafe classifications are tria…
- ★★Input/Output Guardrails
The same classifier model is applied to both the prompt (input guardrail) and the model response (output guardrail), classifying content on both sides of the LLM call.
Neighbourhood
Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.