Full-Code · Model-Vendor Agentsactive

Meta Llama Guard 3

Type: full-code · Vendor: Meta · Language: N/A · License: Llama 3.1 Community License · Status: active · Status in practice: mature · First released: 2024-07-23

Links: homepage docs repo

Llama Guard 3 is a fine-tuned Llama 3.1 8B safety classifier that labels prompts and responses as safe or unsafe against a fixed hazard taxonomy.

Description. Meta Llama Guard 3 is an open-weight content-safety classifier from Meta, built by fine-tuning the Llama-3.1-8B pretrained model for safety classification. It takes a prompt or a model response and generates a safe/unsafe label plus, when unsafe, the violated hazard categories. The classification can be applied to both LLM inputs and LLM outputs, giving the host application a signal to refuse out-of-policy content.

Agent loop shape. Llama Guard is a single-pass classifier rather than an agent loop: it takes a conversation prompt or response and emits a safe/unsafe label with violated categories, and the calling application decides whether to allow, block, or refuse based on that output.

Primary use cases

  • classifying prompts and responses as safe or unsafe
  • input and output content moderation for LLM apps
  • emitting machine-readable hazard category codes
  • gating tool-call and code-interpreter abuse

Key concepts

  • Hazard taxonomy (S1-S14) typed-refusal-codes (docs)The fixed set of 14 content-safety categories, aligned to the MLCommons hazard taxonomy plus a Code Interpreter Abuse category, that the model emits as the machine-readable violation reason.
  • Prompt vs response classification input-output-guardrails (docs)The two modes the same classifier runs in — classifying the user prompt before the model answers, and classifying the model's response after — so it can guard both sides of a turn.
  • Safe / unsafe label refusal (docs)The single-token binary verdict the model generates for an input, followed by the violated category codes when the verdict is unsafe, which the host app uses as a refusal signal.

Patterns this full-code implements —

Neighbourhood

Click any neighbour to follow the lineage. Scroll to zoom, drag to pan.