Methodology · Prompt Engineeringemergingpartial

Automatic Prompt Optimization

also known as prompt optimization, DSPy-style compilation, programmatic prompting

Applies to: llm-appagentclassification-taskpipeline

Tags: prompt-engineeringoptimizationdspy

Stop hand-tuning the prompt. Define the task as inputs, outputs, and a metric, then let an optimizer search the prompt space for you. The optimizer proposes prompt variants, scores them on your metric, and keeps the winners. It pays off once you have a clear metric and a labelled example set, and it scales past what a human can tune by hand.

Methodology process overview

flowchart TD sig[Typed task signature] --> prog[Express task as a program] ex[Labelled example set] --> opt[Run optimizer] metric[Metric] --> opt prog --> opt opt --> propose[Propose prompt variants] propose --> score[Score on metric] score -->|keep winners| propose score --> lock[Lock prompt + check held-out] lock --> report[Score report]

Intent. Replace manual prompt tweaking with a metric-driven search that an optimizer runs over the prompt space.

When to apply. Use this when you have a measurable metric, a labelled example set of a few dozen cases or more, and a task stable enough to be worth optimising. It earns its keep on prompts in pipelines and on tasks too fiddly to tune by hand. Don't apply it when you have no metric, since the optimizer has nothing to climb.

Example scenario

A team runs a classifier prompt that tags incoming tickets into one of eight categories. Hand-tuning had stalled around 78 percent accuracy and every wording tweak helped one category while hurting another. They expressed the task as a typed signature: ticket text in, category label out. They labelled 200 tickets, holding back 60. The metric was exact-match accuracy on the label. They ran a DSPy optimizer that searched over instructions and few-shot demonstration sets, scoring each candidate on the training split. The search lifted training accuracy to 89 percent; on the held-out 60 it landed at 86 percent, well past the hand-tuned ceiling. They locked the winning prompt and demonstrations, and kept the held-out set as a regression check for the next model upgrade. They learned the metric mattered most: their first metric ignored a rare but important category, so the optimizer happily ignored it too until they reweighted it.

Inputs

Typed task signature — The task stated as named inputs and outputs, so a program can call it and check the result.
Labelled example set — Input-output pairs the optimizer scores against, split into training and held-out parts.
Metric — A function that scores an output, from exact match to an LLM judge with a rubric.
An optimizer — The search procedure, such as a DSPy optimizer, that proposes and scores prompt variants.

Outputs

Optimised prompt — The winning prompt text or few-shot demonstration set the optimizer found.
Score report — The metric scores on held-out data that justify the chosen variant.

Steps (5)

Express the task as a program
State the task as a typed signature of named inputs and outputs so an optimizer can call it and check results.
usesStructured Output
Assemble a labelled example set
Collect input-output pairs and split them into a training set the optimizer learns on and a held-out set for honest scoring.
Pick a metric
Choose a function that scores an output against the label, from exact match to a rubric-driven LLM judge.
usesEvaluator-Optimizer
Choose and run an optimizer
Pick an optimizer and let it propose prompt variants and few-shot demonstrations, scoring each on the metric.
usesPrompt Variant Evaluation Automatic Workflow Search
Lock the winner and re-check
Freeze the best-scoring prompt and confirm it holds up on the held-out set, not just the training set.
usesPrompt/Response Optimiser

Framework-specific instructions

Pick a framework and generate a framework-targeted rewrite of this methodology's steps.

Choose framework

AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.

Principles

The optimizer climbs exactly what you measure, so the metric is the design.
Always keep a held-out split; tuned prompts over-fit too.
A clear typed signature makes the task optimisable in the first place.
Automate the search, but a human still chooses the metric and reads the failures.

Automatic Prompt Optimization

Methodology process overview

Steps (5)

Express the task as a program

Assemble a labelled example set

Pick a metric

Choose and run an optimizer

Lock the winner and re-check

Framework-specific instructions

Principles

Known failure modes (2)

Related patterns (5)

Related methodologies (2)

Sources (2)

Provenance