Automatic Prompt Optimization
also known as prompt optimization, DSPy-style compilation, programmatic prompting
Stop hand-tuning the prompt. Define the task as inputs, outputs, and a metric, then let an optimizer search the prompt space for you. The optimizer proposes prompt variants, scores them on your metric, and keeps the winners. It pays off once you have a clear metric and a labelled example set, and it scales past what a human can tune by hand.
Methodology process overview
Intent. Replace manual prompt tweaking with a metric-driven search that an optimizer runs over the prompt space.
When to apply. Use this when you have a measurable metric, a labelled example set of a few dozen cases or more, and a task stable enough to be worth optimising. It earns its keep on prompts in pipelines and on tasks too fiddly to tune by hand. Don't apply it when you have no metric, since the optimizer has nothing to climb.
Inputs
- Typed task signature — The task stated as named inputs and outputs, so a program can call it and check the result.
- Labelled example set — Input-output pairs the optimizer scores against, split into training and held-out parts.
- Metric — A function that scores an output, from exact match to an LLM judge with a rubric.
- An optimizer — The search procedure, such as a DSPy optimizer, that proposes and scores prompt variants.
Outputs
- Optimised prompt — The winning prompt text or few-shot demonstration set the optimizer found.
- Score report — The metric scores on held-out data that justify the chosen variant.
Steps (5)
Express the task as a program
State the task as a typed signature of named inputs and outputs so an optimizer can call it and check results.
Assemble a labelled example set
Collect input-output pairs and split them into a training set the optimizer learns on and a held-out set for honest scoring.
Pick a metric
Choose a function that scores an output against the label, from exact match to a rubric-driven LLM judge.
Choose and run an optimizer
Pick an optimizer and let it propose prompt variants and few-shot demonstrations, scoring each on the metric.
Lock the winner and re-check
Freeze the best-scoring prompt and confirm it holds up on the held-out set, not just the training set.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- The optimizer climbs exactly what you measure, so the metric is the design.
- Always keep a held-out split; tuned prompts over-fit too.
- A clear typed signature makes the task optimisable in the first place.
- Automate the search, but a human still chooses the metric and reads the failures.
Known failure modes (2)
Related patterns (5)
- ★★Prompt Variant Evaluation
Author multiple variants of the same prompt node, run them as a batch against a shared dataset, and let an automated evaluation flow score them so the winning variant is selected by measurement.
- ★★Prompt/Response Optimiser
At runtime, transform user inputs and model outputs into standardised, template-aligned prompts and responses against predefined constraints, so the agent and its downstream consumers see consistent shapes.
- ★★Evaluator-Optimizer
One LLM generates; another evaluates and feeds back; loop until criteria are met.
- ·Automatic Workflow Search
Treat the agent's workflow (a graph of LLM-invoking nodes) as an artefact to search; use Monte Carlo Tree Search guided by an eval benchmark to discover the best workflow, then deploy it.
- ★Best-of-N Sampling
Sample N candidate outputs and select the highest-ranked by a reward model or scorer.
Related methodologies (2)
Sources (2)
Provenance
- Added to catalog:
- Last updated:
- Verification status: partial