Iterative Prompt Refinement Loop
also known as prompt engineering lifecycle, test-and-iterate prompting
Treat a prompt like an experiment, not a one-shot write. Draft the simplest prompt that could work, run it on real inputs, read the failures, change one thing, and run again. Stop when the prompt clears a quality bar you set before you started. The loop is cheap, and it beats guessing at wording.
Methodology process overview
Intent. Turn prompt writing into a measured loop so every change is judged against real outputs instead of a hunch.
When to apply. Use this whenever a single prompt drives a feature and its output quality matters. Reach for it the moment a prompt 'mostly works but sometimes fails'. Don't apply it to a throwaway one-off prompt you will never run again.
Inputs
- Task and success bar — What the prompt must produce, and the measurable bar that counts as good enough.
- Real example inputs — A handful of genuine inputs the prompt will face in production, including the awkward ones.
- Target model — The model the prompt will run against, since wording that helps one model can hurt another.
- A way to judge output — A human reader or an automated grader that can score each run against the bar.
Outputs
- A prompt that clears the bar — The frozen prompt text that passed on the real inputs.
- Failure log — The short record of which inputs broke which version and why.
- Regression example set — The example inputs, kept so the prompt can be re-checked after any later edit.
Steps (6)
Draft the simplest prompt
Write the shortest prompt that could plausibly work. Resist adding rules you have not yet seen fail.
Run on real inputs
Run the prompt against the genuine example inputs, not imagined ones. Capture every output.
Read and bucket the failures
Read each bad output and group the failures by cause. The buckets tell you what to fix.
Change one thing
Make a single change aimed at the biggest bucket: a clearer instruction, one example, or a tighter output format.
Re-run and compare
Run the new version on the same inputs and compare scores against the bar. If still below, return to reading failures.
Freeze and keep the examples
When the prompt clears the bar, freeze it and keep the example inputs as a regression set for future edits.
Framework-specific instructions
Pick a framework and generate a framework-targeted rewrite of this methodology's steps.
Choose framework
AI-generated for Agent Development Kit (ADK) (Google) — verify against official docs.
Principles
- Change one variable at a time, or a score change tells you nothing.
- Judge against real inputs, never imagined ones.
- Keep every version and the inputs that broke it.
- Stop at a bar you set before you started, not when you get bored.
Known failure modes (2)
Related patterns (5)
- ★★Chain of Thought
Elicit multi-step reasoning by prompting the model to produce intermediate steps before its final answer.
- ★★Structured Output
Constrain the model's output to conform to a JSON Schema (or similar typed shape).
- ★★Prompt Versioning
Treat prompts as immutable, hashed, semver'd artefacts in a registry; deploy and roll back like code.
- ★Sampled Prompt Trace Eval
Capture full prompt/response/metadata traces from production into a monitoring dataset, but only run LLM-judge evaluation on a random sample so monitoring cost stays bounded as traffic grows.
- ★Own Your Prompts (12-Factor Agents)
Every prompt in a production agent is versioned, tested, and owned by the team in the application repo — never inherited as a framework default.
Related methodologies (1)
Sources (2)
Provenance
- Added to catalog:
- Last updated:
- Verification status: partial