Methodology · Prompt Engineeringprovenpartial

Iterative Prompt Refinement Loop

also known as prompt engineering lifecycle, test-and-iterate prompting

Applies to: llm-appagentcoding-agentrag-pipeline

Tags: prompt-engineeringiterationevaluation

Treat a prompt like an experiment, not a one-shot write. Draft the simplest prompt that could work, run it on real inputs, read the failures, change one thing, and run again. Stop when the prompt clears a quality bar you set before you started. The loop is cheap, and it beats guessing at wording.

Methodology process overview

flowchart TD bar[Task + success bar] --> draft[Draft simplest prompt] inputs[Real example inputs] --> run[Run on real inputs] draft --> run run --> read[Read and bucket failures] read --> change[Change one thing] change --> rerun[Re-run and compare] rerun -->|below bar| read rerun -->|clears bar| freeze[Freeze prompt + keep examples] freeze --> reg[Regression example set]

Intent. Turn prompt writing into a measured loop so every change is judged against real outputs instead of a hunch.

When to apply. Use this whenever a single prompt drives a feature and its output quality matters. Reach for it the moment a prompt 'mostly works but sometimes fails'. Don't apply it to a throwaway one-off prompt you will never run again.

Example scenario

A two-person team adds a feature that summarises a long support thread into three bullet points. They wrote a first prompt by feel; it looked fine on one thread and shipped. A week later users complained the summaries dropped the customer's actual question. The team switched to the loop. They collected twelve real threads, including two messy ones, and set a bar: the question must appear in bullet one on all twelve. The first prompt passed seven. They read the five misses, saw the model led with the agent's reply, and changed exactly one thing: they added a line saying bullet one is always the customer's question. Re-run: eleven of twelve. The last miss was a thread with two questions, so they added a single example showing how to handle that. Re-run: twelve of twelve. They froze the prompt and kept the twelve threads as a regression set, which caught a later edit that quietly broke the two-question case.

Inputs

Task and success bar — What the prompt must produce, and the measurable bar that counts as good enough.
Real example inputs — A handful of genuine inputs the prompt will face in production, including the awkward ones.
Target model — The model the prompt will run against, since wording that helps one model can hurt another.
A way to judge output — A human reader or an automated grader that can score each run against the bar.

Outputs

A prompt that clears the bar — The frozen prompt text that passed on the real inputs.
Failure log — The short record of which inputs broke which version and why.
Regression example set — The example inputs, kept so the prompt can be re-checked after any later edit.

Steps (6)

Draft the simplest prompt
Write the shortest prompt that could plausibly work. Resist adding rules you have not yet seen fail.
Run on real inputs
Run the prompt against the genuine example inputs, not imagined ones. Capture every output.
usesSampled Prompt Trace Eval
Read and bucket the failures
Read each bad output and group the failures by cause. The buckets tell you what to fix.
Change one thing
Make a single change aimed at the biggest bucket: a clearer instruction, one example, or a tighter output format.
usesStructured Output Chain of Thought
Re-run and compare
Run the new version on the same inputs and compare scores against the bar. If still below, return to reading failures.
usesPrompt Versioning
Freeze and keep the examples
When the prompt clears the bar, freeze it and keep the example inputs as a regression set for future edits.