Prompts Deserve the Same Discipline as Code
If a prompt directly affects:
- the quality of answers
- the quality of generated code
- the safety of agent behavior
then it is no longer a “personal trick.” It is part of the working system.
Therefore, prompts should have:
- versions
- change history
- owners
- evaluation criteria
Why Gut-Feel Assessment Is Not Enough
Many teams tweak prompts by feel:
- “this version seems better”
- “the responses feel smoother”
- “the agent seems smarter this time”
The problem is that feelings are not reproducible.
A “better prompt” should mean:
- fewer errors
- better format compliance
- less scope creep
- less manual correction needed
How to Version Prompts
The simplest approach:
- store prompts in a repository
- every change goes through a pull request
- document the reason for each change
- if possible, attach before/after examples
Example changelog:
v1.2
- Clarified fallback behavior when data is missing
- Required findings to include file references
- Added length constraint to reduce rambling
Just doing this puts a team ahead of most organizations that still prompt from memory.
What Is Prompt Evaluation?
Evaluation (eval) is a small test suite that checks whether a prompt achieves its objectives.
For a review agent, an eval might include 5 cases:
- A diff with a null pointer bug
- A diff with a performance regression
- A diff with only formatting changes
- A diff with missing context
- A diff with a security change
The expectation is not identical output word-for-word. The expectation is:
- Did it detect the real issue?
- Did it follow the output contract?
- Did it fabricate information when context was missing?
A Critical Principle: Change One Thing at a Time
When editing a prompt, do not change 5 things at once.
Change one element at a time:
- add an output contract
- fix the fallback behavior
- narrow the scope
Then re-run the eval.
This is the only way to know which change actually helped.
Practical Metrics to Start With
You do not need a complex measurement system on day one. Start with simple criteria:
- Rate of output matching the required format
- Rate of correctly identifying critical issues
- Rate of users needing to re-prompt
- Rate of agent going out of scope
- Rate of agent stating uncertainty when appropriate
Key Takeaway
Standardizing prompts without versioning and evaluation only gets you halfway.
A strong prompt is not just a well-written prompt. It is a prompt that is measurable, reviewable, and improvable in a controlled way.
In the final foundations part, we assemble everything into a minimum viable Prompt Standard kit for immediate team deployment. Continue to Part 5 — A Minimum Viable Prompt Standard Kit.