Evaluating AI Output: How to Test a Non-Deterministic Product
You can't unit-test an LLM the way you test deterministic code. How to build evaluation sets, score non-deterministic output, and catch regressions before they ship.
You change a prompt to fix one problem. Did it fix that problem? Did it break three others? With deterministic code, your test suite tells you. With AI, the output is non-deterministic and judgement-based - there's no obvious assertion. So most teams ship prompt and model changes on vibes, and find out they broke something when a user complains.
Evaluation is the discipline that fixes this. This article covers how to build evaluation for a non-deterministic AI product: evaluation sets, scoring methods, and catching regressions before they ship.
Why you can't unit-test an LLM
A unit test asserts an exact output: expect(add(2,2)).toBe(4). AI output doesn't work that way. Ask the model the same question twice and you may get two differently-worded correct answers. There's no single string to assert against.
So evaluation has to score quality rather than check equality. Is the answer correct? Is it grounded? Is it well-formatted? Does it refuse when it should? These are judgements, and evaluation is the practice of making those judgements systematically and repeatably.
The evaluation set
The foundation is an evaluation set: a collection of representative inputs paired with either expected outputs or quality criteria. Building it well is most of the work.
A good evaluation set includes:
- Representative real inputs - the kinds of things actual users send, not idealised examples
- Edge cases - the unusual inputs that break things
- Adversarial cases - inputs designed to make the model misbehave (probing for hallucination, prompt injection, overstepping)
- Refusal cases - inputs where the right answer is "I don't know" or a refusal
Start small. Even 20-30 well-chosen cases catch most regressions. Grow the set over time, especially by adding any real failure you find in production - every production bug becomes an evaluation case so it never recurs.
Scoring methods
How you score depends on the output type. From most to least objective:
Exact / structural checks
Where output should match a known answer or schema, check it directly. Classification ("is this spam?"), extraction ("does the output JSON have these fields with these values?"), and routing decisions can often be scored with exact or near-exact checks. These are the easiest and most reliable.
Rule-based checks
Where you can express quality as rules: does the answer cite a source? Is it within a length range? Does it avoid forbidden content? Does it include the required disclaimer? Rules catch a lot without needing judgement.
Model-graded evaluation (LLM-as-judge)
For subjective quality (is this summary good? is this answer helpful and accurate?), use another model as the grader. You give the grader the input, the output, and a rubric, and it scores. This scales judgement that would otherwise need a human.
LLM-as-judge has caveats - the grader has its own biases and can be inconsistent - so you calibrate it against human judgement on a sample, and use it for relative comparison (did this change make things better or worse?) more than absolute scores.
Human evaluation
For the highest-stakes quality judgements, a human reviews. Expensive and slow, so reserve it for the cases that matter most and for periodically calibrating your automated scoring.
The workflow
With an evaluation set and scoring in place, the workflow becomes:
- Make a change (new prompt, new model, new retrieval logic)
- Run the evaluation set against the change
- Compare scores to the baseline
- Ship only if it's better (or at least not worse on the things that matter)
This turns "I think this prompt is better" into "this prompt scores higher on our evaluation set." It's the difference between guessing and knowing.
It's also what makes swapping models safe. When a new model ships, run your evaluation set against it. If it scores better, switch. Without evaluation, model swaps are terrifying; with it, they're routine.
Evaluation in production
Offline evaluation (against your set) catches regressions before shipping. Production evaluation catches what your set didn't anticipate:
- Sample real interactions and score them (rule-based or model-graded) to track quality over time
- Capture user signals - thumbs up/down, edits to AI output, abandonment - as implicit quality measures
- Feed failures back into the offline evaluation set
Production is where you discover the inputs you didn't think to test. The loop - find a production failure, add it to the evaluation set, fix it, confirm the fix - is how an AI product's quality compounds over time.
The cost of skipping evaluation
Teams that skip evaluation ship changes blind. Each prompt tweak might help or hurt; nobody knows. Quality drifts. A model swap that should be routine becomes a gamble nobody wants to take, so the product stays on an outdated model. Eventually a quality problem reaches enough users to become visible, and the team scrambles with no way to test fixes systematically.
Evaluation isn't overhead - it's what lets an AI product improve with confidence instead of degrading with uncertainty. It's infrastructure, built in from the start.
What to do next
If you're building an AI product and want it to improve reliably rather than drift, book a 30-minute discovery call. Evaluation is core to how we build.
Read next: Building AI features that don't hallucinate and AI product observability.
Got a Bubble or Canvas app you’d like a second pair of eyes on?
30-minute discovery call. We’ll look at your app live and tell you honestly what we’d do next.
Or grab the Bubble migration playbook PDF.