AI Product Development

Evaluating AI Output: How to Test a Non-Deterministic Product

You can't unit-test an LLM the way you test deterministic code. How to build evaluation sets, score non-deterministic output, and catch regressions before they ship.

Will Driscoll29 April 20269 min read

You change a prompt to fix one problem. Did it fix that problem? Did it break three others? With deterministic code, your test suite tells you. With AI, the output is non-deterministic and judgement-based - there's no obvious assertion. So most teams ship prompt and model changes on vibes, and find out they broke something when a user complains.

Evaluation is the discipline that fixes this. This article covers how to build evaluation for a non-deterministic AI product: evaluation sets, scoring methods, and catching regressions before they ship.

Why you can't unit-test an LLM

A unit test asserts an exact output: expect(add(2,2)).toBe(4). AI output doesn't work that way. Ask the model the same question twice and you may get two differently-worded correct answers. There's no single string to assert against.

So evaluation has to score quality rather than check equality. Is the answer correct? Is it grounded? Is it well-formatted? Does it refuse when it should? These are judgements, and evaluation is the practice of making those judgements systematically and repeatably.

The evaluation set

The foundation is an evaluation set: a collection of representative inputs paired with either expected outputs or quality criteria. Building it well is most of the work.

A good evaluation set includes:

Representative real inputs - the kinds of things actual users send, not idealised examples
Edge cases - the unusual inputs that break things
Adversarial cases - inputs designed to make the model misbehave (probing for hallucination, prompt injection, overstepping)
Refusal cases - inputs where the right answer is "I don't know" or a refusal

Start small. Even 20-30 well-chosen cases catch most regressions. Grow the set over time, especially by adding any real failure you find in production - every production bug becomes an evaluation case so it never recurs.

Scoring methods

How you score depends on the output type. From most to least objective:

Exact / structural checks

Where output should match a known answer or schema, check it directly. Classification ("is this spam?"), extraction ("does the output JSON have these fields with these values?"), and routing decisions can often be scored with exact or near-exact checks. These are the easiest and most reliable.

Rule-based checks

Where you can express quality as rules: does the answer cite a source? Is it within a length range? Does it avoid forbidden content? Does it include the required disclaimer? Rules catch a lot without needing judgement.

Model-graded evaluation (LLM-as-judge)

For subjective quality (is this summary good? is this answer helpful and accurate?), use another model as the grader. You give the grader the input, the output, and a rubric, and it scores. This scales judgement that would otherwise need a human.

LLM-as-judge has caveats - the grader has its own biases and can be inconsistent - so you calibrate it against human judgement on a sample, and use it for relative comparison (did this change make things better or worse?) more than absolute scores.

Human evaluation

For the highest-stakes quality judgements, a human reviews. Expensive and slow, so reserve it for the cases that matter most and for periodically calibrating your automated scoring.

The workflow

With an evaluation set and scoring in place, the workflow becomes:

Make a change (new prompt, new model, new retrieval logic)
Run the evaluation set against the change
Compare scores to the baseline
Ship only if it's better (or at least not worse on the things that matter)

This turns "I think this prompt is better" into "this prompt scores higher on our evaluation set." It's the difference between guessing and knowing.

It's also what makes swapping models safe. When a new model ships, run your evaluation set against it. If it scores better, switch. Without evaluation, model swaps are terrifying; with it, they're routine.

Evaluation in production

Offline evaluation (against your set) catches regressions before shipping. Production evaluation catches what your set didn't anticipate:

Sample real interactions and score them (rule-based or model-graded) to track quality over time
Capture user signals - thumbs up/down, edits to AI output, abandonment - as implicit quality measures
Feed failures back into the offline evaluation set

Production is where you discover the inputs you didn't think to test. The loop - find a production failure, add it to the evaluation set, fix it, confirm the fix - is how an AI product's quality compounds over time.

The cost of skipping evaluation

Teams that skip evaluation ship changes blind. Each prompt tweak might help or hurt; nobody knows. Quality drifts. A model swap that should be routine becomes a gamble nobody wants to take, so the product stays on an outdated model. Eventually a quality problem reaches enough users to become visible, and the team scrambles with no way to test fixes systematically.

Evaluation isn't overhead - it's what lets an AI product improve with confidence instead of degrading with uncertainty. It's infrastructure, built in from the start.

What to do next

If you're building an AI product and want it to improve reliably rather than drift, book a 30-minute discovery call. Evaluation is core to how we build.

Got a Bubble or Canvas app you’d like a second pair of eyes on?

30-minute discovery call. We’ll look at your app live and tell you honestly what we’d do next.

Book a discovery call See how we rescue Canvas apps →

Or grab the Bubble migration playbook PDF.