Define the production decision before the metrics

Evaluation should begin with the decision the team needs to make: whether to continue discovery, release to a limited group, permit a downstream action, replace a model, expand a corpus, or stop the pilot. Metrics are useful only when they support a real decision and a named owner.

The team should also define what remains prohibited even if average quality looks strong.

Build representative and adversarial test sets

A useful test set includes common work, high-value work, edge cases, missing evidence, conflicting sources, restricted requests, ambiguous inputs, and known historical failures. Selected success examples are not enough to describe production behavior.

Each item should record expected evidence, acceptable answer or action, refusal behavior, reviewer, and severity if the system fails.

Measure evidence support and reviewer agreement

Factuality should be tied to visible evidence rather than general plausibility. Reviewers should record whether the output is supported, complete enough for the task, appropriately uncertain, and safe to route to the next state.

Reviewer disagreement is itself a signal. It may reveal unclear policy, insufficient evidence, ambiguous workflow ownership, or an evaluation category that needs refinement.

Test blocked actions, refusals, and fallback paths

A production-ready workflow must behave predictably when it should not answer or act. Test missing sources, invalid schemas, unauthorized requests, restricted records, unsupported claims, and downstream failures.

The result should show whether the system blocked, refused, escalated, or fell back correctly—not only whether it generated a fluent response.

Compare versions and plan rollback

Model, prompt, retrieval, policy, and corpus changes can all alter behavior. Preserve version metadata and rerun the same test set so the team can compare gains, regressions, and new failure modes.

Release criteria should include rollback or disablement conditions, monitoring ownership, and the next review date. A pilot should not become dependency simply because users have started relying on it.

Buyer checklist

  • Name the release decision and owner.
  • Create representative and adversarial tests.
  • Record expected evidence and refusal behavior.
  • Measure reviewer agreement and severity.
  • Test blocked actions and fallback.
  • Define rollback and re-evaluation cadence.

Frequently asked questions

How large should an evaluation set be?

It should be large and diverse enough to represent the workflow risks and decisions. Coverage and severity matter more than a universal item count.

Can one benchmark prove production readiness?

No. Production readiness depends on the specific workflow, sources, users, authority, failure costs, monitoring, and release controls.

How often should evaluation be rerun?

Rerun after meaningful changes to the model, prompt, retrieval, corpus, policy, workflow, or user population, and on the agreed operational cadence.