What does behavior-preserving modernization mean?

It means documenting the current system’s observable inputs, outputs, calculations, permissions, workflow states, and exceptions so intentional changes can be separated from accidental behavioral drift.

Do you rewrite legacy systems from scratch?

Not by default. The preferred path is to map risk, introduce testable seams, and modernize in bounded stages when that reduces operational risk.

What kinds of .NET systems are a fit?

Common fits include ASP.NET, ASP.NET Core, MVC, Web Forms, Classic ASP, VB.NET, C#, SQL Server, stored-procedure-heavy systems, reporting workflows, internal admin tools, and API modernization.

What if our business logic is mostly in SQL Server?

SQL-side business rules are treated as part of the application behavior. The modernization work maps stored procedures, jobs, reports, transactions, owners, and representative parity scenarios before replacement.

Can you help with AI without exposing private data?

Yes, when the engagement is designed around approved data boundaries, secure channels, least-authority access, and an appropriate local, private, hybrid, or managed model approach. Public forms must not receive secrets or regulated data.

What is human-reviewed AI?

AI output remains proposed work until a named reviewer can inspect evidence, edit, approve, reject, block, or escalate it under explicit workflow rules.

What is governed RAG?

Governed RAG adds source ownership, access control, provenance, content states, citation rules, refusal behavior, evaluation, and human escalation around retrieval.

What happens after the first fit call?

If the problem fits, the next step is usually a scoped assessment or package proposal. Sensitive system details move to an approved private channel rather than the public form.

Do you publish client metrics?

Only when the source material and exact wording are approved. Public-safe case pages distinguish methods, sample artifacts, measurement models, and approved outcomes so templates are not presented as client results.

How do you handle confidential systems?

Public routes collect only high-level context. Private code, credentials, PHI, customer records, and confidential architecture require a separately approved secure handling path.

LongtermSoftware.com

Define the production decision before the metrics

Evaluation should begin with the decision the team needs to make: whether to continue discovery, release to a limited group, permit a downstream action, replace a model, expand a corpus, or stop the pilot. Metrics are useful only when they support a real decision and a named owner.

The team should also define what remains prohibited even if average quality looks strong.

Build representative and adversarial test sets

A useful test set includes common work, high-value work, edge cases, missing evidence, conflicting sources, restricted requests, ambiguous inputs, and known historical failures. Selected success examples are not enough to describe production behavior.

Each item should record expected evidence, acceptable answer or action, refusal behavior, reviewer, and severity if the system fails.

Measure evidence support and reviewer agreement

Factuality should be tied to visible evidence rather than general plausibility. Reviewers should record whether the output is supported, complete enough for the task, appropriately uncertain, and safe to route to the next state.

Reviewer disagreement is itself a signal. It may reveal unclear policy, insufficient evidence, ambiguous workflow ownership, or an evaluation category that needs refinement.

Test blocked actions, refusals, and fallback paths

A production-ready workflow must behave predictably when it should not answer or act. Test missing sources, invalid schemas, unauthorized requests, restricted records, unsupported claims, and downstream failures.

The result should show whether the system blocked, refused, escalated, or fell back correctly—not only whether it generated a fluent response.

Compare versions and plan rollback

Model, prompt, retrieval, policy, and corpus changes can all alter behavior. Preserve version metadata and rerun the same test set so the team can compare gains, regressions, and new failure modes.

Release criteria should include rollback or disablement conditions, monitoring ownership, and the next review date. A pilot should not become dependency simply because users have started relying on it.

Buyer checklist

Name the release decision and owner.
Create representative and adversarial tests.
Record expected evidence and refusal behavior.
Measure reviewer agreement and severity.
Test blocked actions and fallback.
Define rollback and re-evaluation cadence.

Frequently asked questions

How large should an evaluation set be?

It should be large and diverse enough to represent the workflow risks and decisions. Coverage and severity matter more than a universal item count.

Can one benchmark prove production readiness?

No. Production readiness depends on the specific workflow, sources, users, authority, failure costs, monitoring, and release controls.

How often should evaluation be rerun?

Rerun after meaningful changes to the model, prompt, retrieval, corpus, policy, workflow, or user population, and on the agreed operational cadence.

How to Evaluate an AI Pilot Before It Becomes a Production Dependency