nnenna hacks
Posts
Gen AI Testing is Not Software Testing: Rethinking QA for Intelligent Systems

Gen AI Testing is Not Software Testing: Rethinking QA for Intelligent Systems

Traditional testing won’t cut it. To build intelligent, reliable AI systems, product teams need a new playbook.

Nnenna Ndukwe
May 13, 2025

Editor’s Note: Via invitation, I attended the AI Community Conference at Microsoft in Cambridge, MA. It was there that I had the privilege of listening to Ashish Bhatia, Principal Product Manager at Microsoft, give a session on “Looking Beyond Vibe Checks, for your Intelligent Solutions”. My entire post was transcribed from handwritten notes and photos I captured of his presentation. I hope you enjoy these insightful takeaways on building battle-tested Generative AI solutions architecture, with a focus on testing and evaluations for production-grade software.

Nnenna Ndukwe

Why Gen AI Testing Matters More than Ever

Engineering and product teams are shipping AI into production faster than ever. While this is exciting (for those who are AI enthusiasts and believe in its potential), there are cracks in the systems and processes for proper Gen AI testing. Applying traditional software testing strategies to Gen AI systems leads to blind spots, brittle logic, and broken UX. When you consider even higher stakes, it leads to insecure systems that are relatively trivial to exploit by bad-acting users.

If you're still testing Gen AI features like typical backend services, you're setting your team up for failure.

The Testing Shift: From Deterministic Code to Probabilistic Systems

In software testing, we verify whether code behaves as expected. Every output can be traced back to a function, module, or line. We write unit tests, mock APIs, and enforce strict schemas.

But Generative AI breaks that system.

LLMs are not deterministic. They're probabilistic engines. Even with the same prompt, you might get five different outputs and maybe none (or few) are technically “wrong.” This challenges everything we know about QA.

What Makes Gen AI Testing Different?

Here’s a breakdown of the differences between traditional software testing and generative AI testing:

Aspect	Traditional Software Testing	Generative AI Testing
Nature of Output	Deterministic	Probabilistic
Expected Behavior	Known and fixed	Contextual and flexible
Testing Strategy	Unit, integration, end-to-end tests	Prompt evals, hallucination checks, bias tests
Metrics	Code coverage, pass/fail assertions	Factuality, coherence, style, completeness
Debugging	Step-through and logs	Model selection, prompt design, eval scores
Versioning	Semantic versioning of code	Model versioning, dataset tracking

Software testing asks: “Did this function do what it was supposed to?”

Gen AI testing asks: “Is this response good enough? Why?”

Breaking Down the Gen AI Testing Process

Here’s a closer look at each phase of the Gen AI Testing Process, adapted from traditional testing but optimized for the unique nature of AI-generated outputs.

1. Prompt Selection

Before you test the model, you must get your prompt right.

Prompts are the new programming interface. They're fragile, often opaque, and tightly coupled to model behavior. This step is about:

Selecting representative prompts across use cases
Designing few-shot vs zero-shot strategies
Creating variations for A/B testing
Including edge cases and ambiguous inputs

🛠️ Tools to use: LangSmith, PromptLayer, guidance libraries
📈 What to measure: Prompt effectiveness, sensitivity, completion consistency, etc.

2. Model Selection

Don’t just pick the biggest or cheapest model. This step is about evaluating tradeoffs.

Run the same prompts across multiple models to compare:

Cost per output
Latency
Factual accuracy
Style, tone, and safety

Start with top-tier models to establish a baseline, then work down toward more efficient alternatives using your own evaluation pipeline.

🛠️ Tools to use: OpenRouter, vLLM, HuggingFace Inference
📈 What to measure: ROUGE/BLEU for textual alignment, subjective ratings (via LLM judge), latency/cost vs quality benchmarks

3. Accuracy Testing

This is where the system is stress-tested for hallucinations, missing context, and factual correctness.

Accuracy testing includes:

Comparing model output against ground truth
Testing retrieval-augmented results (RAG)
Validating how well the model understands user intent

For agentic systems, accuracy includes tool selection, workflow steps, and reasoning chains.

🛠️ Tools to use: Ragas, Giskard, TruLens
📈 What to measure: Precision, recall, semantic alignment, eval scores by category (e.g. accuracy vs relevance)

4. Regression Testing

Once your model and prompt are working well, lock it in and protect against silent failure.

Changes in model version, prompt wording, and system configs can all cause unexpected degradation. Regression testing ensures new changes don’t break what used to work.

🛠️ Tools to use: EvalDBs, test set storage, prompt versioning systems
📈 What to measure: Output diffs, drop in eval scores, failure cases caught early

Bonus Tip: Add Human-in-the-Loop (HITL)

Even with automated evals, human reviewers can catch nuance, tone shifts, or implicit bias that models miss. Periodic audits or targeted HITL review can boost trust, especially in high-stakes Gen AI applications.

Testing Strategies for Gen AI Solutions

Testing Gen AI systems is about trust, safety, behavior under pressure, and how well your system holds up across use cases and users.

Here are the 6 testing strategies you must consider when building reliable Gen AI products:

🔦 1. Vibe Check

What are vibe checks? Think of them as quick intuition checks to gauge model behavior.

This is the human litmus test used during ideation and development. You run prompts casually and see: does the model sound right? Does it feel helpful, fluent, or on-brand?

Use it for prototyping new features, early-stage UX alignment, and catching off-tone outputs.

Vibe checks are subjective, but useful. Think of them as AI’s equivalent to "smoke testing”.

🔐 2. Responsible AI Testing

What is responsible AI testing? Ensuring ethical behavior, safety, and fairness.

As AI systems are given the responsibility of making more autonomous decisions, you need to test them against ethical standards.

What should you test for? Toxicity, bias, and harmful suggestions. Accessibility and inclusive outputs are critical, too. And regulatory alignment (think: GDPR, copyright, fairness, etc.)

🛠️ Tools: OpenAI Risk Cards, IBM AI Fairness 360, SafetyGym
Build safety evals into your CI as a priority.

🧪 3. Adversarial Testing

What is adversarial testing? Stress-testing your model by making it fail on purpose.

Challenge your system like an attacker would. Try jailbreak prompts, misleading user inputs, and using data that mimics adversarial behavior (prompt injection, tone shift, etc.)

🧰 Tools: promptbench, red-teaming sandboxes
Think like a threat modeler in order to find weaknesses before users do.

🧬 4. Unit & Batch Testing

What is unit and batch testing? Evaluating individual components and prompt behaviors.

Use batch prompt testing to:

Validate prompt variations
Compare model responses side-by-side
Evaluate thousands of outputs in parallel

Unit testing also applies to specific modules in your agentic system, which includes individual agents, tool calling, memory, and planning components. If you can’t test well at the agent component level, you won’t be able to evaluate well at the integration level and system level.

Thinking in modules and components for unit and batch testing is practice that builds confidence at scale.

🔄 5. Integration Testing

What is integration testing? Validating system behavior across full user flows that leverage multiple components, tools, services, etc.

This tests how all components interact: models, tools, context windows, external APIs, and more. The full production.

Use integration testing to catch context window overflow, tool call failures, and response errors that only emerge during runtime orchestration.

Integration testing is crucial for agent workflows, RAG systems, and multi-step AI experiences.

🧨 6. Regression Testing

Returning from the previous section for reinforcement.

As models update, context shifts, and prompts evolve — regression testing ensures your system doesn’t silently degrade. Use snapshot comparisons, test cases, and behavioral diffs to catch drift.

Insights from Microsoft’s AI Battlefield

Insight #1: Evaluation Is the Product

The output is the product in GenAI systems. That means your evaluation pipeline must be production-grade and as robust as your CI/CD workflow.

❝

“Own your evals before you own your AI.”

- Ashish Bhatia

Before you scale any Gen AI system, own your evaluation pipeline. It’s the only way to compare models for your actual use case (not leaderboard metrics). It’s critical for benchmarking prompt strategies and context formatting. Also, owning your evals pipeline helps to understand tradeoffs for cost, latency, and quality in your specific use cases. Your eval pipeline is your decision-making engine.

Define eval metrics that matter (e.g. relevance, tone, completeness)
Log and store model outputs with context for traceability
Use RAG-based testing to simulate realistic queries
Automate subjective evaluation with LLM-as-a-judge patterns

If you’re not evaluating every model response today, you’ll miss degradation, drift, or even misuse in production.

Insight #2: Start With the Best Model, Then Work Backwards

❝

Start with the best model (e.g. GPT-4o), then work your way down until you find the cheapest viable model for your quality bar.

Start with GPT-4o, Claude Opus, or other frontier models to build your baseline. Once your eval pipeline gives you a clear signal of success, optimize for cost or latency by testing smaller models (Mistral, LLaMA, Phi-3), fine-tuned internal variants, and task-specific routers.

But never guess. Let eval data guide your optimization. Otherwise, you're just shooting in the dark.

Insight #3: Leverage MCPs, the Backbone of Scalable AI Architectures

Raw model calls are not enough. Reliable Gen AI systems are wrapped in orchestration logic, guardrails, and Model Context Protocols (MCPs), a standardized way to deliver context to LLMs.

❝

MCPs define how memory, embeddings, user intent, and system state are packaged into each model request.

As you scale from prototype to production, the complexity of context handling becomes the bottleneck. That’s where Model Context Protocols (MCPs) come in.

MCPs help define how inputs, memory, retrieval-augmented data, and system state are packaged and passed into LLMs across different sessions and agents. Think of it as the API contract for the context itself.

Why MCPs matter:

Standardize how your prompts are structured
Enable model switching without breaking workflows
Improve traceability, caching, and replay for evals

If you’re building agentic or retrieval-heavy systems, MCPs are essential for reliability and observability. MCPs enable multi-modal context (text + embeddings + metadata), better observability and versioning of prompts, and easier model switching and sandboxing.

If you want to scale agentic or RAG systems, MCPs are non-negotiable.

Insight #4: Agentic Systems Must Be Tested Holistically

AI can be a tool. Or it can function as an agent, making decisions, reasoning, even taking action on your behalf.

When is AI a tool? When it enhances a workflow with a defined input/output (e.g., code autocomplete, summarization).
When is AI an agent? When it orchestrates multiple steps, retrieves context, and makes autonomous decisions across systems.

Agentic behavior introduces uncertainty, chaining, and emergent interactions, which all demand new forms of testing. When you go from “AI as a tool” to “AI as an agent,” your testing scope explodes. Instead of simply validating outputs, you're tasked with validating reasoning chains, tool use, and autonomy.

Testing needs to account for:

Multi-step planning and execution
External tool calling or API interactions
Recovery from ambiguity or failure

That means building end-to-end test harnesses for agentic behavior beyond the scope of eval output.

What Should Product Teams Do Next?

Whether you're an engineer, architect, PM, or technical leader, here’s how to move forward:

For Engineers:

Use prompt eval tools like Ragas to track quality
Log and trace every AI interaction, including prompt metadata and responses
Version control your prompts and system messages, just like code

For Architects:

Design modular GenAI layers with well-defined MCPs
Enable model-agnostic routing to optimize for cost vs. performance
Prioritize eval pipelines over shiny demos

For Product Managers:

Define success metrics beyond correctness (e.g., tone, brevity, safety).
Align your roadmap with eval insights. Don't fly blind.
Budget for experimentation with multiple models and retrieval systems.

For Technical Decision Makers:

Demand evaluation infrastructure alongside model development
Favor platforms that allow deep observability into LLM behavior
Build for model portability, not vendor lock-in

Final Takeaways

✅ Traditional tests only work for deterministic systems.
✅ Gen AI testing must be probabilistic, contextual, and output-first.
✅ Model selection without eval data is wasted effort.
✅ Don’t ship agents without agent-level evaluation.
✅ Own your model context protocols — they define your system’s intelligence.

If you’re serious about building with AI, your testing strategy must evolve.

Your AI won’t be perfect. But your approach to Gen AI testing can be systematic.

Ready to Build Testable, Trustworthy Gen AI Systems?

If your team is building AI-powered features or evaluating Gen AI tools, I can help you:

Design scalable, cost-efficient model evaluation pipelines
Integrate GenAI testing into your SDLC and CI/CD
Choose the right model and the right architecture for your goals

Let’s talk. Book a strategy session or connect with me on LinkedIn.

Nnenna Ndukwe is an AI consultant, Full-Stack Software Engineer, Developer Advocate, and an active AI community member in Boston. Connect with her on LinkedIn and X for more discussions on AI, software engineering, and the future of technology.

Reply

or to participate.