The Testing Problem Is Different for Agents

Traditional software testing relies on deterministic behavior: given input X, expect output Y. AI agents introduce non-determinism at their core — the same input can produce different outputs, different tool call sequences, and different reasoning paths. This does not mean agents are untestable. It means we need a testing framework designed for probabilistic systems.

A practical agent testing strategy operates at three levels, each catching different categories of defects.

Level 1: Unit Tests (Deterministic)

Unit tests validate the deterministic components of your agent system — everything except the LLM calls themselves.

What to Unit Test

Tool functions: Each tool the agent can call should have standard unit tests with known inputs and expected outputs
State management: State transitions, reducers, and serialization logic
Input validation: Prompt template rendering, parameter parsing, and guardrail logic
Output parsing: Extracting structured data from LLM responses

# Test a tool function deterministically
def test_calculate_shipping_cost():
    result = calculate_shipping(weight_kg=2.5, destination="US", method="express")
    assert result["cost"] == 24.99
    assert result["estimated_days"] == 3

# Test output parsing
def test_parse_agent_action():
    raw_response = "I'll look up the order. ACTION: get_order(order_id='ORD-123')"
    action = parse_action(raw_response)
    assert action.tool == "get_order"
    assert action.params == {"order_id": "ORD-123"}

Mock LLM Responses

For unit testing agent control flow, replace the LLM with deterministic mock responses:

class MockLLM:
    def __init__(self, responses: list[str]):
        self.responses = iter(responses)

    async def generate(self, prompt: str) -> str:
        return next(self.responses)

# Test the agent's decision logic with predictable LLM outputs
async def test_agent_routes_to_billing():
    mock = MockLLM(["The customer is asking about billing."])
    agent = SupportAgent(llm=mock)
    result = await agent.classify("Why was I charged twice?")
    assert result.category == "billing"

Level 2: Integration Tests (Semi-Deterministic)

Integration tests verify that agent components work together correctly, including interactions with external tools and services.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

What to Integration Test

Tool orchestration: Does the agent call tools in a valid sequence?
Error handling: Does the agent recover gracefully from tool failures?
Guardrail enforcement: Do safety checks prevent unauthorized actions?
State persistence: Does checkpointing and recovery work correctly?

Strategies for Reducing Non-Determinism

Fixed seeds and low temperature: Set temperature to 0 and use fixed random seeds to increase reproducibility
Assertion on patterns, not exact text: Check that the agent called the right tools with the right parameters, not that it phrased its reasoning identically
Bounded retries: Allow tests to retry up to 3 times, passing if any attempt succeeds (for truly non-deterministic outputs)

Level 3: End-to-End Evaluation (Probabilistic)

E2E tests run the full agent pipeline with real LLM calls against a suite of test scenarios. These tests are evaluated probabilistically rather than with exact assertions.

LLM-as-Judge Pattern

Use a separate LLM to evaluate whether the agent's response meets quality criteria:

async def evaluate_response(scenario, agent_response):
    eval_prompt = f"""
    Scenario: {scenario.description}
    Expected behavior: {scenario.expected_behavior}
    Agent response: {agent_response}

    Rate the agent's response on these criteria (1-5):
    1. Correctness: Did it solve the problem?
    2. Completeness: Did it address all aspects?
    3. Safety: Did it stay within authorized boundaries?
    4. Tone: Was the communication appropriate?

    Return JSON: {{"correctness": N, "completeness": N, "safety": N, "tone": N}}
    """
    return await eval_llm.generate(eval_prompt)

Test Scenario Design

Build a diverse evaluation dataset covering:

Happy paths: Common requests the agent should handle well
Edge cases: Unusual inputs, ambiguous requests, multi-step problems
Adversarial inputs: Prompt injections, out-of-scope requests, attempts to bypass guardrails
Regression cases: Specific failures from production that have been fixed

Setting Pass Thresholds

Track aggregate scores across the full test suite, not individual scenarios
Set minimum thresholds (e.g., average correctness above 4.0 out of 5.0)
Monitor score trends over time to catch gradual degradation

CI/CD Integration

Unit tests: Run on every commit. Fast, deterministic, no API costs.
Integration tests: Run on pull requests. Moderate speed, minimal API costs with mock LLMs.
E2E evaluation: Run nightly or on release candidates. Slow, involves real API costs.

The goal is not to make agent behavior perfectly deterministic — it is to build confidence that the agent handles the scenarios your users encounter, with quality that meets your standards.

Sources: DeepEval Testing Framework | LangSmith Evaluation | Braintrust AI Evaluation

AI Agent Testing Strategies: Unit, Integration, and End-to-End Approaches