Testing

Shadow Mode: How to Test AI Agents Before They Touch Real Data

OpsForEnergy·February 26, 2026·6 min read

The first time the Permit Agent parsed a real AHJ email, it got the permit number right but misread the deadline as 48 hours instead of 14 days. The mistake was subtle. A human would have caught it by reading the full paragraph. The agent, operating on a slightly ambiguous sentence structure, extracted the wrong number.

This is why shadow mode exists. Before any agent sends a real notification or writes to a live database, it runs in parallel with human processes for weeks — processing the same inputs, producing its own outputs, and logging them for comparison. No live actions. No customer impact. Just observation and measurement.

Shadow mode has three phases.

Phase 1: Observation. The agent reads real inputs but writes outputs to a shadow log instead of the production system. For the Permit Agent, this meant checking the live inbox, parsing emails, and recording what it would have done — which project it would have matched, which status it would have updated, and what notification it would have sent.

Phase 2: Comparison. A human reviewer compares the agent's shadow outputs to the actual actions taken by the PM. Every discrepancy is logged: false positives, false negatives, incorrect classifications, and missed edge cases. This comparison produces an accuracy score and a ranked list of failure modes.

Phase 3: Promotion. The agent is promoted to live operation only when accuracy meets a predefined threshold and the remaining errors are acceptable. For the OpsForEnergy agents, the threshold is 90% accuracy with zero high-severity errors — meaning no missed deadlines, no incorrect permit statuses, and no misfiled documents.

The shadow mode checklist I use for every agent:

Define the success criteria before shadow mode begins
Log every input, output, and confidence score
Review a random sample of outputs daily, not just the failures
Track accuracy by category, not just overall
Fix the top failure mode before moving to the next
Require two consecutive weeks above threshold before promotion

The metric: The Permit Agent ran in shadow mode for 14 days and processed 47 emails. It achieved 91% accuracy by the end of week two. The Field Agent ran for 21 days on SMS and email inputs, reaching 93% accuracy. The Ops Supervisor, which has more subjective outputs, ran for 10 days and reached 88% accuracy — close enough to promote with a human review loop for the remaining 12%.

An honest limitation: Shadow mode cannot catch errors that only appear under live operation. Some inputs are rare and will not show up during a two-week shadow period. That is why every agent at OpsForEnergy retains a human-in-the-loop for its first month of live operation, with automatic escalation for low-confidence outputs.

Want to see this in action? Here's the demo →

Download

Shadow Mode Checklist — a step-by-step PDF for testing agents before production.

Get the checklist →

How I Built an AI Agent to Track AHJ Permits Shadow Mode