
Shadow Mode: How to Test AI Agents Before They Touch Real Data
The first time the Permit Agent parsed a real AHJ email, it got the permit number right but misread the deadline as 48 hours instead of 14 days. The mistake was subtle. A human would have caught it by reading the full paragraph. The agent, operating on a slightly ambiguous sentence structure, extracted the wrong number.
This is why shadow mode exists. Before any agent sends a real notification or writes to a live database, it runs in parallel with human processes for weeks — processing the same inputs, producing its own outputs, and logging them for comparison. No live actions. No customer impact. Just observation and measurement.
Shadow mode has three phases.
Phase 1: Observation. The agent reads real inputs but writes outputs to a shadow log instead of the production system. For the Permit Agent, this meant checking the live inbox, parsing emails, and recording what it would have done — which project it would have matched, which status it would have updated, and what notification it would have sent.
Phase 2: Comparison. A human reviewer compares the agent's shadow outputs to the actual actions taken by the PM. Every discrepancy is logged: false positives, false negatives, incorrect classifications, and missed edge cases. This comparison produces an accuracy score and a ranked list of failure modes.
Phase 3: Promotion. The agent is promoted to live operation only when accuracy meets a predefined threshold and the remaining errors are acceptable. For the OpsForEnergy agents, the threshold is 90% accuracy with zero high-severity errors — meaning no missed deadlines, no incorrect permit statuses, and no misfiled documents.
The shadow mode checklist I use for every agent:
- Define the success criteria before shadow mode begins
- Log every input, output, and confidence score
- Review a random sample of outputs daily, not just the failures
- Track accuracy by category, not just overall
- Fix the top failure mode before moving to the next
- Require two consecutive weeks above threshold before promotion
The metric: The Permit Agent ran in shadow mode for 14 days and processed 47 emails. It achieved 91% accuracy by the end of week two. The Field Agent ran for 21 days on SMS and email inputs, reaching 93% accuracy. The Ops Supervisor, which has more subjective outputs, ran for 10 days and reached 88% accuracy — close enough to promote with a human review loop for the remaining 12%.
An honest limitation: Shadow mode cannot catch errors that only appear under live operation. Some inputs are rare and will not show up during a two-week shadow period. That is why every agent at OpsForEnergy retains a human-in-the-loop for its first month of live operation, with automatic escalation for low-confidence outputs.
Want to see this in action? Here's the demo →
Shadow Mode Checklist — a step-by-step PDF for testing agents before production.
Get the checklist →