Implementation guide

How to shadow-test AI agent policies

Writing a new policy is not the uncertain part. Flipping it on in production without knocking over a legitimate workload is the hard part. Shadow mode (also called observe-only or dry-run) is the answer: the policy evaluates against live traffic and logs what it would have done, but never blocks anything. After a week of data you know exactly which legitimate calls the rule would have stopped, and you can tune before enforcement starts. Run whole-SDK shadow, per-rule shadow, the analysis queries that decide when to promote, and the PR workflow that makes the flip reviewable.

  • An SDK initialized in shadow mode so governed decisions log while each governed tool call runs.
  • Per-rule shadow flags so you can test a single new rule alongside the rest of the bundle.
  • Analysis queries that surface false positives, dead rules, and surprise hits.
  • A PR workflow that promotes a shadowed rule to enforce with a clean rollback path.

Step 1: Run the SDK in shadow mode

The lowest-risk entry point is whole-SDK shadow. Set mode="shadow"when you construct the client. Every veto.decidecall still returns allow so nothing blocks, but the would-have outcome and the rule that produced it land in the decision record undershadow_outcome andshadow_rule_matched.

py
import os
from veto_sdk import Veto

veto = Veto(
    api_key=os.environ["VETO_API_KEY"],
    mode="shadow",
)

decision = veto.decide(
    tool="refund_order",
    args={"order_id": "ord_123", "amount_cents": 75000},
    agent={"id": "ag_001", "role": "support"},
)

# In shadow mode, decision.outcome reflects what the policy WOULD do,
# but veto.decide always returns allow so the action runs.
# The "would-have" outcome lives in decision.shadow_outcome.
print(decision.outcome)  # always "allow"
print(decision.shadow_outcome)  # one of "deny", "require_approval", or "allow"

This mode is right for the first deploy of any policy bundle. After the first clean week, flip the SDK back to enforce and use per-rule shadow for incremental changes.

Step 2: Shadow individual rules

Once the policy bundle is in enforce mode, you rarely want to drop the whole SDK back to shadow. The per-rulemode: shadow flag lets a single new rule run observe-only while every other rule continues to enforce. This is the pattern for the steady state.

yaml
# policies/agents.yaml
- name: refunds_above_threshold_v2
  mode: shadow  # only this rule runs in shadow mode
  match:
    tool: refund_order
  rules:
    - if: args.amount_cents > 30000
      then: require_approval

- name: existing_refund_rule
  match:  # this one still enforces
    tool: refund_order
  rules:
    - if: args.amount_cents > 50000
      then: require_approval

The new rule logs under shadow_outcome while the existing one is the actual decision. You can run both side by side and watch how the new rule would change behavior before you commit to it.

Step 3: Analyze the shadow data

Three queries answer the questions that matter. How many decisions would the rule have changed? Which existing tool calls would the rule have caught? And does the sample look like real blocks, or like false positives? The workspace surfaces all three under the shadow report; the CLI commands below produce the same data for offline analysis.

sh
# What would the new rule have done in the last 7 days?
veto-cli decisions list \
  --policy v17-shadow \
  --since 7d \
  --shadow-only \
  > shadow_v17.jsonl

# Group by shadow outcome
jq -r '.shadow_outcome' shadow_v17.jsonl | sort | uniq -c

# Most common rule that would have fired
jq -r 'select(.shadow_outcome != "allow") | .shadow_rule_matched' shadow_v17.jsonl \
  | sort | uniq -c | sort -rn | head -20

# Sample five denies for human review
jq -c 'select(.shadow_outcome == "deny")' shadow_v17.jsonl | shuf | head -5

The sample of five denies is the highest-signal artifact. Walk through each one with the team that owns the workflow. If they all look like real abuse, the rule is ready. If any look legitimate, the rule needs another pass before enforcement.

Step 4: Promote to enforce

Once the data is clean, the promotion is a small YAML change: remove themode: shadow field. Open a PR with the shadow report attached, get it reviewed, merge. The SDK hot-reloads the file and the next decide call enforces the rule. The old policy_version remains in the decision records so you can roll back with a revert.

yaml
# In policies/agents.yaml: flip the rule from shadow to enforce
- name: refunds_above_threshold_v2
  # mode: shadow  <-- removed
  match:
    tool: refund_order
  rules:
    - if: args.amount_cents > 30000
      then: require_approval

# Then commit, open a PR, get it reviewed, merge.
# The SDK hot-reloads the file on change and starts enforcing on the next call.
# The previous policy_version stays in the decision records for rollback.

For the formal definition of shadow mode and the decision view, see shadow mode validation.

Failure modes to catch

Promoting after one day

Weekend traffic is a different shape from weekday traffic. End-of-quarter is a different shape from mid-quarter. Run shadow for at least a full week. A month is better for low-volume tools.

Not reviewing the sample by hand

The aggregate counts are necessary but not sufficient. Five hand-reviewed examples usually surface the one workflow nobody remembered when they wrote the rule.

Mixing shadow and enforce in a single PR

Keep the rule-add PR and the rule-promote PR separate. Two commits, two reviews, two rollback points. Mixing them couples a code-review for the rule with a production change for enforcement.

Production checklist

  • Every new rule lands with mode: shadow in its first PR.
  • Shadow window is at least seven days or one full business cycle.
  • Aggregate report shows a low false-positive rate on shadow denies.
  • Hand-reviewed sample of five denies confirms the rule catches real abuse.
  • Promotion is a separate PR that strips the mode flag and links to the shadow report.

FAQ

How long should I shadow-test before enforcing?

At least seven days of production traffic, or one full business cycle if your traffic has a weekly pattern. The goal is to see representative legitimate workloads before enforcement. Shorter windows miss the long-tail edge cases that the policy will then block on day one of enforcement.

What metric tells me the policy is ready?

Two numbers. First, the false-positive rate on shadow_outcome=deny against legitimate-looking traffic. Keep it low enough that reviewers trust the gate. Second, the rule coverage: important rules should have fired at least a few times. A rule that does not fire during the test window is either dead code or the window is too short.

Can I shadow-test in production but enforce in staging at the same time?

Yes. Set mode by environment. Teams enforce in staging from day one so the team gets clear policy feedback, then run shadow in production until the metrics are clean. The two modes share the same YAML so the rules stay in lockstep.

Related guides

Enforce policy without breaking the workflow.