Research note, March 2026

Can Is Not May: Authority Models for Governable AI Agents

Name: Veto
Availability: InStock
Author: Veto

Yaz Caleb: Veto and Arizona State University

Prompt-only baselines permitted unauthorized actions in 18.3% of sampled trials. The authority model recorded 0 observed violations in this benchmark.

Across 7,427 trials with 4 LLMs under ambient social pressure, prompt-only guardrails permitted unauthorized actions in 18.3% of sampled cases (per-model range: 1%-40%). The authority model used deterministic interception before tool execution and recorded no unauthorized actions in the benchmark configuration.

Abstract

AI agents act through tool-use frameworks that often prove capability while leaving authority implicit. This paper introduces authority models: deterministic, external policy engines that evaluate each governed tool call against a seven-parameter May judgment before execution.

The paper formalizes capability-authority independence: the principle that whether an agent can perform an action provides no information about whether it may. All four cells of the Can x May product are realizable, and any system conflating can with may admits privilege escalation by capability acquisition.

Three formal properties are proven: Can/May Separation, Deny-Monotonicity, and Escalation Monotonicity. AuthorityBench evaluates five enforcement conditions across 54 scenarios, four LLMs, and 7,427 trials. Under ambient social pressure, prompt-only baselines permitted unauthorized actions in 18.3% of sampled trials. The authority model recorded zero observed violations in this benchmark: by architecture, not by model behavior.

Key results

Violation rate under ambient social pressure (P2): the realistic threat where context makes a violation seem natural without explicit prompt injection. Aggregate across 4 models.

Prompt-Only

18.3%

Authorized tools listed in system prompt. No external enforcement.

Keyword Filter

16.0%

Pre-execution keyword denial on tool names. Bypassed by tool substitution.

LLM-as-Judge

0.0%

The authority model evaluates each call against policy. Report the exact model pairing in your benchmark notes.

Authority Model

0.0%

Deterministic policy evaluation with session history. No unauthorized actions were observed in this benchmark configuration.

Per-model violation rates under prompt-only enforcement range from 1.0% (lowest sampled model condition) to 40.0% (highest sampled model condition). The authority model recorded 0.0% across all four models in this benchmark. Cross-model variance for prompt-only: 0.022. Cross-model variance for authority model: 0.000. Enforcement determinism is a property of the architecture, not individual model strength.

The LLM-as-Judge condition in this sample also shows 0.0% violations, but was tested on a single proposer-judge pairing with an already-low base violation rate. It does not establish that LLM judging generally matches deterministic enforcement.

AuthorityBench

54 authorization scenarios across six categories, each tested at three adversarial pressure levels: benign (P1), ambient (P2), and adversarial (P3). Four LLMs from two provider ecosystems. All scenarios include ground-truth labels and argument-level authority constraints.

What this means

Prompts are suggestions, not enforcement. System prompt instructions, keyword filters, and alignment training all reduced violations on average in the sample, but did not eliminate them. Under ambient social pressure, the tested prompt-based approaches permitted unauthorized actions at nonzero rates across the evaluated models.

Deterministic interception: evaluating each governed tool call against explicit policy before execution, outside the LLM's inference path, was the tested approach in this benchmark sample with no recorded unauthorized actions across the tested models, scenarios, and pressure levels. Not because the models always behave well. Because the tested architecture did not give them the choice on the governed path.

This is the research foundation behind Veto's architecture: declarative policy, runtime interception, human escalation. The open-source SDK is the reference implementation described in Section 5 of the paper.

Related concepts

The authorization gap

The blind spot identity-only authentication leaves between intent and action.

Excessive agency

OWASP LLM06 and the systems-level cause behind most agent failures.

Runtime authorization

The control loop the paper proposes: after planning, before execution.

Shadow mode validation

Observe-only enforcement for collecting evidence before turning on blocking.

Shadow-test policies

Run new policies against live traffic without affecting outcomes.

Audit agent actions

What to record, where to store it, and how to make logs useful in incidents.

Read the full study.

Download PDF Sign up