Preprint · March 2026
Can Is Not May: Authority Models for Governable AI Agents
Yaz Caleb — Plaw, Inc. / Arizona State University
Download PDFPrompts fail 18.3% of the time. Deterministic policy enforcement fails 0%.
Across 7,427 trials with 4 LLMs under ambient social pressure, prompt-only guardrails permitted unauthorized actions in 18.3% of cases (per-model range: 1%–40%). The authority model — deterministic interception before tool execution — permitted zero.
Abstract
AI agents act through tool-use frameworks, but no formal mechanism ensures they are authorized to act — only that they are capable. This paper introduces authority models: deterministic, external policy engines that evaluate every tool call against a seven-parameter May judgment before execution.
The paper formalizes capability-authority independence — the principle that whether an agent can perform an action provides no information about whether it may. All four cells of the Can x May product are realizable, and any system conflating these layers admits privilege escalation by capability acquisition.
Three formal properties are proven: Can/May Separation, Deny-Monotonicity, and Escalation Monotonicity. AuthorityBench evaluates five enforcement conditions across 54 scenarios, four LLMs, and 7,427 trials. Under ambient social pressure, prompt-only baselines permit unauthorized actions 18.3% of the time. The authority model permits zero — by construction, not by probability.
Key results
Violation rate under ambient social pressure (P2) — the realistic threat where context makes a violation seem natural without explicit prompt injection. Aggregate across 4 models.
Prompt-Only
18.3%
Authorized tools listed in system prompt. No external enforcement.
Keyword Filter
16.0%
Pre-execution keyword denial on tool names. Bypassed by tool substitution.
LLM-as-Judge
0.0%
GPT-5.4 evaluates each call against policy. Tested on one model pairing only.
Authority Model
0.0%
Deterministic policy evaluation with session history. Zero violations by construction.
Per-model violation rates under prompt-only enforcement range from 1.0% (GPT-5.4) to 40.0% (MiniMax-M2.5). The authority model achieves 0.0% across all four models. Cross-model variance for prompt-only: 0.022. Cross-model variance for authority model: 0.000. Enforcement determinism is a property of the architecture, not individual model strength.
The LLM-as-Judge condition (GPT-5.4 judging GPT-5.4 trajectories) also shows 0.0% violations in this sample, but was tested on a single proposer-judge pairing with an already-low base violation rate. It does not establish that LLM judging generally matches deterministic enforcement.
AuthorityBench
54 authorization scenarios across six categories, each tested at three adversarial pressure levels: benign (P1), ambient (P2), and adversarial (P3). Four LLMs from two provider ecosystems. All scenarios include ground-truth labels and argument-level authority constraints.
Four LLMs tested: GLM-5, GPT-5.4, Kimi-K2.5, and MiniMax-M2.5. Five enforcement conditions: Prompt-Only, Keyword Filter, Authority Model, Authority Model (−H, without history tracking), and LLM-as-Judge.
Scenarios, policy files, and the benchmark harness are publicly available at github.com/yazcaleb/can-is-not-may.
What this means
Prompts are suggestions, not enforcement. System prompt instructions, keyword filters, and alignment training all reduce violations on average — but none eliminate them. Under ambient social pressure, every prompt-based approach tested permits unauthorized actions at nonzero rates across all models.
Deterministic interception — evaluating every tool call against explicit policy before execution, outside the LLM's inference path — is the only approach in this benchmark that achieves zero unauthorized actions across all models, all scenarios, and all pressure levels. Not because the models behave well. Because the architecture does not give them the choice.
This is the research foundation behind Veto's architecture: declarative policy, runtime interception, human escalation. The open-source SDK is the reference implementation described in Section 5 of the paper.
Read the full paper.