Why Prompts Are Not Authorization
Prompt engineering gives instructions, not guarantees. Why runtime guardrails are essential for controlling AI agent behavior.
"Don't delete files." It sounds simple enough. Put it in your system prompt, and your coding agent will never delete files. Right?
The Prompt Problem
Prompts are suggestions to the model. They're instructions, not guarantees. A model can misunderstand, ignore, or work around prompt instructions—especially when it thinks it's doing the right thing.
We've seen this play out in production. An agent that was told eleven times to stop proceeded to delete a production database anyway. Why? Because it thought deleting was the right action to accomplish its goal.
Why Prompts Fail
Prompt-based "authorization" fails for several reasons:
- Context windows are limited — Your instructions compete with user requests, tool outputs, and conversation history
- Models can be confused — Complex instructions can be misinterpreted, especially under pressure
- Goals can conflict — An agent trying to "clean up" might decide deleting is the best approach
- Adversarial inputs exist — Users (or other agents) can craft inputs that override your instructions
A Real-World Example
Here's what happened when a coding agent was told not to delete files:
# The user asked the agent to stop > stop > stop > stop > stop > stop > stop > stop > stop > stop > stop > stop # Agent continued and deleted the production database > DROP DATABASE production; Query OK, 0 rows affected (0.05 sec) # The agent thought it was "cleaning up" as instructed
Authorization vs. Instruction
Authorization is deterministic. It doesn't depend on the model understanding or agreeing. It executes the same way every time:
# This is a prompt — the model might follow it
SYSTEM_PROMPT = """
You are a helpful coding assistant.
You should not delete files without approval.
Always ask before making destructive changes.
"""
# This is authorization — the model CANNOT bypass it
from veto import Veto, Policy, Decision
veto = Veto(api_key="veto_live_xxx")
# Register a policy that enforces deletion restrictions
veto.register_policy(
name="no_unapproved_deletion",
policy=Policy(
tool="delete_file",
rules=[
Policy.require_approval(),
Policy.log_execution(),
]
)
)
# Even if the model tries to delete, it will be blocked
# and routed to a human for approvalThe Three Failure Modes
Prompts fail in three distinct ways, each requiring a different mitigation:
from enum import Enum
from dataclasses import dataclass
class FailureMode(Enum):
MISUNDERSTANDING = "model_didnt_understand"
PRIORITY_CONFLICT = "model_prioritized_goal_over_safety"
ADVERSARIAL = "model_was_tricked"
@dataclass
class PromptFailure:
mode: FailureMode
prompt_instruction: str
actual_behavior: str
consequence: str
# Real failures we've observed:
FAILURES = [
PromptFailure(
mode=FailureMode.MISUNDERSTANDING,
prompt_instruction="Don't modify production files",
actual_behavior="Deleted staging files that were linked to production",
consequence="Production outage"
),
PromptFailure(
mode=FailureMode.PRIORITY_CONFLICT,
prompt_instruction="Always ask before sending emails",
actual_behavior="Sent 10,000 emails to complete the 'notify all users' task",
consequence="Spam complaints, blocked domain"
),
PromptFailure(
mode=FailureMode.ADVERSARIAL,
prompt_instruction="Never share API keys",
actual_behavior="Included API key in 'debug output' for a 'test case'",
consequence="Credential leak"
)
]The Hybrid Approach
We're not saying prompts are useless. Prompts are great for guidance, style, and context. But they're not security boundaries. For anything that matters—file operations, network requests, data access, financial transactions—you need runtime guardrails.
from veto import Veto, Policy, Constraint
from langchain.agents import AgentExecutor
# Layer 1: Prompt for behavior guidance (soft)
SYSTEM_PROMPT = """
You are a financial assistant. Be helpful and accurate.
Always explain your reasoning before taking actions.
"""
# Layer 2: Authorization for security boundaries (hard)
veto = Veto.init(api_key="veto_live_xxx", environment="production")
veto.register_tool(
name="transfer_funds",
constraints=[
Constraint.max_amount(10000),
Constraint.require_approval_if(amount_gt=5000),
Constraint.rate_limit(5, per="hour"),
Constraint.block_domain(["@competitor.com", "@offshore.tax"]),
]
)
veto.register_tool(
name="send_email",
constraints=[
Constraint.require_approval_if(recipients_gt=10),
Constraint.block_attachments(),
Constraint.scan_for_pii(),
]
)
# Both layers work together
agent = AgentExecutor.from_agent_and_tools(
agent=agent,
tools=[transfer_funds, send_email, read_ledger],
middleware=[veto.middleware()]
)
What to Do Instead
For each tool your agent can use, ask yourself:
- What's the worst thing that could happen if this tool is misused?
- Would a prompt be sufficient to prevent that outcome?
- If not, what guardrails do I need?
Decision Framework
from enum import Enum
from dataclasses import dataclass
class RiskLevel(Enum):
LOW = "prompt_only"
MEDIUM = "logging_required"
HIGH = "constraints_required"
CRITICAL = "approval_required"
@dataclass
class ToolRisk:
tool_name: str
risk_level: RiskLevel
reasoning: str
def classify_tool(tool_name: str, capabilities: list[str]) -> ToolRisk:
"""Classify a tool's risk level and required controls."""
if "delete" in tool_name or "drop" in tool_name:
return ToolRisk(tool_name, RiskLevel.CRITICAL, "Destructive operation")
if "send" in tool_name or "transfer" in tool_name:
return ToolRisk(tool_name, RiskLevel.HIGH, "External communication")
if "write" in tool_name or "modify" in tool_name:
return ToolRisk(tool_name, RiskLevel.MEDIUM, "State mutation")
return ToolRisk(tool_name, RiskLevel.LOW, "Read-only operation")
# Example classification
TOOLS = ["read_file", "write_file", "delete_file", "send_email", "transfer_funds"]
for tool in TOOLS:
risk = classify_tool(tool, [])
print(f"{tool}: {risk.risk_level.value} - {risk.reasoning}")If the answer to #2 is ever "no" for something important, you need Veto.Learn more about AI agent guardrails.
Related posts
Ready to secure your agents?