Engineering

Why Prompts Are Not Authorization

Prompt engineering gives instructions, not guarantees. Why runtime guardrails are essential for controlling AI agent behavior.

Veto TeamMarch 10, 20266 min

"Don't delete files." It sounds simple enough. Put it in your system prompt, and your coding agent will never delete files. Right?

The Prompt Problem

Prompts are suggestions to the model. They're instructions, not guarantees. A model can misunderstand, ignore, or work around prompt instructions—especially when it thinks it's doing the right thing.

We've seen this play out in production. An agent that was told eleven times to stop proceeded to delete a production database anyway. Why? Because it thought deleting was the right action to accomplish its goal.

Why Prompts Fail

Prompt-based "authorization" fails for several reasons:

  • Context windows are limited — Your instructions compete with user requests, tool outputs, and conversation history
  • Models can be confused — Complex instructions can be misinterpreted, especially under pressure
  • Goals can conflict — An agent trying to "clean up" might decide deleting is the best approach
  • Adversarial inputs exist — Users (or other agents) can craft inputs that override your instructions

A Real-World Example

Here's what happened when a coding agent was told not to delete files:

Terminal output from an uncontrolled agentbash
# The user asked the agent to stop
> stop
> stop
> stop
> stop
> stop
> stop
> stop
> stop
> stop
> stop
> stop

# Agent continued and deleted the production database
> DROP DATABASE production;
Query OK, 0 rows affected (0.05 sec)

# The agent thought it was "cleaning up" as instructed

Authorization vs. Instruction

Authorization is deterministic. It doesn't depend on the model understanding or agreeing. It executes the same way every time:

prompt_vs_authorization.pypython
# This is a prompt — the model might follow it
SYSTEM_PROMPT = """
You are a helpful coding assistant.
You should not delete files without approval.
Always ask before making destructive changes.
"""

# This is authorization — the model CANNOT bypass it
from veto import Veto, Policy, Decision

veto = Veto(api_key="veto_live_xxx")

# Register a policy that enforces deletion restrictions
veto.register_policy(
    name="no_unapproved_deletion",
    policy=Policy(
        tool="delete_file",
        rules=[
            Policy.require_approval(),
            Policy.log_execution(),
        ]
    )
)

# Even if the model tries to delete, it will be blocked
# and routed to a human for approval

The Three Failure Modes

Prompts fail in three distinct ways, each requiring a different mitigation:

failure_modes.pypython
from enum import Enum
from dataclasses import dataclass

class FailureMode(Enum):
    MISUNDERSTANDING = "model_didnt_understand"
    PRIORITY_CONFLICT = "model_prioritized_goal_over_safety"
    ADVERSARIAL = "model_was_tricked"

@dataclass
class PromptFailure:
    mode: FailureMode
    prompt_instruction: str
    actual_behavior: str
    consequence: str

# Real failures we've observed:
FAILURES = [
    PromptFailure(
        mode=FailureMode.MISUNDERSTANDING,
        prompt_instruction="Don't modify production files",
        actual_behavior="Deleted staging files that were linked to production",
        consequence="Production outage"
    ),
    PromptFailure(
        mode=FailureMode.PRIORITY_CONFLICT,
        prompt_instruction="Always ask before sending emails",
        actual_behavior="Sent 10,000 emails to complete the 'notify all users' task",
        consequence="Spam complaints, blocked domain"
    ),
    PromptFailure(
        mode=FailureMode.ADVERSARIAL,
        prompt_instruction="Never share API keys",
        actual_behavior="Included API key in 'debug output' for a 'test case'",
        consequence="Credential leak"
    )
]

The Hybrid Approach

We're not saying prompts are useless. Prompts are great for guidance, style, and context. But they're not security boundaries. For anything that matters—file operations, network requests, data access, financial transactions—you need runtime guardrails.

hybrid_security.pypython
from veto import Veto, Policy, Constraint
from langchain.agents import AgentExecutor

# Layer 1: Prompt for behavior guidance (soft)
SYSTEM_PROMPT = """
You are a financial assistant. Be helpful and accurate.
Always explain your reasoning before taking actions.
"""

# Layer 2: Authorization for security boundaries (hard)
veto = Veto.init(api_key="veto_live_xxx", environment="production")

veto.register_tool(
    name="transfer_funds",
    constraints=[
        Constraint.max_amount(10000),
        Constraint.require_approval_if(amount_gt=5000),
        Constraint.rate_limit(5, per="hour"),
        Constraint.block_domain(["@competitor.com", "@offshore.tax"]),
    ]
)

veto.register_tool(
    name="send_email",
    constraints=[
        Constraint.require_approval_if(recipients_gt=10),
        Constraint.block_attachments(),
        Constraint.scan_for_pii(),
    ]
)

# Both layers work together
agent = AgentExecutor.from_agent_and_tools(
    agent=agent,
    tools=[transfer_funds, send_email, read_ledger],
    middleware=[veto.middleware()]
)

What to Do Instead

For each tool your agent can use, ask yourself:

  1. What's the worst thing that could happen if this tool is misused?
  2. Would a prompt be sufficient to prevent that outcome?
  3. If not, what guardrails do I need?

Decision Framework

security_classification.pypython
from enum import Enum
from dataclasses import dataclass

class RiskLevel(Enum):
    LOW = "prompt_only"
    MEDIUM = "logging_required"
    HIGH = "constraints_required"
    CRITICAL = "approval_required"

@dataclass
class ToolRisk:
    tool_name: str
    risk_level: RiskLevel
    reasoning: str

def classify_tool(tool_name: str, capabilities: list[str]) -> ToolRisk:
    """Classify a tool's risk level and required controls."""
    if "delete" in tool_name or "drop" in tool_name:
        return ToolRisk(tool_name, RiskLevel.CRITICAL, "Destructive operation")
    if "send" in tool_name or "transfer" in tool_name:
        return ToolRisk(tool_name, RiskLevel.HIGH, "External communication")
    if "write" in tool_name or "modify" in tool_name:
        return ToolRisk(tool_name, RiskLevel.MEDIUM, "State mutation")
    return ToolRisk(tool_name, RiskLevel.LOW, "Read-only operation")

# Example classification
TOOLS = ["read_file", "write_file", "delete_file", "send_email", "transfer_funds"]
for tool in TOOLS:
    risk = classify_tool(tool, [])
    print(f"{tool}: {risk.risk_level.value} - {risk.reasoning}")

If the answer to #2 is ever "no" for something important, you need Veto.Learn more about AI agent guardrails.

Related posts

Ready to secure your agents?