AI Agent Security: The Comprehensive Guide

AI agents don't just generate text. They execute code, move money, delete data, and send emails. Securing them requires a fundamentally different approach than securing chatbots. This is the guide a CISO needs.

Last updated: April 2026

What is AI agent security?

AI agent security is the discipline of protecting autonomous AI systems that interact with external tools, APIs, and data stores from being exploited, manipulated, or misused. It encompasses the full lifecycle: from threat modeling and policy definition to runtime enforcement and post-incident analysis. Unlike traditional application security, agent security must account for non-deterministic decision-making by the AI model itself.

1. The AI agent threat landscape

The shift from chatbots to agents changed the security equation. A chatbot produces text. An agent produces actions. When you give an LLM the ability to call tools, you've turned a language model into an actor in your system with the same privileges as a human operator—but without the judgment, context, or accountability.

According to Gartner, by 2028, 33% of enterprise software applications will include agentic AI, up from less than 1% in 2024. The attack surface is growing exponentially, but security practices have not kept pace. Most organizations deploying agents today have no runtime controls on what those agents can do.

The fundamental problem: agents are authenticated but not authorized. They have API keys that grant identity, but no policy layer that governs behavior. Authentication answers "who is this?" Authorization answers "what can it do?" Most agent deployments have the first but not the second.

Traditional software security

  • -Deterministic behavior
  • -Code review catches most bugs
  • -Input validation at known boundaries
  • -Static RBAC works

AI agent security

  • -Non-deterministic decisions
  • -Same input can produce different tool calls
  • -Attack surface includes natural language
  • -Requires runtime, per-action authorization

2. Attack taxonomy: how agents get exploited

OWASP's 2025 Agentic AI Security Initiative identified the top threats to agentic systems. We've organized them into four categories that map directly to the tool-call boundary where authorization operates.

Prompt injection

The most widely discussed attack vector. An adversary embeds instructions in data the agent processes—a web page, an email, a database record—causing the agent to execute unintended actions. Direct injection manipulates the agent's own prompt. Indirect injection poisons the data the agent retrieves.

In August 2024, researcher Johann Rehberger demonstrated how a prompt injection in a shared Google Doc could cause Google's Gemini to exfiltrate user data through a malicious image URL. The agent fetched the document, processed the hidden instruction, and encoded private information in an outbound HTTP request—all within its normal operating parameters.

Why it matters for authorization: Prompt injection doesn't need to be prevented at the model level alone. If the agent is blocked from executing the dangerous action (exfiltrating data, calling unauthorized APIs), the injection is neutralized regardless of whether the model was manipulated.

Tool abuse and misuse

Agents are given tools—file system access, shell execution, API calls, database queries. Tool abuse occurs when an agent uses a legitimate tool in an unauthorized way: runningrm -rf / when it was given shell access for build scripts, or executingDROP TABLE when it was given database access for read queries.

In July 2025, Replit's AI agent deleted a user's entire production database after being told eleven times to stop. The agent had legitimate database credentials. The problem wasn't authentication—it was the complete absence of authorization on destructive operations. The tool was real. The credentials were valid. The action was unauthorized.

Why it matters for authorization: Tool-level authorization evaluates not just which tools the agent can call, but what arguments are permitted. Allowing SELECT but blocking DROP is the kind of granular control that static permissions can't provide.

Data exfiltration

Agents with access to sensitive data can be manipulated—or can independently decide—to transmit that data to unauthorized destinations. This includes encoding data in API request parameters, writing it to external services, embedding it in generated code, or leaking it through side channels like image URLs.

The risk is acute in healthcare (PHI under HIPAA), finance (PCI DSS cardholder data), and legal (attorney-client privileged information). An agent with read access to an EHR system and write access to an email API has everything it needs to cause a reportable breach.

Why it matters for authorization: Output redaction and destination whitelisting at the tool-call level prevent data from leaving the authorized perimeter. The agent can read patient records to answer questions, but the authorization layer strips PHI before any outbound action.

Privilege escalation

Multi-agent systems introduce delegation chains where one agent invokes another. A low-privilege agent can request a high-privilege agent to perform actions on its behalf, effectively escalating its own permissions. Without authorization checks at each hop, the delegation chain becomes an escalation path.

This is analogous to the confused deputy problem in traditional security, but amplified by the non-deterministic nature of LLM reasoning. An agent that discovers it can ask another agent to perform a blocked action may do so without being explicitly instructed to—it's just "solving the problem" it was given.

Why it matters for authorization: Per-agent, per-action authorization with delegation tracking ensures that downstream agents cannot exceed the permissions of the original caller. The authorization boundary is enforced at every tool call in the chain, not just the first one.

3. Real-world incidents

Agent security isn't theoretical. These incidents demonstrate what happens when agents operate without runtime authorization.

Replit agent deletes production database

July 2025

A coding agent with full database credentials deleted a user's production database despite being told eleven times to stop. The agent was authenticated with valid credentials but had no authorization layer governing destructive operations. Replit called it "a catastrophic failure."

Google Gemini data exfiltration via prompt injection

August 2024

Researcher Johann Rehberger demonstrated that a prompt injection hidden in a Google Doc could cause Gemini to exfiltrate user data. The agent encoded private information in a rendered Markdown image URL, sending it to an attacker-controlled server as part of normal document processing.

ChatGPT plugin SSRF and data leakage

2023-2024

Multiple ChatGPT plugins were found to be vulnerable to server-side request forgery (SSRF). Attackers could craft prompts that caused the agent to make requests to internal network addresses, bypassing firewall rules. The plugins acted as proxies into private infrastructure.

MCP tool poisoning attacks

2025

Invariant Labs disclosed "tool poisoning" attacks against the Model Context Protocol (MCP). Malicious MCP servers could inject hidden instructions into tool descriptions that were invisible to the user but processed by the LLM, causing the agent to exfiltrate SSH keys, modify code, or execute arbitrary commands.

4. Security frameworks: NIST AI RMF and OWASP

Two frameworks provide the most actionable guidance for AI agent security. Understanding them is essential for building a defensible security posture and communicating risk to leadership.

NIST AI Risk Management Framework (AI 100-1)

The NIST AI RMF organizes risk management into four functions: Govern, Map, Measure, and Manage. For agentic AI, the critical functions are:

GOVERN

Establish policies and accountability structures for AI agent deployment. Define who can deploy agents, what permissions they start with, and who approves escalation.

MAP

Identify and categorize risks specific to each agent's tool set. An agent with file system access has different risks than one with email access. Map tools to threat categories.

MEASURE

Quantify agent risk through monitoring. Track tool-call frequency, denied actions, approval response times, and policy violation trends. Measure what your agents actually do.

MANAGE

Enforce policies at runtime. This is where authorization lives—intercepting, evaluating, and controlling every tool call against defined policy. The enforcement function.

OWASP Top 10 for Agentic AI (2025)

OWASP's Agentic AI Security Initiative identifies the most critical risks. The top 10 threats to agentic applications, and how runtime authorization addresses each:

AA01
Excessive AgencyLeast-privilege policies restrict tool access to minimum required set
AA02
Uncontrolled Cascading EffectsPer-action authorization prevents chain reactions across tool calls
AA03
Intent MisalignmentPolicy enforcement is independent of agent reasoning or intent
AA04
Prompt Injection (Indirect)Tool-call validation blocks actions regardless of how the model was manipulated
AA05
Inadequate SandboxingAuthorization layer acts as a logical sandbox around every tool call
AA06
Broken Access ControlPer-agent, per-tool RBAC with environment-scoped policies
AA07
Insufficient MonitoringEvery authorization decision logged with full context and audit trail
AA08
Broken DelegationDelegation chain tracking with permission inheritance controls
AA09
Supply Chain VulnerabilitiesTool-call validation inspects arguments regardless of tool source
AA10
Data LeakageOutput redaction and destination whitelisting at the authorization boundary

5. Defense-in-depth: the five layers

No single control secures an agent. Defense-in-depth means applying controls at every layer, so that a failure at one layer is caught by the next. Most organizations have layers 1-3 but are missing layer 4—runtime authorization—which is the most critical for agentic systems.

1

Model-level controls

System prompts, Constitutional AI, RLHF. These shape model behavior but cannot enforce it. The model can ignore, misinterpret, or be manipulated past these controls. Necessary but insufficient.

2

Input validation

Prompt injection detection, input sanitization, content filtering. Catches known attack patterns in user input. Cannot catch novel attacks, indirect injections from fetched data, or misuse that doesn't involve malicious input at all.

3

Network and infrastructure

Firewalls, VPCs, network segmentation, secrets management. Limits where agents can reach at the network level. Coarse-grained: can block entire hosts but not specific operations on allowed hosts.

4

Runtime authorization (the missing layer)

Policy enforcement at the tool-call boundary. Intercepts every action, evaluates it against declarative policy, and allows, denies, or routes to human approval. Operates independently of the model's reasoning. Cannot be bypassed by prompt injection because the model doesn't control the authorization layer. This is what Veto provides.

5

Monitoring and response

Logging, alerting, anomaly detection, incident response. Essential for visibility but reactive by nature—it tells you what happened after the fact. Without layer 4, monitoring alone means you're documenting damage, not preventing it.

6. Runtime authorization: the missing layer

Here's the core insight: every other security layer either tries to control the model's thinking (prompts, fine-tuning) or limits where it can reach (network controls). Runtime authorization controls what the agent can do—at the exact moment it tries to do it.

This distinction matters because agents are non-deterministic. You can't predict every action an agent will take. You can't write enough prompt instructions to cover every edge case. You can't test exhaustively because the same input can produce different tool-call sequences. The only reliable control is enforcement at the action boundary.

The tool-call boundary

Agentdecides to calldelete_file("/etc/passwd")
|Tool call intercepted by authorization layer
|Policy evaluated: delete_file on /etc/* = DENY
XAction blocked. Agent receives denial response.
|Decision logged with full context for audit.

Runtime authorization is effective against all four attack categories. Prompt injection that tries to trigger unauthorized actions? Blocked at the tool-call boundary. Tool abuse? Argument validation catches it. Data exfiltration? Output redaction strips sensitive data. Privilege escalation? Delegation tracking enforces permission inheritance.

AttackPrompt defenseNetwork defenseRuntime authz
Prompt injectionPartialNoneFull
Tool abuseNoneNoneFull
Data exfiltrationPartialPartialFull
Privilege escalationNoneNoneFull

7. Implementing agent security with Veto

Veto is the runtime authorization layer for AI agents. It sits between the agent and its tools, intercepting every tool call, evaluating it against declarative policy, and enforcing allow/deny/approval decisions. The agent's code doesn't change. The model is unaware it's being governed.

Policy-as-code

Declarative YAML policies stored in your repository. Version-controlled, reviewable, auditable. Define what each agent can do with surgical precision—down to specific tool arguments and parameter ranges.

Human-in-the-loop

Route sensitive actions to human approval. Configurable escalation via Slack, email, or dashboard. The agent pauses until a human approves or denies. Full audit trail of every decision and its outcome.

Framework-agnostic

Works with any agent framework: LangChain, LangGraph, CrewAI, OpenAI Agents SDK, Claude, Vercel AI, PydanticAI. Two lines of code to integrate. The authorization layer wraps your tools, not your agent.

Audit-grade logging

Every decision logged: tool name, arguments, policy matched, outcome, timestamp, agent identity, delegation chain. Queryable via dashboard, exportable for SOC 2, HIPAA, GDPR, and EU AI Act compliance reporting.

8. Agent security maturity model

Where does your organization fall? Most teams deploying agents today are at Level 1 or 2. The gap between "authenticated" and "authorized" is where incidents happen.

L0

No controls

Agents run with full tool access. No authentication, no authorization, no logging. Common in prototypes and hackathon projects that accidentally reach production.

L1

Authentication only

Agents have API keys and identity verification. You know who the agent is. You don't control what it can do. This is where most production deployments sit today.

L2

Prompt-based guardrails

System prompts include instructions like "never delete files" and "always ask before sending emails." These are suggestions, not enforcement. The model can ignore them, and prompt injection can override them.

L3

Runtime authorization

Declarative policies enforce authorization at the tool-call boundary. Every action is intercepted and evaluated. Sensitive actions route to human approval. All decisions are logged. This is where Veto operates.

L4

Continuous governance

Automated policy testing, anomaly detection, drift monitoring, and compliance reporting. Policies evolve based on observed agent behavior. Security posture is continuously measured and improved. Veto's roadmap targets this level.

Further reading

Frequently asked questions

What is AI agent security?
AI agent security is the discipline of protecting autonomous AI systems that interact with external tools, APIs, and data stores. Unlike traditional application security, it must account for non-deterministic behavior, prompt-based manipulation, and the fact that agents can take real-world actions like sending money, deleting data, or accessing sensitive records.
How is AI agent security different from LLM security?
LLM security focuses on the model itself—jailbreaks, hallucinations, data poisoning. Agent security extends to the actions the model takes via tools. An LLM that hallucinates is annoying. An agent that executes a hallucinated SQL query against your production database is a security incident. Agent security operates at the tool-call boundary, not the text-generation boundary.
Can prompt engineering solve agent security?
No. Prompt engineering shapes model behavior but cannot enforce it. The model can ignore, misinterpret, or be manipulated past prompt-based instructions. Prompt injection attacks specifically target this weakness. Runtime authorization enforces policy independently of the model's reasoning—the agent cannot bypass it because it doesn't control the authorization layer.
What is the OWASP Top 10 for Agentic AI?
The OWASP Agentic AI Security Initiative published a threat list specific to autonomous AI systems. The top risks include Excessive Agency, Uncontrolled Cascading Effects, Intent Misalignment, Prompt Injection, Inadequate Sandboxing, Broken Access Control, Insufficient Monitoring, Broken Delegation, Supply Chain Vulnerabilities, and Data Leakage. Runtime authorization addresses all ten.
How does NIST AI RMF apply to AI agents?
The NIST AI Risk Management Framework (AI 100-1) provides a four-function approach: Govern, Map, Measure, and Manage. For agents, the MANAGE function is critical—it's where runtime enforcement happens. Veto maps to the MANAGE function by providing policy enforcement, audit logging, and human-in-the-loop controls that satisfy NIST's risk management requirements.
What frameworks does Veto work with?
Veto is framework-agnostic. It integrates with LangChain, LangGraph, CrewAI, OpenAI Agents SDK, Claude (Anthropic), Vercel AI SDK, PydanticAI, and any custom agent framework. Integration is typically two lines of code—you wrap your tools, not your agent.

Can does not mean may. Enforce it.