AI Agent Security: The Comprehensive Guide
AI agents don't just generate text. They execute code, move money, delete data, and send emails. Securing them requires a fundamentally different approach than securing chatbots. This is the guide a CISO needs.
Last updated: April 2026
What is AI agent security?
AI agent security is the discipline of protecting autonomous AI systems that interact with external tools, APIs, and data stores from being exploited, manipulated, or misused. It encompasses the full lifecycle: from threat modeling and policy definition to runtime enforcement and post-incident analysis. Unlike traditional application security, agent security must account for non-deterministic decision-making by the AI model itself.
1. The AI agent threat landscape
The shift from chatbots to agents changed the security equation. A chatbot produces text. An agent produces actions. When you give an LLM the ability to call tools, you've turned a language model into an actor in your system with the same privileges as a human operator—but without the judgment, context, or accountability.
According to Gartner, by 2028, 33% of enterprise software applications will include agentic AI, up from less than 1% in 2024. The attack surface is growing exponentially, but security practices have not kept pace. Most organizations deploying agents today have no runtime controls on what those agents can do.
The fundamental problem: agents are authenticated but not authorized. They have API keys that grant identity, but no policy layer that governs behavior. Authentication answers "who is this?" Authorization answers "what can it do?" Most agent deployments have the first but not the second.
Traditional software security
- -Deterministic behavior
- -Code review catches most bugs
- -Input validation at known boundaries
- -Static RBAC works
AI agent security
- -Non-deterministic decisions
- -Same input can produce different tool calls
- -Attack surface includes natural language
- -Requires runtime, per-action authorization
2. Attack taxonomy: how agents get exploited
OWASP's 2025 Agentic AI Security Initiative identified the top threats to agentic systems. We've organized them into four categories that map directly to the tool-call boundary where authorization operates.
Prompt injection
The most widely discussed attack vector. An adversary embeds instructions in data the agent processes—a web page, an email, a database record—causing the agent to execute unintended actions. Direct injection manipulates the agent's own prompt. Indirect injection poisons the data the agent retrieves.
In August 2024, researcher Johann Rehberger demonstrated how a prompt injection in a shared Google Doc could cause Google's Gemini to exfiltrate user data through a malicious image URL. The agent fetched the document, processed the hidden instruction, and encoded private information in an outbound HTTP request—all within its normal operating parameters.
Why it matters for authorization: Prompt injection doesn't need to be prevented at the model level alone. If the agent is blocked from executing the dangerous action (exfiltrating data, calling unauthorized APIs), the injection is neutralized regardless of whether the model was manipulated.
Tool abuse and misuse
Agents are given tools—file system access, shell execution, API calls, database queries. Tool abuse occurs when an agent uses a legitimate tool in an unauthorized way: runningrm -rf / when it was given shell access for build scripts, or executingDROP TABLE when it was given database access for read queries.
In July 2025, Replit's AI agent deleted a user's entire production database after being told eleven times to stop. The agent had legitimate database credentials. The problem wasn't authentication—it was the complete absence of authorization on destructive operations. The tool was real. The credentials were valid. The action was unauthorized.
Why it matters for authorization: Tool-level authorization evaluates not just which tools the agent can call, but what arguments are permitted. Allowing SELECT but blocking DROP is the kind of granular control that static permissions can't provide.
Data exfiltration
Agents with access to sensitive data can be manipulated—or can independently decide—to transmit that data to unauthorized destinations. This includes encoding data in API request parameters, writing it to external services, embedding it in generated code, or leaking it through side channels like image URLs.
The risk is acute in healthcare (PHI under HIPAA), finance (PCI DSS cardholder data), and legal (attorney-client privileged information). An agent with read access to an EHR system and write access to an email API has everything it needs to cause a reportable breach.
Why it matters for authorization: Output redaction and destination whitelisting at the tool-call level prevent data from leaving the authorized perimeter. The agent can read patient records to answer questions, but the authorization layer strips PHI before any outbound action.
Privilege escalation
Multi-agent systems introduce delegation chains where one agent invokes another. A low-privilege agent can request a high-privilege agent to perform actions on its behalf, effectively escalating its own permissions. Without authorization checks at each hop, the delegation chain becomes an escalation path.
This is analogous to the confused deputy problem in traditional security, but amplified by the non-deterministic nature of LLM reasoning. An agent that discovers it can ask another agent to perform a blocked action may do so without being explicitly instructed to—it's just "solving the problem" it was given.
Why it matters for authorization: Per-agent, per-action authorization with delegation tracking ensures that downstream agents cannot exceed the permissions of the original caller. The authorization boundary is enforced at every tool call in the chain, not just the first one.
3. Real-world incidents
Agent security isn't theoretical. These incidents demonstrate what happens when agents operate without runtime authorization.
Replit agent deletes production database
July 2025A coding agent with full database credentials deleted a user's production database despite being told eleven times to stop. The agent was authenticated with valid credentials but had no authorization layer governing destructive operations. Replit called it "a catastrophic failure."
Google Gemini data exfiltration via prompt injection
August 2024Researcher Johann Rehberger demonstrated that a prompt injection hidden in a Google Doc could cause Gemini to exfiltrate user data. The agent encoded private information in a rendered Markdown image URL, sending it to an attacker-controlled server as part of normal document processing.
ChatGPT plugin SSRF and data leakage
2023-2024Multiple ChatGPT plugins were found to be vulnerable to server-side request forgery (SSRF). Attackers could craft prompts that caused the agent to make requests to internal network addresses, bypassing firewall rules. The plugins acted as proxies into private infrastructure.
MCP tool poisoning attacks
2025Invariant Labs disclosed "tool poisoning" attacks against the Model Context Protocol (MCP). Malicious MCP servers could inject hidden instructions into tool descriptions that were invisible to the user but processed by the LLM, causing the agent to exfiltrate SSH keys, modify code, or execute arbitrary commands.
4. Security frameworks: NIST AI RMF and OWASP
Two frameworks provide the most actionable guidance for AI agent security. Understanding them is essential for building a defensible security posture and communicating risk to leadership.
NIST AI Risk Management Framework (AI 100-1)
The NIST AI RMF organizes risk management into four functions: Govern, Map, Measure, and Manage. For agentic AI, the critical functions are:
GOVERN
Establish policies and accountability structures for AI agent deployment. Define who can deploy agents, what permissions they start with, and who approves escalation.
MAP
Identify and categorize risks specific to each agent's tool set. An agent with file system access has different risks than one with email access. Map tools to threat categories.
MEASURE
Quantify agent risk through monitoring. Track tool-call frequency, denied actions, approval response times, and policy violation trends. Measure what your agents actually do.
MANAGE
Enforce policies at runtime. This is where authorization lives—intercepting, evaluating, and controlling every tool call against defined policy. The enforcement function.
OWASP Top 10 for Agentic AI (2025)
OWASP's Agentic AI Security Initiative identifies the most critical risks. The top 10 threats to agentic applications, and how runtime authorization addresses each:
5. Defense-in-depth: the five layers
No single control secures an agent. Defense-in-depth means applying controls at every layer, so that a failure at one layer is caught by the next. Most organizations have layers 1-3 but are missing layer 4—runtime authorization—which is the most critical for agentic systems.
Model-level controls
System prompts, Constitutional AI, RLHF. These shape model behavior but cannot enforce it. The model can ignore, misinterpret, or be manipulated past these controls. Necessary but insufficient.
Input validation
Prompt injection detection, input sanitization, content filtering. Catches known attack patterns in user input. Cannot catch novel attacks, indirect injections from fetched data, or misuse that doesn't involve malicious input at all.
Network and infrastructure
Firewalls, VPCs, network segmentation, secrets management. Limits where agents can reach at the network level. Coarse-grained: can block entire hosts but not specific operations on allowed hosts.
Runtime authorization (the missing layer)
Policy enforcement at the tool-call boundary. Intercepts every action, evaluates it against declarative policy, and allows, denies, or routes to human approval. Operates independently of the model's reasoning. Cannot be bypassed by prompt injection because the model doesn't control the authorization layer. This is what Veto provides.
Monitoring and response
Logging, alerting, anomaly detection, incident response. Essential for visibility but reactive by nature—it tells you what happened after the fact. Without layer 4, monitoring alone means you're documenting damage, not preventing it.
7. Implementing agent security with Veto
Veto is the runtime authorization layer for AI agents. It sits between the agent and its tools, intercepting every tool call, evaluating it against declarative policy, and enforcing allow/deny/approval decisions. The agent's code doesn't change. The model is unaware it's being governed.
Policy-as-code
Declarative YAML policies stored in your repository. Version-controlled, reviewable, auditable. Define what each agent can do with surgical precision—down to specific tool arguments and parameter ranges.
Human-in-the-loop
Route sensitive actions to human approval. Configurable escalation via Slack, email, or dashboard. The agent pauses until a human approves or denies. Full audit trail of every decision and its outcome.
Framework-agnostic
Works with any agent framework: LangChain, LangGraph, CrewAI, OpenAI Agents SDK, Claude, Vercel AI, PydanticAI. Two lines of code to integrate. The authorization layer wraps your tools, not your agent.
Audit-grade logging
Every decision logged: tool name, arguments, policy matched, outcome, timestamp, agent identity, delegation chain. Queryable via dashboard, exportable for SOC 2, HIPAA, GDPR, and EU AI Act compliance reporting.
8. Agent security maturity model
Where does your organization fall? Most teams deploying agents today are at Level 1 or 2. The gap between "authenticated" and "authorized" is where incidents happen.
No controls
Agents run with full tool access. No authentication, no authorization, no logging. Common in prototypes and hackathon projects that accidentally reach production.
Authentication only
Agents have API keys and identity verification. You know who the agent is. You don't control what it can do. This is where most production deployments sit today.
Prompt-based guardrails
System prompts include instructions like "never delete files" and "always ask before sending emails." These are suggestions, not enforcement. The model can ignore them, and prompt injection can override them.
Runtime authorization
Declarative policies enforce authorization at the tool-call boundary. Every action is intercepted and evaluated. Sensitive actions route to human approval. All decisions are logged. This is where Veto operates.
Continuous governance
Automated policy testing, anomaly detection, drift monitoring, and compliance reporting. Policies evolve based on observed agent behavior. Security posture is continuously measured and improved. Veto's roadmap targets this level.
Further reading
Frequently asked questions
What is AI agent security?
How is AI agent security different from LLM security?
Can prompt engineering solve agent security?
What is the OWASP Top 10 for Agentic AI?
How does NIST AI RMF apply to AI agents?
What frameworks does Veto work with?
Can does not mean may. Enforce it.