How to contain prompt injection in production
Prompt injection is the top entry on the OWASP LLM Top Ten for a reason: when untrusted text changes model behavior, the agent can turn that behavior into a real action. The fix is not a single model patch or a single filter. It is defense in depth: filter what you can at the input layer, validate what the model emits, and authorize each governed tool call below the model in deterministic code. Wire all three layers in Python. The third layer is the one that holds when the first two fail, which is most of the time.
What you'll build
- An input filter that strips known injection patterns and PII before the prompt hits the model.
- An output validator that parses tool calls as structured data and rejects anything off-schema.
- A Veto decision wrapped around each governed tool call so policy gets the final word.
- YAML rules that mark external user input as untrusted and gate the sensitive tools accordingly.
Step 1: Filter the input
The first layer catches the loud cases. A handful of known injection phrases get blocked. PII gets redacted. This is a triage filter, not a defense; do not treat it as one. The reason it exists is to keep the model from wasting cycles arguing with known adversarial input. The model and the runtime authorization path below it pick up everything else.
# pip install presidio-analyzer veto-sdk
import os
from presidio_analyzer import AnalyzerEngine
from veto_sdk import Veto
analyzer = AnalyzerEngine()
veto = Veto(api_key=os.environ["VETO_API_KEY"])
KNOWN_INJECTION_PATTERNS = [
"ignore previous instructions",
"you are now",
"system prompt:",
"developer mode",
]
def pre_filter(user_input: str) -> str:
lowered = user_input.lower()
if any(p in lowered for p in KNOWN_INJECTION_PATTERNS):
return "[blocked: suspicious instruction pattern]"
pii = analyzer.analyze(text=user_input, language="en")
if pii:
return strip_pii(user_input, pii)
return user_input
Real-world injection attempts mutate across models, prompts, and context sources, so the pattern list is a moving target. Treat the filter as a signal source; anything it catches is also worth logging for the threat-intel pipeline. The defense does not depend on the filter being complete.
Step 2: Validate model output
The second layer parses model output as structured data. Each governed tool call must match a schema and name a tool from a known allow-list. Free-form prose that asks the system to call a tool by name in natural language gets dropped on the floor. This is where you stop the "please run rm -rf" class of attempts.
import json
from pydantic import BaseModel, ValidationError
class ToolCallProposal(BaseModel):
tool: str
arguments: dict
ALLOWED_TOOLS = {"refund_order", "read_ticket", "add_note", "escalate"}
def validate_model_output(raw: str) -> ToolCallProposal | None:
try:
parsed = ToolCallProposal.model_validate_json(raw)
except (ValidationError, json.JSONDecodeError):
return None
if parsed.tool not in ALLOWED_TOOLS:
return None
return parsed
Pydantic does the heavy lifting. Anything that does not deserialize is rejected. Anything that names a tool outside ALLOWED_TOOLS is rejected. Log rejections; they are the second signal source for the threat pipeline.
Step 3: Authorize each governed tool call
The third layer is the one that holds. Even if the input filter missed something and the output validator passed it, the Veto decision sees the concrete tool name, the concrete arguments, and the context. It evaluates a YAML rule and returns allow, deny, or require_approval. This is the last line and the only one that does not depend on the model behaving.
def run_agent_turn(user_input: str, agent_id: str):
safe_input = pre_filter(user_input)
response = llm.complete(
messages=[{"role": "user", "content": safe_input}],
tools=TOOL_SCHEMAS,
)
for proposal_raw in response.tool_calls:
proposal = validate_model_output(proposal_raw)
if proposal is None:
log.warning("agent_proposed_invalid_tool_call", raw=proposal_raw)
continue
decision = veto.decide(
tool=proposal.tool,
args=proposal.arguments,
agent={"id": agent_id, "role": "support"},
context={
"user_input": safe_input,
"source": "external_user",
},
)
if decision.outcome == "deny":
yield f"Blocked: {decision.reason}"
continue
if decision.outcome == "require_approval":
approval = veto.approvals.wait(decision.approval_id, timeout=120)
if approval.status != "approved":
yield f"Approval rejected: {approval.note}"
continue
yield TOOLS[proposal.tool](**proposal.arguments)
Notice the context.source field. Marking the input as external_user lets the policy treat actions driven by chat input differently from actions driven by internal cron jobs. That is how you stop indirect injection. The model can be convinced of anything; the policy is not.
Step 4: Write the injection-defense policy
A short YAML bundle covers the patterns that account for most published incidents: external email exfiltration, destructive deletes driven by chat input, and large refunds proposed by chat sessions. Keep the rule list narrow at first and grow it as your threat model evolves.
# policies/injection-defense.yaml
- name: block_external_emails_unless_approved
match:
tool: send_email
rules:
- if: not args.to.endsWith("@approved.example")
then: require_approval
- name: never_let_external_input_drive_a_delete
match:
tool: delete_user
rules:
- if: context.source == "external_user"
then: deny
- name: cap_refunds_from_chat_sessions
match:
tool: refund_order
rules:
- if: context.source == "external_user" and args.amount_cents > 10000
then: require_approval
For the broader threat model, read the MCP security guide and the OWASP LLM Top Ten primer in the glossary.
Failure modes to catch
Trusting the model to refuse
System prompts that ask the model to refuse certain actions are wishes, not controls. The defensive assumption should be that models can be coaxed past their own refusals. The deterministic check has to live below the model.
No context.source on the decision
Without marking the input as external, your policy cannot distinguish an internal cron from a user chat. Pass context.source on every decide call.
Ignoring indirect injection
If your agent reads webpages, support tickets, or PDFs, the content of those documents is also untrusted input. Mark anything that came from a read tool with source = read_content and apply the same strict rules to actions taken on the back of it.
Production checklist
- Input filter catches the known injection patterns and runs before the LLM call.
- Output validator parses tool calls as structured data and rejects unknown tools.
- Each governed tool call routes through veto.decide before execution.
- context.source is set on governed decisions and the policy uses it.
- Rejections from all three layers feed a single threat-intel log.
FAQ
Can a prompt filter alone stop injection?⌄
No. Filters catch known patterns, while indirect injection (a malicious instruction smuggled through a webpage the agent reads) can walk past them. Input filtering is necessary, not sufficient. The deterministic check has to live below the model.
What is the difference between guardrails and runtime authorization?⌄
Guardrails is an umbrella term. Input filtering, output validation, content moderation, and runtime authorization are all guardrails. Runtime authorization is the specific subtype that decides allow or deny on a concrete tool call with concrete arguments, in code outside the model. Rephrasing the prompt does not change a policy decision enforced outside the model.
What if my agent reads documents that might contain hidden instructions?⌄
That is indirect prompt injection. The mitigation is the same: do not let the model decide what to do with the read result. Route each side-effecting tool call it proposes through a Veto decision. The policy can read context.source and mark anything sourced from external document content as untrusted, then deny or gate the higher-stakes tools.
Related guides
Stop tool poisoning attacks at the MCP Gateway layer.
Block data exfiltration from agentsArgument-level constraints on read tools so injection cannot leak data.
Add human approvalGate the highest-stakes tools with a human approver as the final defense.
MCP security guideFull threat model for MCP-based agent stacks.
Agent authorizationDeterministic last line of defense.
Put a deterministic check below your model.