Security guide

How to contain prompt injection in production

Prompt injection is the top entry on the OWASP LLM Top Ten for a reason: when untrusted text changes model behavior, the agent can turn that behavior into a real action. The fix is not a single model patch or a single filter. It is defense in depth: filter what you can at the input layer, validate what the model emits, and authorize each governed tool call below the model in deterministic code. Wire all three layers in Python. The third layer is the one that holds when the first two fail, which is most of the time.

  • An input filter that strips known injection patterns and PII before the prompt hits the model.
  • An output validator that parses tool calls as structured data and rejects anything off-schema.
  • A Veto decision wrapped around each governed tool call so policy gets the final word.
  • YAML rules that mark external user input as untrusted and gate the sensitive tools accordingly.

Step 1: Filter the input

The first layer catches the loud cases. A handful of known injection phrases get blocked. PII gets redacted. This is a triage filter, not a defense; do not treat it as one. The reason it exists is to keep the model from wasting cycles arguing with known adversarial input. The model and the runtime authorization path below it pick up everything else.

py
# pip install presidio-analyzer veto-sdk
import os
from presidio_analyzer import AnalyzerEngine
from veto_sdk import Veto

analyzer = AnalyzerEngine()
veto = Veto(api_key=os.environ["VETO_API_KEY"])

KNOWN_INJECTION_PATTERNS = [
    "ignore previous instructions",
    "you are now",
    "system prompt:",
    "developer mode",
]

def pre_filter(user_input: str) -> str:
    lowered = user_input.lower()
    if any(p in lowered for p in KNOWN_INJECTION_PATTERNS):
        return "[blocked: suspicious instruction pattern]"
    pii = analyzer.analyze(text=user_input, language="en")
    if pii:
        return strip_pii(user_input, pii)
    return user_input

Real-world injection attempts mutate across models, prompts, and context sources, so the pattern list is a moving target. Treat the filter as a signal source; anything it catches is also worth logging for the threat-intel pipeline. The defense does not depend on the filter being complete.

Step 2: Validate model output

The second layer parses model output as structured data. Each governed tool call must match a schema and name a tool from a known allow-list. Free-form prose that asks the system to call a tool by name in natural language gets dropped on the floor. This is where you stop the "please run rm -rf" class of attempts.

py
import json
from pydantic import BaseModel, ValidationError

class ToolCallProposal(BaseModel):
    tool: str
    arguments: dict

ALLOWED_TOOLS = {"refund_order", "read_ticket", "add_note", "escalate"}

def validate_model_output(raw: str) -> ToolCallProposal | None:
    try:
        parsed = ToolCallProposal.model_validate_json(raw)
    except (ValidationError, json.JSONDecodeError):
        return None
    if parsed.tool not in ALLOWED_TOOLS:
        return None
    return parsed

Pydantic does the heavy lifting. Anything that does not deserialize is rejected. Anything that names a tool outside ALLOWED_TOOLS is rejected. Log rejections; they are the second signal source for the threat pipeline.

Step 3: Authorize each governed tool call

The third layer is the one that holds. Even if the input filter missed something and the output validator passed it, the Veto decision sees the concrete tool name, the concrete arguments, and the context. It evaluates a YAML rule and returns allow, deny, or require_approval. This is the last line and the only one that does not depend on the model behaving.

py
def run_agent_turn(user_input: str, agent_id: str):
    safe_input = pre_filter(user_input)

    response = llm.complete(
        messages=[{"role": "user", "content": safe_input}],
        tools=TOOL_SCHEMAS,
    )

    for proposal_raw in response.tool_calls:
        proposal = validate_model_output(proposal_raw)
        if proposal is None:
            log.warning("agent_proposed_invalid_tool_call", raw=proposal_raw)
            continue

        decision = veto.decide(
            tool=proposal.tool,
            args=proposal.arguments,
            agent={"id": agent_id, "role": "support"},
            context={
                "user_input": safe_input,
                "source": "external_user",
            },
        )

        if decision.outcome == "deny":
            yield f"Blocked: {decision.reason}"
            continue

        if decision.outcome == "require_approval":
            approval = veto.approvals.wait(decision.approval_id, timeout=120)
            if approval.status != "approved":
                yield f"Approval rejected: {approval.note}"
                continue

        yield TOOLS[proposal.tool](**proposal.arguments)

Notice the context.source field. Marking the input as external_user lets the policy treat actions driven by chat input differently from actions driven by internal cron jobs. That is how you stop indirect injection. The model can be convinced of anything; the policy is not.

Step 4: Write the injection-defense policy

A short YAML bundle covers the patterns that account for most published incidents: external email exfiltration, destructive deletes driven by chat input, and large refunds proposed by chat sessions. Keep the rule list narrow at first and grow it as your threat model evolves.

yaml
# policies/injection-defense.yaml
- name: block_external_emails_unless_approved
  match:
    tool: send_email
  rules:
    - if: not args.to.endsWith("@approved.example")
      then: require_approval

- name: never_let_external_input_drive_a_delete
  match:
    tool: delete_user
  rules:
    - if: context.source == "external_user"
      then: deny

- name: cap_refunds_from_chat_sessions
  match:
    tool: refund_order
  rules:
    - if: context.source == "external_user" and args.amount_cents > 10000
      then: require_approval

For the broader threat model, read the MCP security guide and the OWASP LLM Top Ten primer in the glossary.

Failure modes to catch

Trusting the model to refuse

System prompts that ask the model to refuse certain actions are wishes, not controls. The defensive assumption should be that models can be coaxed past their own refusals. The deterministic check has to live below the model.

No context.source on the decision

Without marking the input as external, your policy cannot distinguish an internal cron from a user chat. Pass context.source on every decide call.

Ignoring indirect injection

If your agent reads webpages, support tickets, or PDFs, the content of those documents is also untrusted input. Mark anything that came from a read tool with source = read_content and apply the same strict rules to actions taken on the back of it.

Production checklist

  • Input filter catches the known injection patterns and runs before the LLM call.
  • Output validator parses tool calls as structured data and rejects unknown tools.
  • Each governed tool call routes through veto.decide before execution.
  • context.source is set on governed decisions and the policy uses it.
  • Rejections from all three layers feed a single threat-intel log.

FAQ

Can a prompt filter alone stop injection?

No. Filters catch known patterns, while indirect injection (a malicious instruction smuggled through a webpage the agent reads) can walk past them. Input filtering is necessary, not sufficient. The deterministic check has to live below the model.

What is the difference between guardrails and runtime authorization?

Guardrails is an umbrella term. Input filtering, output validation, content moderation, and runtime authorization are all guardrails. Runtime authorization is the specific subtype that decides allow or deny on a concrete tool call with concrete arguments, in code outside the model. Rephrasing the prompt does not change a policy decision enforced outside the model.

What if my agent reads documents that might contain hidden instructions?

That is indirect prompt injection. The mitigation is the same: do not let the model decide what to do with the read result. Route each side-effecting tool call it proposes through a Veto decision. The policy can read context.source and mark anything sourced from external document content as untrusted, then deny or gate the higher-stakes tools.

Related guides

Put a deterministic check below your model.