In late 2025, the security community stopped treating indirect prompt injection as a theoretical risk. It had spent two years as a tidy lab demonstration; then production systems started getting hit. The OWASP Top 10 for LLM applications now ranks prompt injection as the number-one risk, NIST has called indirect injection generative AI’s greatest security flaw, and academic researchers showed that a single poisoned email could coerce a model into exfiltrating SSH keys in up to 80% of trials, with zero user interaction. The attack needs no malicious binary, no phishing clicks, and no anomalous login. The agent simply reads content and takes action, exactly as designed, and the content was written by an attacker.
The most instructive example is ForcedLeak. In September 2025, researchers at Noma disclosed a critical vulnerability chain (CVSS 9.4) in Salesforce’s Agentforce platform: An attacker embedded malicious instructions in the description field of a routine Web-to-Lead form. The text sat harmlessly in the CRM until an employee later asked the AI agent to process that lead, at which point the agent dutifully executed both the legitimate query and the attacker’s hidden payload, exfiltrating sensitive CRM data to an external server. The detail that should keep you up at night is that the exfiltration destination was a domain still on Salesforce’s trusted allowlist, one that had expired and which the researchers re-registered for about five dollars. Every security control saw legitimate traffic to a trusted domain. Nothing looked wrong.
If your instinct reading that is “we filter for prompt injection,” you’re defending the wrong perimeter. Input filtering is necessary but nowhere near sufficient. The uncomfortable truth is that the injection isn’t the breach; the action is. And almost everything we call “AI security” is aimed at the wrong half of that sentence.
The defense everyone is building
Ask most enterprise AI teams how they secure their agents, and you’ll hear a consistent answer: They sanitize inputs. They harden system prompts with elaborate instructions to ignore conflicting directives. They run classifiers over incoming content to flag adversarial patterns. Some have adopted the more sophisticated training-time defenses the frontier labs have published—instruction hierarchies that teach a model to assign differential trust to different sources and reinforcement-learning approaches that harden models against injection in agentic contexts.
All of this is good work, and none of it should be abandoned. But notice what every one of these techniques shares. They all try to stop the model from being fooled. They assume that if we make the model robust enough at the input layer, the system is safe. That assumption is the vulnerability.
We’ve spent two years trying to make the model unfoolable. The systems that survive contact with production assume it will be fooled anyway.
Why the input layer is the wrong perimeter
Prompt injection isn’t a bug a future model will lack. It’s a structural property of how language models work. The model consumes a single undifferentiated stream of tokens at the moment of inference. Your instructions, the retrieved document, the tool output, and the web page just fetched are indistinguishable channels collapsed into one context. There’s no hardware-enforced boundary between “trusted instruction” and “untrusted data” the way there is between kernel space and user space in an operating system.
This is why the attack surface explodes the moment an agent becomes agentic. A chatbot that only talks is a contained risk. An agent that retrieves from the open web, reads email, queries databases, and calls APIs ingests adversarial content from a dozen sources on every turn, and any one of them can carry an instruction. Researchers cataloging real agent ecosystems have already found hundreds of malicious third-party extensions performing data exfiltration and silent injection without any user awareness. These aren’t laboratory curiosities. They’re the production environment.
So, if you can’t guarantee the model will never be fooled—and you can’t—then architecture that depends on it never being fooled is built on sand. You need a second principle, one distributed systems engineers have understood for decades.
Verify, then trust
The principle is simple to state and hard to retrofit: An agent’s proposed action should be validated against an external, deterministic policy before it executes, regardless of why the agent proposed it. The validator doesn’t ask whether the instruction that produced the action was legitimate. It doesn’t try to detect the injection. It asks a different and far more answerable question: Is this action, on its face, permitted?
This inverts the burden. Detecting a cleverly disguised malicious instruction is open-ended because the adversary gets to be arbitrarily creative. Checking whether a wire transfer exceeds a hard dollar limit is a closed problem with a definite answer. We move the security decision from where the attacker has infinite freedom to where they have almost none.
Crucially, the check must be deterministic code, not another model asking, “Does this look dangerous?” The moment you ask a second LLM to adjudicate, you’ve reintroduced the exact same vulnerability one layer down. The enforcement layer is boring, auditable conventional software, and that’s the point.
Here’s what it looks like in practice. An agent managing procurement proposes an action, and a runtime contract evaluates it before anything reaches a real API:
# agent_contract.yaml
agent_id: “procurement_executor_07”
role: “EXECUTOR”
policy:
approve_invoice:
max_amount_usd: 50000
allowed_vendors: from_approved_registry
require_human_above_usd: 10000
# Runtime, on a proposed action:
ACTION approve_invoice(vendor=’Acme’, amount=1200000)
REJECTED policy violation: max_amount_usd
proposed 1,200,000 / limit 50,000
action discarded, human notified, no API call made
The injected instruction at 2:14am never matters here. The agent can be perfectly, catastrophically fooled, and the wire transfer still doesn’t happen, all because a simple deterministic check stood between the model’s output and the outside world, and the proposed action failed it.
This only works if the action arrives structured, which makes structure a precondition.
The contract inspects approve_invoice (vendor, amount) cleanly only because the action is already typed. If the agent emits prose, “please approve the Acme invoice,” something has to parse it, and the only thing that parses open language is another LLM, so the indeterminacy walks back in. That dictates the design.
A consequential action must cross the boundary as a typed tool call, never as free text. Where the input is unavoidably natural—an email saying, “Wire them their balance” for example—let the model extract a structured value but never let its extraction be self-authorizing. The model proposes the amount; the gate still checks it against the limit, the vendor registry, and the actual balance in the system of record, not the number the email asserted. Extraction is probabilistic, while validation stays deterministic.
A few decisions are pure judgment with no schema, such as “Is this email phishing?” There the model stays in the loop. You bound the consequences instead, with reversibility and human review above a threshold. Contracts protect parameterizable actions, and unparameterizable judgments fall back to containment.
The architecture this implies
Once you accept that the action layer is where security lives, three design commitments follow, and they map almost directly onto principles that hardened distributed systems years ago.
Least privilege for agents, scoped to the action, not the agent. The naive version assumes you can predict what an agent will do and provision it accordingly. For a specialized agent you can: One that only summarizes has no business holding a credential that moves money. But the agents people actually reach for are general. In a single session, I might ask a coding agent to summarize a file, write code, execute it, and query company data—four tasks with four risk profiles, none of which are enumerated in advance. Static least privilege collapses the moment one identity spans that range.
The fix is to make privilege a property of the action, not the agent. The agent holds no dangerous capability by standing grant; it requests narrow, transient elevation per action, which the same deterministic gate approves or denies. Reading a document is auto-approved; querying the warehouse is not. The dangerous credential exists only for the instant the action is permitted, then evaporates. One caveat: This governs what an agent may reach but not what the code it writes then does. Executing code can be gated as a capability, but what executes still needs containment, sandboxing, and egress control, because generativity is a different problem from access.
Zero trust for machine identities. Every action an agent takes should be authenticated and authorized as if it came from an untrusted actor, because, functionally, it might be acting on an attacker’s instructions. The proliferation of agents has expanded the attack surface faster than most identity systems were designed to handle, and treating agent traffic as inherently trusted because it originates inside your own system is precisely the mistake.
Capability contracts at the boundary. Every consequential action passes through a deterministic gate that encodes what is allowed, dollar limits, rate limits, allowlisted destinations, mandatory human review thresholds. The contract is version-controlled, auditable, and lives entirely outside the model.
The trap of normalized deviance
The quieter organizational danger is the slow accumulation of false confidence from connecting insecure agents to real systems and watching nothing bad happen. . .for a while. Researchers have warned about indirect injections for years, but most deployments have gotten away with it. Each uneventful day makes the next risky connection feel safer. This is the normalization of deviance. Every system that eventually failed catastrophically felt the same way: fine, fine, fine, until it wasn’t.
The teams that will weather the coming wave of agent incidents aren’t the ones with the cleverest input filters. They’re the ones who assumed compromise from the start and built the boring enforcement layer anyway, the ones who decided that an agent’s autonomy ends precisely at the point where it tries to do something irreversible.
Where to start on Monday
You don’t need to rearchitect everything. Start by inventorying the actions your agents can take, and sort them by blast radius: What’s the worst thing that happens if this action fires when it shouldn’t? For every high-blast-radius action, write a deterministic contract that gates it and put a human in the loop above a threshold you can defend to your risk team. Then, and only then, keep hardening your inputs.
Prompt injection won’t be solved at the input layer, because it can’t be. But it can be rendered survivable at the action layer, where deterministic code gets the final word. The model’s job is to be useful. Your architecture’s job is to make sure that when the model fails—or worse, when it has been turned against you—the failure stops at the gate.

