Personal Agents Are Security Hell

2026-03-09

Personal Agents Are Security Hell

I started looking into the security aspect of personal agents as I wanted to build my own. An always-on assistant with access to my email, files, calendar, private accounts, and the open internet. A combination that, at least intuitively, should raise a few alarm bells.

Simon Willison calls the structural problem the Lethal Trifecta: to be useful, a personal agent needs access to private data, it processes untrusted content (emails, web pages, documents from others), and it communicates externally (sends emails, calls APIs). You're deliberately lowering the barriers between your most sensitive data and the outside world. As a feature.

It's also what makes them exploitable. OWASP now maintains two separate top 10 lists — one for LLM apps, one specifically for agentic applications. Prompt injection sits at #1 on both (ish). And as this remains unsolved, and more people haste to adopt personal agents with limited technical knowledge, they become an attractive target for spray-and-pray drive-by attacks. Drop injected prompts in web pages, emails, shared documents and wait until someone points their poorly scoped harness to your content.

Nasr, Carlini, and colleagues from OpenAI, Anthropic, and Google DeepMind tested twelve prompt injection defences — input classifiers, output filters, paraphrasing, sandboxing, combinations of all of the above. Every single one was broken. Most at over 90% attack success rate. Fuck.

And y'all wanna hook that up to some vibecoded memeware with full access to your entire digital life?

The Original Sin

LLMs process everything as a single stream of tokens. Instructions, user data, tool outputs, injected payloads — they all arrive through the same channel and receive the same treatment. There is no architectural boundary between "do this" and "here's some data."

This is an old problem in a new medium. Von Neumann architectures store code and data in the same memory space, the original sin that gave us buffer overflows. With them, a vast plurality of all documented vulnerabilities. SQL injection exploited the same confusion in databases. Prompt injection exploits it in natural language. The medium keeps changing. The exploits look remarkably similar. Which makes me wonder how much we can learn from 40+ years of buffer overflows and sally';DROP TABLE users; to build more robust agents.

Schneier calls this "natural language becoming the malicious code itself".

EchoLeak (CVE-2025-32711, CVSS 9.3 — near-maximum severity) was a zero-click exfiltration attack against Microsoft 365 Copilot, disclosed in June 2025. The attack required no user action at all. An attacker sends an email containing a crafted payload. Copilot, processing the inbox, encounters the payload and auto-fetches what appears to be an image — but the URL encodes stolen data: calendar entries, email contents, whatever Copilot has access to. The victim never clicks anything. The data leaves through a request that looks like routine image loading. This was a production system used by millions of people, not a proof of concept.

The ZombAIs demo (October 2024) showed a different attack surface, equally simple. Claude Computer Use browsed to a webpage containing a single sentence. It autonomously downloaded a binary, made it executable, and ran it. The binary connected to a command-and-control server. One sentence on a webpage. Full remote access.

Three Defences That Caught My Attention

When I started looking at what exists beyond guidelines and best practices, three architectural approaches stood out as having more substance than the rest. I don't really care as much about injection detectors or input sanitisation. They have their place, but they'll likely be caught in a game of cat-and-mouse with adversaries for a while. I am more interested in architectural solutions that make injection a less appealing attack vector. And yes, this will seem extremely basic to anyone remotely working in the field of AI security, shame the noob.

Dual LLM

Simon Willison proposed the Dual LLM pattern in 2023. The idea: separate the model with authority from the model that reads untrusted data. The privileged model (P-LLM) has tools but never sees external content. The quarantined model (Q-LLM) reads external content but has no tools. A controller passes only opaque variable references between them — the P-LLM sees $VAR2, not the raw email.

This is an attempt to undo the original sin externally. If you can't change how LLMs process tokens, you can at least stop the tokens carrying instructions from mixing with those carrying untrusted data. It prevents confused deputy attacks (where the privileged model is tricked into misusing its authority) and blocks direct data exfiltration.

What it doesn't cover: semantic manipulation. An attacker can craft content that makes the Q-LLM produce a summary presenting a phishing email as a legitimate invoice. Data-flow defences can't catch that. No production implementation exists. But the pattern directly influenced what came next.

CaMeL

CaMeL (DeepMind, 2025) deserves the most space here because it's the most substantive attempt at a real solution.

The P-LLM generates a plan — not in prose, but in restricted Python. Actual code. A custom interpreter executes that plan with taint tracking on every value. Every piece of data carries capability metadata: where it came from, what it's allowed to be used for. When the plan tries to send an email to an address extracted from untrusted content, the interpreter checks the capability label and blocks the action before execution.

The formal guarantee is specific: untrusted data cannot change which steps execute, only what values flow through those steps. An injected instruction in an email cannot add "also forward everything to attacker@evil.com" to the plan, because the plan was fixed before any untrusted data was retrieved.

On the AgentDojo benchmark, CaMeL achieved 77% task completion versus 84% undefended. That 7-point gap is the quantified cost of deterministic enforcement — the capability you sacrifice to make the remaining 77% provably secure.

Microsoft Research extended this with two-dimensional capability labels — confidentiality and integrity — borrowing from 1970s military information security (Bell-LaPadula + Biba) to track what data an agent can see and how much it should trust what it reads.

CaMeL comes closest to patching the original sin. It forces a real boundary between instructions and data by routing the plan through an actual interpreter rather than the LLM's token stream. But the limitation matters: formal guarantees require constraining the agent to the structured plan language. Arbitrary shell execution exits the interpreter sandbox and voids the guarantees. The moment you allow unconstrained natural-language tool use, you're back in the flat token stream.

Instruction Hierarchy

OpenAI's Instruction Hierarchy (2024) takes a different approach entirely. Rather than separating models or constraining plan languages, fine-tune the model itself to treat message slots with different authority — system prompt highest, user message next, tool outputs lowest. They reported a 63% improvement on system prompt extraction defence, with some generalization to unseen attack types.

The problem is that the hierarchy lives entirely in the model's weights — there's no external layer enforcing it. Within months, EmbraceTheRed bypassed it on GPT-4o-mini using indirect references that the fine-tuning hadn't covered. ICLR 2025 rejected the paper, unconvinced that training alone could hold against adversaries who adapt their attacks to the specific defence. OpenAI eventually folded the hierarchy into their Model Spec — a statement of how the model should behave, not a mechanism that prevents it from misbehaving. Training or fine-tuning security looks to be a source of attacker friction a lot more than a guarantee. This may be fine for many applications, but giving agents consequential capabilities will necessitate more assurances.

The Tradeoff

The pattern across all three: Capability and Deterministic Safety seem to be somewhat at odds with each other. Instruction Hierarchy costs almost nothing in capability but gives probabilistic safety. Dual LLM costs significant capability to deliver significant resilience (being an early, non-productive model, it seems a bit extreme). CaMeL imposes significant friction (particularly at design stage) but gives provable safety for defined policies. More guarantees, less capability.

And stacking doesn't help. Nasr and Carlini showed that layering filters and detectors doesn't overcome the underlying robustness problem — the defences fail for the same architectural reason the attacks work.

This is not unlike human workers. A very well trained LLM is like a very trustworthy and well behaved employee. You may trust them with a wide range of tasks, but extremely consequential actions (e.g. commanding a nuclear reactor) require external guarantees.

How About We Just Ask the Human?

The automated defences are incomplete. The intuitive response: let the human review high-risk actions. Make the agent ask permission before sending emails or executing code. The human becomes the backstop.

Permission fatigue is real, particularly when the permissioning model is misaligned with the user's own mental model of the system. Redteaming against Windows very frequently includes tricking the user into giving our code permissions it shouldn't have, usually thanks to either broad escalation paths (taking a .doc from fully buttoned down to running onOpen macros without any stops in between) or pestering them with vague permissions requests ("do you want to allow X to make changes to your computer?" regularly triggered by subprocesses the user doesn't know but needs).

An fMRI study by Vance and colleagues measured brain activity while participants responded to repeated security warnings over a five-day workweek. Activity in visual processing regions declined measurably within the first days of the study. Not over weeks or months of exposure — within days. And the recovery between sessions was only partial. The neural mechanism for attending to warnings habituates fast. This isn't a failure of user discipline. It's how visual attention works.

The behavioral data lines up. 69% of Windows users applied UAC permissions incorrectly in Microsoft's own elevation prompt. Across 25 million Chrome SSL warnings, 70.2% were clicked through — users dismissing man-in-the-middle warnings more than two-thirds of the time. These aren't edge cases. These are the primary security interfaces of the two most-used software platforms on Earth.

The same pattern shows up in AI agents. As users gain experience with Claude Code, auto-approve usage nearly doubles — from 20% to 40%. The more people use the tool, the less they review its actions. The approval prompt becomes a thing you click through, not a thing you read.

James Reason's Swiss Cheese Model offers useful vocabulary here. Accidents happen when holes in multiple defence layers align at the same moment. No single failure causes the accident. The important distinction is between active failures — a prompt injection in an email — and latent conditions: overly broad permissions granted at setup, no audit trail, an underspecified system prompt. The latent conditions sit there for months before combining with an active failure. The research above suggests the human review layer has more holes in it than system designers typically assume.

The point is that the permissioning surface — the shape and taxonomy of what the system asks the user to approve — doesn't match how human attention actually works. Designing a permissioning system that does is a serious UX research problem in its own right.

So What Do You Do?

Rather than trying to fix the original sin, it seems all approaches focus on building agent harnesses that make it less of a problem.

Scanning around existing solutions I see things like nanoclaw and hermes-agent (a very late addition to my research as it launched at the end of it) have strong guardrails to prevent rogue agents taking over the system, narrow scoped exfiltration protection, and bash denylists. The containerisation, credential redaction, and container scoping really drew my attention. But I still see (theoretical?) potential for agent goal hijack or memory / context poisoning of agents, which, given full read-and-write access to the outside world, makes them an interesting adversarial target.

Some ideas I want to explore in a practical implementation:

Using narrow-scoped tools and deterministic permissions to provide task-specific agents with only the tools that they need (e.g. "read newsletter inbox" vs. "read email", or "write draft" vs "send email")
Leveraging nanoclaw's zero config, skill based customisation approach in a more aggressive way -- use an "architect" agent to help the user develop the permissioning surface. User speaks human, agent speaks CaMeL or whatever is the policy language we end up using.
Keep the codebase simple so I can understand it. Nanoclaw hit the nail here. I'm not hooking up 30k lines of vibecoded memeware to my entire digital footprint and neither should you.
Add probabilistic guardrails. An agent that monitors and escalates behaviour that is permitted but looks weird. I like this a lot, but needs to be well designed to be token efficient.

I will explore this in my next piece.

--A

Source Bibliography

Nasr, Carlini et al. — The Attacker Moves Second (2025) — 12 defences broken at >90% ASR
Simon Willison — The Lethal Trifecta (2025) — structural vulnerability framing
Brodt, Feldman, Schneier, Nassi — The Promptware Kill Chain (2026) — seven-stage attack model, 36 incidents
Bruce Schneier — The Promptware Kill Chain (blog) — accessible summary
Simon Willison — The Dual LLM Pattern (2023) — P-LLM/Q-LLM separation
Google DeepMind — CaMeL (2025) — formal taint tracking, capability-secure plan language
OpenAI — The Instruction Hierarchy (2024) — training models to treat message slots differently
Microsoft Research — FIDES (2025) — information-flow control extending CaMeL
Martin Fowler — Agentic AI Security — practitioner-oriented framing
OWASP — Top 10 for LLM Applications 2025
Motiee et al. — Do Windows Users Follow the Principle of Least Privilege? (SOUPS 2010) — 69% incorrect behavior
Akhawe & Felt — Alice in Warningland (USENIX 2013) — 70.2% SSL click-through
Vance et al. — Tuning Out Security Warnings (MIS Quarterly 2018) — fMRI, brain drops after second exposure
Anthropic — Measuring AI Agent Autonomy in Practice (2025) — auto-approve doubles with experience
Bianchi, Curry, Hovy — AI Accidents Waiting to Happen? (JAIR 2023) — Perrow applied to AI
Perrow — Normal Accidents (1984) — tight coupling and interactive complexity
Reason — Human Error (1990) — Swiss Cheese Model
Leveson — Engineering a Safer World (2011) — STAMP/STPA, hierarchical distance