Context Pollution in Long-Running Agents

Context Pollution in Long-Running Agents

While looking at building a hardened personal agent, I began by exploring the most attractive attack vectors to defend, and a first approach at mitigation.

One failure mode I've been particularly interested in, however, is what happens when an agent's context lives through 10, 20, 50 compaction cycles. If you've run long dev sessions with Claude Code, you may notice that after a while things can get weird, especially if you mix tasks (which an OpenClaw kinda guy will by definition do).

What I found in literature, at least, is that agent performance can degrade through mechanisms that are now well-documented as the context window fills and is not thoughtfully managed.

My main motivation is preventing behavioural and personality drift. While I can't (yet?) apply Anthropic's "Brain Surgery" Approach, one thing is clear from their study: drift happens through context accumulation. A fresh agent with no context resets to their default personality and behaviour, so it is clear to me, at least in theory, that careful context management is my way as a harness developer to prevent my lil' bot from going insane and wiping my emails.

The standard fix used by coding harnesses like Claude Code, compaction, can be lossy and unpredictable in ways long-running, multi-task agents may find detrimental. Moreover, it is a mechanism that attackers can exploit for persistence by crafting the right injection language.

This is the context pollution rot problem. The signal degrades over time, and a naïve approach to handling it can exacerbate the problem.

The Longer It Runs, the Worse It Gets

The intuition is simple: more material in the context window means more competition for the model's attention. The research quantifies how bad it actually gets.

What happens when you give the model everything it needs to answer correctly, but surround it with additional material? Even with perfect retrieval — the answer sitting right there in the context — Du et al. found performance degraded between 13.9% and 85% depending on the model. Length alone, without distracting content or conflicting instructions, is sufficient to cause this.

Lost in the Middle documented a U-shaped attention curve: models perform best on information near the beginning and end of the context, worst on information in the middle. At sufficient length, performance on middle-positioned information drops below closed-book — the model does worse with the answer in context than with no context at all.

For diverse-task agents, the problem compounds through task-switching interference. Prior-task context can actively pull the model away from desired behaviour. One study measured a single agent dropping from 73.1% to 16.6% accuracy at 80 sequential tasks; even a multi-agent setup degraded from 90.6% to 65.3% because orchestration context still accumulated. Multi-turn interactions show similar patterns: 39% average performance drop, with unreliability more than doubling. Choppy and unpredictable, apparently.

Safety Degrades at Length

Safety is just one more dimension of performance, and it, too, can degrade over time. Add to that: context pollution is a persistence vector (more on this later). A personal agent that needs to be truly reliable must definitely have a solid context cleanup schedule — naive compaction won't cut it.

Most models score below 50% safe response rate at extended context, including on straightforward cases. A broader evaluation (ACL 2025) confirmed this: most fell below 55%, with safety degrading faster than general capability as context grew.

"When Refusals Fail" found that safety behavior shifts unpredictably and in opposite directions at long context. GPT-4.1-nano went from a 5% refusal rate to 40% — becoming overly cautious. Grok 4 Fast went from 80% to 10% — becoming far less cautious. Same test conditions. Opposite behavioral shifts. Can we predict in which direction a given model will degrade?

Anthropic's many-shot jailbreaking research adds a structural explanation. The effectiveness of in-context examples at overriding safety training follows a power law: more examples, more effective, and the scaling is relentless. Alignment training increases the number of attempts you need to achieve a result, but past breaking your gains remain exponential.

In a personal agent running for hours, accumulated content can serve the same function as those carefully crafted examples — through sheer volume rather than deliberate intent. The model has processed enough varied instructions, tool outputs, and external content that the safety constraints from the system prompt carry less weight relative to everything else.

System/user prompt separation — the mechanism most frameworks rely on for instruction hierarchy — fails to establish reliable priority even with straightforward conflicts. At long context, the hierarchy weakens further.

Compaction Needs to be Thoughtful

The obvious response to context growth is to compress it. Most agent frameworks implement some form of compaction — summarizing the conversation to free up space for new work. But compaction creates a different, less predictable signal rather than restoring the original.

Evaluating context compression approaches, Factory.ai found the best performers scored 2.45 out of 5 on file tracking — keeping track of which files had been created, modified, or deleted during the session. Compression ratios ran between 98.6% and 99.3%. The agent loses track of what it's done.

The Complexity Trap found something counterintuitive: summarization caused trajectory elongation. Agents that had their context summarized took longer to complete tasks, apparently because the summarization stripped the context that would have told the agent it was stuck or going in circles. It lost awareness of its own failure state.

Agents are already stochastic systems. Compaction adds variance on top of this baseline unpredictability. How sensitive are they to context changes? A single space character can change over 500 predictions; switching between a YAML list and a Python list costs 3–6% accuracy. Compaction restructures far more than a space — it reshapes the entire context. Information compression is explicitly identified as a source of stochasticity in agent behavior. Every time you compact, you roll the dice on what the agent does next.

This shows up in production. Claude Code users have documented the effects. GitHub issue #13112 reports context loss after compaction — the agent losing track of what it was working on. Issue #13919 describes skills awareness being lost at around 55,000 tokens — the agent forgetting capabilities it had been using successfully minutes earlier. Reports of 45% coherence loss at 80–95% context saturation paint a picture of agents that degrade sharply as they approach their limits, then degrade differently after compaction resets the window.

The safety implications compound the problem. If safety behavior already shifts unpredictably at long context, and compaction restructures the entire context unpredictably, the result is a system whose safety properties you can't reason about at session length. The "When Refusals Fail" finding — that models shift in opposite directions — means compaction can push safety behavior either way. You might compact and get a safer agent. You might compact and get a less safe one. No way to know in advance.

Drift No More? offers the most useful framing I found for thinking about this. Agent drift — the gradual deviation from intended behavior — behaves more like a controllable equilibrium than an irreversible decay. The drift doesn't accumulate without bound; it stabilizes at a finite level. And periodic reminder interventions — explicitly restating the original goal at regular intervals — reliably pulled agents back toward intended behavior, reducing divergence by 6–12% and improving alignment quality scores by 16–27% across the models they tested.

So, yeah, thoughtful compaction seems to be the way. You can't eliminate drift, but you can lower its equilibrium level — the question shifts from "how do I prevent this?" to "how do I shift the equilibrium down?" The paper recommends periodic goal restatement at fixed turn intervals, and their future work points toward adaptive prompting — timing reminders based on detected drift rather than a fixed schedule. The distinction between "make the context shorter" and "make the context cleaner" is the foundation of the cleaner agent approach I describe in the companion article.

Attackers Can Use This

Everything above happens without an adversary. Someone actively trying to exploit these dynamics can do worse.

The attack surface is specific: compaction and summarization are lossy transformations that an attacker can predict and shape. If you know the agent will summarize its history, you can craft payloads designed to survive that summarization.

Unit 42 demonstrated this against AWS Bedrock. The attack targets the session summarization step specifically. An injected instruction, planted in content the agent processes, is crafted to survive compaction by mimicking the format of high-priority user context. It persists into long-term memory. From there, it influences every future session — a silent instruction the user never sees. The exfiltration is slow and invisible: data leaked in small amounts across many sessions, never enough to trigger rate limits or anomaly detection.

Rehberger's SpAIware demonstrated the complete chain against ChatGPT. An indirect prompt injection triggers the agent to manipulate its own persistent memory through the summarization pathway. Once the instruction is in memory, the agent exfiltrates data continuously across all future sessions — from a single injection.

ZombieAgent (2026) takes this further — the payload re-writes itself into memory each session, reconstructing from partial fragments even after cleaning. Microsoft discovered over 50 real-world injection payloads from 31 companies in "Summarize with AI" buttons. The memory-focused attacks are equally mature: MINJA (NeurIPS 2025) achieved 98.2% injection success through query-only interaction, and AgentPoison (NeurIPS 2024) demonstrated 82% retrieval success with less than 0.1% of the memory poisoned. The poisoning is invisible to normal operation.

The asymmetry is what makes this dangerous. Compaction is biased in the attacker's favor. Safety constraints — "always confirm before deleting files," "never send credentials over email" — look like procedural overhead to a summarizer. An attacker's payload is crafted to look like important user context: a preference, a standing instruction, a key decision. The summarizer strips the safety constraint because it seems redundant. It preserves the injection because it seems important.

The defenses emerging from this research converge on a few principles I'll explore further:

Context pollution is a convergence of three problems. Quality degrades because context length and task diversity overwhelm the model's attention. Safety degrades because the mechanisms that enforce safe behavior weaken at length and shift unpredictably after compaction. Security degrades because compaction creates a persistence vector that attackers can exploit — and the lossy nature of summarization favors well-crafted payloads over safety constraints.

The "Drift No More?" finding — that drift is a controllable equilibrium — suggests the problem is addressable with the right intervention. That intervention looks more like reconstruction than summarization: building context from what the agent needs to know right now, rather than compressing what it has accumulated.

In the companion article, I describe an attempt at this — a context cleaner agent that reconstructs. It's a fourth agent role with no tools and no filesystem access, designed to address the problems documented here. Whether it works well enough is an open question. But the problem it's trying to solve is well-established.

— A