Lethal Trifecta
An agent that combines untrusted content, private data access, and external communication is dangerous regardless of how careful its individual decisions appear.
Description
Adapted from Simon Willison's framing. An agent that combines all three of the following traits is dangerous regardless of how careful its individual decisions appear:
| Trait | Examples |
|---|---|
| Access to untrusted content | User text, web content, ticket descriptions, emails, external repos, search results |
| Access to private data | Codebases, credentials, customer data, internal documents, financial systems |
| Ability to communicate externally | Send emails, call external APIs, post to forums, create issues, write public files |
The danger is prompt injection: untrusted content includes instructions that cause the agent to exfiltrate private data via external communication. Real incidents have occurred in 2025-2026, including documented npm registry compromises via GitHub Issues prompt injection.
Three traits that are safe alone and dangerous together — detect, then isolate:
How KCC Handles This
Surface 4 (Declared Tools) requires each tool to declare untrusted_content, private_data_access, and external_communication. The kernel inspects the full tool list at registration. If any combination collectively provides all three categories, the agent is flagged lethal-trifecta: Surface 6 is forced to HITL regardless of declared mode, the Butler always escalates these invocations, and promotion to HOTL requires explicit kernel maintainer approval logged as an exception.
Detect, Then Isolate: The Runtime Sandbox
Reference pattern, non-normative. Forcing HITL governs the decision to proceed; it does not contain the runtime while the agent works. When the three conditions trip, the cell may run the agent inside a sandbox with a declared toolchain, a per-session boundary, a write-allowlist (writes refused outside a scratch volume), a network-allowlist (no egress by default), and a human-approval gate for any sensitive capability beyond the allowlist. Auto-launching a sandbox is safe — where auto-remediation would not be — because isolation never makes the agent more capable; it only removes reach.
Why It Cannot Be Solved By "Smarter" Prompting
Instructing the model to refuse prompt-injection attempts does not work reliably: injections are designed to bypass exactly these instructions, and the model cannot reliably distinguish system instructions from user-supplied ones. The structural answer — constrain what the agent can do via declared tools — is more reliable. The agent cannot exfiltrate data it cannot reach.
Don't trust the model to refuse the attack. Restrict what the model can do.