KCC 10.1

Lethal Trifecta

An agent that combines untrusted content, private data access, and external communication is dangerous regardless of how careful its individual decisions appear.

Special ConcernsPrompt injectionDeclared toolsHITLSandbox isolation
Created 2026-06-08 · v0.4.0

Description

Adapted from Simon Willison's framing. An agent that combines all three of the following traits is dangerous regardless of how careful its individual decisions appear:

TraitExamples
Access to untrusted contentUser text, web content, ticket descriptions, emails, external repos, search results
Access to private dataCodebases, credentials, customer data, internal documents, financial systems
Ability to communicate externallySend emails, call external APIs, post to forums, create issues, write public files

The danger is prompt injection: untrusted content includes instructions that cause the agent to exfiltrate private data via external communication. Real incidents have occurred in 2025-2026, including documented npm registry compromises via GitHub Issues prompt injection.

Three traits that are safe alone and dangerous together — detect, then isolate:

Diagram

How KCC Handles This

Surface 4 (Declared Tools) requires each tool to declare untrusted_content, private_data_access, and external_communication. The kernel inspects the full tool list at registration. If any combination collectively provides all three categories, the agent is flagged lethal-trifecta: Surface 6 is forced to HITL regardless of declared mode, the Butler always escalates these invocations, and promotion to HOTL requires explicit kernel maintainer approval logged as an exception.

Detect, Then Isolate: The Runtime Sandbox

Reference pattern, non-normative. Forcing HITL governs the decision to proceed; it does not contain the runtime while the agent works. When the three conditions trip, the cell may run the agent inside a sandbox with a declared toolchain, a per-session boundary, a write-allowlist (writes refused outside a scratch volume), a network-allowlist (no egress by default), and a human-approval gate for any sensitive capability beyond the allowlist. Auto-launching a sandbox is safe — where auto-remediation would not be — because isolation never makes the agent more capable; it only removes reach.

Why It Cannot Be Solved By "Smarter" Prompting

Instructing the model to refuse prompt-injection attempts does not work reliably: injections are designed to bypass exactly these instructions, and the model cannot reliably distinguish system instructions from user-supplied ones. The structural answer — constrain what the agent can do via declared tools — is more reliable. The agent cannot exfiltrate data it cannot reach.

Don't trust the model to refuse the attack. Restrict what the model can do.