KCC 13.2
Butler Scoring
A worked example: the Butler scores a single invocation to decide HITL or HOTL for a high-value, high-risk payment retry feature.
Reference ExamplesButlerHITL/HOTLGatingTrust history
Created 2026-06-08 · v0.4.0
The Scenario
Cell payments-core wants to generate a spec for a new payment retry feature. The Butler scores the invocation to decide HITL or HOTL. The reviewer is Alice, who has strong calibration history.
value: 0.8 # payment systems are high-value risk: 0.7 # payment errors have customer impact complexity: 0.5 # standard retry pattern confidence: 0.82 # from the agent's expected confidence trust_history: 0.85 # Alice has strong calibration history cognitive_load: 0.3 # Alice has 4 reviews so far today trifecta: false # no external communication tool involved
Decision Logic
base_score = ( 0.20 * 0.8 + # value -0.25 * 0.7 + # risk -0.15 * 0.5 + # complexity 0.20 * 0.82 + # confidence 0.15 * 0.85 + # trust_history -0.10 * 0.3 # cognitive_load ) base_score ≈ 0.16 hitl_below = 0.4 ; hotl_above = 0.7 0.16 < 0.4 -> decision = HITL
Decision: HITL
Alice reviews before the agent acts. The decision and its inputs are recorded in the decision trace (Surface 9).
What This Illustrates
- Risk dominates even with good trust — Alice's 0.85 trust and the agent's 0.82 confidence are pulled down by the 0.7 risk of a payment system. Trust earns autonomy in low-risk contexts; high-risk contexts always require human review.
- Confidence is an input, not an override — 0.82 is above threshold, so confidence didn't force HITL, but high confidence alone doesn't grant autonomy.
- Trifecta was checked first — if the agent had been lethal-trifecta, all scoring would have been moot.
- The decision is auditable — six months later, the trace shows exactly why Alice's review was required.