KCC 13.2

Butler Scoring

A worked example: the Butler scores a single invocation to decide HITL or HOTL for a high-value, high-risk payment retry feature.

Reference ExamplesButlerHITL/HOTLGatingTrust history
Created 2026-06-08 · v0.4.0

The Scenario

Cell payments-core wants to generate a spec for a new payment retry feature. The Butler scores the invocation to decide HITL or HOTL. The reviewer is Alice, who has strong calibration history.

value: 0.8                       # payment systems are high-value
risk: 0.7                        # payment errors have customer impact
complexity: 0.5                  # standard retry pattern
confidence: 0.82                 # from the agent's expected confidence
trust_history: 0.85              # Alice has strong calibration history
cognitive_load: 0.3              # Alice has 4 reviews so far today
trifecta: false                 # no external communication tool involved

Decision Logic

base_score = (
   0.20 * 0.8   +    # value
  -0.25 * 0.7   +    # risk
  -0.15 * 0.5   +    # complexity
   0.20 * 0.82  +    # confidence
   0.15 * 0.85  +    # trust_history
  -0.10 * 0.3        # cognitive_load
)
base_score ≈ 0.16

hitl_below = 0.4 ; hotl_above = 0.7
0.16 < 0.4  ->  decision = HITL

Decision: HITL

Alice reviews before the agent acts. The decision and its inputs are recorded in the decision trace (Surface 9).

What This Illustrates

  • Risk dominates even with good trust — Alice's 0.85 trust and the agent's 0.82 confidence are pulled down by the 0.7 risk of a payment system. Trust earns autonomy in low-risk contexts; high-risk contexts always require human review.
  • Confidence is an input, not an override — 0.82 is above threshold, so confidence didn't force HITL, but high confidence alone doesn't grant autonomy.
  • Trifecta was checked first — if the agent had been lethal-trifecta, all scoring would have been moot.
  • The decision is auditable — six months later, the trace shows exactly why Alice's review was required.