BlogGovernance

Human-in-the-loop boundaries that actually hold

Human-in-the-loop is the most common control in agentic AI risk plans. It is also the control most often specified in a way that does not hold. This piece defines what a robust HITL boundary requires — and the failure modes that hollow it out.

Drel Research25 May 202511 min read

Human-in-the-loop (HITL) is the control that appears in more agentic AI risk plans than any other. When a security team asks how a high-risk agent action is governed, the answer is almost always some variation of “a human approves it before it executes.” This is the right instinct. The problem is that “a human approves it” is a description of intent, not a specification of a control. Most HITL implementations do not hold under operational pressure.

Approval boundary requirements by action type

Action type	Required approval gate	Evidence required	Who approves
Read-only query	None required	Access log showing query scope; no state changes	No approval — automated logging only
Low-value external call	Session-level pre-authorisation	Pre-auth scope documented; call within approved endpoint list	User at session start; system enforces scope
High-value external call	Per-action explicit approval	Proposed call parameters surfaced in full; action not executed until approved	Named human reviewer with authority for the external system
Data write	Per-action explicit approval for sensitive data; session-level for non-sensitive	Write target, content scope, and reversibility stated; approval token issued	Data owner or delegate; security architect for sensitive writes
Autonomous code execution	Per-action mandatory approval — no session-level pre-auth permitted	Code surfaced in full before execution; sandbox configuration confirmed; approval token required at tool layer	Security-cleared engineer; no self-approval by the session user
Multi-step decision chain	Plan-level approval before any consequential step executes	Full plan surfaced and legible; each consequential step identified; human confirms before first execution	Accountable business owner; security architect for high-risk chains

The HITL promise

Human-in-the-loop works as a control when it means exactly what it says: a human reviews the proposed action, understands its consequences, and explicitly approves or rejects it before the agent proceeds. Under this model, the agent cannot execute a consequential action without human authorization. The human is a mandatory gate, not a notification.

When this model holds, it is a strong control. Agentic AI systems can reason, plan, and prepare for complex actions — but they cannot execute those actions unilaterally. The human approval requirement caps the blast radius at whatever the human is willing to explicitly authorize. An attacker who has hijacked the agent's reasoning loop can generate a proposed action, but cannot execute it without a human reviewing and approving it.

This promise is real. The control can work. The failure is in implementation, not in the underlying model.

Why HITL fails in practice

HITL implementations degrade along predictable paths. Understanding those paths is necessary for specifying a robust implementation — and for verifying, during a security review, that the implementation on a given assessed system is not on one of those paths.

The degradation is rarely deliberate. Teams do not set out to create a HITL that does not work. The failure modes emerge from the interaction between good-faith product decisions and the realities of how humans behave under load.

The four failure modes

Failure mode 1: The approval UX makes rubber-stamping inevitable.

When HITL is implemented as a UI notification — a dialog, a confirmation button, an inline “approve?” prompt — the UX design determines whether the human is reviewing or rubber-stamping. If the approval dialog shows the agent's proposed action in a format that requires interpretation, with no time to do that interpretation, and is presented alongside high-velocity operational work, humans will click approve. Every time.

The diagnostic is simple: does the approval UI surface enough information for a human to meaningfully evaluate the action in the time they have available? If not, the HITL is a click-through, not a control.

Failure mode 2: The boundary is enforced at the UI, not at the infrastructure.

If HITL is enforced only in the application's UI — the action is blocked at the UI layer pending approval, but the underlying tool call can be made independently of the UI — then bypassing the UI bypasses the control. This is the same failure mode as client-side input validation: it catches ordinary inputs but not adversarial ones.

A HITL that is enforced only in the UI is not a security control. It is a UX feature. Security reviewers should verify that the HITL enforcement is at the tool layer, the API gateway, or the model serving infrastructure — not just in the front-end code.

Failure mode 3: Scope creep removes HITL after it is approved.

HITL is specified and reviewed at approval time. After approval, the system evolves: new agent capabilities are added, the scope of “what the agent can do” expands, and the HITL boundaries are not re-evaluated. The initial approval covered a narrower set of consequential actions than the system currently supports.

The control for this failure mode is re-assessment triggers in the risk disposition: any addition of new agent capabilities, new tool manifest entries, or new deployment contexts triggers a re-review of the HITL scope. Without explicit re-assessment triggers, HITL scope drift is not detected until after an incident.

Failure mode 4: Time pressure overrides the control in practice.

In production deployments under performance pressure, the HITL timeout — how long the agent waits for human approval before timing out — is often set too short to allow genuine review. When the timeout fires, the system either cancels the action (reducing perceived agent utility) or proceeds without approval (defeating the control).

Teams under pressure to demonstrate agentic performance frequently choose “proceed without approval” as the timeout behavior. This is a product decision that is rarely surfaced as a security decision. The security review must ask: what happens when the approval times out, and is that behavior consistent with the HITL being a real control?

A HITL with a timeout that defaults to “proceed” is not a control. It is a delay. Under pressure, it will consistently resolve to “proceed” — which is the same outcome as having no HITL at all.

Robust HITL design

A robust HITL implementation addresses all four failure modes:

Enforcement at the model gateway, not the UI. The approval gate is enforced at the tool call layer or the model serving infrastructure. The agent cannot execute the tool call without an authorization token issued by the approval mechanism. The UI is a rendering surface for the approval workflow, not the enforcement point.
Approval UX designed for genuine review.The approval dialog surfaces the specific action proposed, the parameters of that action, the context that led to it, and a clear statement of what will happen if approved. It is designed to be evaluated in a realistic time window for the human reviewer's role.
Timeout defaults to cancel, not proceed. When the approval timeout fires, the default action is cancellation. If the operation is time-sensitive, the system surfaces that urgency to the human reviewer — it does not remove the human from the loop.
Re-assessment triggers for capability scope changes.The HITL scope is defined in the risk disposition. Any change that expands the agent's capability set triggers a re-review of whether the HITL coverage is still adequate for the new capability set.

Where to place the HITL boundary

Not every action an agent takes requires human approval. Placing the HITL boundary in the right location requires defining what constitutes a “consequential action” for the specific deployment.

A useful framework for defining the boundary:

Irreversibility: Actions that cannot be undone (sent emails, submitted payments, deleted records) are higher priority for HITL than actions that can be reversed
Blast radius: Actions with large downstream consequences (posting to external systems, modifying shared data stores, triggering external workflows) require HITL; read-only actions with no external effects typically do not
Scope of authority: Actions that exceed the user's stated intent for the session — things the agent is doing that the user did not explicitly ask for — require human confirmation
Data sensitivity: Actions that access, copy, or transmit sensitive data to external destinations require HITL regardless of reversibility

The HITL boundary specification should be a documented list of action categories, not a general principle. “Consequential actions require approval” is not a specification. “Any tool call to an external API that writes or deletes data requires approval before execution” is a specification.

Approval granularity

A critical design question for HITL is the granularity of approval: does the human approve each tool call individually, approve a multi-step plan, or approve a task category?

Per-action approval is the most granular and the most secure. Each tool call in a category requires explicit approval. It is also the most disruptive — in a high-volume agentic deployment, it can create an approval backlog that defeats the purpose of the agent.

Plan-level approvalrequires human review and approval of the agent's proposed multi-step plan before any consequential step executes. This is more practical for complex agentic workflows — a human reviews “the agent plans to: (1) retrieve X, (2) draft Y, (3) send Z” and approves or modifies the plan before execution. It requires that the plan is complete and legible before the agent begins execution.

Task-category approvalpre-authorizes specific categories of tasks for a session: "approve sending up to three emails in this session, to addresses in the user's existing contacts." This reduces per-action friction while maintaining a defined scope. It requires that the categories are well-defined and that the pre-authorization mechanism is enforced at the tool layer.

The right granularity depends on the deployment context. The security review should assess whether the chosen granularity is consistent with a genuine human review — or whether it is granular enough to look like HITL while being coarse enough to make meaningful review impossible.

Verification method

Verifying that a HITL implementation is a genuine control — not a rubber-stamp mechanism — requires testing the four failure modes directly:

UX test: Present a reviewer with a sample approval request and measure whether they can evaluate it meaningfully in the time available. Present a realistic number of approval requests in sequence and observe whether approval quality degrades. This is a human-factors test, not a technical test.
Infrastructure enforcement test: Attempt to invoke the gated tool call directly via API or via the underlying tool interface, bypassing the UI approval flow. The tool call should fail with an authorization error, not succeed. If it succeeds, the control is UI-only and fails this test.
Timeout behavior test: Trigger an approval request and allow it to time out without human action. Observe the outcome. If the agent proceeds without approval, document this as a control failure.
Scope coverage test:Review the current tool manifest against the HITL boundary specification. Confirm that every tool call in the “consequential action” category is covered by the HITL mechanism. Any tool call in that category that is not covered is a control gap.

These tests should be documented as part of the HITL control evidence in the agentic AI security review dossier. Without behavioral verification, the HITL claim in the risk disposition is an intent statement, not an evidence-backed control.

Blog

Get new posts in your inbox

AI security review, OWASP Agentic Top 10, ISO 42001 evidence, and what AI Committees actually need. No cadence promises — we publish when there's something worth reading.

Verify your HITL implementation before it is tested by an incident

Drel structures the human-in-the-loop control review as part of the agentic AI security assessment — covering enforcement layer, approval UX, timeout behavior, and scope coverage with documented evidence.

Request early access See the demo dossier

A note on scope: Drel reviews assessed systems against documented architecture, configuration and intent. It does not ingest live telemetry from production environments. Dispositions reflect the assessed system at the time of review and the re-assessment triggers that govern when the disposition must be revisited.