LLM output validation — the controls that actually work
Prompt injection and hallucination are symptoms. The root cause is missing output validation at the right lifecycle gates. This piece maps the validation controls that close the gap, with examples from assessed systems.
Most LLM security work is spent on the input side. Prompt firewalls, jailbreak classifiers, system-prompt hardening, retrieval sanitisation — all of it pushed up against the model entrance. The intuition is the familiar perimeter intuition from web security: stop the attacker at the gate. The intuition is wrong for LLMs in one consequential way. An LLM is not a deterministic parser. It can produce harmful output from inputs that look completely innocuous, and it can produce safe output from inputs that look adversarial. The input boundary is too leaky to carry the weight of the security argument by itself.
Output validation is the more durable fix. It does not depend on guessing every possible adversarial input. It works at the place where the model meets the rest of the system. If output validation is correctly placed and reasonably specific, the worst that prompt injection can produce is output that the validation layer catches before it reaches anywhere consequential. That is the property worth investing in.
Prompt injection and hallucination are symptoms of the same root cause: LLM output reaching downstream systems without validation. Treat the symptoms by treating the root cause.
The symptom-cause confusion
Walk into the security review of a typical LLM application and you will hear two concerns named most often: prompt injection and hallucination. Both are real and both matter. But both are symptoms, and a control plan that targets them directly will end up chasing an open-ended problem.
Prompt injection is a symptom of the LLM treating untrusted text as instructions. Hallucination is a symptom of the LLM producing content not grounded in any verifiable source. In both cases, the dangerous event is not the model behaviour itself — it is what happens next. A model that produces a hallucinated answer in isolation is a harmless statistical artefact. A model that produces a hallucinated answer that is rendered to a paying customer as if it were authoritative guidance is a liability event. The difference is everything that happens after the model returns its tokens.
The same logic applies to prompt injection. A model that “follows the injection” in its internal reasoning is not, by itself, an incident. A model whose tool-calling layer takes the resulting unsafe parameters and executes them against a live API is the incident. The injection is the trigger; the consequential event is the absence of a check between the model output and the side-effectful action.
Once you frame it this way, the control plan shifts. The question is no longer “how do we stop the LLM from producing bad output?” — which has no bounded answer — but “what does the system do with whatever output the LLM produces?” The second question has answers that can be designed, implemented, tested, and audited. Five categories of answer, in fact, which is the rest of this piece.
Where output flows
Before naming the validation types, you have to name the destinations. The validation a piece of output needs depends entirely on where it is going next. A free-text answer rendered into a chat UI has different needs from a JSON payload that becomes the argument to a SQL query. In every assessment we run, we see teams treat all output as if it were going to a single destination — and then under-protect the destinations with the highest blast radius.
Five destinations cover the vast majority of LLM application architectures:
- Rendered HTML. The output is displayed to a human user in a chat surface, email, document, or report. The risk class is what the user sees — script tags, malicious links, false claims, leaked content.
- Shell, SQL, or code execution. The output is parsed and executed. The risk class is the destruction or escape of the execution environment, lateral movement into systems the model was not supposed to reach.
- Structured-data APIs. The output is parsed as a structured payload — JSON, XML, function call arguments — and forwarded to a downstream API. The risk class is malformed payloads, oversized fields, injection into the downstream system.
- Downstream agents.In multi-agent setups, one model's output becomes another model's input. The risk class is cross-agent injection, instruction propagation, and unbounded reasoning chains.
- Persistent memory.The output is written to long-term storage — a vector store, a conversation log, a knowledge graph. The risk class is future retrieval of the stored content, where the LLM's earlier output becomes part of a later prompt.
Every output flows to one or more of these. A single LLM response can go to several at once — for instance, a structured payload that becomes both a tool call and a UI-rendered answer. The validation has to cover each destination independently. A single validator that “checks the output” without naming the destination is a placeholder, not a control.
1. Structural validation
Structural validation asks one question: is the output the shape we expected? For free-text output the question barely applies — text is text. For everything that becomes structured data downstream — tool calls, function arguments, JSON-mode responses, anything serialised — structural validation is non-negotiable.
The checks in this category include:
- JSON schema validation. The output parses as JSON, has the expected fields, the fields have the expected types, the values are within the expected ranges.
- Type checking. Numeric fields are numeric. Date fields parse as dates. Enum fields are within the declared enum.
- Length and shape limits. No field exceeds its declared max length. No array exceeds its declared max size. Nesting depth is bounded.
- Required-field enforcement. Every required field is present and non-null. Optional fields are either absent or well-formed.
Structural validation is the cheapest control to implement and the most reliable to test. Modern LLM SDKs support constrained generation that emits valid JSON against a supplied schema, which removes most of the failure cases at the source. A defence in depth posture pairs constrained generation with post-hoc schema validation — the model produces well-formed JSON, and the validator confirms it on the way out the door. The cost of running both is negligible; the failure mode of running only one is real.
2. Content validation
Content validation asks: does the output contain things it should not? This is the category that maps most naturally to traditional output-filtering work — PII scrubbing, secrets detection, categorical content classification. The newer elements are the ones specific to LLM behaviour: scrubbing of system-prompt fragments, detection of training-data regurgitation, detection of cross-conversation leakage.
A working content-validation stack typically combines three approaches:
- Pattern-based detection. Regex and similar literal-pattern matching. Effective for known formats — credit card numbers, national IDs, SSH keys, API tokens with predictable structures. Cheap to run. Misses anything novel.
- Heuristic detection.Rules that look for combinations of signals — “contains a 16-digit number AND a date AND the word card” — and assign a confidence score. Better recall than pure pattern matching. Higher maintenance cost.
- Small-classifier inference. A small dedicated model that classifies the output text into safe/unsafe categories. Useful for content that is hard to specify literally — tone, intent, presence of weaponised instructions. Adds latency and another dependency.
Content validation is most often the first place teams stop. The risk is that teams write a list of regexes, run them, and call it done. Regexes catch what you have already seen. They do not catch new categories. The robust posture is to combine all three approaches and to budget for ongoing classifier re-training as the threat surface evolves.
3. Behavioural validation
Behavioural validation is the category most LLM applications skip entirely. It asks: does the output actually reflect the intended task? Not whether the output is well-formed (structural) or contains forbidden tokens (content), but whether the answer is the answer the system was supposed to produce given the inputs.
Three checks live here:
- Citation verification. If the output contains references to source material, those references actually exist in the retrieved content and the cited passage actually says what the answer says it says. Hallucinated citations and paraphrase errors are caught here.
- Claim consistency. If the system is grounded in retrieved content, the claims in the answer are consistent with that content. A cross-check between the retrieved chunks and the assertions in the answer. Models exist that do this as a small additional pass; quality varies but the cost is bounded.
- Refusal handling.If the LLM refuses the task — “I cannot help with that” — the downstream system should not silently try a second model, route to a fallback that does answer, or treat the refusal as an empty answer to be passed along. Refusal-bypass via downstream plumbing is a recurring pattern in incident reports.
Behavioural validation is the category that closes the hallucination gap most directly. The gap is not that the model hallucinates — that is a property of statistical generation. The gap is that the system does not check the hallucination before serving it. Adding the check transforms hallucination from an unbounded liability into a detectable, measurable failure mode that can be tracked, escalated, and improved over time.
Validation types mapped to output destinations
| Validation type | Destination it protects | Typical check |
|---|---|---|
| Structural | Rendered HTML, structured-data APIs, downstream agents | JSON schema check; field type and length limits; required-field enforcement. |
| Content | Rendered HTML, persistent memory, structured-data APIs | PII filtering; secrets scan; categorical content classifier on output text. |
| Behavioural | All destinations where claim accuracy matters | Citation existence check; claim-source consistency check; refusal-bypass detection. |
| Authorisation | Multi-tenant rendered HTML, downstream agents, persistent memory | Tenant-scope check; per-user redaction; cross-tenant content blocked at output. |
| Safety | Shell/SQL/code execution, downstream agents, tool calls | Parameter-manifest match; destination allowlist; scope-within-grant check. |
5. Safety validation
Safety validation is the category that prevents an LLM output from causing a real-world side effect that should not have happened. It applies whenever the output is going to a destination that executes something — a tool call, a code execution sandbox, a downstream agent that itself has tools, a shell, a SQL interface, an API that performs writes.
The core control is tool-call validation, and it has three components:
- Parameters match the manifest. Each tool has a declared signature. The arguments coming out of the model are validated against the manifest before the tool is invoked. Type, range, and required-field checks at minimum. Reject the call if validation fails, with a clear error.
- Destination is in the allowlist. Any argument that identifies an external endpoint — a URL, a database identifier, a file path, an email address — is checked against an allowlist before the call runs. The allowlist is documented and reviewed; it is not a one-line regex hidden in code.
- Scope is within authorisation.The call is within the permission grant for this session. A tool that can run with broad permissions should still respect the user's, the agent's, or the application's scoped grant for the request. Out-of-scope calls are rejected, logged, and surfaced to the operator.
Safety validation also includes side-effect approval boundaries: destructive operations — delete, send, write, execute, transfer — require an explicit human-in-the-loop approval before they run. The approval is not a rubber stamp; the operator sees the parameters, the destination, and the rationale before confirming. The approval is logged with the parameters.
Safety validation is the one category where there is no acceptable failure mode. A model that produces a hallucinated answer is a bad day. A model that successfully invokes a destructive tool because the safety layer was not in place is an incident.
Lifecycle gates for output validation
Not every validation type needs to be in place from day one. A reasonable lifecycle staging plan ties each validation type to a clearance gate. The five validation types do not all carry the same weight; the gates reflect that.
- Before pilot, structural + content. Even a small internal pilot needs structural validation on any structured output and content validation for PII and secrets. Skipping these is not a defensible decision; the cost is low and the failure modes are well documented.
- Before pilot for high-stakes, behavioural + authorisation.Any pilot where the wrong answer carries reputational, regulatory, or contractual risk needs citation verification and per-user output filtering in place from the first user. For lower-stakes pilots — internal tools where the audience tolerates errors — these can be deferred to the production gate, with explicit acceptance of the residual risk.
- Before production for general, behavioural + authorisation.Anything reaching production traffic needs both. Without behavioural validation, the system has no defensible answer when users report incorrect content. Without authorisation validation, the system is one cross-tenant leak away from an incident.
- Always before any pilot involving tool execution, safety validation. This is the one non-negotiable gate. A tool-executing LLM without safety validation is not pilot-ready. There is no pilot small enough to skip this layer.
The gate matters because it is the moment the absence of a control becomes a formal decision. If structural validation is missing at the pilot gate, the pilot does not proceed until it is added — or the disposition explicitly records that the gap is acceptable and why. The gate forces the conversation. Without explicit gates, validation gets pushed into “we'll do it before production,” and then production arrives faster than expected.
What goes in evidence
The last point is the one that gets least attention and produces the largest differences between systems that survive a regulator review and systems that do not. The evidence for output validation is not the existence of the controls — it is the test records that demonstrate the controls are exercised.
For each validation type, the evidence pack should include:
- Implementation reference. The specific code path, library, or service that performs the validation. A file path and function name, not a vague description.
- Test cases. The inputs the validation is exercised against, including known-good cases and adversarial cases. Adversarial cases are the ones auditors care about — they demonstrate that the validator catches the patterns it is supposed to catch, not just that it runs without error.
- Test run record. The result of running the tests, with a timestamp and a pass/fail summary. Stored such that a reviewer can fetch the record without help from the engineering team.
- Metric, if applicable. A measured rate of catches — injection attempt detection rate, citation pass rate, refusal-bypass rate — for the validators where measurement is meaningful. A number is more defensible than a statement.
The same logic applies in adjacent governance work. Frameworks like the OWASP Agentic Top 10 enumerate the threats; the control map for those threats converts them into checks, gates, and evidence. Output validation is one of the densest sections of any such map because it sits across so many threat classes.
The composite picture: prompt injection prevention is a useful complement, not a substitute. Output validation is where most of the load is carried. A control plan that names the five validation types, the destinations they protect, the gates at which they must be in place, and the evidence required to verify them is a defensible plan. A control plan that says “we filter outputs” is not.
Build output validation as a control plan, not a vibe.
Drel maps LLM output to required validation controls per destination — and flags the lifecycle gates where each must be in place.
Blog
Get new posts in your inbox
AI security review, OWASP Agentic Top 10, ISO 42001 evidence, and what AI Committees actually need. No cadence promises — we publish when there's something worth reading.
A note on scope: Drel reviews assessed systems against documented architecture, configuration and intent. It does not ingest live telemetry from production environments. Dispositions reflect the assessed system at the time of review and the re-assessment triggers that govern when the disposition must be revisited.