BlogGovernance

What evidence an AI security review should produce

A review that produces only a slide deck is not a review. The evidence an AI security review produces must survive a regulator question, a procurement audit, or a post-incident inquiry. Here is what that evidence needs to include.

Drel Research26 October 202510 min read

There is a common pattern in AI security reviews we examine after the fact: the review happened, in the sense that meetings occurred and people discussed risks. But the output was a slide deck, a meeting note, or a paragraph in a design doc. When the procurement team, the regulator, or the post-incident investigator asks for the evidence that the system was properly assessed, there is nothing to show that could withstand scrutiny.

A review that does not produce a defensible evidence set is not a review. It is a discussion. Discussions are useful. They are not defensible records.

This piece maps the evidence an AI security review must produce to be defensible — what each artefact is, what it must contain, and how it connects to the governance frameworks that will ask for it. The difference between a completed checklist and a defensible evidence set is the difference between a record that answers questions and one that raises them.

Why the evidence set matters

AI security evidence serves three audiences, and it needs to work for all three:

The governance committee that issues the clearance decision. They need evidence that is organised enough to read, specific enough to act on, and complete enough to support a defensible decision.
External auditors and regulators. They will ask for the evidence that a risk management process existed and was followed. They are looking for proof of process, not perfection of outcome.
Post-incident investigators. When something goes wrong, the evidence set establishes what was known, what was controlled, and what was consciously accepted. It is the basis for determining whether the incident was a failure of process or a failure of execution.

The test of an evidence set is not whether it shows that nothing can go wrong. It is whether it shows that the right questions were asked, answered honestly, and recorded with enough specificity that the answers can be verified.

A completed checklist fails this test. Checking boxes demonstrates that the checklist was completed. It does not demonstrate that the threats were real, the controls were verified, or the residual risk was consciously accepted. The evidence artefacts described below provide the substance that makes the record defensible rather than merely complete.

Evidence requirements — what each artefact documents

Evidence item	What it documents	Who produces it	Shelf life
System description	What was reviewed: components, boundary, data flows, model spec, deployment context	System owner / AI architect	Superseded on each re-assessment; version and date required
Threat model	Risks identified, methodology used, likelihood/impact ratings, risk threshold, threats found below threshold	Security team	Valid while system scope and deployment context are unchanged
Control plan	Required controls with owner, lifecycle gate, verification method; control gap log with closure plans	Security team	Living document; updated when controls are verified or gaps closed
Risk disposition	Clearance decision, rationale, residual risk acceptances (named), re-assessment triggers, sign-off log	Governance lead + sign-off authority	Valid until a re-assessment trigger fires or is superseded
Evidence gaps log	Controls not yet verified: the gap, reason for acceptance, closure plan with a date	Security team	Closed when the gap is remediated; open gaps reference the disposition condition
Re-assessment trigger register	Specific conditions that invalidate the clearance; owner responsible for evaluating each trigger event	Governance lead	Reviewed when a potential trigger event occurs; updated on each re-assessment
Sign-off record	Name, role, status, date, and any caveats for each signatory to the disposition	Sign-off authority	Archived with the evidence pack; superseded on re-assessment

The complete evidence set

A defensible AI security review produces five categories of evidence:

System description — what was reviewed, in enough detail to reconstruct the scope.
Threat model document — what risks were identified, assessed, and prioritised.
Control verification evidence — that specified controls are actually in place.
Disposition record — the clearance decision, rationale, residual risk acceptance, and sign-off log.
Chain of custody — who produced each artefact, when, and from what inputs.

Each of these has a right version and a wrong version. The sections below describe both.

The system description

The system description is the primary input to the threat model. It must be specific enough that a competent reviewer who was not involved in building the system can understand its risk surface without additional context. Vague system descriptions produce vague threat models.

A defensible system description includes:

Functional description. What does the system do, in plain language? What is its primary purpose and its intended user population?
System boundary. What components are in scope, with explicit exclusions and the referenced baseline for each excluded component.
Data flows. What data enters the system, where it comes from, how it is processed, and where it goes. Data classification for each flow.
Model specification. Which model, which provider, which version, in what configuration (including system prompt summary).
External integrations. What external systems does the AI component connect to, and with what level of access?
Deployment context. User population, operational consequence, regulatory regime.

The threat model document

The threat model is the evidence that the risks were identified systematically, not selectively. It must show the process, not just the conclusions. A list of threats with no methodology attached is not a threat model — it is a threat list. A reviewer looking at it cannot tell whether the list is comprehensive or whether important threats were missed because the modelling was incomplete.

A defensible threat model document contains:

Methodology reference. What framework was used to identify threats? STRIDE, OWASP LLM Top 10, OWASP Agentic Top 10, or a hybrid? Naming the methodology makes the threat list checkable.
Threat register. For each identified threat: a description, a likelihood assessment, an impact assessment, and a risk rating. The rating methodology should be stated once and applied consistently.
Risk threshold statement. What rating triggers a required control? This makes the link between the threat register and the control plan explicit and reviewable.
Threats assessed and found below threshold. Including risks that were assessed and found acceptable is as important as including risks that required controls. The absence of a threat from the control plan should be traceable to a below-threshold rating, not to an oversight.

The threat model document is the most technically demanding artefact in the evidence set. It requires domain knowledge — both of the attack surface of the AI system type (LLM, RAG, agentic) and of the specific deployment context. A generic threat model that was not calibrated to the system’s actual data flows and user population will be visible as generic to an experienced reviewer.

Control verification evidence

The control plan describes the controls that should be in place. Control verification evidence shows that they are. These are different things, and a governance record that contains only the plan — with no verification evidence — has not demonstrated that anything was actually implemented.

For each required control, the verification evidence should demonstrate:

That the control exists. A pointer to the specific implementation: a code review reference, a configuration extract, a test result, an architecture review note.
That it does what it claims.The verification method from the control plan entry, with its output. “Access control test: queried for documents outside user organisation, result: empty response” is verification. “Implemented” is not.
The lifecycle gate status. Controls required before pilot have a different urgency from controls required before production. The evidence set should show the status at the time of the clearance decision.

The disposition record

The disposition record is the governance output of the review. It is the document that the sign-off authority signs, and it is the primary artefact a regulator or procurement auditor will ask for.

A defensible disposition record contains:

The clearance decision. Proceed, conditional, restricted pilot, hold, or decline. Not a narrative description of the decision — the decision itself, as a specific category.
The rationale. Two to four sentences naming the headline risk, the headline control, and the residual exposure that was accepted.
Required controls summary. A reference to the full control plan with a summary of the lifecycle gate status at the time of clearance.
Residual risk acceptance. For each risk accepted without full control closure: the risk description, the named acceptor (name and role), and the condition under which the acceptance holds.
Re-assessment triggers. Specific conditions that will invalidate the clearance and require a new review.
Sign-off log. Name, role, date, and status for each signatory.

The evidence pack: tying it together

The evidence pack is the complete, assembled set of review artefacts for a specific system at a specific point in time. It is not a summary — it is the full record. When a regulator or auditor asks to see the evidence, the evidence pack is what you provide.

The structure of a complete evidence pack:

Disposition record (the clearance decision and governance output)
System description (the scope and boundary document)
Threat model document (the risk identification and assessment)
Control plan (the required controls with owners, gates, and verification methods)
Control verification evidence (the proof that controls are in place)
Open gaps log (managed control gaps with closure plans)
DPO advisory (where personal data is in scope)
External assessment references (pentest report, provider certifications)

Each item in the pack should be dated and versioned. The pack as a whole should carry the review date and the version of the system it was conducted against. When a re-assessment is triggered, the old pack is archived and a new pack is created. The new pack references the old one and notes what changed.

Chain of custody

Chain of custody is the meta-evidence: who produced each artefact, from what inputs, and when. It is what allows a reviewer to verify that the system description that drove the threat model was based on the actual system, not on a notional design that had already changed.

A simple chain of custody record for each artefact:

Author — who produced this artefact and in what role.
Inputs — what documentation, system access, or interviews it was based on.
Date — when it was produced.
Review status — whether it was reviewed by a second party before being accepted into the evidence pack.

The chain of custody does not need to be a separate document. It can be a header section on each artefact. What matters is that it is present — because it is the element that converts a collection of documents into a defensible record with a clear relationship to the assessed system.

The gap between a completed checklist and a defensible evidence set is exactly this: the checklist shows that boxes were ticked. The evidence set shows who ticked them, against what system, based on what evidence, and with what sign-off. The latter is what survives scrutiny.

Blog

Get new posts in your inbox

AI security review, OWASP Agentic Top 10, ISO 42001 evidence, and what AI Committees actually need. No cadence promises — we publish when there's something worth reading.

Produce a defensible evidence set, not just a checklist

Drel assembles the complete evidence pack — system description, threat model, control verification, and disposition record — in a form that survives a procurement audit or regulatory review.

Request early access See the demo dossier

A note on scope: Drel reviews assessed systems against documented architecture, configuration and intent. It does not ingest live telemetry from production environments. Dispositions reflect the assessed system at the time of review and the re-assessment triggers that govern when the disposition must be revisited.