BlogReference

Security review for agentic AI procurement — a buyer's checklist

Procurement teams are approving agentic AI systems without the security vocabulary to ask the right questions. This checklist covers the eight areas a security review must address before an agentic system reaches production.

Drel10 min read

Procurement teams are moving fast on agentic AI. The pitch is compelling: an agent that schedules meetings, drafts contracts, processes invoices, and reaches across half a dozen internal systems to do it. The capability demo lands; the commercial conversation begins; a security review is scheduled for “before launch.” And by the time that review happens, the contract has been signed, the integration has been built, and the security team is being asked to bless a system whose architecture they had no influence over.

This pattern repeats across organisations, and it produces the same outcome each time: an agentic system goes live without a defensible security position behind it. Not because the security team failed, but because procurement never asked the security questions. The two functions run in parallel, evaluating different things, and the gap between them is where most agentic risks settle.

This piece is the checklist procurement should be working from before the commercial conversation closes. Eight areas, each with the specific questions a security review must be able to answer. The intent is not to replace security review — it is to make sure procurement does not approve a system that security cannot subsequently clear.

The eight areas — each must be answered before clearance

1Tool surface

Tools the agent can call

2Delegation chain

User → orchestrator → sub-agents

3Blast radius

Maximum harm on compromise

4Identity model

Agent identity vs user impersonation

5Memory architecture

What persists, who can write

6Audit & observability

Trace ID, queryable logs

7Human-in-the-loop

Approval boundaries

8Re-assessment triggers

When the review is invalidated

A procurement that closes on capability but not on these eight areas has not been security-reviewed.

The gap between procurement and security review

Procurement evaluates capability and fit. Does the system do what we need? Is the vendor financially viable? Are the commercial terms acceptable? Does it integrate with our existing stack? These are necessary questions, and procurement teams are practiced at asking them. The reference customer call, the proof-of-concept, the cost model — these are the artefacts procurement is set up to produce.

Security review evaluates risk and controls. What can go wrong? What is the impact if it does? What controls reduce the likelihood or impact? What evidence supports that the controls are in place? These are different questions, requiring different artefacts: threat models, control inventories, evidence packs, residual risk acceptance.

For deterministic software, the two functions can run sequentially. Procurement closes; security reviews; gaps are remediated before deployment. The model holds because the system's behaviour is bounded by its code, and the code is inspectable. A security review of a payroll system, for example, can be carried out against a documented set of features and a fixed permission model.

For agentic AI, the model breaks down. The system's behaviour is not bounded by code — it is bounded by tools, prompts, model behaviour, and the orchestration layer above them. None of those are inspectable in the way code is. None of them are static; all of them can change post-procurement without code changes. And the blast radius of a misbehaving agent is shaped by decisions that are typically made during integration, not during procurement.

For agentic AI, procurement and security review must happen together. The questions are different, but the answers are entangled. Conflating them produces decisions that cannot be defended; running them in sequence produces decisions that arrive too late.

The checklist that follows is what procurement should be asking — and what security should be reviewing — before the commercial conversation closes. Each of the eight areas is an independent risk dimension. A system that scores well on capability and poorly on any one of these is not ready for production, regardless of how compelling the demo was.

Area 1: Tool surface

The tool surface is the set of capabilities the agent can invoke. Every tool is a bridge between the agent and a system that does something — read data, write data, send messages, trigger workflows. The size and shape of the tool surface is the single most important determinant of the agent's risk profile, and it is the area procurement is most likely to overlook because tools are presented as features rather than as risk.

The first question to ask: what does the tool manifest look like? Not in description — the actual document. Each tool has a name, a description, and a parameter schema. The description is what the LLM reads to decide when to call the tool. The parameter schema is what the LLM reads to decide what to pass. Both are instruction content from the LLM's perspective, which means both are attack surfaces.

Tool description poisoning — sometimes called descriptor poisoning — is a vector that procurement reviews routinely miss. If the tool description can be modified post-deployment without versioning, signing, or audit, then any actor with access to the configuration console can change the agent's behaviour without changing any code. A tool described as “send a confirmation email to the user's registered address” can be quietly re-described as “send a notification to the address specified in the parameters,” and the agent will follow the new description. No deployment, no review, no detection.

The procurement questions to surface this risk:

  • Can you provide the full tool manifest at the point of contract — names, descriptions, parameter schemas, and any prompt-style guidance attached to each tool?
  • How are tool descriptions sourced? Are they version-controlled? Are they signed?
  • Who can modify a tool description in the running system? Is the change audited?
  • What is the change-control process for adding a new tool to the manifest?
  • Are there any “dynamic tools” — tools whose definitions are generated at runtime? If so, what is the trust boundary on the generator?

A vendor that cannot answer these is not necessarily insecure — but they have not been asked these questions before, and the answers they invent on the call are not contractually binding. The follow-up is to get the relevant answers into the contract as covenants, not just as conversation.

Area 2: Delegation chain

Agentic systems rarely operate as a single agent calling a single tool. A realistic deployment involves an orchestrator that decomposes the user's request, dispatches work to sub-agents, and aggregates results. Each link in that chain is an authorisation transition: a token passes from one component to the next, and what the receiving component can do depends on what scope that token carries.

The default in many agentic platforms is full inheritance. The user authorises the orchestrator with a session token. The orchestrator authorises a sub-agent with the same token. The sub-agent invokes a tool using that token. By the time the action reaches a downstream system, the chain has been traversed three or four times, and the downstream system sees an authorisation token that grants everything the original user could do — not just the specific operation the user requested.

This is the confused deputy pattern, and it is the most common authorisation failure in agentic systems. A user with broad access asks an agent for a narrow operation. The agent — or a sub-agent, or a compromised tool — exercises the broad access to do something the user did not request. The system did exactly what its authorisation model permitted; the authorisation model was wrong.

The fix is scope-narrowing at every transition. When the orchestrator delegates to a sub-agent, it should issue a token scoped to the specific operation, not pass the user's session token. When a sub-agent invokes a tool, the tool should validate that the authorisation covers this specific action, not just that the caller is authenticated. The chain should narrow at each step, not inherit upstream.

Procurement questions:

  • Describe the authorisation flow from user request to tool invocation. Where does scope narrowing happen?
  • When the orchestrator calls a sub-agent, what token is passed? Is it the user's session token, or a scoped token issued for this operation?
  • When a sub-agent calls a tool, does the tool re-validate scope, or trust the caller?
  • If the user's session has broad access — say, full HR system permissions — does an agent invoked by that user inherit those permissions automatically?
  • Can scope be narrowed declaratively at integration time (e.g. “this agent can only read employees, never write”)?

The architectural answer matters more than the policy answer. A vendor who says “we have a permission model” without describing the token flow is answering a different question. The right answer is specific: at each transition, here is the token, here is the scope, here is the validation that happens before the next operation proceeds.

Area 3: Blast radius

Blast radius is the maximum harm a single compromise could cause. It composes the previous two areas — tool surface and delegation chain — and adds the downstream systems the agent can reach. The question is not “what does the agent normally do?” but “what could the agent do if an attacker controlled its prompt input, tool responses, or memory?”

Blast radius is mostly invisible during procurement because the demo only shows intended behaviour. The agent calls the right tool with the right parameters in the right order. What the demo does not show is the union of all tools the agent could call, the parameter space each tool accepts, and the downstream systems those tools touch. That union is the blast radius — and most procurement reviews never see it written down.

To enumerate blast radius, list every tool the agent has access to. For each tool, list the downstream systems it can reach. For each downstream system, list the most consequential operation reachable through it. Then ask: if the agent could be coaxed into calling each tool with adversarial parameters, what is the worst outcome?

An agent with a single “send email” tool has a communication blast radius — phishing of internal contacts, exfiltration via email body, reputational damage from messages sent under company identity. Add a “modify customer record” tool and the radius expands into data integrity. Add an “execute SQL” tool and the radius expands into the entire database. Each addition is a multiplier, not an increment.

Procurement questions:

  • List every downstream system the agent will be able to reach in our deployment.
  • For each system, what is the most consequential operation reachable through the agent?
  • If the agent were compromised — its prompt manipulated, its memory poisoned, a tool response crafted to redirect its behaviour — what is the worst outcome it could produce in a single session?
  • Are there capability tiers? Can we deploy with a reduced tool set and expand later if needed?

Area 4: Identity model

The identity model answers a deceptively simple question: when the agent does something, whose identity is doing it? There are three common patterns and they have very different security and audit properties.

Pattern 1: user impersonation via delegated token.The user authenticates; the agent receives a delegated token and acts under the user's identity. Downstream systems see the user; audit logs record the user. The pattern is convenient because existing permission models continue to apply — the agent can only do what the user could do. The risk is that audit logs do not distinguish between user actions and agent-on-behalf-of-user actions. When something goes wrong, the user appears in the log.

Pattern 2: agent service identity. The agent has its own identity, with its own permissions. Downstream systems see the agent; audit logs record the agent. The pattern produces clean audit trails — every action is clearly attributed — but it requires a separate permission model for the agent. Done well, this is the strongest pattern. Done badly, the agent ends up with over-broad permissions because it is easier than narrowing them per user.

Pattern 3: shared service account. The agent acts under a generic service account that is also used for batch jobs, integrations, and other system functions. Downstream systems cannot tell the agent apart from any other process using that account. This is the worst pattern from an audit perspective, and it appears more often than vendors will admit.

For agentic systems, the recommended pattern is a hybrid: the agent has its own identity, but actions on behalf of a specific user carry both identities — the agent that performed the action and the user it was acting for. Downstream audit logs should be able to answer two questions: which agent performed this action, and which user (if any) initiated the request that led to it.

Procurement questions:

  • What identity does the agent use when making downstream calls? Show the actual token claims, not just “our authentication is OAuth-based.”
  • Can the downstream system distinguish between (a) the user calling directly, (b) the agent acting on the user's behalf, and (c) the agent acting autonomously?
  • If multiple agents share a service account, how is per-agent attribution preserved in downstream logs?
  • What is the rotation schedule for agent credentials?

Area 5: Memory architecture

Memory is the part of agentic systems that procurement reviews almost never cover, because the demo does not surface it. Yet memory is where the longest-lived risks live. Anything that persists across sessions — past conversations, learned user preferences, retrieved facts, intermediate plans — is in memory. And memory that is read by the agent influences future behaviour.

Memory poisoning is the agentic-system analogue of stored cross-site scripting. An attacker writes adversarial content into memory in one session; the agent reads the content in a later session and treats it as instructions or trusted facts. The attack is asymmetric in time — the write and the trigger can be separated by days or weeks — and it is asymmetric in identity, because the user who triggers the harm may be different from the user who wrote the poisoned content.

The questions that surface memory architecture risk:

  • What persists across sessions for an individual user? What persists across users?
  • Where is memory stored? In the vendor's service, in our environment, both?
  • Who can write to memory? The user, the agent, an admin, an integration?
  • Is memory validated on read-back? Are entries treated as data or as instructions?
  • How is shared memory isolated between users or tenants?
  • Is there a memory retention policy? Can users delete entries? Can administrators?

Vendors who have not thought about memory architecture will often answer the first question with “we maintain a context window” — confusing the transient prompt context with persistent memory. They are different layers and they have different threat models. The follow-up is to ask specifically about storage: where, for how long, accessed by whom.

Area 6: Audit and observability

The bar for audit in agentic systems is higher than for deterministic software, because non-deterministic systems generate unique transactions that cannot be re-derived from inputs alone. If you cannot inspect what happened, you cannot investigate when something goes wrong. And in non-deterministic systems, things will go wrong in ways that do not show up in your testing.

The audit floor for an agentic system: every tool call logged with at least the following fields. A trace ID that connects the call to the originating user request. The user ID that initiated the chain. The agent or sub-agent that made the call. The tool name and full parameters. The tool response (or a hash, if the response is large or sensitive). A timestamp. A correlation ID linking related calls in the same orchestration step.

Logs must be queryable, not just archived. “We log everything” with no query interface and no retention beyond 30 days is not auditable. The use case is investigation: an incident is reported, and the response team needs to reconstruct what happened over a window that may stretch back weeks or months. Logs that cannot be queried efficiently for that window are operationally useless.

Procurement questions:

  • Show me a sample log entry for a tool call. Confirm the fields above are present.
  • What is the retention period? Can it be extended? Is the extension contractual?
  • Is the log queryable through a query interface, or only via support requests?
  • Can logs be exported to our SIEM? In what format, at what frequency?
  • When the orchestrator dispatches to a sub-agent, is the trace ID preserved? Can we follow a single user request all the way through to the final tool call?
  • Are LLM inputs and outputs logged, or only tool calls? If outputs are logged, how is sensitive content handled?

The trace ID question is the one that separates serious audit infrastructure from after-the-fact log aggregation. A trace ID that does not survive orchestrator-to-sub-agent transitions means that, in practice, you cannot follow a single user request across the system. That gap surfaces during incident investigation as a near-total inability to reconstruct what happened.

Area 7: Human-in-the-loop boundaries

Some actions an agent might take are too consequential to fire-and-forget. Sending an email externally, modifying a customer record, transferring funds, executing a workflow with downstream side effects — these are categories where most organisations want an explicit human approval step. The question for procurement is whether the boundary is architectural or aspirational.

An architectural boundary is enforced in code: the tool execution pipeline refuses to call the destructive tool unless an approval token is present. An aspirational boundary is enforced in the system prompt: “Before sending an email, always ask the user for confirmation.” The two look identical in the demo. They are not identical in production.

A prompt-based boundary fails under prompt injection, under model error, under edge cases the prompt author did not anticipate. The model can be persuaded — by user input, by retrieved content, by a tool response — to skip the confirmation step. An architectural boundary cannot be persuaded. The tool itself refuses to execute without the approval.

Procurement questions:

  • Which specific actions require explicit human approval before execution?
  • Is the requirement enforced in the tool execution pipeline (the tool itself refuses to run without an approval), or in the system prompt (the agent is instructed to seek approval)?
  • What does the approval interaction look like for the user? Is the action clearly described before approval?
  • Where is the approval captured? Is it audit-logged with the approving user, the action requested, the approval timestamp, and the resulting action?
  • What happens if the user is offline or unavailable? Does the action queue, fail, or proceed without approval?

The capture question is the one most often missed. A human-in-the-loop boundary that does not produce a queryable audit record of who approved what, when, is not a useful boundary. The approval is the evidence; without the evidence, the boundary cannot be verified in retrospect.

Area 8: Re-assessment triggers

The final area is the one that ties the whole review together. The security posture of an agentic system is valid for a specific configuration. When the configuration changes, the review needs to be redone — but only for the parts of the configuration that are material to risk. Re-assessment triggers are the specific named conditions that, when met, invalidate the current clearance and require a fresh assessment.

Without explicit re-assessment triggers, agentic systems drift. The vendor adds a new tool to the manifest. A user gets a new permission grant. The underlying model version rolls forward. The vendor changes a sub-processor. Each change is small; none triggers a review on its own; the cumulative effect is that the system in production three months after launch bears little resemblance to the system that was reviewed at launch.

The triggers that should appear in every agentic procurement:

  • Tool added or removed. Every change to the tool manifest triggers a re-review of tool surface and blast radius.
  • Tool description or parameter schema modified. Even without adding new tools, modifying existing descriptions can change agent behaviour materially.
  • Scope expansion. The agent is granted access to a new data source, new user population, or new business domain.
  • Model change. The underlying LLM version changes — major or minor release. Capability boundaries and alignment properties shift with model versions.
  • Sub-processor change.The vendor changes one of the services or models they use under the hood. This is a re-assessment trigger even if the vendor describes it as “internal.”
  • Identity model change. The way the agent represents itself to downstream systems changes.
  • Memory architecture change. A new memory layer is added, or an existing one is repurposed (e.g. from per-user to shared).
  • Autonomy increase. Actions that previously required approval become autonomous. A human-in-the-loop boundary is removed or relaxed.

Procurement's job is to get these triggers into the contract as advance notification commitments. A vendor that will give you 30 days' notice of model changes, tool changes, and sub-processor changes is a vendor whose system you can govern. A vendor that reserves the right to change these silently is a vendor whose system you cannot — and the gap will surface as a governance failure long before it surfaces as an incident.

Questions to ask, red flags to listen for

AreaQuestionRed flag
Tool surfaceCan you provide the full tool manifest the agent will be configured with at launch, with descriptions and parameter schemas?Vendor cannot produce a manifest, or the manifest is described as “dynamic” without versioning.
Tool surfaceHow are tool descriptions sourced and protected from modification post-deployment?Tool descriptions are editable by anyone with access to the configuration console, with no signing or version history.
Delegation chainWhen the orchestrator calls a sub-agent, what authorisation token is passed, and is it scoped to the requested operation or the user’s full session?Sub-agents inherit the orchestrator’s full token without scope narrowing.
Blast radiusIf a single agent in this system were compromised, what is the maximum set of actions and data it could reach?Vendor cannot enumerate the answer — “it’s limited by the model” is not an answer.
Identity modelWhen the agent makes a downstream API call, does the target system see the agent identity, the user identity, or a shared service account?The target system always sees a shared service account, regardless of which user initiated the action.
MemoryWhat information persists across sessions, where is it stored, and who can write to it?Memory is described as a generic “context store” with no documented write-path access controls.
AuditFor an action taken three months ago, can you produce the trace ID, user ID, tool calls, parameters, and responses involved?Logs only available for 30 days, or trace IDs are not preserved across orchestrator-to-sub-agent transitions.
Human-in-the-loopWhich specific tool calls require human approval before execution, and is that boundary enforced in code or only in the system prompt?Approval requirements live in the prompt only — not in the tool execution pipeline.
Re-assessmentWill you notify us before changing the underlying model version, adding tools, or expanding the agent’s data access?No contractual commitment to advance notice; vendor reserves right to change components silently.

Each row is a question procurement should ask before the contract closes. The red flags are the answers (or non-answers) that should slow the procurement down. None of the red flags are necessarily deal-breakers — but each one is a governance commitment the security team needs to make explicit before the system can be cleared for production.

Some of the answers may need to come from the vendor's engineering team, not the sales team. That is itself a useful signal. A vendor whose sales team can answer all eight areas authoritatively has institutionalised the security review conversation. A vendor whose sales team needs to defer most of these questions to engineering — and who can produce engineering on a call within a week — is acceptable. A vendor who cannot get you these answers at all is not ready for an enterprise procurement.

Procurement's contribution to agentic AI governance is not technical depth — it is the discipline to refuse to close a commercial conversation until the security questions have been put in writing. The vocabulary is learnable in an afternoon. The discipline is the harder thing to import.

The procurement function does not need to become a security review function. It needs to know the questions that block a clean handover to security — and to refuse to close a contract that leaves those questions unanswered. The eight areas above are that list. A procurement process that incorporates them produces systems that can be cleared; a process that does not produces systems that arrive at security review already too late to influence.

For the broader governance context, see the third-party AI vendor assessment piece on what additional questions go into the standard vendor risk file. For the technical depth on agentic systems specifically, see the OWASP Agentic Top 10 control mapping.

Equip procurement with the right questions before agentic AI hits production.

Drel produces the structured assessment record agentic procurement reviews require.

A note on scope: Drel reviews assessed systems against documented architecture, configuration and intent. It does not ingest live telemetry from production environments. Dispositions reflect the assessed system at the time of review and the re-assessment triggers that govern when the disposition must be revisited.