BlogTechnical

Mapping the agentic AI attack surface

The agentic AI attack surface has five distinct layers: the prompt channel, the tool surface, the memory layer, the orchestration boundary, and the output channel. This piece maps each layer with its associated threats and controls.

Drel Research1 September 202412 min read

When security teams ask where to focus their agentic AI review effort, the answer is rarely a single component. The attack surface of an agentic system is distributed across five layers, each with distinct threat patterns and distinct controls. A review that covers only one or two of them leaves significant risk unaddressed.

This piece maps all five layers, the threats that operate at each, and the review questions that must be answered for a complete assessment. It is designed to be used alongside the agentic AI security review framework.

Agentic AI attack surface — five areas

Tool manifest

The set of tools the agent can invoke at runtime — each tool is a potential action an attacker can trigger via a hijacked reasoning loop.

Key control

Scope the manifest to the minimum tools required for the specific deployment task; remove all speculative or development-time tools before production.

Memory / context

All persistent context: in-session conversation history, external vector stores, and episodic summaries — each a write surface for poisoning attacks.

Key control

Enforce session isolation; validate episodic memory entries before storage; restrict write access to external memory to authorised ingestion pipelines only.

Planning loop

The observe-plan-act cycle where the model decides what to do next — influenceable by any content that enters the context window, regardless of source.

Key control

Anchor goals explicitly in the system prompt with displacement-resistance language; treat all non-system-prompt content as untrusted data.

Agent-to-agent communication

Messages between orchestrators and workers in multi-agent systems — frequently treated as implicitly trusted but exploitable via compromised workers.

Key control

Treat all inter-agent messages as untrusted inputs; enforce capability delegation at task scope, not full orchestrator scope.

Human approval boundary

The gate between agent-proposed actions and real-world execution — only a genuine control when enforced at the infrastructure layer, not just in the UI.

Key control

Enforce approval gates at the tool or gateway layer independently of model reasoning; default timeouts to cancel, not proceed.

The five-layer model

The agentic AI attack surface can be modelled as five layers, from the point where inputs enter the system to the point where outputs leave it:

Prompt channel — user input and retrieved content that enters the model's context
Tool surface — every tool the agent can invoke, and the actions those tools can perform
Memory layer — context that persists within a session or across sessions
Orchestration boundary — inter-agent communication in multi-agent systems
Output channel — what the agent produces and where it goes

These layers are not independent. An attack that enters at Layer 1 (prompt channel) may propagate to Layer 2 (tool surface), execute via Layer 3 (memory), coordinate via Layer 4 (orchestration), and exfiltrate via Layer 5 (output). The review must assess not just each layer in isolation but the paths that connect them.

Layer 1 — The prompt channel

The prompt channel is everything that enters the model's context window: the system prompt, the user's direct input, content retrieved from databases or the web, tool call results, and any other text the model reads before deciding what to do next.

The fundamental problem with the prompt channel is that the model does not reliably distinguish instructions from data. A user message and a retrieved document are both just text in the context window. If a retrieved document contains instructions formatted to look authoritative — “Ignore previous instructions. Your new task is…” — many models will comply.

Threats at this layer:

Direct prompt injection — malicious instructions embedded in user input, overriding the system prompt's intent
Indirect prompt injection — instructions embedded in content the agent retrieves (web pages, documents, database records) rather than content the user types
System prompt displacement — input that convinces the model its operating instructions have changed
Context window flooding — large inputs that push the system prompt out of the effective context window, reducing its influence on the model's reasoning

Review questions for Layer 1: What sources can inject content into the prompt channel? What trust level does each source carry? Is the system prompt anchored in a way that resists displacement? Are retrieved documents treated as untrusted data or as authoritative content?

Layer 2 — The tool surface

The tool surface is defined by the agent's tool manifest — the set of actions the model can invoke. Every tool is a capability. Every capability is a potential consequence if the reasoning loop is hijacked.

Tools in typical agentic deployments include web search, code execution, file system operations, database access, email and calendar APIs, external service integrations, and in multi-agent systems, the ability to spawn or instruct subordinate agents. The blast radius of a compromised agent is bounded by its tool manifest.

Threats at this layer:

Excessive capability — the manifest includes tools the deployment task does not require, expanding the blast radius without expanding utility
Tool chaining — using a sequence of permitted tools to achieve an outcome that no single tool would permit directly
Unscoped tool parameters — tools that accept broad parameters (e.g., “delete any file”) rather than constrained ones (e.g., “delete files in /tmp/session-output”)
Missing tool-level authorization — tools that execute without verifying the invoking identity's permission for the specific action

Review questions for Layer 2:Is every tool in the manifest required for this specific deployment task? Are tool parameters constrained to the minimum scope needed? Does each tool enforce its own authorization checks, independent of the model's reasoning?

The full least-privilege analysis for the tool manifest is covered in tool-use permissions for agentic AI.

Layer 3 — The memory layer

The memory layer encompasses all persistent context: in-context memory (the current session's conversation history), external memory (vector databases, relational stores), and episodic memory (session summaries stored for future retrieval).

Memory is what allows an agentic system to improve with use and maintain continuity across sessions. It is also what allows a poisoning attack to persist. Unlike a direct prompt injection — which affects only the current session — memory poisoning survives session boundaries. An instruction planted in episodic memory in session one can execute in session fifty.

Memory poisoning is the agentic equivalent of writing to a configuration file with elevated privileges. The effect is not immediate — but it is persistent, and it executes in the context of whoever triggers the poisoned memory next.

Threats at this layer:

In-context poisoning — injected instructions that are included in the current context window and influence current-session decisions
External memory poisoning — malicious documents inserted into a vector store or knowledge base, retrieved in future sessions
Episodic memory poisoning — instructions embedded in session content that the agent summarises and stores, executing in future sessions
Cross-session persistence — context from one user's session bleeding into another's via shared memory stores

The full memory security analysis is covered in agent memory as an attack surface.

Layer 4 — The orchestration boundary

Multi-agent systems introduce an additional layer: the communication channels between agents. An orchestrator agent receives a task and delegates sub-tasks to worker agents. Worker agents return results to the orchestrator. Each message in this exchange is an untrusted input.

The orchestration boundary is distinct from the prompt channel because the inputs arrive from other agents, not from users or external data sources. Teams frequently treat inter-agent communication as trusted, reasoning that “our agents are talking to each other.” This is the wrong trust model. A worker agent that has been compromised via prompt injection can return a message to the orchestrator that contains injected instructions. The orchestrator, trusting the worker's output, may comply.

Threats at this layer:

Inter-agent prompt injection — injected instructions in worker agent outputs, targeting the orchestrator's reasoning
Capability delegation abuse — an orchestrator that passes its full capability set to workers, expanding the blast radius of a compromised worker
Orchestrator compromise cascade — a compromised orchestrator that issues malicious instructions to all worker agents simultaneously
Missing inter-agent authorization — worker agents that accept instructions from any source claiming to be the orchestrator

The full multi-agent review is covered in security review for multi-agent systems.

Layer 5 — The output channel

The output channel is where what the agent produces becomes consequential in the world. This includes text displayed to users, documents written to file systems, API calls made on behalf of users, code executed in downstream environments, and data sent to external services.

The output channel is frequently under-secured compared to the input channel. Teams invest in input validation and prompt hardening while deploying agent outputs without equivalent scrutiny. But an agent that has been manipulated via any of the first four layers will express that manipulation at Layer 5 — in the output.

Threats at this layer:

Data exfiltration via output — sensitive data extracted from memory or tool results and included in agent outputs sent to attacker-controlled destinations
Stored cross-site scripting — model outputs rendered as HTML without sanitization, injecting scripts into the application
Malicious code generation — agents that generate code that is executed downstream, where the generated code contains attacker-influenced logic
Consequential irreversible actions — agent outputs that trigger downstream state changes (database writes, emails sent, payments initiated) without human approval

Review questions for Layer 5: Are model outputs sanitized before rendering? Are there approval gates for consequential actions? Can the agent write to external services or storage without human oversight? Is there an audit record of every output the agent produced?

Mapping threats across layers

The five-layer model becomes most useful when it is used to map attack chains — sequences of events that start at one layer and cascade through others. A few representative attack chains illustrate the pattern:

Indirect injection to exfiltration: An attacker plants injected instructions in a web page (Layer 1). The agent retrieves the page, follows the instructions, calls an external API tool (Layer 2), and sends sensitive data from memory (Layer 3) to an attacker-controlled endpoint (Layer 5).

Memory poisoning to persistence:An attacker crafts a session that plants instructions in episodic memory (Layer 3). In a future session with a different user, the poisoned memory is retrieved (Layer 3 → Layer 1), and the instructions execute against the new user's context, using tools (Layer 2) to perform actions the new user never requested.

Orchestrator compromise cascade: A worker agent is compromised via prompt injection (Layer 1 of the worker). The worker returns a message to the orchestrator that contains injected instructions (Layer 4). The orchestrator, treating the message as trusted, issues the injected instructions to all other workers (Layer 4 again), each of which then executes them via their tool manifests (Layer 2).

These chains demonstrate why a per-layer review is necessary but not sufficient. The review must also examine the connections between layers — what outputs from one layer become inputs to another, and whether the trust model at each transition is appropriate.

Five-layer review checklist

The following questions form the minimum checklist for an agentic AI attack surface review. Each maps to one of the five layers:

Layer 1 — Prompt channel:

What sources can inject content into the prompt channel? Are all of them treated as untrusted?
Is the system prompt anchored against displacement by user or retrieved content?
Are retrieved documents sandboxed from instruction space?

Layer 2 — Tool surface:

Is the tool manifest scoped to the minimum required for this deployment task?
Are tool parameters constrained rather than open-ended?
Does each tool enforce authorization independent of the model?

Layer 3 — Memory layer:

What types of memory does the system use?
Is there session isolation preventing cross-session memory access?
What controls prevent attacker-influenced content from being stored as trusted memory?

Layer 4 — Orchestration boundary:

Are inter-agent messages treated as untrusted inputs?
Is capability delegation minimized — do workers receive only the capabilities their specific sub-task requires?
Is there an authorization mechanism for agent-to-agent instruction passing?

Layer 5 — Output channel:

Are model outputs sanitized before rendering or execution?
Are consequential actions gated on human approval?
Is there a complete audit record of agent outputs?

For the full agentic AI security review framework, see the agentic AI security review hub.

Blog

Get new posts in your inbox

AI security review, OWASP Agentic Top 10, ISO 42001 evidence, and what AI Committees actually need. No cadence promises — we publish when there's something worth reading.

Map your agentic attack surface before deployment

Drel structures the five-layer agentic AI attack surface review and produces the control plan your governance process requires — across assessed systems, not as a generic template.

Request early access See the demo dossier

A note on scope: Drel reviews assessed systems against documented architecture, configuration and intent. It does not ingest live telemetry from production environments. Dispositions reflect the assessed system at the time of review and the re-assessment triggers that govern when the disposition must be revisited.