BlogTechnical

Model denial of service and cost-exhaustion attacks

LLM denial of service is different from traditional DoS. An attacker does not need to crash the service — they need to make it expensive to run. Cost-exhaustion attacks are under-defended and growing in the assessed systems we review.

Drel Research18 May 202510 min read

Model denial of service (OWASP LLM04) is the risk category that most security teams address last, often because it appears less severe than injection or data disclosure. In practice, cost exhaustion attacks against LLM applications are one of the most accessible attacks for a low-sophistication adversary and one of the most damaging for organisations running high-volume inference workloads.

The controls are well-understood and implementable. The reason they are missing in many assessed systems is not technical complexity — it is that they are designed into a system at build time and retrofitted poorly. For the full OWASP context, see the OWASP LLM Top 10 Assessment.

Why LLM DoS is different

Traditional denial of service attacks saturate a resource — network bandwidth, CPU, memory, file descriptors — until the service cannot serve legitimate requests. The goal is unavailability. The control is capacity: if the service has more capacity than the attacker can exhaust, the attack fails.

LLM denial of service has two distinct modes that differ from this model:

Cost exhaustion does not require making the service unavailable. It requires making it expensive to run. In a pay-per-token inference model, an attacker who generates sufficient token volume imposes financial cost on the operator regardless of whether the service remains available. The attack succeeds when the cost becomes unsustainable, not when the service goes down.

Resource exhaustion is closer to traditional DoS but operates at the inference layer. An attacker who floods the inference endpoint with large context requests may cause latency degradation for legitimate users even if the service remains technically available — and may exhaust per-user or per-deployment token quotas that affect legitimate access.

A cost exhaustion attack does not need to crash the service. It needs to generate enough token volume that the operator's monthly inference bill becomes the problem. The attacker's marginal cost is near zero; the operator's is not.

This asymmetry means that cost exhaustion attacks are economically attractive at a level of effort that would not bother most adversaries mounting a traditional DoS. An attacker with a few hundred API keys and a script can mount a meaningful cost attack against a public LLM application.

DoS threat matrix — attack types, impact, detection, and control

Attack type	Impact	Detection signal	Control
Token exhaustion via complex prompts	Cost exhaustion; billing spike proportional to input token count	Above-baseline input token volume per user or session	Input token cap per request enforced at gateway
Recursive context expansion	Context window saturation; inference latency degradation for all users	Requests consistently near maximum context length	Context length cap; document size limit before injection
Parallel request flooding	Quota exhaustion affecting all users of the deployment	Spike in concurrent requests from a single identity	Per-user rate limit; concurrency limit; circuit breaker
Repetitive regeneration loops	Output token amplification; cost scaling without proportional user value	High output-to-input token ratio per request or session	Output token cap per request; session-level token budget

Cost exhaustion attacks

Cost exhaustion attacks target the token billing dimension of inference. The attacker's goal is to maximise token consumption per request, minimise their own cost per request, and scale the attack to generate significant billing impact.

Long prompt attacks. Input tokens cost money. An attacker who submits requests with very long prompts — pasting a large document into every query, for example, or including a long preamble before the actual question — consumes input tokens at a rate higher than legitimate users. If the application imposes no input token cap, long prompt attacks scale linearly with request volume.

Token amplification attacks.Some prompts reliably produce completions much longer than the prompt itself. A question that always produces a long enumerated response — “list every country in the world alphabetically with its capital” — amplifies the output token cost beyond what input volume alone suggests. Output tokens are typically billed at a higher rate than input tokens, making amplification particularly costly.

Repeated extraction requests.An attacker attempting to extract a proprietary model's responses systematically — running thousands of variations of a query to map the output distribution — generates token cost as a side effect of the extraction. This combines model theft risk with cost exhaustion.

Resource exhaustion attacks

Resource exhaustion at the inference layer is distinct from cost exhaustion — it targets throughput and latency rather than billing. Large context requests occupy inference infrastructure for longer than small requests, increasing latency for concurrent legitimate users.

Context flooding. Many LLM applications accept user-provided context — documents to summarise, code to review, emails to analyse. An attacker who submits documents at or near the maximum context window for every request occupies inference capacity that legitimate users cannot access during that time. If the application does not cap context length, context flooding can degrade service quality without necessarily exhausting a spending cap.

Quota exhaustion. Inference providers impose rate limits and token quotas per deployment, per user, or per time window. An attacker who exhausts a deployment-level quota affects all users of that deployment. Per-user quotas limit the blast radius but require the application to track and enforce usage per authenticated user.

The four attack patterns

Attack pattern	Mechanism	Primary impact	Primary control
Long prompt	Submit requests with input tokens near or at the cap, or without a cap	Cost + resource	Input token cap per request
Token amplification	Craft prompts that reliably produce completions much longer than the input	Cost	Output token cap per request
Context flooding	Submit documents at the maximum context window on every request	Resource	Context length cap; document size limit
Quota exhaustion	Send requests at the maximum rate limit until the deployment quota is consumed	Resource (all users)	Per-user quota; circuit breaker

Controls that work

The controls for LLM DoS are not novel — they are the same throttling and quota mechanisms used in any API, adapted to the token-based billing model of LLM inference. What makes them specific to LLMs is that they must be applied at the token layer, not just the request layer.

Input token cap per request. Every request must have a hard limit on the number of input tokens accepted. The cap should be set to the maximum that legitimate use requires, not to the maximum the model accepts. If the typical legitimate query is 500 tokens, a cap of 2000 tokens provides a reasonable buffer while blocking long prompt attacks. The cap should be enforced at the gateway, before the request reaches the inference provider.

Output token cap per request. Setting a maximum completion length limits token amplification. This also limits the cost of a single successful amplification request. The cap should be set based on the longest reasonable legitimate response in the deployment context.

Per-user rate limits. Rate limits are expressed as requests per time window (e.g., 60 requests per minute). For LLM applications, rate limits should also be expressed in tokens per time window — a user sending one enormous request per minute can exhaust more resources than a user sending many small requests.

Cost alerts. Set alerts at 50%, 75%, and 90% of the expected monthly inference budget. A cost spike that would exhaust the budget in a week rather than a month is a signal that requires investigation — it may represent a cost exhaustion attack, unexpected usage growth, or a runaway process.

Circuit breaker. A circuit breaker cuts off inference requests when cost-per-hour or requests-per-minute exceeds a defined threshold. Unlike a rate limit (which limits per-user throughput), a circuit breaker limits total system throughput. It is a blunt instrument — it affects all users when it fires — but it prevents a cost exhaustion attack from fully consuming the budget before human intervention.

Rate limiting design

Rate limiting for LLM applications has more dimensions than rate limiting for traditional APIs. The relevant dimensions are:

Requests per time window. Limits request frequency. Effective against volume attacks but not against low-rate long-prompt attacks.
Input tokens per time window. Limits total input token consumption per user per window. More effective than request rate for long prompt attacks.
Output tokens per time window. Limits total completion token consumption. Effective against amplification attacks.
Concurrent requests. Limits how many requests from a single user can be in-flight simultaneously. Limits context flooding impact.

In practice, most LLM applications implement only request-level rate limits because that is what off-the-shelf API gateway products support. Token-level rate limits require counting tokens at the gateway layer — which requires either calling the tokeniser or using approximate token counts. The implementation overhead is non-trivial, which is why token-level limits are frequently absent in assessed systems.

The minimum control set is: input token cap per request, output token cap per request, request-rate limit per user, and cost alerts. Token-per-window limits are recommended for high-sensitivity deployments.

Evidence requirements

A security review that addresses model DoS must verify that controls are in place for both cost exhaustion and resource exhaustion attack patterns. The evidence required:

Gateway configuration showing input token cap per request — specific value, enforced before reaching the inference provider.
Gateway configuration showing output token cap per request.
Rate limit configuration — requests per window and, ideally, tokens per window per user.
Rate limit enforcement test: requests above the limit are rejected with a 429 response.
Cost alert configuration — alert thresholds defined and tested.
Circuit breaker configuration — trigger condition, action (reject or queue), and test showing it engages.

The OWASP LLM Top 10 Assessment maps this evidence against the full OWASP risk framework and produces a clearance recommendation for AI Committee review.

Blog

Get new posts in your inbox

AI security review, OWASP Agentic Top 10, ISO 42001 evidence, and what AI Committees actually need. No cadence promises — we publish when there's something worth reading.

Review your LLM application's DoS controls

Drel assesses the token caps, rate limits, cost alerts, and circuit breaker configurations in assessed systems, identifies control gaps, and maps findings to the clearance decision.

Request early access See the demo dossier

A note on scope: Drel reviews assessed systems against documented architecture, configuration and intent. It does not ingest live telemetry from production environments. Dispositions reflect the assessed system at the time of review and the re-assessment triggers that govern when the disposition must be revisited.