VYPR
Medium severity5.3NVD Advisory· Published May 26, 2026

CVE-2026-9540

CVE-2026-9540

Description

A vulnerability was identified in vllm-project vllm 0.19.0. This issue affects some unknown processing of the component OpenAI-compatible Serving Path. Such manipulation leads to denial of service. It is possible to launch the attack remotely. The exploit is publicly available and might be used. The pull request to fix this issue awaits acceptance.

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

A single request with high n_completions and logprobs in vllm 0.19.0 causes a compute amplification that blocks co-scheduled requests, leading to denial of service.

Vulnerability

A vulnerability exists in vllm-project vllm version 0.19.0 within the V1 scheduler of the OpenAI-compatible Serving Path. When a request specifies high values for n (completions) and logprobs (e.g., n=8, logprobs=20), the scheduler batches these sequences without accounting for the compute overhead of the sampling stage. For large-vocabulary models like Qwen2.5 (~151k tokens), computing per-step logprobs for multiple completions requires a massive Top-K sort across the full vocabulary for every sequence at every decode iteration. This design flaw causes a synchronous compute amplification, as all requests in the same batch must wait for the heavy sampling to complete at each step [1][2].

Exploitation

An attacker with remote access to the OpenAI-compatible API endpoint can trigger this vulnerability by sending a crafted request with high n and logprobs parameters to a co-scheduled batch. No special authentication is required if the endpoint is publicly exposed. The attacker does not need high privileges—only the ability to submit API requests. The exploit reproduces by sending a single request with n=8 and logprobs=20 alongside other requests, causing the heavy request to dominate GPU compute time synchronously at every decode step. The public proof-of-concept is available and confirmed to work on vllm 0.18.0 and later [2][3].

Impact

Successful exploitation results in a denial of service (DoS) condition for innocent "victim" requests co-scheduled in the same decode batch. The time-to-first-token (TTFT) for plain requests can regress by a factor of 76x–423x, increasing from ~65ms to as much as ~9.7s. This effectively blocks the victim requests from completing in a reasonable time, degrading the overall serving quality and availability [1][2][3].

Mitigation

A pull request (#37594) [2] has been submitted to fix this issue by introducing a max_num_batched_logprobs budget in SchedulerConfig and the V1 scheduler. With the budget set to 100, the extreme latency spikes are mitigated, restoring victim TTFT to ~65ms. As of the publication date (2026-05-26), the fix awaits acceptance and has not been merged into a release. Users are advised to apply the patch manually or restrict access to the API endpoint to trusted clients until an official patched version (expected to be 0.19.1 or later) is released. No workarounds are documented; the issue is not listed in CISA KEV [2].

AI Insight generated on May 26, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected products

2
  • Vllm/Vllmreferences2 versions
    (expand)+ 1 more
    • (no CPE)
    • (no CPE)range: = 0.19.0

Patches

0

No patches discovered yet.

Vulnerability mechanics

AI mechanics synthesis has not run for this CVE yet.

References

6

News mentions

0

No linked articles in our index yet.