CVE-2026-9540
Description
A vulnerability was identified in vllm-project vllm 0.19.0. This issue affects some unknown processing of the component OpenAI-compatible Serving Path. Such manipulation leads to denial of service. It is possible to launch the attack remotely. The exploit is publicly available and might be used. The pull request to fix this issue awaits acceptance.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
A single request with high n_completions and logprobs in vllm 0.19.0 causes a compute amplification that blocks co-scheduled requests, leading to denial of service.
Vulnerability
A vulnerability exists in vllm-project vllm version 0.19.0 within the V1 scheduler of the OpenAI-compatible Serving Path. When a request specifies high values for n (completions) and logprobs (e.g., n=8, logprobs=20), the scheduler batches these sequences without accounting for the compute overhead of the sampling stage. For large-vocabulary models like Qwen2.5 (~151k tokens), computing per-step logprobs for multiple completions requires a massive Top-K sort across the full vocabulary for every sequence at every decode iteration. This design flaw causes a synchronous compute amplification, as all requests in the same batch must wait for the heavy sampling to complete at each step [1][2].
Exploitation
An attacker with remote access to the OpenAI-compatible API endpoint can trigger this vulnerability by sending a crafted request with high n and logprobs parameters to a co-scheduled batch. No special authentication is required if the endpoint is publicly exposed. The attacker does not need high privileges—only the ability to submit API requests. The exploit reproduces by sending a single request with n=8 and logprobs=20 alongside other requests, causing the heavy request to dominate GPU compute time synchronously at every decode step. The public proof-of-concept is available and confirmed to work on vllm 0.18.0 and later [2][3].
Impact
Successful exploitation results in a denial of service (DoS) condition for innocent "victim" requests co-scheduled in the same decode batch. The time-to-first-token (TTFT) for plain requests can regress by a factor of 76x–423x, increasing from ~65ms to as much as ~9.7s. This effectively blocks the victim requests from completing in a reasonable time, degrading the overall serving quality and availability [1][2][3].
Mitigation
A pull request (#37594) [2] has been submitted to fix this issue by introducing a max_num_batched_logprobs budget in SchedulerConfig and the V1 scheduler. With the budget set to 100, the extreme latency spikes are mitigated, restoring victim TTFT to ~65ms. As of the publication date (2026-05-26), the fix awaits acceptance and has not been merged into a release. Users are advised to apply the patch manually or restrict access to the API endpoint to trusted clients until an official patched version (expected to be 0.19.1 or later) is released. No workarounds are documented; the issue is not listed in CISA KEV [2].
- [Bug]: n_completions + logprobs Causes Significant TTFT Spike for Co-Scheduled Requests on Cold Cache
- [Bugfix] fixd issue#37343: prevent TTFT regression by adding batched logprobs budget to scheduler by Pineberry1 · Pull Request #37594 · vllm-project/vllm
- Catching a vLLM Latency Spike with eBPF and an Open-Weight LLM
AI Insight generated on May 26, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected products
2Patches
0No patches discovered yet.
Vulnerability mechanics
AI mechanics synthesis has not run for this CVE yet.
References
6News mentions
0No linked articles in our index yet.