vLLM: Six CVEs Disclosed in 21 Hours — Critical Auth Bypass, Code Execution, and GPU Memory Leaks

Key findings

Critical auth bypass (CVE-2026-48746) lets unauthenticated attackers call the OpenAI API without an API key
High-severity code execution (CVE-2026-41523) via assert bypass when Python runs in optimized mode
Audio decompression bomb (CVE-2026-54233) can OOM the server from a 25 MB OPUS file
GGUF dequantization bug (CVE-2026-53923) leaks uninitialized GPU memory in multi-tenant setups
NaN/Infinity temperature values (CVE-2026-54235) bypass validation and cause undefined GPU behavior
Incomplete prior fix (CVE-2026-54236) still leaks PIL memory addresses via the Anthropic router

Six vLLM CVEs Disclosed Together — Auth Bypass, Code Execution, and Memory Leaks

On June 16–17, 2026, six security vulnerabilities in the open-source LLM inference server vLLM were disclosed in a tight 21-hour window, spanning a critical authentication bypass, a high-severity arbitrary code execution flaw, and several medium-severity issues involving denial of service, memory leaks, and undefined GPU behavior. The batch affects users running vLLM in multi-tenant or API-facing deployments and underscores the growing attack surface of AI inference infrastructure.

Critical: OpenAI API Authentication Bypass

The most severe vulnerability, CVE-2026-48746 (critical), allows an unauthenticated attacker to bypass the OpenAI API AuthenticationMiddleware entirely. The flaw originates in how ASGI web servers and Starlette handle trust boundaries, enabling API calls without the configured VLLM_API_KEY or --api-key. This issue was discovered during a source code audit by X41 Sec. Any vLLM instance exposing the OpenAI-compatible API endpoint is at risk of unauthorized access.

High Severity: Arbitrary Code Execution via Assert Bypass

CVE-2026-41523 (high) describes an assert-based security check in vLLM's activation function loading that can be bypassed when Python runs in optimized mode (python -O or PYTHONOPTIMIZE=1). In that mode, assert statements are stripped at runtime, allowing an unauthenticated attacker to achieve arbitrary code execution by publishing a malicious HuggingFace model. Users who run vLLM with Python optimizations enabled are directly exposed.

Denial of Service via Audio Decompression Bomb

CVE-2026-54233 (medium) targets the /v1/audio/transcriptions endpoint. While the endpoint limits compressed upload size to 25 MB (via VLLM_MAX_AUDIO_CLIP_FILESIZE_MB), it does not cap the decoded PCM output. A 25 MB OPUS file decompresses to approximately 14.9 GB of float32 PCM, causing an out-of-memory condition that can crash the server. This is a classic decompression bomb attack vector.

GPU Memory Leak via GGUF Dequantization

CVE-2026-53923 (medium) involves integer truncation of tensor dimensions in vLLM's GGUF dequantize CUDA kernels (csrc/quantization/gguf/gguf_kernel.cu). The kernel processes only a truncated number of elements, but the output tensor is allocated at full size via torch::empty, which leaves uninitialized GPU memory exposed. In multi-tenant serving environments, this can leak sensitive data from other tenants' GPU memory.

Undefined GPU Behavior from NaN/Infinity Temperature

CVE-2026-54235 (medium) exploits the fact that all temperature validation gates use comparison operators (<, >), which silently evaluate to False for NaN and positive Infinity under Python's IEEE 754 float semantics. Both values pass every guard and propagate to GPU sampling kernels, where they produce undefined behavior or CUDA errors that can crash the server or corrupt outputs.

Incomplete Fix Leaks PIL Repr Addresses

CVE-2026-54236 (medium) is an incomplete fix for a prior vulnerability (CVE-2026-22778). The earlier patch did not fully close the information leak; the Anthropic API router still leaks Python object repr addresses from PIL (Python Imaging Library) objects. An unauthenticated attacker can obtain memory addresses, aiding in the construction of further exploits.

Patch Status and Mitigations

As of the disclosure date, vLLM maintainers have not yet released a unified patch for all six CVEs. Users are advised to:

Apply the authentication bypass fix for CVE-2026-48746 immediately if exposing the OpenAI API endpoint.
Avoid running vLLM with PYTHONOPTIMIZE=1 or python -O until CVE-2026-41523 is patched.
Restrict access to the /v1/audio/transcriptions endpoint or set stricter upload limits as a workaround for CVE-2026-54233.
Monitor the vLLM GitHub repository for security advisories and updated releases.

Why This Batch Matters

This disclosure event highlights that vLLM, like any rapidly evolving AI infrastructure project, faces a widening attack surface — from API authentication and model loading to GPU memory management and audio processing. For organizations running vLLM in production, especially in multi-tenant or internet-facing configurations, these six CVEs together represent a significant risk profile that demands prompt attention.