vLLM: GGUF dequantize kernel int truncation exposes uninitialized GPU memory in multi-tenant serving
Description
Summary
Integer truncation of tensor dimensions in vLLM's GGUF dequantize kernels (csrc/quantization/gguf/gguf_kernel.cu) causes partial tensor processing. The output tensor is allocated at full size via torch::empty (uninitialized memory), but the dequantize CUDA kernel processes only a truncated number of elements. The unfilled portion of the output tensor retains whatever was previously in GPU memory. In multi-tenant inference deployments, this residual GPU memory may contain tensor data from other users' inference requests, constituting information disclosure.
Root
Cause
The to_cuda_ggml_t function pointer type at ggml-common.h:1067 declares its element count parameter as int (32-bit):
using to_cuda_ggml_t = void (*)(const void * __restrict__ x,
dst_t * __restrict__ y,
int k, // 32-bit
cudaStream_t stream);
All dequantize kernel functions (dequantize_block_cuda, dequantize_row_q2_K_cuda, etc. in dequantize.cuh) inherit this int k parameter and use it as the kernel launch grid size:
static void dequantize_block_cuda(..., const int k, cudaStream_t stream) {
const int num_blocks = (k + 2*CUDA_DEQUANTIZE_BLOCK_SIZE - 1) / (2*CUDA_DEQUANTIZE_BLOCK_SIZE);
dequantize_block<<<num_blocks, CUDA_DEQUANTIZE_BLOCK_SIZE, 0, stream>>>(vx, y, k);
}
In ggml_dequantize() at gguf_kernel.cu:85, the caller passes m * n (an int64_t product) to this int k parameter:
at::Tensor DW = torch::empty({m, n}, options); // line 80: full-size, UNINITIALIZED
// ...
to_cuda((void*)W.data_ptr(), (scalar_t*)DW.data_ptr(), m * n, stream); // line 85: m*n truncated to int
When m * n > INT_MAX, the truncated k is smaller than the actual tensor size. The kernel processes k elements. The remaining (m * n) - k elements in DW are never written and contain stale GPU memory.
This is a single root cause -- the int type on the k parameter in to_cuda_ggml_t -- with a single fix: change int k to int64_t k. All dequantize functions inherit this type through the same typedef.
Affected
Functions
All in csrc/quantization/gguf/gguf_kernel.cu:
| Function | Line | Allocation | Info Disclosure? | |----------|------|-----------|-----------------| | ggml_dequantize | 74 | torch::empty({m, n}) at line 80 | Yes -- m*n truncated to int k at line 85 | | ggml_mul_mat_vec_a8 | 91 | torch::empty({vecs, row}) at line 99 | Yes -- int col = X.sizes()[1] at line 94 | | ggml_mul_mat_a8 | 207 | torch::empty({batch, row}) at line 215 | Yes -- int col = X.sizes()[1] at line 210 | | ggml_moe_a8 | 279 | torch::empty({tokens*top_k, row}) at line 289 | Yes -- int col = X.sizes()[1] at line 285 |
All four functions allocate output tensors with torch::empty (uninitialized) and then run CUDA kernels that use truncated dimension values as loop bounds. The unfilled portion of each output tensor retains stale GPU memory.
ggml_moe_a8_vec (line 382) uses torch::zeros instead of torch::empty, so it is not affected by the info disclosure variant.
Impact: Information Disclosure in Multi-Tenant Serving
vLLM is designed for multi-tenant inference serving. GPU memory is reused across requests from different users. When the dequantize kernel partially fills an output tensor:
- The output tensor
DWis allocated withtorch::empty-- the buffer contains whatever was previously in that GPU memory region - The dequantize kernel fills only a truncated portion of the buffer
- The unfilled portion retains residual data from prior GPU operations, which may include tensor data from other users' inference requests
- The contaminated tensor proceeds through the model computation
- No error or warning is generated -- the partial fill is silent
This is a confidentiality violation. In shared inference deployments (the primary vLLM use case), one user's inference data can leak into another user's model computation through residual GPU memory.
Attacker
Control
The attacker crafts a GGUF model file with weight tensor dimensions whose product exceeds INT_MAX (e.g., a matrix with shape [65536, 65536] gives m * n = 4,294,967,296). The model is hosted on HuggingFace or any model hub. The victim loads the model with vLLM for inference serving. The truncation happens automatically during model weight dequantization.
Fix
A fix for this vulnerability was added here: https://github.com/vllm-project/vllm/pull/44971
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
Affected products
1Patches
Vulnerability mechanics
Root cause
"Integer truncation: the `to_cuda_ggml_t` function pointer type declares its element count as `int` (32-bit), so when `m * n` exceeds `INT_MAX` the truncated value causes the CUDA dequantize kernel to process only a subset of the output tensor, leaving the remainder uninitialized."
Attack vector
An attacker crafts a GGUF model file with weight tensor dimensions whose product exceeds `INT_MAX` (e.g., a matrix with shape `[65536, 65536]` giving `m * n = 4,294,967,296`). The model is hosted on a model hub such as HuggingFace. When a victim loads this model with vLLM for inference serving, the dequantize kernel processes only a truncated number of elements, leaving the remainder of the output tensor filled with stale GPU memory from prior operations. In multi-tenant deployments, this residual data may contain tensor values from other users' inference requests, constituting an information disclosure [ref_id=1].
Affected code
The root cause is the `to_cuda_ggml_t` function pointer type in `ggml-common.h:1067`, which declares its element count parameter as `int` (32-bit). All dequantize kernel functions in `dequantize.cuh` inherit this `int k` parameter and use it as the kernel launch grid size. In `gguf_kernel.cu`, the callers `ggml_dequantize` (line 85), `ggml_mul_mat_vec_a8` (line 99), `ggml_mul_mat_a8` (line 215), and `ggml_moe_a8` (line 289) pass `int64_t` products (e.g., `m * n`) to this truncated `int` parameter, causing partial tensor processing when the product exceeds `INT_MAX`.
What the fix does
The patch changes the `k` parameter type from `int` to `int64_t` in the `to_cuda_ggml_t` typedef in `ggml-common.h` and in all dequantize kernel functions in `dequantize.cuh` [patch_id=6351924]. Additionally, in `gguf_kernel.cu`, the patch replaces `torch::empty` with `torch::fill_(..., 0.0)` for the output tensors in `ggml_dequantize`, `ggml_mul_mat_vec_a8`, `ggml_mul_mat_a8`, and `ggml_moe_a8`, and changes local dimension variables from `int` to `int64_t`. The type fix ensures the full element count is passed to the kernel, while the zero-fill guarantees that even if a future truncation bug occurs, no stale GPU memory is exposed.
Preconditions
- inputThe attacker must provide a GGUF model file with at least one weight tensor whose dimension product exceeds INT_MAX (e.g., shape [65536, 65536]).
- configThe victim must load the malicious model into vLLM on a CUDA-capable GPU.
- configThe GPU memory must have been previously used by other inference requests so that stale data is present in uninitialized allocations.
Generated on Jun 17, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
4News mentions
0No linked articles in our index yet.