CVE-2026-10705
Description
Dask's HLL Handler truncates hash bits, increasing collisions and enabling potential denial-of-service attacks via hash-based routing.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
Dask's HLL Handler truncates hash bits, increasing collisions and enabling potential denial-of-service attacks via hash-based routing.
Vulnerability
A flaw exists in dask's nunique_approx function within dask/dataframe/hyperloglog.py up to version 3.0. The function compute_hll_array hashes rows using pd.util.hash_pandas_object and then casts the 64-bit hash to 32-bit, losing significant hash information and increasing pre-estimation collisions. This issue also affects hash-based partition routing for shuffle and join operations [1].
Exploitation
An attacker can exploit this vulnerability by crafting specific keys that, due to the hash truncation and the deterministic nature of pd.util.hash_pandas_object, concentrate into a single target partition during shuffle or join operations. This requires a high degree of complexity and is known to be difficult to exploit, but can be carried out remotely [1].
Impact
Successful exploitation can lead to resource consumption and availability degradation due to data skew caused by concentrated partitions. In deployments processing attacker-controlled keys, this can result in a denial-of-service condition by overwhelming a specific partition [1].
Mitigation
A pull request to fix this issue is awaiting acceptance [3]. The proposed solution involves preserving the full uint64 output of pd.util.hash_pandas_object in HyperLogLog calculations and introducing a dataframe.shuffle.hash-key configuration option [3]. No patched version has been released as of the available references.
AI Insight generated on Jun 3, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected products
1Patches
18e06781230ffMerge 7620d876a3049dec06228d0bc05cd37e0641a179 into 41318496642e6c19be81e80e690cc547892c3470
6 files changed · +78 −6
dask/dask-schema.yaml+10 −0 modified@@ -66,6 +66,16 @@ properties: Compression algorithm used for on disk-shuffling. Partd, the library used for compression supports ZLib, BZ2, and SNAPPY + hash-key: + type: + - string + - "null" + description: | + Hash key for shuffle partitioning. When set to a 16-character string, + this key is passed to pd.util.hash_pandas_object to change hash output + for string/object columns. Set to a random value to mitigate + HashDoS-style partition hotspotting in adversarial environments. + parquet: type: object
dask/dask.yaml+1 −0 modified@@ -11,6 +11,7 @@ dataframe: shuffle: method: null compression: null # compression for on disk-shuffling. Partd supports ZLib, BZ2, SNAPPY + hash-key: null # Hash key for shuffle partitioning. Set to a random string to mitigate HashDoS. parquet: metadata-task-size-local: 512 # Number of files per local metadata-processing task metadata-task-size-remote: 1 # Number of files per remote metadata-processing task
dask/dataframe/backends.py+4 −0 modified@@ -545,6 +545,10 @@ def get_parallel_type_object(_): def hash_object_pandas( obj, index=True, encoding="utf8", hash_key=None, categorize=True ): + if hash_key is None: + from dask import config + + hash_key = config.get("dataframe.shuffle.hash-key", None) return pd.util.hash_pandas_object( obj, index=index, encoding=encoding, hash_key=hash_key, categorize=categorize )
dask/dataframe/hyperloglog.py+6 −6 modified@@ -20,24 +20,24 @@ def compute_first_bit(a): "Compute the position of the first nonzero bit for each int in an array." # TODO: consider making this less memory-hungry - bits = np.bitwise_and.outer(a, 1 << np.arange(32)) + bits = np.bitwise_and.outer(a, np.uint64(1) << np.arange(64, dtype=np.uint64)) bits = bits.cumsum(axis=1).astype(bool) - return 33 - bits.sum(axis=1) + return 65 - bits.sum(axis=1) def compute_hll_array(obj, b): # b is the number of bits if not 8 <= b <= 16: raise ValueError("b should be between 8 and 16") - num_bits_discarded = 32 - b + num_bits_discarded = 64 - b m = 1 << b # Get an array of the hashes hashes = hash_pandas_object(obj, index=False) if isinstance(hashes, pd.Series): hashes = hashes._values - hashes = hashes.astype(np.uint32) + hashes = hashes.astype(np.uint64) # Of the first b bits, which is the first nonzero? j = hashes >> num_bits_discarded @@ -78,6 +78,6 @@ def estimate_count(Ms, b): V = (M == 0).sum() if V: return m * np.log(m / V) - if E > 2**32 / 30.0: - return -(2**32) * np.log1p(-E / 2**32) + if E > 2**64 / 30.0: + return -(2**64) * np.log1p(-E / 2**64) return E
dask/dataframe/tests/test_hashing.py+30 −0 modified@@ -82,3 +82,33 @@ def test_hash_object_dispatch(obj): result = dd.dispatch.hash_object_dispatch(obj) expected = pd.util.hash_pandas_object(obj) assert_eq(result, expected) + + +def test_hash_object_dispatch_custom_hash_key(): + import dask + + obj = pd.DataFrame({"x": ["a", "b", "c"], "y": ["d", "e", "f"]}) + default_hash = dd.dispatch.hash_object_dispatch(obj, index=False) + + with dask.config.set({"dataframe.shuffle.hash-key": "abcdefghijklmnop"}): + keyed_hash = dd.dispatch.hash_object_dispatch(obj, index=False) + + assert not (default_hash == keyed_hash).all() + + +def test_hash_object_dispatch_explicit_key_overrides_config(): + import dask + + obj = pd.Series(["a", "b", "c"]) + explicit_hash = dd.dispatch.hash_object_dispatch( + obj, index=False, hash_key="abcdefghijklmnop" + ) + + with dask.config.set({"dataframe.shuffle.hash-key": "zyxwvutsrqponmlk"}): + config_hash = dd.dispatch.hash_object_dispatch(obj, index=False) + explicit_in_config = dd.dispatch.hash_object_dispatch( + obj, index=False, hash_key="abcdefghijklmnop" + ) + + assert_eq(explicit_hash, explicit_in_config) + assert not (explicit_hash == config_hash).all()
dask/dataframe/tests/test_hyperloglog.py+27 −0 modified@@ -94,3 +94,30 @@ def test_larger_data(): seed=1, ) assert df.nunique_approx().compute() > 1000 + + +def test_compute_first_bit_64bit(): + from dask.dataframe.hyperloglog import compute_first_bit + + arr = np.array([1 << 40, 0, 1], dtype=np.uint64) + + result = compute_first_bit(arr) + + np.testing.assert_array_equal(result, np.array([41, 65, 1])) + + +def test_compute_hll_array_uses_high_uint64_bits(monkeypatch): + from dask.dataframe import hyperloglog + + hashes = pd.Series(np.array([1 << 63, 1 << 62], dtype=np.uint64)) + + def hash_pandas_object(obj, index=False): + return hashes + + monkeypatch.setattr(hyperloglog, "hash_pandas_object", hash_pandas_object) + + state = hyperloglog.compute_hll_array(pd.Series([0, 1]), b=8) + + assert state[128] == 64 + assert state[64] == 63 + assert state[0] == 0
Vulnerability mechanics
Root cause
"The HyperLogLog implementation truncates 64-bit hash outputs to 32-bit, leading to resource exhaustion."
Attack vector
An attacker with low privileges can remotely trigger resource consumption by calling the `nunique_approx` function. This function, located in `dask/dataframe/hyperloglog.py`, processes data and can be manipulated to exhaust system resources. The complexity required for exploitation is high, making it difficult to carry out [CWE-400].
Affected code
The vulnerability resides in the `compute_first_bit` and `compute_hll_array` functions within `dask/dataframe/hyperloglog.py`. Specifically, the code incorrectly truncates 64-bit hash values to 32-bit and uses a fixed bit range for calculations. The `hash_object_pandas` function in `dask/dataframe/backends.py` was also modified to accept and utilize a `hash_key` configuration.
What the fix does
The patch preserves the full 64-bit output of `pd.util.hash_pandas_object` within the HyperLogLog implementation, preventing truncation to 32-bit [ref_id=1]. Additionally, a new configuration option `dataframe.shuffle.hash-key` was introduced and passed to the hashing function, allowing for custom hash keys to mitigate potential hotspots [ref_id=1]. This change ensures that the hashing mechanism correctly utilizes the full hash output, thereby resolving the resource consumption vulnerability.
Preconditions
- authAttacker must have low privileges.
- networkThe attack can be carried out remotely.
Generated on Jun 3, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
6News mentions
0No linked articles in our index yet.