VYPR
Low severity3.1NVD Advisory· Published Jun 3, 2026

CVE-2026-10705

CVE-2026-10705

Description

Dask's HLL Handler truncates hash bits, increasing collisions and enabling potential denial-of-service attacks via hash-based routing.

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

Dask's HLL Handler truncates hash bits, increasing collisions and enabling potential denial-of-service attacks via hash-based routing.

Vulnerability

A flaw exists in dask's nunique_approx function within dask/dataframe/hyperloglog.py up to version 3.0. The function compute_hll_array hashes rows using pd.util.hash_pandas_object and then casts the 64-bit hash to 32-bit, losing significant hash information and increasing pre-estimation collisions. This issue also affects hash-based partition routing for shuffle and join operations [1].

Exploitation

An attacker can exploit this vulnerability by crafting specific keys that, due to the hash truncation and the deterministic nature of pd.util.hash_pandas_object, concentrate into a single target partition during shuffle or join operations. This requires a high degree of complexity and is known to be difficult to exploit, but can be carried out remotely [1].

Impact

Successful exploitation can lead to resource consumption and availability degradation due to data skew caused by concentrated partitions. In deployments processing attacker-controlled keys, this can result in a denial-of-service condition by overwhelming a specific partition [1].

Mitigation

A pull request to fix this issue is awaiting acceptance [3]. The proposed solution involves preserving the full uint64 output of pd.util.hash_pandas_object in HyperLogLog calculations and introducing a dataframe.shuffle.hash-key configuration option [3]. No patched version has been released as of the available references.

AI Insight generated on Jun 3, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected products

1

Patches

1
8e06781230ff

Merge 7620d876a3049dec06228d0bc05cd37e0641a179 into 41318496642e6c19be81e80e690cc547892c3470

https://github.com/dask/dask3em0Jun 2, 2026via nvd-ref
6 files changed · +78 6
  • dask/dask-schema.yaml+10 0 modified
    @@ -66,6 +66,16 @@ properties:
                   Compression algorithm used for on disk-shuffling. Partd, the library used
                   for compression supports ZLib, BZ2, and SNAPPY
     
    +          hash-key:
    +            type:
    +            - string
    +            - "null"
    +            description: |
    +              Hash key for shuffle partitioning. When set to a 16-character string,
    +              this key is passed to pd.util.hash_pandas_object to change hash output
    +              for string/object columns. Set to a random value to mitigate
    +              HashDoS-style partition hotspotting in adversarial environments.
    +
     
           parquet:
             type: object
    
  • dask/dask.yaml+1 0 modified
    @@ -11,6 +11,7 @@ dataframe:
       shuffle:
         method: null
         compression: null  # compression for on disk-shuffling. Partd supports ZLib, BZ2, SNAPPY
    +    hash-key: null  # Hash key for shuffle partitioning. Set to a random string to mitigate HashDoS.
       parquet:
         metadata-task-size-local: 512  # Number of files per local metadata-processing task
         metadata-task-size-remote: 1  # Number of files per remote metadata-processing task
    
  • dask/dataframe/backends.py+4 0 modified
    @@ -545,6 +545,10 @@ def get_parallel_type_object(_):
     def hash_object_pandas(
         obj, index=True, encoding="utf8", hash_key=None, categorize=True
     ):
    +    if hash_key is None:
    +        from dask import config
    +
    +        hash_key = config.get("dataframe.shuffle.hash-key", None)
         return pd.util.hash_pandas_object(
             obj, index=index, encoding=encoding, hash_key=hash_key, categorize=categorize
         )
    
  • dask/dataframe/hyperloglog.py+6 6 modified
    @@ -20,24 +20,24 @@
     def compute_first_bit(a):
         "Compute the position of the first nonzero bit for each int in an array."
         # TODO: consider making this less memory-hungry
    -    bits = np.bitwise_and.outer(a, 1 << np.arange(32))
    +    bits = np.bitwise_and.outer(a, np.uint64(1) << np.arange(64, dtype=np.uint64))
         bits = bits.cumsum(axis=1).astype(bool)
    -    return 33 - bits.sum(axis=1)
    +    return 65 - bits.sum(axis=1)
     
     
     def compute_hll_array(obj, b):
         # b is the number of bits
     
         if not 8 <= b <= 16:
             raise ValueError("b should be between 8 and 16")
    -    num_bits_discarded = 32 - b
    +    num_bits_discarded = 64 - b
         m = 1 << b
     
         # Get an array of the hashes
         hashes = hash_pandas_object(obj, index=False)
         if isinstance(hashes, pd.Series):
             hashes = hashes._values
    -    hashes = hashes.astype(np.uint32)
    +    hashes = hashes.astype(np.uint64)
     
         # Of the first b bits, which is the first nonzero?
         j = hashes >> num_bits_discarded
    @@ -78,6 +78,6 @@ def estimate_count(Ms, b):
             V = (M == 0).sum()
             if V:
                 return m * np.log(m / V)
    -    if E > 2**32 / 30.0:
    -        return -(2**32) * np.log1p(-E / 2**32)
    +    if E > 2**64 / 30.0:
    +        return -(2**64) * np.log1p(-E / 2**64)
         return E
    
  • dask/dataframe/tests/test_hashing.py+30 0 modified
    @@ -82,3 +82,33 @@ def test_hash_object_dispatch(obj):
         result = dd.dispatch.hash_object_dispatch(obj)
         expected = pd.util.hash_pandas_object(obj)
         assert_eq(result, expected)
    +
    +
    +def test_hash_object_dispatch_custom_hash_key():
    +    import dask
    +
    +    obj = pd.DataFrame({"x": ["a", "b", "c"], "y": ["d", "e", "f"]})
    +    default_hash = dd.dispatch.hash_object_dispatch(obj, index=False)
    +
    +    with dask.config.set({"dataframe.shuffle.hash-key": "abcdefghijklmnop"}):
    +        keyed_hash = dd.dispatch.hash_object_dispatch(obj, index=False)
    +
    +    assert not (default_hash == keyed_hash).all()
    +
    +
    +def test_hash_object_dispatch_explicit_key_overrides_config():
    +    import dask
    +
    +    obj = pd.Series(["a", "b", "c"])
    +    explicit_hash = dd.dispatch.hash_object_dispatch(
    +        obj, index=False, hash_key="abcdefghijklmnop"
    +    )
    +
    +    with dask.config.set({"dataframe.shuffle.hash-key": "zyxwvutsrqponmlk"}):
    +        config_hash = dd.dispatch.hash_object_dispatch(obj, index=False)
    +        explicit_in_config = dd.dispatch.hash_object_dispatch(
    +            obj, index=False, hash_key="abcdefghijklmnop"
    +        )
    +
    +    assert_eq(explicit_hash, explicit_in_config)
    +    assert not (explicit_hash == config_hash).all()
    
  • dask/dataframe/tests/test_hyperloglog.py+27 0 modified
    @@ -94,3 +94,30 @@ def test_larger_data():
             seed=1,
         )
         assert df.nunique_approx().compute() > 1000
    +
    +
    +def test_compute_first_bit_64bit():
    +    from dask.dataframe.hyperloglog import compute_first_bit
    +
    +    arr = np.array([1 << 40, 0, 1], dtype=np.uint64)
    +
    +    result = compute_first_bit(arr)
    +
    +    np.testing.assert_array_equal(result, np.array([41, 65, 1]))
    +
    +
    +def test_compute_hll_array_uses_high_uint64_bits(monkeypatch):
    +    from dask.dataframe import hyperloglog
    +
    +    hashes = pd.Series(np.array([1 << 63, 1 << 62], dtype=np.uint64))
    +
    +    def hash_pandas_object(obj, index=False):
    +        return hashes
    +
    +    monkeypatch.setattr(hyperloglog, "hash_pandas_object", hash_pandas_object)
    +
    +    state = hyperloglog.compute_hll_array(pd.Series([0, 1]), b=8)
    +
    +    assert state[128] == 64
    +    assert state[64] == 63
    +    assert state[0] == 0
    

Vulnerability mechanics

Root cause

"The HyperLogLog implementation truncates 64-bit hash outputs to 32-bit, leading to resource exhaustion."

Attack vector

An attacker with low privileges can remotely trigger resource consumption by calling the `nunique_approx` function. This function, located in `dask/dataframe/hyperloglog.py`, processes data and can be manipulated to exhaust system resources. The complexity required for exploitation is high, making it difficult to carry out [CWE-400].

Affected code

The vulnerability resides in the `compute_first_bit` and `compute_hll_array` functions within `dask/dataframe/hyperloglog.py`. Specifically, the code incorrectly truncates 64-bit hash values to 32-bit and uses a fixed bit range for calculations. The `hash_object_pandas` function in `dask/dataframe/backends.py` was also modified to accept and utilize a `hash_key` configuration.

What the fix does

The patch preserves the full 64-bit output of `pd.util.hash_pandas_object` within the HyperLogLog implementation, preventing truncation to 32-bit [ref_id=1]. Additionally, a new configuration option `dataframe.shuffle.hash-key` was introduced and passed to the hashing function, allowing for custom hash keys to mitigate potential hotspots [ref_id=1]. This change ensures that the hashing mechanism correctly utilizes the full hash output, thereby resolving the resource consumption vulnerability.

Preconditions

  • authAttacker must have low privileges.
  • networkThe attack can be carried out remotely.

Generated on Jun 3, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

6

News mentions

0

No linked articles in our index yet.