VYPR
Low severity3.6NVD Advisory· Published Jun 4, 2026· Updated Jun 4, 2026

CVE-2026-10803

CVE-2026-10803

Description

MLflow versions prior to 3.10.0 are vulnerable to predictable hash collisions in dataset digests, allowing data manipulation.

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

MLflow versions prior to 3.10.0 are vulnerable to predictable hash collisions in dataset digests, allowing data manipulation.

Vulnerability

A flaw exists in MLflow versions up to 3.10.0 within the mlflow/data/digest_utils.py file, specifically in the dataset digest computation function. The vulnerability stems from deterministic sampling (using df.head(10000)) and selective column filtering, which allows an attacker to craft datasets with different semantic content that produce the same digest [1]. This affects all platforms and Python versions 3.10+.

Exploitation

An attacker with the ability to submit datasets to a shared MLflow instance can exploit this vulnerability. The attack involves keeping the first 10,000 rows identical to a legitimate dataset while arbitrarily modifying rows beyond the 10,000th row. This manipulation results in a matching digest for datasets with differing data content, making it difficult to verify data integrity [1]. The exploitability is assessed as difficult, and the attack has high complexity and has been published.

Impact

Successful exploitation allows an attacker to create datasets with identical digests but different underlying data. This undermines the integrity verification of datasets used for experiment reproducibility and data lineage tracking, potentially impacting compliance requirements like GDPR and HIPAA, and compromising adversarial ML research environments where data integrity is critical [1].

Mitigation

A pull request has been submitted to address this issue by replacing deterministic head-only sampling with head+tail sampling, including all column types, and using SHA-256 for stronger collision resistance [3]. The fixed version is not yet disclosed, and no workaround is provided in the available references. MLflow is still actively developed [2].

AI Insight generated on Jun 4, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected products

2
  • Mlflow/Mlflowreferences2 versions
    (expand)+ 1 more
    • (no CPE)
    • (no CPE)range: <=3.10.0

Patches

1
f71d22ba070b

Merge c475c0c7ff0e7f319c8fece50cf376b5f62d0218 into 2bd05b0f7d2d650d0ea4f8a7cd1f352cfdb656ee

https://github.com/mlflow/mlflow3em0Apr 7, 2026via nvd-ref
1 file changed · +70 34
  • mlflow/data/digest_utils.py+70 34 modified
    @@ -1,17 +1,20 @@
     import hashlib
     from typing import Any
     
    -from packaging.version import Version
    -
     from mlflow.exceptions import MlflowException
     from mlflow.protos.databricks_pb2 import INVALID_PARAMETER_VALUE
     
     MAX_ROWS = 10000
    +_DIGEST_SIZE = 32
     
     
     def compute_pandas_digest(df) -> str:
         """Computes a digest for the given Pandas DataFrame.
     
    +    Uses head+tail sampling to detect changes beyond the first MAX_ROWS rows,
    +    and includes all column types (not just string/numeric) to prevent
    +    collision attacks via excluded columns.
    +
         Args:
             df: A Pandas DataFrame.
     
    @@ -21,26 +24,29 @@ def compute_pandas_digest(df) -> str:
         import numpy as np
         import pandas as pd
     
    -    # trim to max rows
    -    trimmed_df = df.head(MAX_ROWS)
    +    hashable_elements = []
     
    -    # keep string and number columns, drop other column types
    -    if Version(pd.__version__) >= Version("2.1.0"):
    -        string_columns = trimmed_df.columns[(df.map(type) == str).all(0)]
    +    # For large DataFrames, sample both head and tail to prevent deterministic collision attacks
    +    if len(df) > MAX_ROWS:
    +        sample_size = MAX_ROWS // 2
    +        head_sample = df.head(sample_size)
    +        tail_sample = df.tail(sample_size)
    +        hashable_elements.append(pd.util.hash_pandas_object(head_sample).values)
    +        hashable_elements.append(pd.util.hash_pandas_object(tail_sample).values)
         else:
    -        string_columns = trimmed_df.columns[(df.applymap(type) == str).all(0)]
    -    numeric_columns = trimmed_df.select_dtypes(include=[np.number]).columns
    +        # For small DataFrames, hash all rows
    +        hashable_elements.append(pd.util.hash_pandas_object(df).values)
    +
    +    # Include total row count
    +    hashable_elements.append(np.int64(len(df)))
    +
    +    # Include column names
    +    hashable_elements.extend(str(col).encode() for col in df.columns)
     
    -    desired_columns = string_columns.union(numeric_columns)
    -    trimmed_df = trimmed_df[desired_columns]
    +    # Include dtype information to prevent type-coercion collisions
    +    hashable_elements.extend(str(dtype).encode() for dtype in df.dtypes)
     
    -    return get_normalized_md5_digest(
    -        [
    -            pd.util.hash_pandas_object(trimmed_df).values,
    -            np.int64(len(df)),
    -        ]
    -        + [str(x).encode() for x in df.columns]
    -    )
    +    return _compute_sha256_digest(hashable_elements)
     
     
     def compute_numpy_digest(features, targets=None) -> str:
    @@ -60,14 +66,29 @@ def compute_numpy_digest(features, targets=None) -> str:
     
         def hash_array(array):
             flattened_array = array.flatten()
    -        trimmed_array = flattened_array[0:MAX_ROWS]
    -        try:
    -            hashable_elements.append(pd.util.hash_array(trimmed_array))
    -        except TypeError:
    -            hashable_elements.append(np.int64(trimmed_array.size))
     
    -        # hash full array dimensions
    +        # For large arrays, sample both head and tail
    +        if flattened_array.size > MAX_ROWS:
    +            sample_size = MAX_ROWS // 2
    +            head_sample = flattened_array[:sample_size]
    +            tail_sample = flattened_array[-sample_size:]
    +            try:
    +                hashable_elements.append(pd.util.hash_array(head_sample))
    +                hashable_elements.append(pd.util.hash_array(tail_sample))
    +            except TypeError:
    +                hashable_elements.append(np.int64(head_sample.size))
    +                hashable_elements.append(np.int64(tail_sample.size))
    +        else:
    +            # For small arrays, hash all elements
    +            try:
    +                hashable_elements.append(pd.util.hash_array(flattened_array))
    +            except TypeError:
    +                hashable_elements.append(np.int64(flattened_array.size))
    +
    +        # Hash full array dimensions
             hashable_elements.extend(np.int64(x) for x in array.shape)
    +        # Include dtype to prevent type-coercion collisions
    +        hashable_elements.append(str(array.dtype).encode())
     
         def hash_dict_of_arrays(array_dict):
             for key in sorted(array_dict.keys()):
    @@ -81,27 +102,42 @@ def hash_dict_of_arrays(array_dict):
             else:
                 hash_array(item)
     
    -    return get_normalized_md5_digest(hashable_elements)
    +    return _compute_sha256_digest(hashable_elements)
     
     
    -def get_normalized_md5_digest(elements: list[Any]) -> str:
    -    """Computes a normalized digest for a list of hashable elements.
    +def _compute_sha256_digest(elements: list[Any]) -> str:
    +    """Computes a SHA-256 digest for a list of hashable elements.
     
         Args:
    -        elements: A list of hashable elements for inclusion in the md5 digest.
    +        elements: A list of hashable elements for inclusion in the digest.
     
         Returns:
    -        An 8-character, truncated md5 digest.
    +        A hex digest string truncated to _DIGEST_SIZE characters.
         """
    -
         if not elements:
             raise MlflowException(
    -            "No hashable elements were provided for md5 digest creation",
    +            "No hashable elements were provided for digest creation",
                 INVALID_PARAMETER_VALUE,
             )
     
    -    md5 = hashlib.md5(usedforsecurity=False)
    +    sha = hashlib.sha256()
         for element in elements:
    -        md5.update(element)
    +        sha.update(element)
     
    -    return md5.hexdigest()[:8]
    +    return sha.hexdigest()[:_DIGEST_SIZE]
    +
    +
    +def get_normalized_md5_digest(elements: list[Any]) -> str:
    +    """Computes a normalized digest for a list of hashable elements.
    +
    +    .. deprecated::
    +        This function now uses SHA-256 internally. The name is retained for
    +        backward compatibility. Use _compute_sha256_digest for new code.
    +
    +    Args:
    +        elements: A list of hashable elements for inclusion in the digest.
    +
    +    Returns:
    +        A hex digest string.
    +    """
    +    return _compute_sha256_digest(elements)
    

Vulnerability mechanics

Root cause

"The dataset digest computation uses deterministic sampling and excludes certain column types, enabling predictable hash collisions."

Attack vector

An attacker can craft datasets with different semantic content that produce the same digest. This is achieved by manipulating rows beyond the first 10,000, which are deterministically sampled, or by altering excluded column types like datetime or boolean. The attack is possible when users can submit datasets to a shared MLflow instance or when datasets are used for experiment reproducibility verification or data lineage compliance. The exploitability is assessed as difficult due to high complexity [ref_id=1].

Affected code

The vulnerability lies within the mlflow/data/digest_utils.py file, specifically in the dataset digest computation function. The deterministic sampling is implemented via `df.head(MAX_ROWS)` and selective column filtering is applied to `trimmed_df` by selecting only string and numeric columns. The digest space is limited by `md5.hexdigest()[:8]`, resulting in a 32-bit output [ref_id=1].

What the fix does

The proposed fix involves several changes to prevent predictable collisions. It suggests using randomized or head+tail sampling instead of deterministic head-only sampling. Additionally, all column types should be included in the hash computation, and the digest length should be increased to mitigate birthday attacks. Including dtype information will also prevent type-coercion collisions [ref_id=1]. The patch aims to ensure that the digest accurately reflects the entire dataset content, thus preventing malicious manipulation.

Preconditions

  • inputUsers can submit datasets to a shared MLflow instance.
  • inputDatasets are used for experiment reproducibility verification.
  • inputData lineage is required for compliance.

Generated on Jun 4, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

6

News mentions

1