CVE-2026-10803
Description
MLflow versions prior to 3.10.0 are vulnerable to predictable hash collisions in dataset digests, allowing data manipulation.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
MLflow versions prior to 3.10.0 are vulnerable to predictable hash collisions in dataset digests, allowing data manipulation.
Vulnerability
A flaw exists in MLflow versions up to 3.10.0 within the mlflow/data/digest_utils.py file, specifically in the dataset digest computation function. The vulnerability stems from deterministic sampling (using df.head(10000)) and selective column filtering, which allows an attacker to craft datasets with different semantic content that produce the same digest [1]. This affects all platforms and Python versions 3.10+.
Exploitation
An attacker with the ability to submit datasets to a shared MLflow instance can exploit this vulnerability. The attack involves keeping the first 10,000 rows identical to a legitimate dataset while arbitrarily modifying rows beyond the 10,000th row. This manipulation results in a matching digest for datasets with differing data content, making it difficult to verify data integrity [1]. The exploitability is assessed as difficult, and the attack has high complexity and has been published.
Impact
Successful exploitation allows an attacker to create datasets with identical digests but different underlying data. This undermines the integrity verification of datasets used for experiment reproducibility and data lineage tracking, potentially impacting compliance requirements like GDPR and HIPAA, and compromising adversarial ML research environments where data integrity is critical [1].
Mitigation
A pull request has been submitted to address this issue by replacing deterministic head-only sampling with head+tail sampling, including all column types, and using SHA-256 for stronger collision resistance [3]. The fixed version is not yet disclosed, and no workaround is provided in the available references. MLflow is still actively developed [2].
- [BUG] Deterministic sampling in dataset digest enables predictable collisions
- GitHub - mlflow/mlflow: The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.
- Fix deterministic sampling in dataset digest computation by 3em0 · Pull Request #22420 · mlflow/mlflow
AI Insight generated on Jun 4, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected products
2Patches
1f71d22ba070bMerge c475c0c7ff0e7f319c8fece50cf376b5f62d0218 into 2bd05b0f7d2d650d0ea4f8a7cd1f352cfdb656ee
1 file changed · +70 −34
mlflow/data/digest_utils.py+70 −34 modified@@ -1,17 +1,20 @@ import hashlib from typing import Any -from packaging.version import Version - from mlflow.exceptions import MlflowException from mlflow.protos.databricks_pb2 import INVALID_PARAMETER_VALUE MAX_ROWS = 10000 +_DIGEST_SIZE = 32 def compute_pandas_digest(df) -> str: """Computes a digest for the given Pandas DataFrame. + Uses head+tail sampling to detect changes beyond the first MAX_ROWS rows, + and includes all column types (not just string/numeric) to prevent + collision attacks via excluded columns. + Args: df: A Pandas DataFrame. @@ -21,26 +24,29 @@ def compute_pandas_digest(df) -> str: import numpy as np import pandas as pd - # trim to max rows - trimmed_df = df.head(MAX_ROWS) + hashable_elements = [] - # keep string and number columns, drop other column types - if Version(pd.__version__) >= Version("2.1.0"): - string_columns = trimmed_df.columns[(df.map(type) == str).all(0)] + # For large DataFrames, sample both head and tail to prevent deterministic collision attacks + if len(df) > MAX_ROWS: + sample_size = MAX_ROWS // 2 + head_sample = df.head(sample_size) + tail_sample = df.tail(sample_size) + hashable_elements.append(pd.util.hash_pandas_object(head_sample).values) + hashable_elements.append(pd.util.hash_pandas_object(tail_sample).values) else: - string_columns = trimmed_df.columns[(df.applymap(type) == str).all(0)] - numeric_columns = trimmed_df.select_dtypes(include=[np.number]).columns + # For small DataFrames, hash all rows + hashable_elements.append(pd.util.hash_pandas_object(df).values) + + # Include total row count + hashable_elements.append(np.int64(len(df))) + + # Include column names + hashable_elements.extend(str(col).encode() for col in df.columns) - desired_columns = string_columns.union(numeric_columns) - trimmed_df = trimmed_df[desired_columns] + # Include dtype information to prevent type-coercion collisions + hashable_elements.extend(str(dtype).encode() for dtype in df.dtypes) - return get_normalized_md5_digest( - [ - pd.util.hash_pandas_object(trimmed_df).values, - np.int64(len(df)), - ] - + [str(x).encode() for x in df.columns] - ) + return _compute_sha256_digest(hashable_elements) def compute_numpy_digest(features, targets=None) -> str: @@ -60,14 +66,29 @@ def compute_numpy_digest(features, targets=None) -> str: def hash_array(array): flattened_array = array.flatten() - trimmed_array = flattened_array[0:MAX_ROWS] - try: - hashable_elements.append(pd.util.hash_array(trimmed_array)) - except TypeError: - hashable_elements.append(np.int64(trimmed_array.size)) - # hash full array dimensions + # For large arrays, sample both head and tail + if flattened_array.size > MAX_ROWS: + sample_size = MAX_ROWS // 2 + head_sample = flattened_array[:sample_size] + tail_sample = flattened_array[-sample_size:] + try: + hashable_elements.append(pd.util.hash_array(head_sample)) + hashable_elements.append(pd.util.hash_array(tail_sample)) + except TypeError: + hashable_elements.append(np.int64(head_sample.size)) + hashable_elements.append(np.int64(tail_sample.size)) + else: + # For small arrays, hash all elements + try: + hashable_elements.append(pd.util.hash_array(flattened_array)) + except TypeError: + hashable_elements.append(np.int64(flattened_array.size)) + + # Hash full array dimensions hashable_elements.extend(np.int64(x) for x in array.shape) + # Include dtype to prevent type-coercion collisions + hashable_elements.append(str(array.dtype).encode()) def hash_dict_of_arrays(array_dict): for key in sorted(array_dict.keys()): @@ -81,27 +102,42 @@ def hash_dict_of_arrays(array_dict): else: hash_array(item) - return get_normalized_md5_digest(hashable_elements) + return _compute_sha256_digest(hashable_elements) -def get_normalized_md5_digest(elements: list[Any]) -> str: - """Computes a normalized digest for a list of hashable elements. +def _compute_sha256_digest(elements: list[Any]) -> str: + """Computes a SHA-256 digest for a list of hashable elements. Args: - elements: A list of hashable elements for inclusion in the md5 digest. + elements: A list of hashable elements for inclusion in the digest. Returns: - An 8-character, truncated md5 digest. + A hex digest string truncated to _DIGEST_SIZE characters. """ - if not elements: raise MlflowException( - "No hashable elements were provided for md5 digest creation", + "No hashable elements were provided for digest creation", INVALID_PARAMETER_VALUE, ) - md5 = hashlib.md5(usedforsecurity=False) + sha = hashlib.sha256() for element in elements: - md5.update(element) + sha.update(element) - return md5.hexdigest()[:8] + return sha.hexdigest()[:_DIGEST_SIZE] + + +def get_normalized_md5_digest(elements: list[Any]) -> str: + """Computes a normalized digest for a list of hashable elements. + + .. deprecated:: + This function now uses SHA-256 internally. The name is retained for + backward compatibility. Use _compute_sha256_digest for new code. + + Args: + elements: A list of hashable elements for inclusion in the digest. + + Returns: + A hex digest string. + """ + return _compute_sha256_digest(elements)
Vulnerability mechanics
Root cause
"The dataset digest computation uses deterministic sampling and excludes certain column types, enabling predictable hash collisions."
Attack vector
An attacker can craft datasets with different semantic content that produce the same digest. This is achieved by manipulating rows beyond the first 10,000, which are deterministically sampled, or by altering excluded column types like datetime or boolean. The attack is possible when users can submit datasets to a shared MLflow instance or when datasets are used for experiment reproducibility verification or data lineage compliance. The exploitability is assessed as difficult due to high complexity [ref_id=1].
Affected code
The vulnerability lies within the mlflow/data/digest_utils.py file, specifically in the dataset digest computation function. The deterministic sampling is implemented via `df.head(MAX_ROWS)` and selective column filtering is applied to `trimmed_df` by selecting only string and numeric columns. The digest space is limited by `md5.hexdigest()[:8]`, resulting in a 32-bit output [ref_id=1].
What the fix does
The proposed fix involves several changes to prevent predictable collisions. It suggests using randomized or head+tail sampling instead of deterministic head-only sampling. Additionally, all column types should be included in the hash computation, and the digest length should be increased to mitigate birthday attacks. Including dtype information will also prevent type-coercion collisions [ref_id=1]. The patch aims to ensure that the digest accurately reflects the entire dataset content, thus preventing malicious manipulation.
Preconditions
- inputUsers can submit datasets to a shared MLflow instance.
- inputDatasets are used for experiment reproducibility verification.
- inputData lineage is required for compliance.
Generated on Jun 4, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
6News mentions
1- MLflow: Critical Credential Leakage Flaw Disclosed Alongside Two Other VulnerabilitiesVypr Intelligence · Jun 4, 2026