CVE-2026-10766
Description
mlrun versions up to 1.12.0-rc3 are vulnerable to weak hashing in DataFrame hash calculation, potentially causing data corruption.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
mlrun versions up to 1.12.0-rc3 are vulnerable to weak hashing in DataFrame hash calculation, potentially causing data corruption.
Vulnerability
A vulnerability exists in the mlrun.utils.helpers.calculate_dataframe_hash function within mlrun versions up to and including 1.12.0-rc3. This issue stems from the use of a weak hashing algorithm, which can lead to hash collisions for different DataFrames. The manipulation requires a local environment and is considered to have high complexity and difficult exploitability.
Exploitation
An attacker with local access can exploit this vulnerability by crafting specific DataFrames that result in hash collisions. This occurs when the calculate_dataframe_hash function, or related artifact path resolution, produces identical hashes and target paths for distinct datasets. The exploitability is described as difficult, and the attack requires a local environment.
Impact
Successful exploitation of this vulnerability can lead to dataset artifact path conflicts and silent data corruption. When different DataFrames produce the same hash and target path, it can result in incorrect data being used or overwritten, potentially impacting the integrity of ML pipelines and applications that rely on these artifacts.
Mitigation
A pull request has been submitted to address this issue by reworking the calculate_dataframe_hash function to use SHA-256 over stable DataFrame schema metadata and hashed index/column values [3]. The fixed version and release date are not yet disclosed. No workarounds are mentioned in the available references. The vulnerability affects versions up to 1.12.0-rc3 [2].
AI Insight generated on Jun 3, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected products
2Patches
12cb55afdacb0Merge d37eba1c151e6107395109a39200827fc352aef5 into dabe2ff4ae538b8bd449d73187cce71f2a5bd289
2 files changed · +165 −10
mlrun/utils/helpers.py+77 −2 modified@@ -1648,8 +1648,83 @@ def calculate_local_file_hash(filename): def calculate_dataframe_hash(dataframe: pandas.DataFrame): - # https://stackoverflow.com/questions/49883236/how-to-generate-a-hash-or-checksum-value-on-python-dataframe-created-from-a-fix/62754084#62754084 - return hashlib.sha1(pandas.util.hash_pandas_object(dataframe).values).hexdigest() + dataframe_hash = hashlib.sha256() + dataframe_hash.update(_get_dataframe_hash_schema(dataframe)) + _update_hash_with_dataframe_values(dataframe_hash, dataframe) + return dataframe_hash.hexdigest() + + +def _update_hash_with_dataframe_values(dataframe_hash, dataframe: pandas.DataFrame): + dataframe_hash.update(b"\0index") + index_frame = dataframe.index.to_frame(index=False) + for index_position in range(len(index_frame.columns)): + dataframe_hash.update(f"\0level:{index_position}".encode()) + _update_hash_with_pandas_object( + dataframe_hash, index_frame.iloc[:, index_position] + ) + + dataframe_hash.update(b"\0columns") + for column_position in range(len(dataframe.columns)): + dataframe_hash.update(f"\0column:{column_position}".encode()) + _update_hash_with_pandas_object( + dataframe_hash, dataframe.iloc[:, column_position] + ) + + +def _update_hash_with_pandas_object( + dataframe_hash, pandas_object: pandas.Index | pandas.Series +): + pandas_object_hash = pandas.util.hash_pandas_object( + pandas_object, index=False + ).values + dataframe_hash.update(len(pandas_object_hash).to_bytes(8, "big")) + dataframe_hash.update(pandas_object_hash.tobytes()) + + +def _get_dataframe_hash_schema(dataframe: pandas.DataFrame) -> bytes: + schema = { + "columns": [ + { + "name": _serialize_pandas_label(column), + "dtype": str(dtype), + "dtype_repr": repr(dtype), + } + for column, dtype in zip(dataframe.columns, dataframe.dtypes, strict=True) + ], + "column_index": { + "type": type(dataframe.columns).__qualname__, + "names": [ + _serialize_pandas_label(name) for name in dataframe.columns.names + ], + }, + "index": { + "type": type(dataframe.index).__qualname__, + "names": [_serialize_pandas_label(name) for name in dataframe.index.names], + "dtypes": _get_dataframe_index_dtypes(dataframe.index), + }, + } + return json.dumps(schema, sort_keys=True, separators=(",", ":")).encode() + + +def _get_dataframe_index_dtypes(index: pandas.Index) -> list[dict[str, str]]: + if isinstance(index, pandas.MultiIndex): + return [ + {"dtype": str(dtype), "dtype_repr": repr(dtype)} + for dtype in index.to_frame(index=False).dtypes + ] + return [{"dtype": str(index.dtype), "dtype_repr": repr(index.dtype)}] + + +def _serialize_pandas_label(label) -> dict[str, Any]: + if isinstance(label, tuple): + value = [_serialize_pandas_label(item) for item in label] + else: + value = repr(label) + + return { + "type": type(label).__qualname__, + "value": value, + } def template_artifact_path(artifact_path, project, run_uid=None):
tests/artifacts/test_dataset.py+88 −8 modified@@ -96,8 +96,11 @@ def test_resolve_dataset_hash_path(): format="csv", ), "artifact_path": "v3io://just/regular/path", - "expected_hash": "0d1c62a76b705b34bb70f355162f83402f3640e3", - "expected_file_target": "v3io://just/regular/path/0d1c62a76b705b34bb70f355162f83402f3640e3.csv", + "expected_hash": "63e11ea8f69464ffcacc58e69e22f2d6a0839b6c69c8e4d2d90e22a9ea3a9757", + "expected_file_target": ( + "v3io://just/regular/path/" + "63e11ea8f69464ffcacc58e69e22f2d6a0839b6c69c8e4d2d90e22a9ea3a9757.csv" + ), "expected_error": None, }, { @@ -109,8 +112,11 @@ def test_resolve_dataset_hash_path(): format="parquet", ), "artifact_path": "v3io://just/regular/path", - "expected_hash": "f039fcf3a8b4bd6805b2bec0c6db96c2189eb9e2", - "expected_file_target": "v3io://just/regular/path/f039fcf3a8b4bd6805b2bec0c6db96c2189eb9e2.parquet", + "expected_hash": "f5470208b890d5e11dc683e0d5bd6d3098db9e3b4dff11445a8e64a21855dc6d", + "expected_file_target": ( + "v3io://just/regular/path/" + "f5470208b890d5e11dc683e0d5bd6d3098db9e3b4dff11445a8e64a21855dc6d.parquet" + ), "expected_error": None, }, { @@ -119,8 +125,11 @@ def test_resolve_dataset_hash_path(): df=pandas.DataFrame({"x": [1, 2]}), ), "artifact_path": "v3io://just/regular/path", - "expected_hash": "0d1c62a76b705b34bb70f355162f83402f3640e3", - "expected_file_target": "v3io://just/regular/path/0d1c62a76b705b34bb70f355162f83402f3640e3", + "expected_hash": "63e11ea8f69464ffcacc58e69e22f2d6a0839b6c69c8e4d2d90e22a9ea3a9757", + "expected_file_target": ( + "v3io://just/regular/path/" + "63e11ea8f69464ffcacc58e69e22f2d6a0839b6c69c8e4d2d90e22a9ea3a9757" + ), "expected_error": None, }, { @@ -140,8 +149,11 @@ def test_resolve_dataset_hash_path(): format="csv", ), "artifact_path": "v3io://just/regular/path", - "expected_hash": "0d1c62a76b705b34bb70f355162f83402f3640e3", - "expected_file_target": "v3io://just/regular/path/0d1c62a76b705b34bb70f355162f83402f3640e3.csv", + "expected_hash": "63e11ea8f69464ffcacc58e69e22f2d6a0839b6c69c8e4d2d90e22a9ea3a9757", + "expected_file_target": ( + "v3io://just/regular/path/" + "63e11ea8f69464ffcacc58e69e22f2d6a0839b6c69c8e4d2d90e22a9ea3a9757.csv" + ), "expected_error": None, }, ]: @@ -163,6 +175,74 @@ def test_resolve_dataset_hash_path(): assert test_case.get("expected_file_target") == target_path +@pytest.mark.parametrize( + ("left_dataframe", "right_dataframe"), + [ + ( + pandas.DataFrame({"A": [1604090909467468979, 2], "B": [4, 4]}), + pandas.DataFrame({"A": [1, 2], "B": [3, 4]}), + ), + ( + pandas.DataFrame({"flag": [True, False, True], "value": [10, 20, 30]}), + pandas.DataFrame({"flag": [1, 0, 1], "value": [10, 20, 30]}), + ), + ( + pandas.DataFrame( + { + "created_at": pandas.array( + [ + pandas.Timestamp("1970-01-01") + + pandas.Timedelta(nanoseconds=100) + ], + dtype="datetime64[ns]", + ) + } + ), + pandas.DataFrame({"created_at": pandas.array([100], dtype="int64")}), + ), + ( + pandas.DataFrame({"sensor": numpy.array([5, -3, 127], dtype=numpy.int8)}), + pandas.DataFrame({"sensor": numpy.array([5, -3, 127], dtype=numpy.int64)}), + ), + ( + pandas.DataFrame( + { + "is_active": [True, False], + "created_at": pandas.to_datetime( + [ + "1970-01-01T00:00:00.000000001", + "1970-01-01T00:00:00.000000002", + ] + ), + "score": numpy.array([10, 20], dtype=numpy.int8), + } + ), + pandas.DataFrame( + { + "is_active": [1, 0], + "created_at": [1, 2], + "score": numpy.array([10, 20], dtype=numpy.int64), + } + ), + ), + ], +) +def test_dataframe_hash_path_avoids_known_collisions(left_dataframe, right_dataframe): + artifact = mlrun.artifacts.dataset.DatasetArtifact(format="parquet") + left_hash, left_target_path = artifact.resolve_dataframe_target_hash_path( + left_dataframe, + artifact_path="v3io://just/regular/path", + ) + right_hash, right_target_path = artifact.resolve_dataframe_target_hash_path( + right_dataframe, + artifact_path="v3io://just/regular/path", + ) + + assert not left_dataframe.equals(right_dataframe) + assert left_hash != right_hash + assert left_target_path != right_target_path + + def test_dataset_stats(): raw_data = { "first_name": ["Jason", "Molly", "Tina", "Jake", "Amy"],
Vulnerability mechanics
Root cause
"The DataFrame hashing mechanism does not sufficiently consider dtype and schema information, leading to collisions."
Attack vector
An attacker must have local access to the environment where mlrun is running. They can then manipulate DataFrame objects such that semantically different DataFrames produce identical hash values. This occurs because the hashing relies on `pandas.util.hash_pandas_object` which can normalize dtypes, causing collisions between, for example, boolean and integer types, or different integer sizes [ref_id=1]. The attack complexity is high, and exploitability is difficult [ref_id=1].
Affected code
The vulnerability resides in the `calculate_dataframe_hash` function within `mlrun/utils/helpers.py`. This function's output is utilized by `DatasetArtifact.resolve_dataframe_target_hash_path` in `mlrun/artifacts/dataset.py` to construct artifact paths based on DataFrame hashes [ref_id=1].
What the fix does
The fix modifies the `calculate_dataframe_hash` function to incorporate DataFrame schema metadata, including column labels, index metadata, and dtype information, before hashing. It now uses SHA-256 for hashing, providing a more robust and deterministic output. This change ensures that semantically different DataFrames, even those with differing dtypes or schemas, will produce unique hashes, thereby preventing artifact path collisions and subsequent data corruption [ref_id=2].
Preconditions
- inputLocal access to the environment.
Generated on Jun 3, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
6News mentions
0No linked articles in our index yet.