MD5 Hash Collision in run-llama/llama_index
Description
A vulnerability in the DocugamiReader class of the run-llama/llama_index repository, up to version 0.12.28, involves the use of MD5 hashing to generate IDs for document chunks. This approach leads to hash collisions when structurally distinct chunks contain identical text, resulting in one chunk overwriting another. This can cause loss of semantically or legally important document content, breakage of parent-child chunk hierarchies, and inaccurate or hallucinated responses in AI outputs. The issue is resolved in version 0.3.1.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
LlamaIndex DocugamiReader uses MD5 hashing for chunk IDs, causing collisions that overwrite document data and break AI response accuracy.
Vulnerability
Analysis
The vulnerability in the DocugamiReader class of the run-llama/llama_index repository (up to version 0.12.28) stems from the use of MD5 hashing to generate unique identifiers for document chunks [1]. The function _build_framework_chunk computed the chunk ID as hashlib.md5(dg_chunk.text.encode()).hexdigest(), which only considered the text content of the chunk [3]. When multiple structurally distinct chunks (e.g., from different XML/HTML elements or table cells) contained identical text, they produced the same MD5 hash, leading to hash collisions [3].
Exploitation and
Impact
An attacker could craft a document with multiple regions sharing the same text but different structures or semantic contexts. During document parsing, these chunks would overwrite each other in storage, causing the loss of one or more chunks [1]. This breaks parent-child chunk hierarchies that rely on unique IDs and can result in an incomplete or corrupted representation of the original document [1]. Downstream AI systems relying on the indexed chunks may then produce inaccurate or hallucinated responses because they miss critical contextual or legal distinctions between the overwritten chunks [1].
Mitigation
Status
The issue is resolved in version 0.3.1 of the llama-index-readers-docugami package [3]. The fix modifies the hash input to include the chunk's XPath alongside its text (dg_chunk.xpath + "\n" + dg_chunk.text), ensuring that structurally different chunks with identical text generate distinct IDs [3]. Users should update to the patched version to prevent data loss and preserve document integrity [3].
AI Insight generated on May 19, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
llama-indexPyPI | < 0.12.41 | 0.12.41 |
llama-index-readers-docugamiPyPI | < 0.3.1 | 0.3.1 |
Affected products
3- Range: <=0.12.28
- Range: <=0.12.28
- run-llama/run-llama/llama_indexv5Range: unspecified
Patches
129b2e07e64edAvoid hash collision in XML parsing (#18986)
2 files changed · +4 −3
llama-index-integrations/readers/llama-index-readers-docugami/llama_index/readers/docugami/base.py+3 −2 modified@@ -160,8 +160,9 @@ def _structure_value(node: Any) -> Optional[str]: ) def _build_framework_chunk(dg_chunk: Chunk) -> Document: - # Stable IDs for chunks with the same text. - _hashed_id = hashlib.md5(dg_chunk.text.encode()).hexdigest() + # Adding dg_chunk.text + dg_chunk.xpath should prevent hash collision between two chunks that have the same text but a different xpath + text = dg_chunk.xpath + "\n" + dg_chunk.text + _hashed_id = hashlib.md5(text.encode()).hexdigest() metadata = { XPATH_KEY: dg_chunk.xpath, ID_KEY: _hashed_id,
llama-index-integrations/readers/llama-index-readers-docugami/pyproject.toml+1 −1 modified@@ -26,7 +26,7 @@ dev = [ [project] name = "llama-index-readers-docugami" -version = "0.3.0" +version = "0.3.1" description = "llama-index readers docugami integration" authors = [{name = "Your Name", email = "you@example.com"}] requires-python = ">=3.9,<4.0"
Vulnerability mechanics
Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
4News mentions
0No linked articles in our index yet.