VYPR
Moderate severityNVD Advisory· Published Jul 10, 2025· Updated Jul 10, 2025

MD5 Hash Collision in run-llama/llama_index

CVE-2025-6211

Description

A vulnerability in the DocugamiReader class of the run-llama/llama_index repository, up to version 0.12.28, involves the use of MD5 hashing to generate IDs for document chunks. This approach leads to hash collisions when structurally distinct chunks contain identical text, resulting in one chunk overwriting another. This can cause loss of semantically or legally important document content, breakage of parent-child chunk hierarchies, and inaccurate or hallucinated responses in AI outputs. The issue is resolved in version 0.3.1.

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

LlamaIndex DocugamiReader uses MD5 hashing for chunk IDs, causing collisions that overwrite document data and break AI response accuracy.

Vulnerability

Analysis

The vulnerability in the DocugamiReader class of the run-llama/llama_index repository (up to version 0.12.28) stems from the use of MD5 hashing to generate unique identifiers for document chunks [1]. The function _build_framework_chunk computed the chunk ID as hashlib.md5(dg_chunk.text.encode()).hexdigest(), which only considered the text content of the chunk [3]. When multiple structurally distinct chunks (e.g., from different XML/HTML elements or table cells) contained identical text, they produced the same MD5 hash, leading to hash collisions [3].

Exploitation and

Impact

An attacker could craft a document with multiple regions sharing the same text but different structures or semantic contexts. During document parsing, these chunks would overwrite each other in storage, causing the loss of one or more chunks [1]. This breaks parent-child chunk hierarchies that rely on unique IDs and can result in an incomplete or corrupted representation of the original document [1]. Downstream AI systems relying on the indexed chunks may then produce inaccurate or hallucinated responses because they miss critical contextual or legal distinctions between the overwritten chunks [1].

Mitigation

Status

The issue is resolved in version 0.3.1 of the llama-index-readers-docugami package [3]. The fix modifies the hash input to include the chunk's XPath alongside its text (dg_chunk.xpath + "\n" + dg_chunk.text), ensuring that structurally different chunks with identical text generate distinct IDs [3]. Users should update to the patched version to prevent data loss and preserve document integrity [3].

AI Insight generated on May 19, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected packages

Versions sourced from the GitHub Security Advisory.

PackageAffected versionsPatched versions
llama-indexPyPI
< 0.12.410.12.41
llama-index-readers-docugamiPyPI
< 0.3.10.3.1

Affected products

3

Patches

1
29b2e07e64ed

Avoid hash collision in XML parsing (#18986)

https://github.com/run-llama/llama_indexClelia (Astra) BertelliJun 5, 2025via ghsa
2 files changed · +4 3
  • llama-index-integrations/readers/llama-index-readers-docugami/llama_index/readers/docugami/base.py+3 2 modified
    @@ -160,8 +160,9 @@ def _structure_value(node: Any) -> Optional[str]:
                 )
     
             def _build_framework_chunk(dg_chunk: Chunk) -> Document:
    -            # Stable IDs for chunks with the same text.
    -            _hashed_id = hashlib.md5(dg_chunk.text.encode()).hexdigest()
    +            # Adding dg_chunk.text + dg_chunk.xpath should prevent hash collision between two chunks that have the same text but a different xpath
    +            text = dg_chunk.xpath + "\n" + dg_chunk.text
    +            _hashed_id = hashlib.md5(text.encode()).hexdigest()
                 metadata = {
                     XPATH_KEY: dg_chunk.xpath,
                     ID_KEY: _hashed_id,
    
  • llama-index-integrations/readers/llama-index-readers-docugami/pyproject.toml+1 1 modified
    @@ -26,7 +26,7 @@ dev = [
     
     [project]
     name = "llama-index-readers-docugami"
    -version = "0.3.0"
    +version = "0.3.1"
     description = "llama-index readers docugami integration"
     authors = [{name = "Your Name", email = "you@example.com"}]
     requires-python = ">=3.9,<4.0"
    

Vulnerability mechanics

Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

4

News mentions

0

No linked articles in our index yet.