VYPR
High severityNVD Advisory· Published Jul 7, 2025· Updated Jul 7, 2025

XML Entity Expansion vulnerability in run-llama/llama_index

CVE-2025-3225

Description

An XML Entity Expansion vulnerability, also known as a 'billion laughs' attack, exists in the sitemap parser of the run-llama/llama_index repository, specifically affecting version v0.12.21. This vulnerability allows an attacker to supply a malicious Sitemap XML, leading to a Denial of Service (DoS) by exhausting system memory and potentially causing a system crash. The issue is resolved in version v0.12.29.

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

XML Entity Expansion vulnerability in llama_index sitemap parser allows DoS via malicious sitemap; fixed in v0.12.29.

CVE-2025-3225 is an XML Entity Expansion (billion laughs) vulnerability in the sitemap parser of llama_index (run-llama/llama_index) version v0.12.21 [1]. This vulnerability occurs because the parser uses xml.etree.ElementTree without protections against entity expansion, allowing an attacker to craft a sitemap XML that causes exponential entity expansion, leading to memory exhaustion [1].

The attack can be executed remotely by supplying a malicious Sitemap XML file to the affected parser [1]. No authentication is required; the attacker only needs to provide the malicious XML to a service that parses sitemaps using llama_index's vulnerable parser [1].

The impact is a Denial of Service (DoS) attack that can exhaust system memory and potentially crash the system [1]. This can disrupt services relying on llama_index for document ingestion or indexing.

The issue is resolved in version v0.12.29 [1]. The fix replaces xml.etree.ElementTree with defusedxml.ElementTree, which guards against entity expansion attacks [3]. Users are advised to upgrade to v0.12.29 or later to mitigate this vulnerability [1].

AI Insight generated on May 19, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected packages

Versions sourced from the GitHub Security Advisory.

PackageAffected versionsPatched versions
llama-index-readers-papersPyPI
< 0.3.20.3.2

Affected products

2

Patches

1
4f6ee062b192

fix: use defusexml instead of xml.etree (#18362)

https://github.com/run-llama/llama_indexMassimiliano PippiApr 3, 2025via ghsa
6 files changed · +26 18
  • llama-index-integrations/readers/llama-index-readers-papers/llama_index/readers/papers/pubmed/base.py+11 8 modified
    @@ -2,12 +2,14 @@
     
     from typing import List, Optional
     
    +from defusedxml import ElementTree as safe_xml
     from llama_index.core.readers.base import BaseReader
     from llama_index.core.schema import Document
     
     
     class PubmedReader(BaseReader):
    -    """Pubmed Reader.
    +    """
    +    Pubmed Reader.
     
         Gets a search query, return a list of Documents of the top corresponding scientific papers on Pubmed.
         """
    @@ -17,7 +19,8 @@ def load_data_bioc(
             search_query: str,
             max_results: Optional[int] = 10,
         ) -> List[Document]:
    -        """Search for a topic on Pubmed, fetch the text of the most relevant full-length papers.
    +        """
    +        Search for a topic on Pubmed, fetch the text of the most relevant full-length papers.
             Uses the BoiC API, which has been down a lot.
     
             Args:
    @@ -27,10 +30,10 @@ def load_data_bioc(
             Returns:
                 List[Document]: A list of Document objects.
             """
    -        import xml.etree.ElementTree as xml
             from datetime import datetime
     
             import requests
    +        from defusedxml import ElementTree as safe_xml
     
             pubmed_search = []
             parameters = {"tool": "tool", "email": "email", "db": "pmc"}
    @@ -40,7 +43,7 @@ def load_data_bioc(
                 "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
                 params=parameters,
             )
    -        root = xml.fromstring(resp.content)
    +        root = safe_xml.fromstring(resp.content)
     
             for elem in root.iter():
                 if elem.tag == "Id":
    @@ -99,7 +102,8 @@ def load_data(
             search_query: str,
             max_results: Optional[int] = 10,
         ) -> List[Document]:
    -        """Search for a topic on Pubmed, fetch the text of the most relevant full-length papers.
    +        """
    +        Search for a topic on Pubmed, fetch the text of the most relevant full-length papers.
     
             Args:
                 search_query (str): A topic to search for (e.g. "Alzheimers").
    @@ -110,7 +114,6 @@ def load_data(
                 List[Document]: A list of Document objects.
             """
             import time
    -        import xml.etree.ElementTree as xml
     
             import requests
     
    @@ -122,7 +125,7 @@ def load_data(
                 "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
                 params=parameters,
             )
    -        root = xml.fromstring(resp.content)
    +        root = safe_xml.fromstring(resp.content)
     
             for elem in root.iter():
                 if elem.tag == "Id":
    @@ -131,7 +134,7 @@ def load_data(
                     print(url)
                     try:
                         resp = requests.get(url)
    -                    info = xml.fromstring(resp.content)
    +                    info = safe_xml.fromstring(resp.content)
     
                         raw_text = ""
                         title = ""
    
  • llama-index-integrations/readers/llama-index-readers-papers/pyproject.toml+2 1 modified
    @@ -29,12 +29,13 @@ license = "MIT"
     maintainers = ["thejessezhang"]
     name = "llama-index-readers-papers"
     readme = "README.md"
    -version = "0.3.1"
    +version = "0.3.2"
     
     [tool.poetry.dependencies]
     python = ">=3.9,<4.0"
     arxiv = "^2.1.0"
     llama-index-core = "^0.12.0"
    +defusedxml = "^0.7.1"
     
     [tool.poetry.group.dev.dependencies]
     ipython = "8.10.0"
    
  • llama-index-integrations/readers/llama-index-readers-stripe-docs/llama_index/readers/stripe_docs/base.py+5 4 modified
    @@ -1,7 +1,7 @@
     import urllib.request
    -import xml.etree.ElementTree as ET
     from typing import List
     
    +from defusedxml.ElementTree import fromstring
     from llama_index.core.readers.base import BaseReader
     from llama_index.core.schema import Document
     from llama_index.readers.web import AsyncWebPageReader
    @@ -13,7 +13,8 @@
     
     
     class StripeDocsReader(BaseReader):
    -    """Asynchronous Stripe documentation reader.
    +    """
    +    Asynchronous Stripe documentation reader.
     
         Reads pages from the Stripe documentation based on the sitemap.xml.
     
    @@ -36,7 +37,7 @@ def _load_sitemap(self) -> str:
         def _parse_sitemap(
             self, raw_sitemap: str, filters: List[str] = DEFAULT_FILTERS
         ) -> List:
    -        root_sitemap = ET.fromstring(raw_sitemap)
    +        root_sitemap = fromstring(raw_sitemap)
             sitemap_partition_urls = []
             sitemap_urls = []
     
    @@ -45,7 +46,7 @@ def _parse_sitemap(
                 sitemap_partition_urls.append(loc)
     
             for sitemap_partition_url in sitemap_partition_urls:
    -            sitemap_partition = ET.fromstring(self._load_url(sitemap_partition_url))
    +            sitemap_partition = fromstring(self._load_url(sitemap_partition_url))
     
                 # Find all <url /> and iterate through them
                 for url in sitemap_partition.findall(f"{{{XML_SITEMAP_SCHEMA}}}url"):
    
  • llama-index-integrations/readers/llama-index-readers-stripe-docs/pyproject.toml+2 1 modified
    @@ -29,14 +29,15 @@ license = "GPL-3.0-or-later"
     maintainers = ["amorriscode"]
     name = "llama-index-readers-stripe-docs"
     readme = "README.md"
    -version = "0.3.0"
    +version = "0.3.1"
     
     [tool.poetry.dependencies]
     python = ">=3.9,<4.0"
     html2text = "^2024.2.26"
     urllib3 = "^2.1.0"
     llama-index-readers-web = "^0.3.0"
     llama-index-core = "^0.12.0"
    +defusedxml = "^0.7.1"
     
     [tool.poetry.group.dev.dependencies]
     ipython = "8.10.0"
    
  • llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/sitemap/base.py+4 3 modified
    @@ -1,14 +1,15 @@
     import urllib.request
    -import xml.etree.ElementTree as ET
     from typing import List
     
    +from defusedxml.ElementTree import fromstring
     from llama_index.core.readers.base import BaseReader
     from llama_index.core.schema import Document
     from llama_index.readers.web.async_web.base import AsyncWebPageReader
     
     
     class SitemapReader(BaseReader):
    -    """Asynchronous sitemap reader for web.
    +    """
    +    Asynchronous sitemap reader for web.
     
         Reads pages from the web based on their sitemap.xml.
     
    @@ -34,7 +35,7 @@ def _load_sitemap(self, sitemap_url: str) -> str:
             return sitemap_url_request.read()
     
         def _parse_sitemap(self, raw_sitemap: str, filter_locs: str = None) -> list:
    -        sitemap = ET.fromstring(raw_sitemap)
    +        sitemap = fromstring(raw_sitemap)
             sitemap_urls = []
     
             for url in sitemap.findall(f"{{{self.xml_schema_sitemap}}}url"):
    
  • llama-index-integrations/readers/llama-index-readers-web/pyproject.toml+2 1 modified
    @@ -47,7 +47,7 @@ license = "GPL-3.0-or-later"
     maintainers = ["HawkClaws", "Hironsan", "NA", "an-bluecat", "bborn", "jasonwcfan", "kravetsmic", "pandazki", "ruze00", "selamanse", "thejessezhang"]
     name = "llama-index-readers-web"
     readme = "README.md"
    -version = "0.3.8"
    +version = "0.3.9"
     
     [tool.poetry.dependencies]
     python = ">=3.9,<4.0"
    @@ -62,6 +62,7 @@ playwright = ">=1.30,<2.0"
     newspaper3k = "^0.2.8"
     spider-client = "^0.0.27"
     llama-index-core = "^0.12.0"
    +defusedxml = "^0.7.1"
     
     [tool.poetry.group.dev.dependencies]
     ipython = "8.10.0"
    

Vulnerability mechanics

Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

4

News mentions

0

No linked articles in our index yet.