VYPR
High severity7.5NVD Advisory· Published Jun 3, 2026

Docling: Unsafe XML Entity Expansion in USPTO Patent Backend

CVE-2026-44020

Description

Impact

The USPTO patent XML parser used the standard xml.sax.parseString() without protection against XML External Entity (XXE) attacks. An attacker could craft malicious USPTO patent XML files with external entity references that could: - Read arbitrary files from the server filesystem - Perform Server-Side Request Forgery (SSRF) attacks - Cause denial of service through entity expansion (Billion Laughs attack)

The vulnerability affects three USPTO patent format parsers: ICE (v4.x), Grant v2.5, and Application v1.x.

Patches

Fixed in version 2.74.0. The parser now uses defusedxml.sax.make_parser() with secure configuration that blocks external entity resolution (feature_external_ges=False, feature_external_pes=False) while allowing DTD declarations required by USPTO files. This prevents XXE attacks while maintaining compatibility with the USPTO XML format.

Workarounds

Avoid processing USPTO patent XML files from untrusted sources. Implement resource limits (memory, CPU time) when processing patent documents.

### References - Fix release: v2.74.0

Affected products

2

Patches

1
576bada7b7d5

fix: security vulnerabilities with XML External Entity and related attacks (#3009)

https://github.com/docling-project/doclingCesar Berrospi RamisFeb 17, 2026Fixed in 2.74.0via llm-release-walk
4 files changed · +177 70
  • docling/backend/xml/jats_backend.py+48 23 modified
    @@ -1,8 +1,27 @@
    +"""Backend to parse articles in JATS (Journal Article Tag Suite) XML format.
    +
    +JATS is a standard XML format used by publishers and journal archives including
    +PubMed Central (PMC), bioRxiv, and medRxiv for representing journal articles.
    +
    +Security Note:
    +    This module uses lxml.etree.XMLParser with secure configuration to protect
    +    against XML External Entity (XXE) attacks and XML bombs. The parser is
    +    configured with:
    +
    +    - resolve_entities: False (prevents entity resolution attacks)
    +    - no_network: True (blocks all network access)
    +    - dtd_validation: False (disables DTD validation)
    +    - load_dtd: False (prevents loading external DTDs)
    +
    +    This configuration ensures safe parsing of JATS XML files while blocking
    +    external entity fetching and preventing XXE attacks.
    +"""
    +
     import logging
     import traceback
     from io import BytesIO
     from pathlib import Path
    -from typing import Final, Optional, Union, cast
    +from typing import Final, cast
     
     from bs4 import BeautifulSoup, NavigableString, Tag
     from docling_core.types.doc import (
    @@ -26,11 +45,11 @@
     
     _log = logging.getLogger(__name__)
     
    -JATS_DTD_URL: Final = ["JATS-journalpublishing", "JATS-archive"]
    -DEFAULT_HEADER_ACKNOWLEDGMENTS: Final = "Acknowledgments"
    -DEFAULT_HEADER_ABSTRACT: Final = "Abstract"
    -DEFAULT_HEADER_REFERENCES: Final = "References"
    -DEFAULT_TEXT_ETAL: Final = "et al."
    +JATS_DTD_URL: Final[list[str]] = ["JATS-journalpublishing", "JATS-archive"]
    +DEFAULT_HEADER_ACKNOWLEDGMENTS: Final[str] = "Acknowledgments"
    +DEFAULT_HEADER_ABSTRACT: Final[str] = "Abstract"
    +DEFAULT_HEADER_REFERENCES: Final[str] = "References"
    +DEFAULT_TEXT_ETAL: Final[str] = "et al."
     
     
     class Abstract(TypedDict):
    @@ -87,20 +106,26 @@ class JatsDocumentBackend(DeclarativeDocumentBackend):
         """
     
         @override
    -    def __init__(
    -        self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]
    -    ) -> None:
    +    def __init__(self, in_doc: "InputDocument", path_or_stream: BytesIO | Path) -> None:
             super().__init__(in_doc, path_or_stream)
             self.path_or_stream = path_or_stream
     
             # Initialize the root of the document hierarchy
    -        self.root: Optional[NodeItem] = None
    +        self.root: NodeItem | None = None
             self.hlevel: int = 0
             self.valid: bool = False
             try:
                 if isinstance(self.path_or_stream, BytesIO):
                     self.path_or_stream.seek(0)
    -            self.tree: etree._ElementTree = etree.parse(self.path_or_stream)
    +            parser = etree.XMLParser(
    +                resolve_entities=False,
    +                load_dtd=False,
    +                no_network=True,
    +                dtd_validation=False,
    +            )
    +            self.tree: etree._ElementTree = etree.parse(
    +                self.path_or_stream, parser=parser
    +            )
     
                 doc_info: etree.DocInfo = self.tree.docinfo
                 if doc_info.system_url and any(
    @@ -172,7 +197,7 @@ def convert(self) -> DoclingDocument:
             return doc
     
         @staticmethod
    -    def _get_text(node: etree._Element, sep: Optional[str] = None) -> str:
    +    def _get_text(node: etree._Element, sep: str | None = None) -> str:
             skip_tags = ["term", "disp-formula", "inline-formula"]
             text: str = (
                 node.text.replace("\n", " ")
    @@ -189,9 +214,9 @@ def _get_text(node: etree._Element, sep: Optional[str] = None) -> str:
     
             return text
     
    -    def _find_metadata(self) -> Optional[etree._Element]:
    +    def _find_metadata(self) -> etree._Element | None:
             meta_names: list[str] = ["article-meta", "book-part-meta"]
    -        meta: Optional[etree._Element] = None
    +        meta: etree._Element | None = None
             for name in meta_names:
                 node = self.tree.xpath(f".//{name}")
                 if len(node) > 0:
    @@ -222,7 +247,7 @@ def _parse_abstract(self) -> list[Abstract]:
         def _parse_authors(self) -> list[Author]:
             # Get mapping between affiliation ids and names
             authors: list[Author] = []
    -        meta: Optional[etree._Element] = self._find_metadata()
    +        meta: etree._Element | None = self._find_metadata()
             if meta is None:
                 return authors
     
    @@ -390,7 +415,7 @@ def _parse_element_citation(self, node: etree._Element) -> str:
                 "part-title",
                 "trans-title",
             ]
    -        title_node: Optional[etree._Element] = None
    +        title_node: etree._Element | None = None
             for name in titles:
                 name_node = node.xpath(name)
                 if len(name_node) > 0:
    @@ -493,12 +518,12 @@ def _add_figure_captions(
             self, doc: DoclingDocument, parent: NodeItem, node: etree._Element
         ) -> None:
             label_node = node.xpath("label")
    -        label: Optional[str] = (
    +        label: str | None = (
                 JatsDocumentBackend._get_text(label_node[0]).strip() if label_node else ""
             )
     
             caption_node = node.xpath("caption")
    -        caption: Optional[str]
    +        caption: str | None
             if len(caption_node) > 0:
                 caption = ""
                 for caption_par in list(caption_node[0]):
    @@ -511,7 +536,7 @@ def _add_figure_captions(
     
             # TODO: format label vs caption once styling is supported
             fig_text: str = f"{label}{' ' if label and caption else ''}{caption}"
    -        fig_caption: Optional[TextItem] = (
    +        fig_caption: TextItem | None = (
                 doc.add_text(label=DocItemLabel.CAPTION, text=fig_text)
                 if fig_text
                 else None
    @@ -538,7 +563,7 @@ def _add_metadata(
             return
     
         @staticmethod
    -    def parse_table_data(element: Tag) -> Optional[TableData]:
    +    def parse_table_data(element: Tag) -> TableData | None:
             # TODO, see how to implement proper support for rich tables from HTML backend
             nested_tables = element.find("table")
             if nested_tables is not None:
    @@ -654,7 +679,7 @@ def _add_table(
             label = table_xml_component["label"]
             caption = table_xml_component["caption"]
             table_text: str = f"{label}{' ' if label and caption else ''}{caption}"
    -        table_caption: Optional[TextItem] = (
    +        table_caption: TextItem | None = (
                 doc.add_text(label=DocItemLabel.CAPTION, text=table_text)
                 if table_text
                 else None
    @@ -681,7 +706,7 @@ def _add_tables(
     
             # Caption
             caption_node = node.xpath("caption")
    -        caption: Optional[str]
    +        caption: str | None
             if caption_node:
                 caption = ""
                 for caption_par in list(caption_node[0]):
    @@ -738,7 +763,7 @@ def _walk_linear(
                 # add elements and decide whether to stop walking
                 if child.tag in ("sec", "ack"):
                     header = child.xpath("title|label")
    -                text: Optional[str] = None
    +                text: str | None = None
                     if len(header) > 0:
                         text = JatsDocumentBackend._get_text(header[0])
                     elif child.tag == "ack":
    
  • docling/backend/xml/uspto_backend.py+114 47 modified
    @@ -3,20 +3,41 @@
     The parsers included in this module can handle patent grants published since 1976 and
     patent applications since 2001.
     The original files can be found in https://bulkdata.uspto.gov.
    +
    +Security Note:
    +    This module uses defusedxml.sax.make_parser() with customized security settings
    +    to protect against XML External Entity (XXE) attacks while allowing USPTO XML files
    +    to be parsed. In addition, it includes safeguards against entity expansion attacks
    +    and entity nesting depth. USPTO files contain DTD declarations that defusedxml
    +    blocks by default, so we configure the parser with:
    +
    +    - feature_external_ges: False (blocks external general entities)
    +    - feature_external_pes: False (blocks external parameter entities)
    +    - forbid_dtd: False (allows DTD declarations in the XML)
    +    - forbid_entities: False (allows entity declarations)
    +    - forbid_external: False (allows external references in declarations)
    +
    +    This configuration permits DTD declarations (required for USPTO files) while the
    +    disabled external entity features prevent actual fetching of external resources,
    +    effectively blocking XXE attacks. The parser processes the XML structure without
    +    accessing any external files or URLs.
     """
     
     import html
     import logging
     import re
    -import xml.sax
    -import xml.sax.xmlreader
     from abc import ABC, abstractmethod
     from enum import Enum, unique
    -from io import BytesIO
    +from io import BytesIO, StringIO
     from pathlib import Path
    -from typing import Final, Optional, Union
    +from typing import Final
    +from xml.sax import SAXParseException
    +from xml.sax.handler import ContentHandler, feature_external_ges, feature_external_pes
    +from xml.sax.xmlreader import AttributesImpl
     
     from bs4 import BeautifulSoup, Tag
    +from defusedxml.common import DefusedXmlException
    +from defusedxml.sax import make_parser
     from docling_core.types.doc import (
         DocItem,
         DocItemLabel,
    @@ -36,7 +57,7 @@
     
     _log = logging.getLogger(__name__)
     
    -XML_DECLARATION: Final = '<?xml version="1.0" encoding="UTF-8"?>'
    +XML_DECLARATION: Final[str] = '<?xml version="1.0" encoding="UTF-8"?>'
     
     
     @unique
    @@ -59,13 +80,11 @@ def __init__(self, _, level: LevelNumber) -> None:
     
     class PatentUsptoDocumentBackend(DeclarativeDocumentBackend):
         @override
    -    def __init__(
    -        self, in_doc: InputDocument, path_or_stream: Union[BytesIO, Path]
    -    ) -> None:
    +    def __init__(self, in_doc: InputDocument, path_or_stream: BytesIO | Path) -> None:
             super().__init__(in_doc, path_or_stream)
     
             self.patent_content: str = ""
    -        self.parser: Optional[PatentUspto] = None
    +        self.parser: PatentUspto | None = None
     
             try:
                 if isinstance(self.path_or_stream, BytesIO):
    @@ -153,7 +172,7 @@ class PatentUspto(ABC):
         """Parser of patent documents from the US Patent Office."""
     
         @abstractmethod
    -    def parse(self, patent_content: str) -> Optional[DoclingDocument]:
    +    def parse(self, patent_content: str) -> DoclingDocument | None:
             """Parse a USPTO patent.
     
             Parameters:
    @@ -177,12 +196,26 @@ def __init__(self) -> None:
             self.handler = PatentUsptoIce.PatentHandler()
             self.pattern = re.compile(r"^(<table .*?</table>)", re.MULTILINE | re.DOTALL)
     
    -    def parse(self, patent_content: str) -> Optional[DoclingDocument]:
    +    def parse(self, patent_content: str) -> DoclingDocument | None:
             try:
    -            xml.sax.parseString(patent_content, self.handler)
    -        except xml.sax._exceptions.SAXParseException as exc_sax:
    -            _log.error(f"Error in parsing USPTO document: {exc_sax}")
    -
    +            parser = make_parser()
    +            parser.setFeature(feature_external_ges, False)
    +            parser.setFeature(feature_external_pes, False)
    +            parser.forbid_dtd = False
    +            parser.forbid_entities = False
    +            parser.forbid_external = False
    +            parser.setContentHandler(self.handler)
    +            parser.parse(StringIO(patent_content))
    +        except SAXParseException as exc_sax:
    +            _log.error(f"Error in parsing USPTO document (malformed XML): {exc_sax}")
    +            return None
    +        except DefusedXmlException as exc_defused:
    +            _log.error(
    +                f"Error in parsing USPTO document (security issue detected): {exc_defused}"
    +            )
    +            return None
    +        except Exception as exc:
    +            _log.error(f"Unexpected error in parsing USPTO document: {exc}")
                 return None
     
             doc = self.handler.doc
    @@ -209,11 +242,11 @@ def parse(self, patent_content: str) -> Optional[DoclingDocument]:
     
             return doc
     
    -    class PatentHandler(xml.sax.handler.ContentHandler):
    +    class PatentHandler(ContentHandler):
             """SAX ContentHandler for patent documents."""
     
    -        APP_DOC_ELEMENT: Final = "us-patent-application"
    -        GRANT_DOC_ELEMENT: Final = "us-patent-grant"
    +        APP_DOC_ELEMENT: Final[str] = "us-patent-application"
    +        GRANT_DOC_ELEMENT: Final[str] = "us-patent-grant"
     
             @unique
             class Element(Enum):
    @@ -247,11 +280,11 @@ def __init__(self, _, is_text: bool) -> None:
             def __init__(self) -> None:
                 """Build an instance of the patent handler."""
                 # Current patent being parsed
    -            self.doc: Optional[DoclingDocument] = None
    +            self.doc: DoclingDocument | None = None
                 # Keep track of docling hierarchy level
                 self.level: LevelNumber = 1
                 # Keep track of docling parents by level
    -            self.parents: dict[LevelNumber, Optional[DocItem]] = {1: None}
    +            self.parents: dict[LevelNumber, DocItem | None] = {1: None}
                 # Content to retain for the current patent
                 self.property: list[str]
                 self.claim: str
    @@ -352,7 +385,7 @@ def characters(self, content):
                             self.text += content
     
             def _start_registered_elements(
    -            self, tag: str, attributes: xml.sax.xmlreader.AttributesImpl
    +            self, tag: str, attributes: AttributesImpl
             ) -> None:
                 if tag in [member.value for member in self.Element]:
                     # special case for claims: claim lines may start before the
    @@ -514,12 +547,26 @@ def __init__(self) -> None:
             self.pattern = re.compile(r"^(<table .*?</table>)", re.MULTILINE | re.DOTALL)
     
         @override
    -    def parse(self, patent_content: str) -> Optional[DoclingDocument]:
    +    def parse(self, patent_content: str) -> DoclingDocument | None:
             try:
    -            xml.sax.parseString(patent_content, self.handler)
    -        except xml.sax._exceptions.SAXParseException as exc_sax:
    -            _log.error(f"Error in parsing USPTO document: {exc_sax}")
    -
    +            parser = make_parser()
    +            parser.setFeature(feature_external_ges, False)
    +            parser.setFeature(feature_external_pes, False)
    +            parser.forbid_dtd = False
    +            parser.forbid_entities = False
    +            parser.forbid_external = False
    +            parser.setContentHandler(self.handler)
    +            parser.parse(StringIO(patent_content))
    +        except SAXParseException as exc_sax:
    +            _log.error(f"Error in parsing USPTO document (malformed XML): {exc_sax}")
    +            return None
    +        except DefusedXmlException as exc_defused:
    +            _log.error(
    +                f"Error in parsing USPTO document (security issue detected): {exc_defused}"
    +            )
    +            return None
    +        except Exception as exc:
    +            _log.error(f"Unexpected error in parsing USPTO document: {exc}")
                 return None
     
             doc = self.handler.doc
    @@ -546,11 +593,11 @@ def parse(self, patent_content: str) -> Optional[DoclingDocument]:
     
             return doc
     
    -    class PatentHandler(xml.sax.handler.ContentHandler):
    +    class PatentHandler(ContentHandler):
             """SAX ContentHandler for patent documents."""
     
    -        GRANT_DOC_ELEMENT: Final = "PATDOC"
    -        CLAIM_STATEMENT: Final = "What is claimed is:"
    +        GRANT_DOC_ELEMENT: Final[str] = "PATDOC"
    +        CLAIM_STATEMENT: Final[str] = "What is claimed is:"
     
             @unique
             class Element(Enum):
    @@ -585,11 +632,11 @@ def __init__(self, _, is_text: bool) -> None:
             def __init__(self) -> None:
                 """Build an instance of the patent handler."""
                 # Current patent being parsed
    -            self.doc: Optional[DoclingDocument] = None
    +            self.doc: DoclingDocument | None = None
                 # Keep track of docling hierarchy level
                 self.level: LevelNumber = 1
                 # Keep track of docling parents by level
    -            self.parents: dict[LevelNumber, Optional[DocItem]] = {1: None}
    +            self.parents: dict[LevelNumber, DocItem | None] = {1: None}
                 # Content to retain for the current patent
                 self.property: list[str]
                 self.claim: str
    @@ -684,7 +731,7 @@ def characters(self, content):
                             self.text += content
     
             def _start_registered_elements(
    -            self, tag: str, attributes: xml.sax.xmlreader.AttributesImpl
    +            self, tag: str, attributes: AttributesImpl
             ) -> None:
                 if tag in [member.value for member in self.Element]:
                     if (
    @@ -887,13 +934,13 @@ class Field(Enum):
         @override
         def __init__(self) -> None:
             """Build an instance of PatentUsptoGrantAps class."""
    -        self.doc: Optional[DoclingDocument] = None
    +        self.doc: DoclingDocument | None = None
             # Keep track of docling hierarchy level
             self.level: LevelNumber = 1
             # Keep track of docling parents by level
    -        self.parents: dict[LevelNumber, Optional[DocItem]] = {1: None}
    +        self.parents: dict[LevelNumber, DocItem | None] = {1: None}
     
    -    def get_last_text_item(self) -> Optional[TextItem]:
    +    def get_last_text_item(self) -> TextItem | None:
             """Get the last text item at the current document level.
     
             Returns:
    @@ -1030,7 +1077,7 @@ def store_content(self, section: str, field: str, value: str) -> None:
                     parent=self.parents[self.level],
                 )
     
    -    def parse(self, patent_content: str) -> Optional[DoclingDocument]:
    +    def parse(self, patent_content: str) -> DoclingDocument | None:
             self.doc = self.doc = DoclingDocument(name="file")
             section: str = ""
             key: str = ""
    @@ -1075,12 +1122,26 @@ def __init__(self) -> None:
             self.pattern = re.compile(r"^(<table .*?</table>)", re.MULTILINE | re.DOTALL)
     
         @override
    -    def parse(self, patent_content: str) -> Optional[DoclingDocument]:
    +    def parse(self, patent_content: str) -> DoclingDocument | None:
             try:
    -            xml.sax.parseString(patent_content, self.handler)
    -        except xml.sax._exceptions.SAXParseException as exc_sax:
    -            _log.error(f"Error in parsing USPTO document: {exc_sax}")
    -
    +            parser = make_parser()
    +            parser.setFeature(feature_external_ges, False)
    +            parser.setFeature(feature_external_pes, False)
    +            parser.forbid_dtd = False
    +            parser.forbid_entities = False
    +            parser.forbid_external = False
    +            parser.setContentHandler(self.handler)
    +            parser.parse(StringIO(patent_content))
    +        except SAXParseException as exc_sax:
    +            _log.error(f"Error in parsing USPTO document (malformed XML): {exc_sax}")
    +            return None
    +        except DefusedXmlException as exc_defused:
    +            _log.error(
    +                f"Error in parsing USPTO document (security issue detected): {exc_defused}"
    +            )
    +            return None
    +        except Exception as exc:
    +            _log.error(f"Unexpected error in parsing USPTO document: {exc}")
                 return None
     
             doc = self.handler.doc
    @@ -1107,10 +1168,10 @@ def parse(self, patent_content: str) -> Optional[DoclingDocument]:
     
             return doc
     
    -    class PatentHandler(xml.sax.handler.ContentHandler):
    +    class PatentHandler(ContentHandler):
             """SAX ContentHandler for patent documents."""
     
    -        APP_DOC_ELEMENT: Final = "patent-application-publication"
    +        APP_DOC_ELEMENT: Final[str] = "patent-application-publication"
     
             @unique
             class Element(Enum):
    @@ -1146,11 +1207,11 @@ def __init__(self, _, is_text: bool) -> None:
             def __init__(self) -> None:
                 """Build an instance of the patent handler."""
                 # Current patent being parsed
    -            self.doc: Optional[DoclingDocument] = None
    +            self.doc: DoclingDocument | None = None
                 # Keep track of docling hierarchy level
                 self.level: LevelNumber = 1
                 # Keep track of docling parents by level
    -            self.parents: dict[LevelNumber, Optional[DocItem]] = {1: None}
    +            self.parents: dict[LevelNumber, DocItem | None] = {1: None}
                 # Content to retain for the current patent
                 self.property: list[str]
                 self.claim: str
    @@ -1245,7 +1306,7 @@ def characters(self, content):
                             self.text += content
     
             def _start_registered_elements(
    -            self, tag: str, attributes: xml.sax.xmlreader.AttributesImpl
    +            self, tag: str, attributes: AttributesImpl
             ) -> None:
                 if tag in [member.value for member in self.Element]:
                     # special case for claims: claim lines may start before the
    @@ -1421,6 +1482,12 @@ def __init__(self, input: str) -> None:
     
             Args:
                 input: The xml content.
    +
    +        Security Note:
    +            This parser uses BeautifulSoup with lxml, which can be vulnerable to XXE.
    +            However, the input here comes from table strings extracted AFTER the main
    +            document has been safely parsed by defusedxml, so the content is already
    +            sanitized and safe to parse.
             """
             self.max_nbr_messages = 2
             self.nbr_messages = 0
    @@ -1678,7 +1745,7 @@ def _parse_table(self, table: Tag) -> TableData:
     
             return dl_table
     
    -    def parse(self) -> Optional[TableData]:
    +    def parse(self) -> TableData | None:
             """Parse the first table from an xml content.
     
             Returns:
    
  • pyproject.toml+2 0 modified
    @@ -71,6 +71,7 @@ dependencies = [
       'scipy (>=1.6.0,<2.0.0)',
       "accelerate>=1.0.0,<2",
       "polyfactory>=2.22.2",
    +  "defusedxml (>=0.7.1, <0.8.0)",
     ]
     
     [project.urls]
    @@ -132,6 +133,7 @@ dev = [
         "ipywidgets~=8.1",
         "nbqa~=1.9",
         "python-semantic-release~=7.32",
    +    "types-defusedxml (>=0.7.0.20250822, <0.8.0)",
     ]
     docs = [
       "mkdocs-material~=9.5",
    
  • uv.lock+13 0 modified
    @@ -1030,6 +1030,7 @@ dependencies = [
         { name = "accelerate" },
         { name = "beautifulsoup4" },
         { name = "certifi" },
    +    { name = "defusedxml" },
         { name = "docling-core", extra = ["chunking"] },
         { name = "docling-ibm-models" },
         { name = "docling-parse" },
    @@ -1108,6 +1109,7 @@ dev = [
         { name = "pytest-durations" },
         { name = "pytest-xdist" },
         { name = "python-semantic-release" },
    +    { name = "types-defusedxml" },
         { name = "types-openpyxl" },
         { name = "types-requests" },
         { name = "types-setuptools" },
    @@ -1138,6 +1140,7 @@ requires-dist = [
         { name = "accelerate", marker = "extra == 'vlm'", specifier = ">=1.2.1,<2.0.0" },
         { name = "beautifulsoup4", specifier = ">=4.12.3,<5.0.0" },
         { name = "certifi", specifier = ">=2024.7.4" },
    +    { name = "defusedxml", specifier = ">=0.7.1,<0.8.0" },
         { name = "docling-core", extras = ["chunking"], specifier = ">=2.62.0,<3.0.0" },
         { name = "docling-ibm-models", specifier = ">=3.9.1,<4" },
         { name = "docling-parse", specifier = ">=5.3.2,<6.0.0" },
    @@ -1199,6 +1202,7 @@ dev = [
         { name = "pytest-durations", specifier = "~=1.6.1" },
         { name = "pytest-xdist", specifier = "~=3.3" },
         { name = "python-semantic-release", specifier = "~=7.32" },
    +    { name = "types-defusedxml", specifier = ">=0.7.0.20250822,<0.8.0" },
         { name = "types-openpyxl", specifier = "~=3.1" },
         { name = "types-requests", specifier = "~=2.31" },
         { name = "types-setuptools", specifier = "~=70.3" },
    @@ -6854,6 +6858,15 @@ wheels = [
         { url = "https://files.pythonhosted.org/packages/ab/3d/21a2212b5fcef9e8e9f368403885dc567b7d31e50b2ce393efad3cd83572/types_awscrt-0.31.2-py3-none-any.whl", hash = "sha256:3d6a29c1cca894b191be408f4d985a8e3a14d919785652dd3fa4ee558143e4bf", size = 43340, upload-time = "2026-02-16T02:33:52.109Z" },
     ]
     
    +[[package]]
    +name = "types-defusedxml"
    +version = "0.7.0.20250822"
    +source = { registry = "https://pypi.org/simple" }
    +sdist = { url = "https://files.pythonhosted.org/packages/7d/4a/5b997ae87bf301d1796f72637baa4e0e10d7db17704a8a71878a9f77f0c0/types_defusedxml-0.7.0.20250822.tar.gz", hash = "sha256:ba6c395105f800c973bba8a25e41b215483e55ec79c8ca82b6fe90ba0bc3f8b2", size = 10590, upload-time = "2025-08-22T03:02:59.547Z" }
    +wheels = [
    +    { url = "https://files.pythonhosted.org/packages/13/73/8a36998cee9d7c9702ed64a31f0866c7f192ecffc22771d44dbcc7878f18/types_defusedxml-0.7.0.20250822-py3-none-any.whl", hash = "sha256:5ee219f8a9a79c184773599ad216123aedc62a969533ec36737ec98601f20dcf", size = 13430, upload-time = "2025-08-22T03:02:58.466Z" },
    +]
    +
     [[package]]
     name = "types-openpyxl"
     version = "3.1.5.20250919"
    

Vulnerability mechanics

Root cause

"The USPTO patent XML parser used the standard xml.sax.parseString() without protection against XML External Entity (XXE) attacks."

Attack vector

An attacker can craft malicious USPTO patent XML files containing external entity references. These references can be used to read arbitrary files from the server's filesystem, perform Server-Side Request Forgery (SSRF) attacks, or cause a denial of service through entity expansion (Billion Laughs attack) [CWE-776]. This vulnerability affects three USPTO patent format parsers: ICE (v4.x), Grant v2.5, and Application v1.x.

Affected code

The vulnerability exists in the `docling/backend/xml/uspto_backend.py` file, specifically within the `PatentUsptoIce.parse`, `PatentUsptoGrantV25.parse`, and `PatentUsptoAppV1.parse` methods. These methods previously used `xml.sax.parseString()` which is susceptible to XXE attacks. The fix involves replacing this with `defusedxml.sax.make_parser()` and configuring it securely.

What the fix does

The patch updates the XML parser to use `defusedxml.sax.make_parser()` with a secure configuration that explicitly blocks external entity resolution by setting `feature_external_ges=False` and `feature_external_pes=False`. This configuration prevents XXE attacks while still allowing the DTD declarations required by USPTO files, thus maintaining compatibility and security [patch_id=4714025]. The fix is included in version 2.74.0.

Preconditions

  • inputThe attacker must be able to provide a malicious USPTO patent XML file to the parser.

Generated on Jun 3, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

3

News mentions

1