VYPR
Medium severity6.9GHSA Advisory· Published Jun 16, 2026· Updated Jun 16, 2026

pypdf: Possible large memory usage for form XObjects during text extraction

CVE-2026-49461

Description

A crafted PDF with self-referencing form XObject causes excessive memory consumption in pypdf during text extraction.

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

A crafted PDF with self-referencing form XObject causes excessive memory consumption in pypdf during text extraction.

## Vulnerability pypdf versions prior to 6.12.2 contain a vulnerability in the text extraction code that processes form XObjects. A specially crafted PDF including a form XObject with self-references can cause uncontrolled memory usage when the text of the containing page is extracted. The issue affects all versions before 6.12.2 [1][3].

Exploitation

An attacker creates a PDF that includes a form XObject containing self-references. The victim must use pypdf to extract text from the page that includes this XObject. No authentication or special privileges are required; the standard extract_text() operation triggers the vulnerability [1][2].

Impact

Successful exploitation results in excessive memory consumption, potentially leading to denial of service due to resource exhaustion. The attacker can cause the application to crash or become unresponsive [1][4].

Mitigation

The vulnerability is fixed in pypdf version 6.12.2, released on 2026-05-26 [3]. Users should upgrade immediately. As a workaround, users who cannot upgrade can apply the changes from pull request #3805, which improves loop control in the text extraction code [2][4].

AI Insight generated on Jun 16, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected products

2

Patches

1
0b5b8adf02a2

SEC: Improve loop control in text extraction (#3805)

https://github.com/py-pdf/pypdfStefanMay 26, 2026via body-scan-shorthand
2 files changed · +59 18
  • pypdf/_page.py+35 18 modified
    @@ -1678,14 +1678,16 @@ def _debug_for_extract(self) -> str:  # pragma: no cover
     
         def _extract_text(
             self,
    -        obj: Any,
    +        obj: DictionaryObject,
             pdf: Any,
             orientations: tuple[int, ...] = (0, 90, 180, 270),
             space_width: float = 200.0,
             content_key: Optional[str] = PG.CONTENTS,
             visitor_operand_before: Optional[Callable[[Any, Any, Any, Any], None]] = None,
             visitor_operand_after: Optional[Callable[[Any, Any, Any, Any], None]] = None,
             visitor_text: Optional[Callable[[Any, Any, Any, Any, Any], None]] = None,
    +        *,
    +        known_ids: Optional[set[int]] = None,
         ) -> str:
             """
             See extract_text for most arguments.
    @@ -1696,18 +1698,15 @@ def _extract_text(
                     default = "/Content"
     
             """
    +        if known_ids is None:
    +            known_ids = set()
    +
             extractor = TextExtraction()
             font_resources: dict[str, DictionaryObject] = {}
             fonts: dict[str, Font] = {}
     
             try:
    -            objr = obj
    -            while NameObject(PG.RESOURCES) not in objr:
    -                # /Resources can be inherited so we look to parents
    -                objr = objr["/Parent"].get_object()
    -                # If no parents then no /Resources will be available,
    -                # so an exception will be raised
    -            resources_dict = cast(DictionaryObject, objr[PG.RESOURCES])
    +            resources_dict = cast(DictionaryObject, obj.get_inherited(key=PG.RESOURCES, default=DictionaryObject()))
             except Exception:
                 # No resources means no text is possible (no font); we consider the
                 # file as not damaged, no need to check for TJ or Tj
    @@ -1796,16 +1795,31 @@ def _extract_text(
                     except IndexError:
                         pass
                     try:
    -                    xobj = resources_dict["/XObject"]
    -                    if xobj[operands[0]]["/Subtype"] != "/Image":  # type: ignore
    -                        text = self.extract_xform_text(
    -                            xobj[operands[0]],  # type: ignore
    -                            orientations,
    -                            space_width,
    -                            visitor_operand_before,
    -                            visitor_operand_after,
    -                            visitor_text,
    -                        )
    +                    xobj = cast(DictionaryObject, resources_dict["/XObject"])
    +                    xform = cast(EncodedStreamObject, xobj[operands[0]])
    +                    if xform["/Subtype"] != NameObject("/Image"):
    +                        xform_id = id(xform)
    +                        if xform_id in known_ids:
    +                            logger_warning(
    +                                "Detected cyclic form XObject reference, skipping %(operand)s.",
    +                                source=__name__,
    +                                operand=operands[0]
    +                            )
    +                            text = ""
    +                        else:
    +                            known_ids.add(xform_id)
    +                            try:
    +                                text = self.extract_xform_text(
    +                                    xform,
    +                                    orientations,
    +                                    space_width,
    +                                    visitor_operand_before,
    +                                    visitor_operand_after,
    +                                    visitor_text,
    +                                    known_ids=known_ids,
    +                                )
    +                            finally:
    +                                known_ids.discard(xform_id)
                             extractor.output += text
                             if visitor_text is not None:
                                 visitor_text(
    @@ -2071,6 +2085,8 @@ def extract_xform_text(
             visitor_operand_before: Optional[Callable[[Any, Any, Any, Any], None]] = None,
             visitor_operand_after: Optional[Callable[[Any, Any, Any, Any], None]] = None,
             visitor_text: Optional[Callable[[Any, Any, Any, Any, Any], None]] = None,
    +        *,
    +        known_ids: Optional[set[int]] = None,
         ) -> str:
             """
             Extract text from an XObject.
    @@ -2096,6 +2112,7 @@ def extract_xform_text(
                 visitor_operand_before,
                 visitor_operand_after,
                 visitor_text,
    +            known_ids=known_ids,
             )
     
         def _get_fonts(self) -> tuple[set[str], set[str]]:
    
  • tests/test_text_extraction.py+24 0 modified
    @@ -596,3 +596,27 @@ def test_fixed_width_page__excessive_needed_spaces(caplog):
     
         assert result == " " * 10_000 + "X"
         assert caplog.messages == ["Limiting excessive whitespace from 13000 to 10000 characters."]
    +
    +
    +def test_page__extract_text__xform__self_references(caplog):
    +    writer = PdfWriter()
    +    page = writer.add_blank_page(width=10, height=10)
    +
    +    form = ContentStream(stream=None, pdf=writer)
    +    form[NameObject("/Type")] = NameObject("/XObject")
    +    form[NameObject("/Subtype")] = NameObject("/Form")
    +    form.set_data(b"/X1 Do")
    +    form_reference = writer._add_object(form)
    +    form[NameObject("/Resources")] = DictionaryObject({
    +        NameObject("/XObject"): DictionaryObject({
    +            NameObject("/X1"): form_reference
    +        })
    +    })
    +
    +    page[NameObject("/Resources")] = form[NameObject("/Resources")]
    +    content = ContentStream(stream=None, pdf=writer)
    +    content.set_data(b"q /X1 Do Q")
    +    page.replace_contents(content)
    +
    +    assert page.extract_text() == ""
    +    assert caplog.messages == ["Detected cyclic form XObject reference, skipping /X1."]
    

Vulnerability mechanics

Root cause

"Missing cycle detection in Form XObject recursion during text extraction leads to uncontrolled recursion and memory exhaustion."

Attack vector

An attacker crafts a PDF whose page content references a Form XObject that, through its /Resources/XObject dictionary, contains a self-reference back to itself (cyclic reference). When pypdf's text extraction iterates over this XObject, it recurses infinitely, eventually exhausting available memory. The attacker must deliver the malformed PDF to the victim application that calls `.extract_text()` on the page. No authentication or special network position is required beyond the ability to supply a PDF document. [patch_id=6167630]

What the fix does

The patch adds a `known_ids` parameter (a `set[int]`) that is passed through `_extract_text` and `extract_xform_text`. Before recursing into a Form XObject, the code takes `id(xform)` and checks whether it is already in the set; if it is, a warning is logged and an empty string is returned, breaking the cycle. The XObject's id is added to the set before the recursive call and removed afterwards (via `try/finally`), preserving correctness for legitimate shared XObjects. This ensures that even a self-referencing Form XObject terminates immediately instead of recursing infinitely.

Preconditions

  • inputThe victim application must invoke pypdf's `page.extract_text()` (or any internal code path that reaches `_extract_text` followed by `extract_xform_text`).
  • inputThe PDF must contain at least one page whose content stream references a Form XObject that directly or indirectly references itself in its /Resources/XObject dictionary.

Generated on Jun 16, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

4

News mentions

0

No linked articles in our index yet.