pypdf: Possible infinite loop when retrieving fonts for layout-mode text extraction
Description
Crafted PDF causes infinite loop in pypdf when extracting text in layout mode, fixed in version 6.13.0.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
Crafted PDF causes infinite loop in pypdf when extracting text in layout mode, fixed in version 6.13.0.
Vulnerability
A vulnerability in pypdf prior to version 6.13.0 allows an attacker to craft a PDF that triggers an infinite loop during text extraction in layout mode. The loop occurs when the library retrieves font information from the PDF, leading to uncontrolled resource consumption. [1][3]
Exploitation
An attacker can exploit this by providing a malicious PDF to any application or service that uses pypdf to extract text in layout mode. No authentication or special privileges are required; the victim only needs to process the crafted PDF. The infinite loop is triggered during the font retrieval step, as detailed in the fix commit. [1][2]
Impact
Successful exploitation results in a denial of service (DoS) due to an infinite loop, causing the application to hang or exhaust CPU resources. No data disclosure, privilege escalation, or remote code execution has been reported. [1]
Mitigation
The issue is fixed in pypdf version 6.13.0, released on 2026-06-05. Users unable to upgrade immediately can apply the changes from pull request #3830 as a workaround. [2][3]
AI Insight generated on Jun 16, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected products
2Patches
168822ded066fSEC: Avoid infinite loops for outlines and text extraction (#3830)
4 files changed · +158 −33
pypdf/_page.py+14 −10 modified@@ -1864,20 +1864,24 @@ def _layout_mode_fonts(self) -> dict[str, Font]: """ # Font retrieval logic adapted from pypdf.PageObject._extract_text() - objr: Any = self + obj: Any = self fonts: dict[str, Font] = {} - while objr is not None: - try: - resources_dict: Any = objr[PG.RESOURCES] - except KeyError: - resources_dict = {} + visited: set[int] = set() + while True: + obj_id = id(obj) + if obj_id in visited: + logger_warning("Detected cycle in /Parent hierarchy when retrieving fonts.", source=__name__) + break + visited.add(obj_id) + + resources_dict: Any = obj.get(PG.RESOURCES, {}) if "/Font" in resources_dict and self.pdf is not None: for font_name in resources_dict["/Font"]: fonts[font_name] = Font.from_font_resource(resources_dict["/Font"][font_name]) - try: - objr = objr["/Parent"].get_object() - except KeyError: - objr = None + + if "/Parent" not in obj: + break + obj = obj["/Parent"].get_object() return fonts
pypdf/_writer.py+38 −23 modified@@ -2788,7 +2788,7 @@ def merge( _ro = reader.root_object if import_outline and CO.OUTLINES in _ro: outline = self._get_filtered_outline( - _ro.get(CO.OUTLINES, None), srcpages, reader + node=_ro.get(CO.OUTLINES, None), pages=srcpages, reader=reader ) self._insert_filtered_outline( outline, outline_item_typ, None @@ -3053,54 +3053,69 @@ def _insert_filtered_annotations( def _get_filtered_outline( self, + *, node: Any, pages: dict[int, PageObject], reader: PdfReader, + visited: Optional[set[int]] = None, ) -> list[Destination]: """ Extract outline item entries that are part of the specified page set. - Args: - node: - pages: - reader: - Returns: A list of destination objects. """ - new_outline = [] + if visited is None: + visited = set() + new_outline: list[Destination] = [] if node is None: - node = NullObject() + return new_outline node = node.get_object() if is_null_or_none(node): node = DictionaryObject() + if node.get("/Type", "") == "/Outlines" or "/Title" not in node: + node_id = id(node) + if node_id in visited: + logger_warning("Detected cycle in outlines.", source=__name__) + return [] + visited.add(node_id) + node = node.get("/First", None) if node is not None: node = node.get_object() - new_outline += self._get_filtered_outline(node, pages, reader) + new_outline += self._get_filtered_outline(node=node, pages=pages, reader=reader, visited=visited) else: - v: Union[None, IndirectObject, NullObject] - while node is not None: + cloned_page: Union[None, IndirectObject, NullObject] + while True: node = node.get_object() - o = cast("Destination", reader._build_outline_item(node)) - v = self._get_cloned_page(cast("PageObject", o["/Page"]), pages, reader) - if v is None: - v = NullObject() - o[NameObject("/Page")] = v + node_id = id(node) + if node_id in visited: + logger_warning("Detected cycle in outlines.", source=__name__) + break + visited.add(node_id) + + destination = cast("Destination", reader._build_outline_item(node)) + cloned_page = self._get_cloned_page(cast("PageObject", destination["/Page"]), pages, reader) + if cloned_page is None: + cloned_page = NullObject() + destination[NameObject("/Page")] = cloned_page if "/First" in node: - o._filtered_children = self._get_filtered_outline( - node["/First"], pages, reader + destination._filtered_children = self._get_filtered_outline( + node=node["/First"], pages=pages, reader=reader, visited=visited ) else: - o._filtered_children = [] + destination._filtered_children = [] if ( - not isinstance(o["/Page"], NullObject) - or len(o._filtered_children) > 0 + not isinstance(cloned_page, NullObject) + or len(destination._filtered_children) > 0 ): - new_outline.append(o) - node = node.get("/Next", None) + new_outline.append(destination) + + if "/Next" not in node: + break + node = node["/Next"] return new_outline def _clone_outline(self, dest: Destination) -> TreeObject:
tests/test_text_extraction.py+29 −0 modified@@ -649,3 +649,32 @@ def test_text_state_params__unicode_decode_error(encoding): # Assertions: 'replace' mode changes invalid UTF-8 bytes to '\xfffd'. assert parameters.text == "\ufffd" assert parameters._decoded_value == "\ufffd" + + +@pytest.mark.timeout(5) +def test_page_object__layout_mode_fonts__cyclic(caplog) -> None: + writer = PdfWriter() + + font = DictionaryObject({ + NameObject("/Type"): NameObject("/Font"), + NameObject("/Subtype"): NameObject("/Type1"), + NameObject("/BaseFont"): NameObject("/Helvetica"), + }) + fonts = {"/F1": Font.from_font_resource(font)} + page = writer.add_blank_page(width=10, height=10) + dictionary2 = DictionaryObject(DictionaryObject({ + NameObject("/Resources"): DictionaryObject({ + NameObject("/Font"): DictionaryObject({ + NameObject("/F1"): font + }) + }) + })) + reference2 = writer._add_object(dictionary2) + dictionary3 = DictionaryObject({NameObject("/Parent"): reference2}) + reference3 = writer._add_object(dictionary3) + page[NameObject("/Parent")] = reference3 + dictionary2[NameObject("/Parent")] = page.indirect_reference + page.pdf = writer + + assert page._layout_mode_fonts() == fonts + assert caplog.messages == ["Detected cycle in /Parent hierarchy when retrieving fonts."]
tests/test_writer.py+77 −0 modified@@ -3218,3 +3218,80 @@ def test_encrypt__incremental(): with pytest.raises(NotImplementedError): writer.encrypt(user_password="dummy") + + +@pytest.mark.timeout(5) +def test_get_filtered_outline__first__cyclic(caplog) -> None: + writer = PdfWriter() + reader = PdfReader(RESOURCE_ROOT / "crazyones.pdf") + + dictionary1 = DictionaryObject({ + NameObject("/Type"): NameObject("/Outlines") + }) + reference1 = writer._add_object(dictionary1) + dictionary2 = DictionaryObject({ + NameObject("/Type"): NameObject("/Outlines") + }) + reference2 = writer._add_object(dictionary2) + dictionary3 = DictionaryObject({ + NameObject("/First"): reference2, + NameObject("/Type"): NameObject("/Outlines") + }) + reference3 = writer._add_object(dictionary3) + dictionary1[NameObject("/First")] = reference3 + dictionary2[NameObject("/First")] = reference1 + + assert writer._get_filtered_outline(node=dictionary1, pages={}, reader=reader) == [] + assert caplog.messages == ["Detected cycle in outlines."] + + +@pytest.mark.timeout(5) +def test_get_filtered_outline__next_first__cyclic(caplog) -> None: + writer = PdfWriter() + reader = PdfReader(RESOURCE_ROOT / "crazyones.pdf") + + dictionary1 = DictionaryObject({ + NameObject("/Title"): TextStringObject("test") + }) + _reference1 = writer._add_object(dictionary1) + dictionary2 = DictionaryObject({ + NameObject("/Type"): NameObject("/Outlines") + }) + reference2 = writer._add_object(dictionary2) + dictionary1[NameObject("/Next")] = reference2 + dictionary2[NameObject("/First")] = reference2 + + assert writer._get_filtered_outline(node=dictionary1, pages={}, reader=reader) == [] + assert caplog.messages == ["Detected cycle in outlines."] + + +@pytest.mark.timeout(5) +def test_get_filtered_outline__next_next__cyclic(caplog) -> None: + writer = PdfWriter() + reader = PdfReader(RESOURCE_ROOT / "crazyones.pdf") + + dictionary1 = DictionaryObject({ + NameObject("/Title"): TextStringObject("test") + }) + reference1 = writer._add_object(dictionary1) + dictionary2 = DictionaryObject({ + NameObject("/Title"): TextStringObject("test") + }) + reference2 = writer._add_object(dictionary2) + dictionary3 = DictionaryObject({ + NameObject("/Next"): reference2, + NameObject("/Title"): TextStringObject("test") + }) + reference3 = writer._add_object(dictionary3) + dictionary1[NameObject("/Next")] = reference3 + dictionary2[NameObject("/Next")] = reference1 + + assert writer._get_filtered_outline(node=dictionary1, pages={}, reader=reader) == [] + assert caplog.messages == ["Detected cycle in outlines."] + + +def test_get_filtered_outline__node_is_none() -> None: + writer = PdfWriter() + reader = PdfReader(RESOURCE_ROOT / "crazyones.pdf") + + assert writer._get_filtered_outline(node=None, pages={}, reader=reader) == []
Vulnerability mechanics
Root cause
"pypdf traversed PDF outline linked lists and page /Parent chains without tracking previously visited nodes, leading to an infinite loop on malformed cyclic input."
Attack vector
An attacker crafts a malicious PDF whose outline tree or page-object /Parent chain contains a cycle (e.g., `/First` points back to an earlier outline node, or `/Parent` of a page leads to another page that eventually points back). When `pypdf` extracts text in layout mode — or merges outlines — the library follows the linked-list structure without tracking visited nodes, causing an infinite loop. The attacker does not need authentication; the payload is delivered through any channel that feeds a PDF to the library (email, upload, etc.).
What the fix does
The patch introduces a `visited: set[int]` parameter in `_get_filtered_outline` and a similar set in `_layout_mode_fonts`. Before following a node’s `/First`, `/Next`, or `/Parent` link, the code records `id(node)` and checks membership; if the ID is already in the set, it logs a warning (`"Detected cycle in outlines."` or `"Detected cycle in /Parent hierarchy when retrieving fonts."`) and returns or breaks, preventing unbounded recursion. This closes both the outline-traversal and page-font-traversal infinite loops without altering the library’s external API.
Preconditions
- inputThe application uses pypdf to open a user-supplied PDF and calls text extraction in layout mode or performs a merge that copies outlines.
Generated on Jun 16, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
4News mentions
0No linked articles in our index yet.