Low severityOSV Advisory· Published Jan 10, 2026· Updated Jan 12, 2026

pypdf has possible long runtimes for malformed startxref

CVE-2026-22691

Description

pypdf is a free and open-source pure-python PDF library. Prior to version 6.6.0, pypdf has possible long runtimes for malformed startxref. An attacker who uses this vulnerability can craft a PDF which leads to possibly long runtimes for invalid startxref entries. When rebuilding the cross-reference table, PDF files with lots of whitespace characters become problematic. Only the non-strict reading mode is affected. Only the non-strict reading mode is affected. This issue has been patched in version 6.6.0.

Affected packages

Versions sourced from the GitHub Security Advisory.

Package	Affected versions	Patched versions
pypdfPyPI	< 6.6.0	6.6.0

Affected products

Py Pdf/PypdfOSV
Range: 1.26.0, 1.27.0, 1.27.1, …

Patches

294165726b64

SEC: Improve handling of partially broken PDF files (#3594)

https://github.com/py-pdf/pypdfStefanJan 9, 2026via ghsa

commit

5 files changed · +236 −38

docs/user/security.md+13 −0 modified

@@ -4,6 +4,8 @@ We strive to provide a library with secure defaults.
 
 ## Configuration
 
+### Filters
+
 *pypdf* currently employs output size limits for some filters which are known to possibly have large compression ratios.
 
 The usual limit is at 75 MB of uncompressed data during decompression. If this is too low for your use case, and you are
@@ -15,6 +17,17 @@ aware of the possible side effects, you can modify the following constants which
 For JBIG2 images, there is a similar parameter to limit the memory usage during decoding: `pypdf.filters.JBIG2_MAX_OUTPUT_LENGTH`
 It defaults to 75 MB as well.
 
+### Reading
+
+*pypdf* currently employs the following reading limits on *PdfReader* instances:
+
+* `root_object_recovery_limit` limits the number of objects to read before stopping with Root object recovery in
+  non-strict mode. It defaults to 10 000. Setting it to `None` will fully disable this limit.
+
+If you want to employ custom limits for the *PdfWriter* as well, the currently preferred way
+is to initialize it from the reader, id est something like
+`PdfWriter(clone_from=PdfReader("file.pdf", root_object_recovery_limit=42))`.
+
 ## Reporting possible vulnerabilities
 
 Please refer to our [security policy](https://github.com/py-pdf/pypdf/security/policy).

pypdf/_doc_common.py+4 −0 modified

@@ -1170,7 +1170,11 @@ def _flatten(
             for attr in inheritable_page_attributes:
                 if attr in pages:
                     inherit[attr] = pages[attr]
+            pages_reference = getattr(pages, "indirect_reference", object())
             for page in cast(ArrayObject, pages[PagesAttributes.KIDS]):
+                if getattr(page, "indirect_reference", object()) == pages_reference:
+                    raise PdfReadError("Detected cyclic page references.")
+
                 addt = {}
                 if isinstance(page, IndirectObject):
                     addt["indirect_reference"] = page

pypdf/_reader.py+112 −37 modified

@@ -29,6 +29,7 @@
 
 import os
 import re
+import sys
 from collections.abc import Iterable
 from io import BytesIO, UnsupportedOperation
 from pathlib import Path
@@ -45,6 +46,7 @@
 from ._doc_common import PdfDocCommon, convert_to_int
 from ._encryption import Encryption, PasswordType
 from ._utils import (
+    WHITESPACES_AS_BYTES,
     StrByteType,
     StreamType,
     logger_warning,
@@ -58,6 +60,7 @@
 from .errors import (
     EmptyFileError,
     FileNotDecryptedError,
+    LimitReachedError,
     PdfReadError,
     PdfStreamError,
     WrongPasswordError,
@@ -101,6 +104,9 @@ class PdfReader(PdfDocCommon):
         password: Decrypt PDF file at initialization. If the
             password is None, the file will not be decrypted.
             Defaults to ``None``.
+        root_object_recovery_limit: The maximum number of objects to query
+            for recovering the Root object in non-strict mode. To disable
+            this security measure, pass ``None``.
 
     """
 
@@ -109,6 +115,8 @@ def __init__(
         stream: Union[StrByteType, Path],
         strict: bool = False,
         password: Union[None, str, bytes] = None,
+        *,
+        root_object_recovery_limit: Optional[int] = 10_000,
     ) -> None:
         self.strict = strict
         self.flattened_pages: Optional[list[PageObject]] = None
@@ -123,6 +131,11 @@ def __init__(
         self.xref_objStm: dict[int, tuple[Any, Any]] = {}
         self.trailer = DictionaryObject()
 
+        # Security parameters.
+        self._root_object_recovery_limit = (
+            root_object_recovery_limit if isinstance(root_object_recovery_limit, int) else sys.maxsize
+        )
+
         # Map page indirect_reference number to page number
         self._page_id2num: Optional[dict[Any, Any]] = None
 
@@ -214,15 +227,17 @@ def root_object(self) -> DictionaryObject:
             logger_warning("Invalid Root object in trailer", __name__)
         if self._validated_root is None:
             logger_warning('Searching object with "/Catalog" key', __name__)
-            nb = cast(int, self.trailer.get("/Size", 0))
-            for i in range(nb):
+            number_of_objects = cast(int, self.trailer.get("/Size", 0))
+            for i in range(number_of_objects):
+                if i >= self._root_object_recovery_limit:
+                    raise LimitReachedError("Maximum Root object recovery limit reached.")
                 try:
-                    o = self.get_object(i + 1)
+                    obj = self.get_object(i + 1)
                 except Exception:  # to be sure to capture all errors
-                    o = None
-                if isinstance(o, DictionaryObject) and o.get("/Type") == "/Catalog":
-                    self._validated_root = o
-                    logger_warning(f"Root found at {o.indirect_reference!r}", __name__)
+                    obj = None
+                if isinstance(obj, DictionaryObject) and obj.get("/Type") == "/Catalog":
+                    self._validated_root = obj
+                    logger_warning(f"Root found at {obj.indirect_reference!r}", __name__)
                     break
         if self._validated_root is None:
             if not is_null_or_none(root) and "/Pages" in cast(DictionaryObject, cast(PdfObject, root).get_object()):
@@ -1043,58 +1058,118 @@ def _get_xref_issues(stream: StreamType, startxref: int) -> int:
                 return 3
         return 0
 
+    @classmethod
+    def _find_pdf_objects(cls, data: bytes) -> Iterable[tuple[int, int, int]]:
+        index = 0
+        ord_0 = ord("0")
+        ord_9 = ord("9")
+        while True:
+            index = data.find(b" obj", index)
+            if index == -1:
+                return
+
+            index_before_space = index - 1
+
+            # Skip whitespace backwards
+            while index_before_space >= 0 and data[index_before_space] in WHITESPACES_AS_BYTES:
+                index_before_space -= 1
+
+            # Read generation number
+            generation_end = index_before_space + 1
+            while index_before_space >= 0 and ord_0 <= data[index_before_space] <= ord_9:
+                index_before_space -= 1
+            generation_start = index_before_space + 1
+
+            # Skip whitespace
+            while index_before_space >= 0 and data[index_before_space] in WHITESPACES_AS_BYTES:
+                index_before_space -= 1
+
+            # Read object number
+            object_end = index_before_space + 1
+            while index_before_space >= 0 and ord_0 <= data[index_before_space] <= ord_9:
+                index_before_space -= 1
+            object_start = index_before_space + 1
+
+            # Validate
+            if object_start < object_end and generation_start < generation_end:
+                object_number = int(data[object_start:object_end])
+                generation_number = int(data[generation_start:generation_end])
+
+                yield object_number, generation_number, object_start
+
+            index += 4  # len(b" obj")
+
+    @classmethod
+    def _find_pdf_trailers(cls, data: bytes) -> Iterable[int]:
+        index = 0
+        data_length = len(data)
+        while True:
+            index = data.find(b"trailer", index)
+            if index == -1:
+                return
+
+            index_after_trailer = index + 7  # len(b"trailer")
+
+            # Skip whitespace
+            while index_after_trailer < data_length and data[index_after_trailer] in WHITESPACES_AS_BYTES:
+                index_after_trailer += 1
+
+            # Must be dictionary start
+            if index_after_trailer + 1 < data_length and data[index_after_trailer:index_after_trailer+2] == b"<<":
+                yield index_after_trailer  # offset of '<<'
+
+            index += 7  # len(b"trailer")
+
     def _rebuild_xref_table(self, stream: StreamType) -> None:
         self.xref = {}
         stream.seek(0, 0)
-        f_ = stream.read(-1)
+        stream_data = stream.read(-1)
 
-        for m in re.finditer(rb"[\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj", f_):
-            idnum = int(m.group(1))
-            generation = int(m.group(2))
-            if generation not in self.xref:
-                self.xref[generation] = {}
-            self.xref[generation][idnum] = m.start(1)
+        for object_number, generation_number, object_start in self._find_pdf_objects(stream_data):
+            if generation_number not in self.xref:
+                self.xref[generation_number] = {}
+            self.xref[generation_number][object_number] = object_start
 
         logger_warning("parsing for Object Streams", __name__)
-        for g in self.xref:
-            for i in self.xref[g]:
+        for generation_number in self.xref:
+            for object_number in self.xref[generation_number]:
                 # get_object in manual
-                stream.seek(self.xref[g][i], 0)
+                stream.seek(self.xref[generation_number][object_number], 0)
                 try:
                     _ = self.read_object_header(stream)
-                    o = cast(StreamObject, read_object(stream, self))
-                    if o.get("/Type", "") != "/ObjStm":
+                    obj = cast(StreamObject, read_object(stream, self))
+                    if obj.get("/Type", "") != "/ObjStm":
                         continue
-                    strm = BytesIO(o.get_data())
-                    cpt = 0
+                    object_stream = BytesIO(obj.get_data())
+                    actual_count = 0
                     while True:
-                        s = read_until_whitespace(strm)
-                        if not s.isdigit():
+                        current = read_until_whitespace(object_stream)
+                        if not current.isdigit():
                             break
-                        _i = int(s)
-                        skip_over_whitespace(strm)
-                        strm.seek(-1, 1)
-                        s = read_until_whitespace(strm)
-                        if not s.isdigit():  # pragma: no cover
+                        inner_object_number = int(current)
+                        skip_over_whitespace(object_stream)
+                        object_stream.seek(-1, 1)
+                        current = read_until_whitespace(object_stream)
+                        if not current.isdigit():  # pragma: no cover
                             break  # pragma: no cover
-                        _o = int(s)
-                        self.xref_objStm[_i] = (i, _o)
-                        cpt += 1
-                    if cpt != o.get("/N"):  # pragma: no cover
+                        inner_generation_number = int(current)
+                        self.xref_objStm[inner_object_number] = (object_number, inner_generation_number)
+                        actual_count += 1
+                    if actual_count != obj.get("/N"):  # pragma: no cover
                         logger_warning(  # pragma: no cover
-                            f"found {cpt} objects within Object({i},{g})"
-                            f" whereas {o.get('/N')} expected",
+                            f"found {actual_count} objects within Object({object_number},{generation_number})"
+                            f" whereas {obj.get('/N')} expected",
                             __name__,
                         )
                 except Exception:  # could be multiple causes
                     pass
 
         stream.seek(0, 0)
-        for m in re.finditer(rb"[\r\n \t][ \t]*trailer[\r\n \t]*(<<)", f_):
-            stream.seek(m.start(1), 0)
+        for position in self._find_pdf_trailers(stream_data):
+            stream.seek(position, 0)
             new_trailer = cast(dict[Any, Any], read_object(stream, self))
             # Here, we are parsing the file from start to end, the new data have to erase the existing.
-            for key, value in list(new_trailer.items()):
+            for key, value in new_trailer.items():
                 self.trailer[key] = value
 
     def _read_xref_subsections(

tests/test_doc_common.py+18 −1 modified

@@ -11,7 +11,8 @@
 import pytest
 
 from pypdf import PdfReader, PdfWriter
-from pypdf.generic import EmbeddedFile, NullObject, TextStringObject, ViewerPreferences
+from pypdf.errors import PdfReadError
+from pypdf.generic import EmbeddedFile, NameObject, NullObject, TextStringObject, ViewerPreferences
 from tests import get_data_from_url
 
 TESTS_ROOT = Path(__file__).parent.resolve()
@@ -449,3 +450,19 @@ def test_outline__issue3462():
         "Page 1",
         "Page 2"
     ]
+
+
+def test_flatten__cyclic_references():
+    path = RESOURCES_ROOT / "crazyones.pdf"
+
+    reader = PdfReader(path)
+    assert len(reader.pages) == 1
+    reader._flatten()
+
+    # Make the first child point to the object itself.
+    pages_object = reader.get_object(10)
+    pages_object[NameObject("/Kids")][0].indirect_reference.idnum = 10
+    reader.resolved_objects[(10, 0)] = pages_object
+
+    with pytest.raises(expected_exception=PdfReadError, match=r"^Detected cyclic page references\.$"):
+        reader._flatten()

tests/test_reader.py+89 −0 modified

@@ -1,5 +1,6 @@
 """Test the pypdf._reader module."""
 import io
+import sys
 import time
 from io import BytesIO
 from pathlib import Path
@@ -17,6 +18,7 @@
     DeprecationError,
     EmptyFileError,
     FileNotDecryptedError,
+    LimitReachedError,
     PdfReadError,
     PdfStreamError,
     WrongPasswordError,
@@ -1889,3 +1891,90 @@ def test_read_standard_xref_table__two_whitespace_characters_between_offset_and_
     reader = PdfReader(BytesIO(get_data_from_url(url, name=name)))
     assert len(reader.pages) == 1
     assert reader.pages[0].extract_text() == "Hello World!"
+
+
+@pytest.mark.enable_socket
+def test_root_object_recovery_limit(caplog):
+    url = "https://github.com/user-attachments/files/24525509/root_object_recovery_limit.pdf"
+    name = "root_object_recovery_limit.pdf"
+    data = get_data_from_url(url, name=name)
+
+    # Default limit.
+    reader = PdfReader(BytesIO(data))
+    with pytest.raises(
+            expected_exception=LimitReachedError, match=r"^Maximum Root object recovery limit reached\.$"
+    ):
+        _ = list(reader.pages)
+    message_numbers = {
+        int(message.split(" ", maxsplit=2)[1])
+        for message in caplog.messages
+        if message.startswith("Object ") and message.endswith(" 0 not defined.")
+    }
+    assert sorted(message_numbers) == list(range(5, 10001))
+
+    # Custom limit.
+    caplog.clear()
+    reader = PdfReader(BytesIO(data), root_object_recovery_limit=42)
+    with pytest.raises(
+            expected_exception=LimitReachedError, match=r"^Maximum Root object recovery limit reached\.$"
+    ):
+        _ = list(reader.pages)
+    message_numbers = {
+        int(message.split(" ", maxsplit=2)[1])
+        for message in caplog.messages
+        if message.startswith("Object ") and message.endswith(" 0 not defined.")
+    }
+    assert sorted(message_numbers) == list(range(5, 43))
+
+    # No limit. Do not run actual process for speed reasons.
+    reader = PdfReader(BytesIO(data), root_object_recovery_limit=None)
+    assert reader._root_object_recovery_limit == sys.maxsize
+
+    # Strict mode.
+    with pytest.raises(expected_exception=PdfReadError, match=r"^Broken xref table$"):
+        reader = PdfReader(BytesIO(data), strict=True)
+        _ = list(reader.pages)
+
+
+@pytest.mark.timeout(10)
+def test_rebuild_xref_table__speed():
+    total_len = 2_000_790
+    middle = b"\nstartxref   1\n % "
+    leading_len = 0x55E  # 1374
+    leading = b" " * leading_len
+    trailing = b" " * (total_len - leading_len - len(middle))
+    data = leading + middle + trailing
+
+    reader = PdfReader(BytesIO(data))
+    with pytest.raises(expected_exception=PdfReadError, match=r"^Cannot find Root object in pdf$"):
+        _ = list(reader.pages)
+
+
+def test_find_pdf_objects():
+    data = (
+        b"     \n"
+        b"  11 0 obj\n"
+        b"  12 0 obj\n"
+        b"13  1  obj\n"
+        b"ob\n"
+        b"ab obj\n"
+        b"42 1337 obj \n"
+        b"\n"
+    )
+
+    result = list(PdfReader._find_pdf_objects(data))
+    assert result == [(11, 0, 8), (12, 0, 19), (13, 1, 28), (42, 1337, 49)]
+
+
+@pytest.mark.parametrize(
+    ("data", "expected"),
+    [
+        (b"\n\ntrailer", []),
+        (b"\n\ntrailer abc", []),
+        (b"\n\ntrailer <<", [10]),
+        (b"\n\ntrailer << /Key null >>\n\n  trailer << /Key 42 >>\n", [10, 37])
+    ]
+)
+def test_find_pdf_trailers(data: bytes, expected: list[int]):
+    result = list(PdfReader._find_pdf_trailers(data))
+    assert result == expected

Vulnerability mechanics

Generated by null/stub on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

github.com/advisories/GHSA-4f6g-68pf-7vhvghsaADVISORY
nvd.nist.gov/vuln/detail/CVE-2026-22691ghsaADVISORY
github.com/py-pdf/pypdf/commit/294165726b646bb7799be1cc787f593f2fdbcf45ghsax_refsource_MISCWEB
github.com/py-pdf/pypdf/pull/3594ghsax_refsource_MISCWEB
github.com/py-pdf/pypdf/releases/tag/6.6.0ghsax_refsource_MISCWEB
github.com/py-pdf/pypdf/security/advisories/GHSA-4f6g-68pf-7vhvghsax_refsource_CONFIRMWEB

News mentions

No linked articles in our index yet.

cvss	0.065
epss	0.000
exploit	0.000
kev	0.000
patch	-0.070
ransomware	0.000