VYPR
Low severityOSV Advisory· Published Jan 10, 2026· Updated Jan 12, 2026

pypdf has possible long runtimes for malformed startxref

CVE-2026-22691

Description

pypdf is a free and open-source pure-python PDF library. Prior to version 6.6.0, pypdf has possible long runtimes for malformed startxref. An attacker who uses this vulnerability can craft a PDF which leads to possibly long runtimes for invalid startxref entries. When rebuilding the cross-reference table, PDF files with lots of whitespace characters become problematic. Only the non-strict reading mode is affected. Only the non-strict reading mode is affected. This issue has been patched in version 6.6.0.

Affected packages

Versions sourced from the GitHub Security Advisory.

PackageAffected versionsPatched versions
pypdfPyPI
< 6.6.06.6.0

Affected products

1

Patches

1
294165726b64

SEC: Improve handling of partially broken PDF files (#3594)

https://github.com/py-pdf/pypdfStefanJan 9, 2026via ghsa
5 files changed · +236 38
  • docs/user/security.md+13 0 modified
    @@ -4,6 +4,8 @@ We strive to provide a library with secure defaults.
     
     ## Configuration
     
    +### Filters
    +
     *pypdf* currently employs output size limits for some filters which are known to possibly have large compression ratios.
     
     The usual limit is at 75 MB of uncompressed data during decompression. If this is too low for your use case, and you are
    @@ -15,6 +17,17 @@ aware of the possible side effects, you can modify the following constants which
     For JBIG2 images, there is a similar parameter to limit the memory usage during decoding: `pypdf.filters.JBIG2_MAX_OUTPUT_LENGTH`
     It defaults to 75 MB as well.
     
    +### Reading
    +
    +*pypdf* currently employs the following reading limits on *PdfReader* instances:
    +
    +* `root_object_recovery_limit` limits the number of objects to read before stopping with Root object recovery in
    +  non-strict mode. It defaults to 10 000. Setting it to `None` will fully disable this limit.
    +
    +If you want to employ custom limits for the *PdfWriter* as well, the currently preferred way
    +is to initialize it from the reader, id est something like
    +`PdfWriter(clone_from=PdfReader("file.pdf", root_object_recovery_limit=42))`.
    +
     ## Reporting possible vulnerabilities
     
     Please refer to our [security policy](https://github.com/py-pdf/pypdf/security/policy).
    
  • pypdf/_doc_common.py+4 0 modified
    @@ -1170,7 +1170,11 @@ def _flatten(
                 for attr in inheritable_page_attributes:
                     if attr in pages:
                         inherit[attr] = pages[attr]
    +            pages_reference = getattr(pages, "indirect_reference", object())
                 for page in cast(ArrayObject, pages[PagesAttributes.KIDS]):
    +                if getattr(page, "indirect_reference", object()) == pages_reference:
    +                    raise PdfReadError("Detected cyclic page references.")
    +
                     addt = {}
                     if isinstance(page, IndirectObject):
                         addt["indirect_reference"] = page
    
  • pypdf/_reader.py+112 37 modified
    @@ -29,6 +29,7 @@
     
     import os
     import re
    +import sys
     from collections.abc import Iterable
     from io import BytesIO, UnsupportedOperation
     from pathlib import Path
    @@ -45,6 +46,7 @@
     from ._doc_common import PdfDocCommon, convert_to_int
     from ._encryption import Encryption, PasswordType
     from ._utils import (
    +    WHITESPACES_AS_BYTES,
         StrByteType,
         StreamType,
         logger_warning,
    @@ -58,6 +60,7 @@
     from .errors import (
         EmptyFileError,
         FileNotDecryptedError,
    +    LimitReachedError,
         PdfReadError,
         PdfStreamError,
         WrongPasswordError,
    @@ -101,6 +104,9 @@ class PdfReader(PdfDocCommon):
             password: Decrypt PDF file at initialization. If the
                 password is None, the file will not be decrypted.
                 Defaults to ``None``.
    +        root_object_recovery_limit: The maximum number of objects to query
    +            for recovering the Root object in non-strict mode. To disable
    +            this security measure, pass ``None``.
     
         """
     
    @@ -109,6 +115,8 @@ def __init__(
             stream: Union[StrByteType, Path],
             strict: bool = False,
             password: Union[None, str, bytes] = None,
    +        *,
    +        root_object_recovery_limit: Optional[int] = 10_000,
         ) -> None:
             self.strict = strict
             self.flattened_pages: Optional[list[PageObject]] = None
    @@ -123,6 +131,11 @@ def __init__(
             self.xref_objStm: dict[int, tuple[Any, Any]] = {}
             self.trailer = DictionaryObject()
     
    +        # Security parameters.
    +        self._root_object_recovery_limit = (
    +            root_object_recovery_limit if isinstance(root_object_recovery_limit, int) else sys.maxsize
    +        )
    +
             # Map page indirect_reference number to page number
             self._page_id2num: Optional[dict[Any, Any]] = None
     
    @@ -214,15 +227,17 @@ def root_object(self) -> DictionaryObject:
                 logger_warning("Invalid Root object in trailer", __name__)
             if self._validated_root is None:
                 logger_warning('Searching object with "/Catalog" key', __name__)
    -            nb = cast(int, self.trailer.get("/Size", 0))
    -            for i in range(nb):
    +            number_of_objects = cast(int, self.trailer.get("/Size", 0))
    +            for i in range(number_of_objects):
    +                if i >= self._root_object_recovery_limit:
    +                    raise LimitReachedError("Maximum Root object recovery limit reached.")
                     try:
    -                    o = self.get_object(i + 1)
    +                    obj = self.get_object(i + 1)
                     except Exception:  # to be sure to capture all errors
    -                    o = None
    -                if isinstance(o, DictionaryObject) and o.get("/Type") == "/Catalog":
    -                    self._validated_root = o
    -                    logger_warning(f"Root found at {o.indirect_reference!r}", __name__)
    +                    obj = None
    +                if isinstance(obj, DictionaryObject) and obj.get("/Type") == "/Catalog":
    +                    self._validated_root = obj
    +                    logger_warning(f"Root found at {obj.indirect_reference!r}", __name__)
                         break
             if self._validated_root is None:
                 if not is_null_or_none(root) and "/Pages" in cast(DictionaryObject, cast(PdfObject, root).get_object()):
    @@ -1043,58 +1058,118 @@ def _get_xref_issues(stream: StreamType, startxref: int) -> int:
                     return 3
             return 0
     
    +    @classmethod
    +    def _find_pdf_objects(cls, data: bytes) -> Iterable[tuple[int, int, int]]:
    +        index = 0
    +        ord_0 = ord("0")
    +        ord_9 = ord("9")
    +        while True:
    +            index = data.find(b" obj", index)
    +            if index == -1:
    +                return
    +
    +            index_before_space = index - 1
    +
    +            # Skip whitespace backwards
    +            while index_before_space >= 0 and data[index_before_space] in WHITESPACES_AS_BYTES:
    +                index_before_space -= 1
    +
    +            # Read generation number
    +            generation_end = index_before_space + 1
    +            while index_before_space >= 0 and ord_0 <= data[index_before_space] <= ord_9:
    +                index_before_space -= 1
    +            generation_start = index_before_space + 1
    +
    +            # Skip whitespace
    +            while index_before_space >= 0 and data[index_before_space] in WHITESPACES_AS_BYTES:
    +                index_before_space -= 1
    +
    +            # Read object number
    +            object_end = index_before_space + 1
    +            while index_before_space >= 0 and ord_0 <= data[index_before_space] <= ord_9:
    +                index_before_space -= 1
    +            object_start = index_before_space + 1
    +
    +            # Validate
    +            if object_start < object_end and generation_start < generation_end:
    +                object_number = int(data[object_start:object_end])
    +                generation_number = int(data[generation_start:generation_end])
    +
    +                yield object_number, generation_number, object_start
    +
    +            index += 4  # len(b" obj")
    +
    +    @classmethod
    +    def _find_pdf_trailers(cls, data: bytes) -> Iterable[int]:
    +        index = 0
    +        data_length = len(data)
    +        while True:
    +            index = data.find(b"trailer", index)
    +            if index == -1:
    +                return
    +
    +            index_after_trailer = index + 7  # len(b"trailer")
    +
    +            # Skip whitespace
    +            while index_after_trailer < data_length and data[index_after_trailer] in WHITESPACES_AS_BYTES:
    +                index_after_trailer += 1
    +
    +            # Must be dictionary start
    +            if index_after_trailer + 1 < data_length and data[index_after_trailer:index_after_trailer+2] == b"<<":
    +                yield index_after_trailer  # offset of '<<'
    +
    +            index += 7  # len(b"trailer")
    +
         def _rebuild_xref_table(self, stream: StreamType) -> None:
             self.xref = {}
             stream.seek(0, 0)
    -        f_ = stream.read(-1)
    +        stream_data = stream.read(-1)
     
    -        for m in re.finditer(rb"[\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj", f_):
    -            idnum = int(m.group(1))
    -            generation = int(m.group(2))
    -            if generation not in self.xref:
    -                self.xref[generation] = {}
    -            self.xref[generation][idnum] = m.start(1)
    +        for object_number, generation_number, object_start in self._find_pdf_objects(stream_data):
    +            if generation_number not in self.xref:
    +                self.xref[generation_number] = {}
    +            self.xref[generation_number][object_number] = object_start
     
             logger_warning("parsing for Object Streams", __name__)
    -        for g in self.xref:
    -            for i in self.xref[g]:
    +        for generation_number in self.xref:
    +            for object_number in self.xref[generation_number]:
                     # get_object in manual
    -                stream.seek(self.xref[g][i], 0)
    +                stream.seek(self.xref[generation_number][object_number], 0)
                     try:
                         _ = self.read_object_header(stream)
    -                    o = cast(StreamObject, read_object(stream, self))
    -                    if o.get("/Type", "") != "/ObjStm":
    +                    obj = cast(StreamObject, read_object(stream, self))
    +                    if obj.get("/Type", "") != "/ObjStm":
                             continue
    -                    strm = BytesIO(o.get_data())
    -                    cpt = 0
    +                    object_stream = BytesIO(obj.get_data())
    +                    actual_count = 0
                         while True:
    -                        s = read_until_whitespace(strm)
    -                        if not s.isdigit():
    +                        current = read_until_whitespace(object_stream)
    +                        if not current.isdigit():
                                 break
    -                        _i = int(s)
    -                        skip_over_whitespace(strm)
    -                        strm.seek(-1, 1)
    -                        s = read_until_whitespace(strm)
    -                        if not s.isdigit():  # pragma: no cover
    +                        inner_object_number = int(current)
    +                        skip_over_whitespace(object_stream)
    +                        object_stream.seek(-1, 1)
    +                        current = read_until_whitespace(object_stream)
    +                        if not current.isdigit():  # pragma: no cover
                                 break  # pragma: no cover
    -                        _o = int(s)
    -                        self.xref_objStm[_i] = (i, _o)
    -                        cpt += 1
    -                    if cpt != o.get("/N"):  # pragma: no cover
    +                        inner_generation_number = int(current)
    +                        self.xref_objStm[inner_object_number] = (object_number, inner_generation_number)
    +                        actual_count += 1
    +                    if actual_count != obj.get("/N"):  # pragma: no cover
                             logger_warning(  # pragma: no cover
    -                            f"found {cpt} objects within Object({i},{g})"
    -                            f" whereas {o.get('/N')} expected",
    +                            f"found {actual_count} objects within Object({object_number},{generation_number})"
    +                            f" whereas {obj.get('/N')} expected",
                                 __name__,
                             )
                     except Exception:  # could be multiple causes
                         pass
     
             stream.seek(0, 0)
    -        for m in re.finditer(rb"[\r\n \t][ \t]*trailer[\r\n \t]*(<<)", f_):
    -            stream.seek(m.start(1), 0)
    +        for position in self._find_pdf_trailers(stream_data):
    +            stream.seek(position, 0)
                 new_trailer = cast(dict[Any, Any], read_object(stream, self))
                 # Here, we are parsing the file from start to end, the new data have to erase the existing.
    -            for key, value in list(new_trailer.items()):
    +            for key, value in new_trailer.items():
                     self.trailer[key] = value
     
         def _read_xref_subsections(
    
  • tests/test_doc_common.py+18 1 modified
    @@ -11,7 +11,8 @@
     import pytest
     
     from pypdf import PdfReader, PdfWriter
    -from pypdf.generic import EmbeddedFile, NullObject, TextStringObject, ViewerPreferences
    +from pypdf.errors import PdfReadError
    +from pypdf.generic import EmbeddedFile, NameObject, NullObject, TextStringObject, ViewerPreferences
     from tests import get_data_from_url
     
     TESTS_ROOT = Path(__file__).parent.resolve()
    @@ -449,3 +450,19 @@ def test_outline__issue3462():
             "Page 1",
             "Page 2"
         ]
    +
    +
    +def test_flatten__cyclic_references():
    +    path = RESOURCES_ROOT / "crazyones.pdf"
    +
    +    reader = PdfReader(path)
    +    assert len(reader.pages) == 1
    +    reader._flatten()
    +
    +    # Make the first child point to the object itself.
    +    pages_object = reader.get_object(10)
    +    pages_object[NameObject("/Kids")][0].indirect_reference.idnum = 10
    +    reader.resolved_objects[(10, 0)] = pages_object
    +
    +    with pytest.raises(expected_exception=PdfReadError, match=r"^Detected cyclic page references\.$"):
    +        reader._flatten()
    
  • tests/test_reader.py+89 0 modified
    @@ -1,5 +1,6 @@
     """Test the pypdf._reader module."""
     import io
    +import sys
     import time
     from io import BytesIO
     from pathlib import Path
    @@ -17,6 +18,7 @@
         DeprecationError,
         EmptyFileError,
         FileNotDecryptedError,
    +    LimitReachedError,
         PdfReadError,
         PdfStreamError,
         WrongPasswordError,
    @@ -1889,3 +1891,90 @@ def test_read_standard_xref_table__two_whitespace_characters_between_offset_and_
         reader = PdfReader(BytesIO(get_data_from_url(url, name=name)))
         assert len(reader.pages) == 1
         assert reader.pages[0].extract_text() == "Hello World!"
    +
    +
    +@pytest.mark.enable_socket
    +def test_root_object_recovery_limit(caplog):
    +    url = "https://github.com/user-attachments/files/24525509/root_object_recovery_limit.pdf"
    +    name = "root_object_recovery_limit.pdf"
    +    data = get_data_from_url(url, name=name)
    +
    +    # Default limit.
    +    reader = PdfReader(BytesIO(data))
    +    with pytest.raises(
    +            expected_exception=LimitReachedError, match=r"^Maximum Root object recovery limit reached\.$"
    +    ):
    +        _ = list(reader.pages)
    +    message_numbers = {
    +        int(message.split(" ", maxsplit=2)[1])
    +        for message in caplog.messages
    +        if message.startswith("Object ") and message.endswith(" 0 not defined.")
    +    }
    +    assert sorted(message_numbers) == list(range(5, 10001))
    +
    +    # Custom limit.
    +    caplog.clear()
    +    reader = PdfReader(BytesIO(data), root_object_recovery_limit=42)
    +    with pytest.raises(
    +            expected_exception=LimitReachedError, match=r"^Maximum Root object recovery limit reached\.$"
    +    ):
    +        _ = list(reader.pages)
    +    message_numbers = {
    +        int(message.split(" ", maxsplit=2)[1])
    +        for message in caplog.messages
    +        if message.startswith("Object ") and message.endswith(" 0 not defined.")
    +    }
    +    assert sorted(message_numbers) == list(range(5, 43))
    +
    +    # No limit. Do not run actual process for speed reasons.
    +    reader = PdfReader(BytesIO(data), root_object_recovery_limit=None)
    +    assert reader._root_object_recovery_limit == sys.maxsize
    +
    +    # Strict mode.
    +    with pytest.raises(expected_exception=PdfReadError, match=r"^Broken xref table$"):
    +        reader = PdfReader(BytesIO(data), strict=True)
    +        _ = list(reader.pages)
    +
    +
    +@pytest.mark.timeout(10)
    +def test_rebuild_xref_table__speed():
    +    total_len = 2_000_790
    +    middle = b"\nstartxref   1\n % "
    +    leading_len = 0x55E  # 1374
    +    leading = b" " * leading_len
    +    trailing = b" " * (total_len - leading_len - len(middle))
    +    data = leading + middle + trailing
    +
    +    reader = PdfReader(BytesIO(data))
    +    with pytest.raises(expected_exception=PdfReadError, match=r"^Cannot find Root object in pdf$"):
    +        _ = list(reader.pages)
    +
    +
    +def test_find_pdf_objects():
    +    data = (
    +        b"     \n"
    +        b"  11 0 obj\n"
    +        b"  12 0 obj\n"
    +        b"13  1  obj\n"
    +        b"ob\n"
    +        b"ab obj\n"
    +        b"42 1337 obj \n"
    +        b"\n"
    +    )
    +
    +    result = list(PdfReader._find_pdf_objects(data))
    +    assert result == [(11, 0, 8), (12, 0, 19), (13, 1, 28), (42, 1337, 49)]
    +
    +
    +@pytest.mark.parametrize(
    +    ("data", "expected"),
    +    [
    +        (b"\n\ntrailer", []),
    +        (b"\n\ntrailer abc", []),
    +        (b"\n\ntrailer <<", [10]),
    +        (b"\n\ntrailer << /Key null >>\n\n  trailer << /Key 42 >>\n", [10, 37])
    +    ]
    +)
    +def test_find_pdf_trailers(data: bytes, expected: list[int]):
    +    result = list(PdfReader._find_pdf_trailers(data))
    +    assert result == expected
    

Vulnerability mechanics

Generated by null/stub on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

6

News mentions

0

No linked articles in our index yet.