pypdf has possible long runtimes for malformed startxref
Description
pypdf is a free and open-source pure-python PDF library. Prior to version 6.6.0, pypdf has possible long runtimes for malformed startxref. An attacker who uses this vulnerability can craft a PDF which leads to possibly long runtimes for invalid startxref entries. When rebuilding the cross-reference table, PDF files with lots of whitespace characters become problematic. Only the non-strict reading mode is affected. Only the non-strict reading mode is affected. This issue has been patched in version 6.6.0.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
pypdfPyPI | < 6.6.0 | 6.6.0 |
Affected products
1Patches
1294165726b64SEC: Improve handling of partially broken PDF files (#3594)
5 files changed · +236 −38
docs/user/security.md+13 −0 modified@@ -4,6 +4,8 @@ We strive to provide a library with secure defaults. ## Configuration +### Filters + *pypdf* currently employs output size limits for some filters which are known to possibly have large compression ratios. The usual limit is at 75 MB of uncompressed data during decompression. If this is too low for your use case, and you are @@ -15,6 +17,17 @@ aware of the possible side effects, you can modify the following constants which For JBIG2 images, there is a similar parameter to limit the memory usage during decoding: `pypdf.filters.JBIG2_MAX_OUTPUT_LENGTH` It defaults to 75 MB as well. +### Reading + +*pypdf* currently employs the following reading limits on *PdfReader* instances: + +* `root_object_recovery_limit` limits the number of objects to read before stopping with Root object recovery in + non-strict mode. It defaults to 10 000. Setting it to `None` will fully disable this limit. + +If you want to employ custom limits for the *PdfWriter* as well, the currently preferred way +is to initialize it from the reader, id est something like +`PdfWriter(clone_from=PdfReader("file.pdf", root_object_recovery_limit=42))`. + ## Reporting possible vulnerabilities Please refer to our [security policy](https://github.com/py-pdf/pypdf/security/policy).
pypdf/_doc_common.py+4 −0 modified@@ -1170,7 +1170,11 @@ def _flatten( for attr in inheritable_page_attributes: if attr in pages: inherit[attr] = pages[attr] + pages_reference = getattr(pages, "indirect_reference", object()) for page in cast(ArrayObject, pages[PagesAttributes.KIDS]): + if getattr(page, "indirect_reference", object()) == pages_reference: + raise PdfReadError("Detected cyclic page references.") + addt = {} if isinstance(page, IndirectObject): addt["indirect_reference"] = page
pypdf/_reader.py+112 −37 modified@@ -29,6 +29,7 @@ import os import re +import sys from collections.abc import Iterable from io import BytesIO, UnsupportedOperation from pathlib import Path @@ -45,6 +46,7 @@ from ._doc_common import PdfDocCommon, convert_to_int from ._encryption import Encryption, PasswordType from ._utils import ( + WHITESPACES_AS_BYTES, StrByteType, StreamType, logger_warning, @@ -58,6 +60,7 @@ from .errors import ( EmptyFileError, FileNotDecryptedError, + LimitReachedError, PdfReadError, PdfStreamError, WrongPasswordError, @@ -101,6 +104,9 @@ class PdfReader(PdfDocCommon): password: Decrypt PDF file at initialization. If the password is None, the file will not be decrypted. Defaults to ``None``. + root_object_recovery_limit: The maximum number of objects to query + for recovering the Root object in non-strict mode. To disable + this security measure, pass ``None``. """ @@ -109,6 +115,8 @@ def __init__( stream: Union[StrByteType, Path], strict: bool = False, password: Union[None, str, bytes] = None, + *, + root_object_recovery_limit: Optional[int] = 10_000, ) -> None: self.strict = strict self.flattened_pages: Optional[list[PageObject]] = None @@ -123,6 +131,11 @@ def __init__( self.xref_objStm: dict[int, tuple[Any, Any]] = {} self.trailer = DictionaryObject() + # Security parameters. + self._root_object_recovery_limit = ( + root_object_recovery_limit if isinstance(root_object_recovery_limit, int) else sys.maxsize + ) + # Map page indirect_reference number to page number self._page_id2num: Optional[dict[Any, Any]] = None @@ -214,15 +227,17 @@ def root_object(self) -> DictionaryObject: logger_warning("Invalid Root object in trailer", __name__) if self._validated_root is None: logger_warning('Searching object with "/Catalog" key', __name__) - nb = cast(int, self.trailer.get("/Size", 0)) - for i in range(nb): + number_of_objects = cast(int, self.trailer.get("/Size", 0)) + for i in range(number_of_objects): + if i >= self._root_object_recovery_limit: + raise LimitReachedError("Maximum Root object recovery limit reached.") try: - o = self.get_object(i + 1) + obj = self.get_object(i + 1) except Exception: # to be sure to capture all errors - o = None - if isinstance(o, DictionaryObject) and o.get("/Type") == "/Catalog": - self._validated_root = o - logger_warning(f"Root found at {o.indirect_reference!r}", __name__) + obj = None + if isinstance(obj, DictionaryObject) and obj.get("/Type") == "/Catalog": + self._validated_root = obj + logger_warning(f"Root found at {obj.indirect_reference!r}", __name__) break if self._validated_root is None: if not is_null_or_none(root) and "/Pages" in cast(DictionaryObject, cast(PdfObject, root).get_object()): @@ -1043,58 +1058,118 @@ def _get_xref_issues(stream: StreamType, startxref: int) -> int: return 3 return 0 + @classmethod + def _find_pdf_objects(cls, data: bytes) -> Iterable[tuple[int, int, int]]: + index = 0 + ord_0 = ord("0") + ord_9 = ord("9") + while True: + index = data.find(b" obj", index) + if index == -1: + return + + index_before_space = index - 1 + + # Skip whitespace backwards + while index_before_space >= 0 and data[index_before_space] in WHITESPACES_AS_BYTES: + index_before_space -= 1 + + # Read generation number + generation_end = index_before_space + 1 + while index_before_space >= 0 and ord_0 <= data[index_before_space] <= ord_9: + index_before_space -= 1 + generation_start = index_before_space + 1 + + # Skip whitespace + while index_before_space >= 0 and data[index_before_space] in WHITESPACES_AS_BYTES: + index_before_space -= 1 + + # Read object number + object_end = index_before_space + 1 + while index_before_space >= 0 and ord_0 <= data[index_before_space] <= ord_9: + index_before_space -= 1 + object_start = index_before_space + 1 + + # Validate + if object_start < object_end and generation_start < generation_end: + object_number = int(data[object_start:object_end]) + generation_number = int(data[generation_start:generation_end]) + + yield object_number, generation_number, object_start + + index += 4 # len(b" obj") + + @classmethod + def _find_pdf_trailers(cls, data: bytes) -> Iterable[int]: + index = 0 + data_length = len(data) + while True: + index = data.find(b"trailer", index) + if index == -1: + return + + index_after_trailer = index + 7 # len(b"trailer") + + # Skip whitespace + while index_after_trailer < data_length and data[index_after_trailer] in WHITESPACES_AS_BYTES: + index_after_trailer += 1 + + # Must be dictionary start + if index_after_trailer + 1 < data_length and data[index_after_trailer:index_after_trailer+2] == b"<<": + yield index_after_trailer # offset of '<<' + + index += 7 # len(b"trailer") + def _rebuild_xref_table(self, stream: StreamType) -> None: self.xref = {} stream.seek(0, 0) - f_ = stream.read(-1) + stream_data = stream.read(-1) - for m in re.finditer(rb"[\r\n \t][ \t]*(\d+)[ \t]+(\d+)[ \t]+obj", f_): - idnum = int(m.group(1)) - generation = int(m.group(2)) - if generation not in self.xref: - self.xref[generation] = {} - self.xref[generation][idnum] = m.start(1) + for object_number, generation_number, object_start in self._find_pdf_objects(stream_data): + if generation_number not in self.xref: + self.xref[generation_number] = {} + self.xref[generation_number][object_number] = object_start logger_warning("parsing for Object Streams", __name__) - for g in self.xref: - for i in self.xref[g]: + for generation_number in self.xref: + for object_number in self.xref[generation_number]: # get_object in manual - stream.seek(self.xref[g][i], 0) + stream.seek(self.xref[generation_number][object_number], 0) try: _ = self.read_object_header(stream) - o = cast(StreamObject, read_object(stream, self)) - if o.get("/Type", "") != "/ObjStm": + obj = cast(StreamObject, read_object(stream, self)) + if obj.get("/Type", "") != "/ObjStm": continue - strm = BytesIO(o.get_data()) - cpt = 0 + object_stream = BytesIO(obj.get_data()) + actual_count = 0 while True: - s = read_until_whitespace(strm) - if not s.isdigit(): + current = read_until_whitespace(object_stream) + if not current.isdigit(): break - _i = int(s) - skip_over_whitespace(strm) - strm.seek(-1, 1) - s = read_until_whitespace(strm) - if not s.isdigit(): # pragma: no cover + inner_object_number = int(current) + skip_over_whitespace(object_stream) + object_stream.seek(-1, 1) + current = read_until_whitespace(object_stream) + if not current.isdigit(): # pragma: no cover break # pragma: no cover - _o = int(s) - self.xref_objStm[_i] = (i, _o) - cpt += 1 - if cpt != o.get("/N"): # pragma: no cover + inner_generation_number = int(current) + self.xref_objStm[inner_object_number] = (object_number, inner_generation_number) + actual_count += 1 + if actual_count != obj.get("/N"): # pragma: no cover logger_warning( # pragma: no cover - f"found {cpt} objects within Object({i},{g})" - f" whereas {o.get('/N')} expected", + f"found {actual_count} objects within Object({object_number},{generation_number})" + f" whereas {obj.get('/N')} expected", __name__, ) except Exception: # could be multiple causes pass stream.seek(0, 0) - for m in re.finditer(rb"[\r\n \t][ \t]*trailer[\r\n \t]*(<<)", f_): - stream.seek(m.start(1), 0) + for position in self._find_pdf_trailers(stream_data): + stream.seek(position, 0) new_trailer = cast(dict[Any, Any], read_object(stream, self)) # Here, we are parsing the file from start to end, the new data have to erase the existing. - for key, value in list(new_trailer.items()): + for key, value in new_trailer.items(): self.trailer[key] = value def _read_xref_subsections(
tests/test_doc_common.py+18 −1 modified@@ -11,7 +11,8 @@ import pytest from pypdf import PdfReader, PdfWriter -from pypdf.generic import EmbeddedFile, NullObject, TextStringObject, ViewerPreferences +from pypdf.errors import PdfReadError +from pypdf.generic import EmbeddedFile, NameObject, NullObject, TextStringObject, ViewerPreferences from tests import get_data_from_url TESTS_ROOT = Path(__file__).parent.resolve() @@ -449,3 +450,19 @@ def test_outline__issue3462(): "Page 1", "Page 2" ] + + +def test_flatten__cyclic_references(): + path = RESOURCES_ROOT / "crazyones.pdf" + + reader = PdfReader(path) + assert len(reader.pages) == 1 + reader._flatten() + + # Make the first child point to the object itself. + pages_object = reader.get_object(10) + pages_object[NameObject("/Kids")][0].indirect_reference.idnum = 10 + reader.resolved_objects[(10, 0)] = pages_object + + with pytest.raises(expected_exception=PdfReadError, match=r"^Detected cyclic page references\.$"): + reader._flatten()
tests/test_reader.py+89 −0 modified@@ -1,5 +1,6 @@ """Test the pypdf._reader module.""" import io +import sys import time from io import BytesIO from pathlib import Path @@ -17,6 +18,7 @@ DeprecationError, EmptyFileError, FileNotDecryptedError, + LimitReachedError, PdfReadError, PdfStreamError, WrongPasswordError, @@ -1889,3 +1891,90 @@ def test_read_standard_xref_table__two_whitespace_characters_between_offset_and_ reader = PdfReader(BytesIO(get_data_from_url(url, name=name))) assert len(reader.pages) == 1 assert reader.pages[0].extract_text() == "Hello World!" + + +@pytest.mark.enable_socket +def test_root_object_recovery_limit(caplog): + url = "https://github.com/user-attachments/files/24525509/root_object_recovery_limit.pdf" + name = "root_object_recovery_limit.pdf" + data = get_data_from_url(url, name=name) + + # Default limit. + reader = PdfReader(BytesIO(data)) + with pytest.raises( + expected_exception=LimitReachedError, match=r"^Maximum Root object recovery limit reached\.$" + ): + _ = list(reader.pages) + message_numbers = { + int(message.split(" ", maxsplit=2)[1]) + for message in caplog.messages + if message.startswith("Object ") and message.endswith(" 0 not defined.") + } + assert sorted(message_numbers) == list(range(5, 10001)) + + # Custom limit. + caplog.clear() + reader = PdfReader(BytesIO(data), root_object_recovery_limit=42) + with pytest.raises( + expected_exception=LimitReachedError, match=r"^Maximum Root object recovery limit reached\.$" + ): + _ = list(reader.pages) + message_numbers = { + int(message.split(" ", maxsplit=2)[1]) + for message in caplog.messages + if message.startswith("Object ") and message.endswith(" 0 not defined.") + } + assert sorted(message_numbers) == list(range(5, 43)) + + # No limit. Do not run actual process for speed reasons. + reader = PdfReader(BytesIO(data), root_object_recovery_limit=None) + assert reader._root_object_recovery_limit == sys.maxsize + + # Strict mode. + with pytest.raises(expected_exception=PdfReadError, match=r"^Broken xref table$"): + reader = PdfReader(BytesIO(data), strict=True) + _ = list(reader.pages) + + +@pytest.mark.timeout(10) +def test_rebuild_xref_table__speed(): + total_len = 2_000_790 + middle = b"\nstartxref 1\n % " + leading_len = 0x55E # 1374 + leading = b" " * leading_len + trailing = b" " * (total_len - leading_len - len(middle)) + data = leading + middle + trailing + + reader = PdfReader(BytesIO(data)) + with pytest.raises(expected_exception=PdfReadError, match=r"^Cannot find Root object in pdf$"): + _ = list(reader.pages) + + +def test_find_pdf_objects(): + data = ( + b" \n" + b" 11 0 obj\n" + b" 12 0 obj\n" + b"13 1 obj\n" + b"ob\n" + b"ab obj\n" + b"42 1337 obj \n" + b"\n" + ) + + result = list(PdfReader._find_pdf_objects(data)) + assert result == [(11, 0, 8), (12, 0, 19), (13, 1, 28), (42, 1337, 49)] + + +@pytest.mark.parametrize( + ("data", "expected"), + [ + (b"\n\ntrailer", []), + (b"\n\ntrailer abc", []), + (b"\n\ntrailer <<", [10]), + (b"\n\ntrailer << /Key null >>\n\n trailer << /Key 42 >>\n", [10, 37]) + ] +) +def test_find_pdf_trailers(data: bytes, expected: list[int]): + result = list(PdfReader._find_pdf_trailers(data)) + assert result == expected
Vulnerability mechanics
Generated by null/stub on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
6- github.com/advisories/GHSA-4f6g-68pf-7vhvghsaADVISORY
- nvd.nist.gov/vuln/detail/CVE-2026-22691ghsaADVISORY
- github.com/py-pdf/pypdf/commit/294165726b646bb7799be1cc787f593f2fdbcf45ghsax_refsource_MISCWEB
- github.com/py-pdf/pypdf/pull/3594ghsax_refsource_MISCWEB
- github.com/py-pdf/pypdf/releases/tag/6.6.0ghsax_refsource_MISCWEB
- github.com/py-pdf/pypdf/security/advisories/GHSA-4f6g-68pf-7vhvghsax_refsource_CONFIRMWEB
News mentions
0No linked articles in our index yet.