VYPR
Medium severityNVD Advisory· Published May 28, 2026· Updated May 28, 2026

CVE-2026-48155

CVE-2026-48155

Description

pypdf is a free and open-source pure-python PDF library. Prior to 6.12.0, an attacker who uses this vulnerability can craft a PDF which leads to large memory usage. This requires extracting text in layout mode with large character offsets. This vulnerability is fixed in 6.12.0.

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

Crafted PDF with large character offsets in layout mode text extraction causes excessive memory usage in pypdf prior to 6.12.0.

Vulnerability

A large memory usage vulnerability exists in pypdf prior to version 6.12.0. Specifically, when extracting text in layout mode, the library does not properly handle large character offsets, allowing an attacker to craft a malicious PDF that triggers excessive memory consumption [1][2][3].

Exploitation

An attacker must provide a PDF document containing manipulated content with inordinately large character offsets. The victim (or an automated system) then needs to extract text from this PDF using pypdf's layout mode function, which triggers the vulnerable code path. No authentication or special privileges are required; the attacker only needs the ability to supply the crafted PDF file [1][3].

Impact

Successful exploitation results in excessive memory usage by the process using pypdf. This can lead to performance degradation or denial of service (DoS) due to resource exhaustion. The vulnerability primarily affects availability; no confidentiality or integrity impacts have been described [1][3].

Mitigation

The issue is fixed in pypdf version 6.12.0, released on 2026-05-21 [2][3]. Users should upgrade to this version or later. As a workaround, until an upgrade can be performed, applying the changes from Pull Request #3790 is recommended [1][3].

AI Insight generated on May 28, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected products

2
  • Py Pdf/Pypdfreferences2 versions
    (expand)+ 1 more
    • (no CPE)
    • (no CPE)range: <6.12.0

Patches

1
9d2747057c4a

SEC: Avoid excessive whitespace in layout mode text extraction (#3790)

https://github.com/py-pdf/pypdfStefanMay 21, 2026via body-scan-shorthand
2 files changed · +134 9
  • pypdf/_text_extraction/_layout_mode/_fixed_width_page.py+28 7 modified
    @@ -12,6 +12,9 @@
     from ._text_state_manager import TextStateManager
     from ._text_state_params import TextStateParams
     
    +WHITESPACE_LIMIT = 10_000
    +NEWLINE_LIMIT = 1_000
    +
     
     class BTGroup(TypedDict):
         """
    @@ -38,15 +41,15 @@ class BTGroup(TypedDict):
         flip_sort: Literal[-1, 1]
     
     
    -def bt_group(tj_op: TextStateParams, rendered_text: str, dispaced_tx: float) -> BTGroup:
    +def bt_group(tj_op: TextStateParams, rendered_text: str, displaced_tx: float) -> BTGroup:
         """
         BTGroup constructed from a TextStateParams instance, rendered text, and
         displaced tx value.
     
         Args:
             tj_op (TextStateParams): TextStateParams instance
             rendered_text (str): rendered text
    -        dispaced_tx (float): x coordinate of last character in BTGroup
    +        displaced_tx (float): x coordinate of last character in BTGroup
     
         """
         return BTGroup(
    @@ -55,12 +58,12 @@ def bt_group(tj_op: TextStateParams, rendered_text: str, dispaced_tx: float) ->
             font_size=tj_op.font_size,
             font_height=tj_op.font_height,
             text=rendered_text,
    -        displaced_tx=dispaced_tx,
    +        displaced_tx=displaced_tx,
             flip_sort=-1 if tj_op.flip_vertical else 1,
         )
     
     
    -def recurs_to_target_op(
    +def recurse_to_target_op(
         ops: Iterator[tuple[list[Any], bytes]],
         text_state_mgr: TextStateManager,
         end_target: Literal[b"Q", b"ET"],
    @@ -141,6 +144,12 @@ def recurs_to_target_op(
                         excess_tx = round(_tj.tx - last_displaced_tx, 3) * (_idx != bt_idx)
                         # space_tx could be 0 if either Tz or font_size was 0 for this _tj.
                         spaces = int(excess_tx // _tj.space_tx) if _tj.space_tx else 0
    +                    if spaces > WHITESPACE_LIMIT:
    +                        logger_warning(
    +                            "Limiting excessive whitespace from %(actual)d to %(limit)d characters.",
    +                            actual=spaces, limit=WHITESPACE_LIMIT, source=__name__
    +                        )
    +                        spaces = WHITESPACE_LIMIT
                         new_text = f'{" " * spaces}{_tj.txt}'
     
                         last_ty = _tj.ty
    @@ -151,15 +160,15 @@ def recurs_to_target_op(
                     text_state_mgr.reset_tm()
                 break
             if op == b"q":
    -            bts, tjs = recurs_to_target_op(
    +            bts, tjs = recurse_to_target_op(
                     ops, text_state_mgr, b"Q", fonts, strip_rotated
                 )
                 bt_groups.extend(bts)
                 tj_ops.extend(tjs)
             elif op == b"cm":
                 text_state_mgr.add_cm(*operands)
             elif op == b"BT":
    -            bts, tjs = recurs_to_target_op(
    +            bts, tjs = recurse_to_target_op(
                     ops, text_state_mgr, b"ET", fonts, strip_rotated
                 )
                 bt_groups.extend(bts)
    @@ -278,7 +287,7 @@ def text_show_operations(
         tj_ops: list[TextStateParams] = []  # Tj/TJ operator data
         for operands, op in ops:
             if op in (b"BT", b"q"):
    -            bts, tjs = recurs_to_target_op(
    +            bts, tjs = recurse_to_target_op(
                     ops, state_mgr, b"ET" if op == b"BT" else b"Q", fonts, strip_rotated
                 )
                 bt_groups.extend(bts)
    @@ -372,6 +381,12 @@ def fixed_width_page(
                 blank_lines = 0 if fh == 0 else (
                     int(abs(y_coord - last_y_coord) / (fh * font_height_weight)) - 1
                 )
    +            if blank_lines > NEWLINE_LIMIT:
    +                logger_warning(
    +                    "Limiting excessive newlines from %(actual)d to %(limit)d.",
    +                    actual=blank_lines, limit=NEWLINE_LIMIT, source=__name__
    +                )
    +                blank_lines = NEWLINE_LIMIT
                 lines.extend([""] * blank_lines)
     
             line_parts = []  # It uses a list to construct the line, avoiding string concatenation.
    @@ -382,6 +397,12 @@ def fixed_width_page(
                 offset = int(tx // char_width)
                 needed_spaces = offset - current_len
                 if needed_spaces > 0 and ceil(last_disp) < int(tx):
    +                if needed_spaces > WHITESPACE_LIMIT:
    +                    logger_warning(
    +                        "Limiting excessive whitespace from %(actual)d to %(limit)d characters.",
    +                        actual=needed_spaces, limit=WHITESPACE_LIMIT, source=__name__
    +                    )
    +                    needed_spaces = WHITESPACE_LIMIT
                     padding = " " * needed_spaces
                     line_parts.append(padding)
                     current_len += needed_spaces
    
  • tests/test_text_extraction.py+106 2 modified
    @@ -13,9 +13,15 @@
     from pypdf import PdfReader, PdfWriter, mult
     from pypdf._font import Font
     from pypdf._text_extraction import set_custom_rtl
    -from pypdf._text_extraction._layout_mode._fixed_width_page import text_show_operations
    +from pypdf._text_extraction._layout_mode._fixed_width_page import (
    +    BTGroup,
    +    fixed_width_page,
    +    recurse_to_target_op,
    +    text_show_operations,
    +)
    +from pypdf._text_extraction._layout_mode._text_state_manager import TextStateManager
     from pypdf.errors import PdfReadError
    -from pypdf.generic import ContentStream
    +from pypdf.generic import ContentStream, DictionaryObject, NameObject
     
     from . import RESOURCE_ROOT, SAMPLE_ROOT, get_data_from_url
     
    @@ -492,3 +498,101 @@ def test_extract_text_with_missing_font_bbox():
         page = reader.pages[0]
         text = page.extract_text()
         assert "🎉" in text
    +
    +
    +def test_recurse_to_target_op__excessive_intra_group_spacing(caplog):
    +    operators = [
    +        (["/F1", 12], b"Tf"),
    +        ([1, 0, 0, 1, 0, 700], b"Tm"),
    +        ([b"A"], b"Tj"),
    +        ([1, 0, 0, 1, 1000000, 700], b"Tm"),
    +        ([b"B"], b"Tj"),
    +        ([], b"ET")
    +    ]
    +    text_state_manager = TextStateManager()
    +    font = DictionaryObject({
    +        NameObject("/Type"): NameObject("/Font"),
    +        NameObject("/Subtype"): NameObject("/Type1"),
    +        NameObject("/BaseFont"): NameObject("/Helvetica"),
    +    })
    +    fonts = {"/F1": Font.from_font_resource(font)}
    +
    +    bt_groups, _tj_ops = recurse_to_target_op(
    +        ops=iter(operators),
    +        text_state_mgr=text_state_manager,
    +        end_target=b"ET",
    +        fonts=fonts,
    +    )
    +    assert bt_groups == [
    +        {
    +            "displaced_tx": 1000008.004,
    +            "flip_sort": 1,
    +            "font_height": 12.0,
    +            "font_size": 12,
    +            "text": "A" + 10000 * " " + "B",
    +            "tx": 0.0,
    +            "ty": 700.0
    +        }
    +    ]
    +    assert caplog.messages == ["Limiting excessive whitespace from 299757 to 10000 characters."]
    +
    +
    +def test_fixed_width_page__excessive_blank_lines(caplog):
    +    ty_groups = {
    +        100: [
    +            BTGroup(tx=0, text="Top", displaced_tx=3, font_height=1, ty=0, font_size=12, flip_sort=1),
    +        ],
    +        # Creates 1499 blank lines:
    +        # (1600 - 100) / (1 * 1) - 1
    +        # = 1500 - 1
    +        # = 1499
    +        1600: [
    +            BTGroup(tx=0, text="Bottom", displaced_tx=6, font_height=1, ty=0, font_size=12, flip_sort=1)
    +        ],
    +    }
    +
    +    result = fixed_width_page(
    +        ty_groups=ty_groups,
    +        char_width=1,
    +        space_vertically=True,
    +        font_height_weight=1,
    +    )
    +
    +    lines = result.splitlines()
    +
    +    assert lines[0] == "Top"
    +    assert lines[-1] == "Bottom"
    +
    +    # 2 content lines + reduced 1000 blank lines
    +    assert len(lines) == 1002
    +
    +    blank_lines = lines[1:-1]
    +    assert all(line == "" for line in blank_lines)
    +
    +    assert caplog.messages == ["Limiting excessive newlines from 1499 to 1000."]
    +
    +
    +def test_fixed_width_page__excessive_needed_spaces(caplog):
    +    ty_groups = {
    +        100: [
    +            BTGroup(
    +                tx=13_000,
    +                text="X",
    +                displaced_tx=13_370,
    +                font_height=12,
    +                ty=0,
    +                font_size=12,
    +                flip_sort=1,
    +            )
    +        ]
    +    }
    +
    +    result = fixed_width_page(
    +        ty_groups=ty_groups,
    +        char_width=1,
    +        space_vertically=True,
    +        font_height_weight=1,
    +    )
    +
    +    assert result == " " * 10_000 + "X"
    +    assert caplog.messages == ["Limiting excessive whitespace from 13000 to 10000 characters."]
    

Vulnerability mechanics

Root cause

"Missing upper bounds on whitespace and newline calculations derived from PDF character offsets allows an attacker to cause excessive memory allocation during layout-mode text extraction."

Attack vector

An attacker crafts a PDF whose content stream contains text-showing operators (e.g., `Tj`) separated by extremely large character offsets via `Tm` matrix translations. When pypdf extracts text in layout mode, `recurse_to_target_op` computes `spaces = int(excess_tx // _tj.space_tx)` from these offsets, and `fixed_width_page` computes `blank_lines` and `needed_spaces` from coordinate differences. Without a cap, these calculations produce strings with hundreds of thousands of whitespace characters, causing excessive memory allocation. The attacker needs no authentication; the vector is a malicious PDF delivered to a victim who calls `page.extract_text()` in layout mode.

Affected code

The vulnerability resides in `pypdf/_text_extraction/_layout_mode/_fixed_width_page.py`. The functions `recurse_to_target_op` (previously `recurs_to_target_op`) and `fixed_width_page` compute whitespace padding and blank-line counts from character offsets without any upper bound. The patch introduces `WHITESPACE_LIMIT = 10_000` and `NEWLINE_LIMIT = 1_000` constants and caps the computed values before generating the output string.

What the fix does

The patch adds two module-level constants, `WHITESPACE_LIMIT = 10_000` and `NEWLINE_LIMIT = 1_000`, and caps three computed values before they are used to generate output strings. In `recurse_to_target_op`, the `spaces` variable is clamped to `WHITESPACE_LIMIT`. In `fixed_width_page`, both `blank_lines` and `needed_spaces` are clamped to `NEWLINE_LIMIT` and `WHITESPACE_LIMIT` respectively. Each cap logs a warning with the actual and limited counts. This prevents an attacker from forcing the library to allocate memory proportional to arbitrary character offsets in the PDF.

Preconditions

  • configThe victim must call page.extract_text() with layout mode enabled (default or explicit).
  • inputThe attacker must deliver a crafted PDF to the victim.

Generated on May 28, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

3

News mentions

0

No linked articles in our index yet.