Docling: Unsafe URI and Path Handling in HTML Backend
Description
Impact
The HTML backend did not perform sufficient validation during resource handling: - Accepted file:// URIs enabling local file system access when enable_local_fetch=True - Path resolution allowed traversal outside intended directories via ../ sequences and absolute paths - Did not block internal network resources under enable_remote_fetch=True - HTTP redirects were not validated, potentially redirecting to unintended schemes - No resource limits for remote image downloads and data: URIs
Patches
Fixed in versions 2.91.0 (initial fixes) and 2.94.0 (additional improvements). The fixes implement: - Updated local path treatment: absolute files always blocked, relative paths require enable_local_fetch=True (default: False) and containment within configured base_path for path traversal protection - file:// scheme stripped & treated as local path (above) - IP address validation to prevent SSRF - HTTP redirect validation, connection and read timeouts - Size limit for both remote images (with streaming download) and base64-decoded data URIs
Workarounds
Keep both enable_local_fetch=False and enable_remote_fetch=False (defaults) when processing untrusted HTML documents.
### References - Initial fixes: v2.91.0 - Additional improvements: v2.94.0
Affected products
2- Range: >=2.91.0, <2.94.0
Patches
2eb6e1e66096bfix(html): add redirect validation to image fetching (#3407)
3 files changed · +68 −31
docling/backend/html_backend.py+22 −1 modified@@ -4323,7 +4323,28 @@ def _load_image_data(self, src_loc: str) -> Optional[bytes]: max_size = self.options.max_remote_image_bytes headers = {"Range": f"bytes=0-{max_size - 1}"} - response = requests.get( + # Create session with redirect limit + session = requests.Session() + session.max_redirects = self.options.max_redirects + + # Hook to validate each redirect target + def _check_redirect_safety(response, *args, **kwargs): + """Validate each redirect target before following it.""" + if response.is_redirect or response.is_permanent_redirect: + redirect_url = response.headers.get("location") + if redirect_url: + # Handle relative redirects + if not redirect_url.startswith(("http://", "https://")): + from urllib.parse import urljoin + + redirect_url = urljoin(response.url, redirect_url) + + # Validate the redirect target + _validate_url_safety(redirect_url) + + session.hooks["response"].append(_check_redirect_safety) + + response = session.get( src_loc, stream=True, headers=headers, timeout=(5, 30) ) response.raise_for_status()
docling/datamodel/backend_options.py+5 −1 modified@@ -1,7 +1,7 @@ from pathlib import Path, PurePath from typing import Annotated, Literal, Optional, Union -from pydantic import AnyUrl, BaseModel, Field, PositiveInt, SecretStr +from pydantic import AnyUrl, BaseModel, Field, PositiveInt, SecretStr, conint class BaseBackendOptions(BaseModel): @@ -98,6 +98,10 @@ class HTMLBackendOptions(BaseBackendOptions): 20 * 1024 * 1024, # 20 MB description="The maximum number of bytes for remote image downloads.", ) + max_redirects: Annotated[int, Field(ge=0)] = Field( + 5, + description="Maximum number of HTTP redirects to follow when fetching remote resources. Set to 0 to disable redirects.", + ) class MarkdownBackendOptions(BaseBackendOptions):
tests/test_backend_html.py+41 −29 modified@@ -306,31 +306,37 @@ def test_e2e_html_conversion_with_images(mock_local, mock_remote): num_pic += 1 assert num_pic == 1, "No embedded picture was found in the converted file" - # fetching image remotely - mock_resp = Mock() - mock_resp.status_code = 200 - mock_resp.headers = {} - mock_resp.raise_for_status = Mock() - mock_resp.iter_content = Mock(return_value=[img_bytes]) - mock_remote.return_value = mock_resp - source_location = "https://example.com/example_01.html" - - backend_options = HTMLBackendOptions( - enable_remote_fetch=True, fetch_images=True, source_uri=source_location - ) - converter = DocumentConverter( - allowed_formats=[InputFormat.HTML], - format_options={ - InputFormat.HTML: HTMLFormatOption(backend_options=backend_options) - }, - ) - res_remote = converter.convert(source) - mock_remote.assert_called_once_with( - "https://example.com/example_image_01.png", - stream=True, - headers={"Range": "bytes=0-20971519"}, # 20 MB - 1 - timeout=(5, 30), - ) + # fetching image remotely - need to mock Session.get instead of requests.get + with patch( + "docling.backend.html_backend.requests.Session.get" + ) as mocked_session_get: + mock_resp = Mock() + mock_resp.status_code = 200 + mock_resp.headers = {} + mock_resp.raise_for_status = Mock() + mock_resp.iter_content = Mock(return_value=[img_bytes]) + mock_resp.is_redirect = False + mock_resp.is_permanent_redirect = False + mocked_session_get.return_value = mock_resp + source_location = "https://example.com/example_01.html" + + backend_options = HTMLBackendOptions( + enable_remote_fetch=True, fetch_images=True, source_uri=source_location + ) + converter = DocumentConverter( + allowed_formats=[InputFormat.HTML], + format_options={ + InputFormat.HTML: HTMLFormatOption(backend_options=backend_options) + }, + ) + res_remote = converter.convert(source) + # Verify the session.get was called + assert mocked_session_get.call_count == 1 + call_args = mocked_session_get.call_args + assert call_args[0][0] == "https://example.com/example_image_01.png" + assert call_args[1]["stream"] is True + assert call_args[1]["headers"] == {"Range": "bytes=0-20971519"} + assert call_args[1]["timeout"] == (5, 30) assert res_remote.document num_pic = 0 for element, _ in res_remote.document.iterate_items(): @@ -448,16 +454,20 @@ def test_fetch_remote_images(monkeypatch): InputFormat.HTML: HTMLFormatOption(backend_options=backend_options) }, ) - with patch("docling.backend.html_backend.requests.get") as mocked_get: + with patch( + "docling.backend.html_backend.requests.Session.get" + ) as mocked_session_get: # Mock the response to support the new streaming interface mock_resp = Mock() mock_resp.headers = {} mock_resp.raise_for_status = Mock() mock_resp.iter_content = Mock(return_value=[b"fake_image_data"]) - mocked_get.return_value = mock_resp + mock_resp.is_redirect = False + mock_resp.is_permanent_redirect = False + mocked_session_get.return_value = mock_resp res = converter.convert(source) - mocked_get.assert_called_once() + mocked_session_get.assert_called_once() assert res.document # image fetching: all conditions apply, local fetching allowed @@ -792,7 +802,9 @@ def iter_content(self, chunk_size=8192): ) oversized_response = MockResponse(25 * 1024 * 1024) # 25 MB, exceeds 20 MB limit - monkeypatch.setattr(requests, "get", lambda *args, **kwargs: oversized_response) + monkeypatch.setattr( + requests.Session, "get", lambda *args, **kwargs: oversized_response + ) with pytest.raises(ValueError, match="Resource size exceeds limit"): backend._load_image_data("http://example.com/huge_image.png")
2bb0fa67bd88fix(html): improve local file path handling (#3400)
2 files changed · +181 −15
docling/backend/html_backend.py+47 −9 modified@@ -1304,19 +1304,53 @@ def _is_remote_url(value: str) -> bool: parsed = urlparse(value) return parsed.scheme in {"http", "https", "ftp", "s3", "gs"} + @staticmethod + def _is_local_path(value: str) -> bool: + """Check if value is a local filesystem path (not a URI).""" + parsed = urlparse(value) + return not parsed.netloc and ( + not parsed.scheme + or (len(parsed.scheme) == 1 and parsed.scheme.isalpha()) # Windows case + ) + + def _is_absolute_path(self, loc: str) -> bool: + return Path(loc).is_absolute() or ( # Windows-specific absolute paths: + len((parsed_loc := urlparse(loc)).scheme) == 1 + and parsed_loc.scheme.isalpha() + and not parsed_loc.netloc + ) + def _resolve_relative_path(self, loc: str) -> str: + loc = loc.strip() + + # Strip file:// prefix for validation as local path + if loc.startswith(file_prefix := "file://"): + loc = loc[len(file_prefix) :] + abs_loc = loc if self.base_path: if loc.startswith("//"): - # Protocol-relative URL - default to https abs_loc = "https:" + loc - elif not loc.startswith(("http://", "https://", "data:", "file://", "#")): - if HTMLDocumentBackend._is_remote_url(self.base_path): # remote fetch + elif not loc.startswith(("http://", "https://", "data:", "#")): + if HTMLDocumentBackend._is_remote_url(self.base_path): abs_loc = urljoin(self.base_path, loc) - elif self.base_path: # local fetch - # For local files, resolve relative to the HTML file location - abs_loc = str(Path(self.base_path).parent / loc) + elif HTMLDocumentBackend._is_local_path(self.base_path): + if self._is_absolute_path(loc): + raise ValueError( + f"Absolute paths are not allowed with local base_path: '{loc}'" + ) + + base_dir = Path(self.base_path).parent.resolve() + resolved_path = (base_dir / loc).resolve() + + if not resolved_path.is_relative_to(base_dir): + raise ValueError( + f"Path traversal blocked: '{loc}' resolves outside base directory" + ) + abs_loc = str(resolved_path) + else: + raise ValueError(f"Invalid base_path format: '{self.base_path}'") _log.debug(f"Resolved location {loc} to {abs_loc}") return abs_loc @@ -4319,14 +4353,18 @@ def _load_image_data(self, src_loc: str) -> Optional[bytes]: return decoded_data - if src_loc.startswith("file://"): - src_loc = src_loc[7:] - if not self.options.enable_local_fetch: raise OperationNotAllowed( "Fetching local resources is only allowed when set explicitly. " "Set options.enable_local_fetch=True." ) + + # Require base_path for directory confinement (validation done in _resolve_relative_path) + if not self.base_path: + raise OperationNotAllowed( + f"Local file access requires base_path for directory confinement: '{src_loc}'" + ) + if os.path.isfile(src_loc) and os.access(src_loc, os.R_OK): with open(src_loc, "rb") as f: return f.read()
tests/test_backend_html.py+134 −6 modified@@ -1,4 +1,5 @@ import base64 +import os import threading import time from io import BytesIO @@ -22,6 +23,7 @@ SectionHeaderItem, ) from docling.document_converter import DocumentConverter, HTMLFormatOption +from docling.exceptions import OperationNotAllowed from .test_data_gen_flag import GEN_TEST_DATA from .verify_utils import verify_document, verify_export @@ -64,7 +66,10 @@ def test_resolve_relative_path(): assert html_doc._resolve_relative_path(relative_path) == expected_abs_loc absolute_path = "/absolute/path/to/file.html" - assert html_doc._resolve_relative_path(absolute_path) == absolute_path + with pytest.raises( + ValueError, match="Absolute paths are not allowed with local base_path" + ): + html_doc._resolve_relative_path(absolute_path) html_doc.base_path = "http://my_host.com" protocol_relative_url = "//example.com/file.html" @@ -86,9 +91,16 @@ def test_resolve_relative_path(): expected_abs_loc = "http://example.com/static/images/my_image.png" assert html_doc._resolve_relative_path(remote_relative_path) == expected_abs_loc + # when base_path is None, paths pass through unchanged + # (validation happens in _load_image_data for actual file access) html_doc.base_path = None - relative_path = "subdir/file.html" - assert html_doc._resolve_relative_path(relative_path) == relative_path + + # Paths pass through _resolve_relative_path unchanged + assert html_doc._resolve_relative_path("subdir/file.html") == "subdir/file.html" + + # Remote URLs also pass through + remote_url = "https://example.com/file.html" + assert html_doc._resolve_relative_path(remote_url) == remote_url # Fragment-only hrefs must pass through unchanged html_doc.base_path = "/local/path/to/file.html" @@ -463,9 +475,8 @@ def test_fetch_remote_images(monkeypatch): pytest.warns(match="a bytes-like object is required"), ): res = converter.convert(source) - mocked_open.assert_called_once_with( - "tests/data/html/example_image_01.png", "rb" - ) + expected_path = os.path.abspath("tests/data/html/example_image_01.png") + mocked_open.assert_called_once_with(expected_path, "rb") assert res.document @@ -835,3 +846,120 @@ def test_anchor_fragment_links_with_source_uri(): "[Example](https://example.com)" in md or "[Example](https://example.com/)" in md ) + + +def test_path_traversal_blocked_in_resolve_relative_path(): + """Test that path traversal attempts are blocked.""" + html_path = Path("./tests/data/html/example_01.html") + options = HTMLBackendOptions(enable_local_fetch=True, fetch_images=True) + in_doc = InputDocument( + path_or_stream=html_path, + format=InputFormat.HTML, + backend=HTMLDocumentBackend, + filename="test", + ) + html_doc = HTMLDocumentBackend( + path_or_stream=html_path, in_doc=in_doc, options=options + ) + html_doc.base_path = "/tmp/docs/report.html" + + # Path traversal with ../ blocked + with pytest.raises(ValueError, match="Path traversal blocked"): + html_doc._resolve_relative_path("../../../../../../../etc/something") + + with pytest.raises(ValueError, match="Path traversal blocked"): + html_doc._resolve_relative_path("subdir/../../../../../../etc/something") + + # Valid relative paths work + result = html_doc._resolve_relative_path("images/photo.png") + assert "/tmp/docs/images/photo.png" in result + assert "etc" not in result + + # Absolute paths blocked with local base_path + with pytest.raises( + ValueError, match="Absolute paths are not allowed with local base_path" + ): + html_doc._resolve_relative_path("/absolute/path/to/file.html") + + # file:// URIs blocked + with pytest.raises( + ValueError, match="Absolute paths are not allowed with local base_path" + ): + html_doc._resolve_relative_path("file:///etc/something") + + # Windows absolute paths blocked with local base_path (forward slashes) + with pytest.raises( + ValueError, match="Absolute paths are not allowed with local base_path" + ): + html_doc._resolve_relative_path("C:/Windows/System32/config/sam") + + with pytest.raises( + ValueError, match="Absolute paths are not allowed with local base_path" + ): + html_doc._resolve_relative_path("D:/sensitive/data.txt") + + # Windows absolute paths with backslashes (native Windows separator) + with pytest.raises( + ValueError, match="Absolute paths are not allowed with local base_path" + ): + html_doc._resolve_relative_path(r"C:\Windows\System32\config\sam") + + with pytest.raises( + ValueError, match="Absolute paths are not allowed with local base_path" + ): + html_doc._resolve_relative_path(r"D:\Users\Foo\Documents\something.txt") + + # Hypothetical single-letter URI schemes (c://, z://) should be rejected as URIs + with pytest.raises(ValueError, match="Invalid base_path format"): + html_doc.base_path = "c://example.com/path" + html_doc._resolve_relative_path("image.png") + + # Reset base_path for remaining tests + html_doc.base_path = "/tmp/docs/report.html" + + # Filesystem access blocked when base_path is None + html_doc.base_path = None + + # Paths pass through unchanged for hyperlinks + assert ( + html_doc._resolve_relative_path("../../../etc/something") + == "../../../etc/something" + ) + assert html_doc._resolve_relative_path("/etc/something") == "/etc/something" + assert html_doc._resolve_relative_path("image.png") == "image.png" + + # But file access is blocked + with pytest.raises( + OperationNotAllowed, match="Local file access requires base_path" + ): + html_doc._load_image_data("../../../etc/something") + + with pytest.raises( + OperationNotAllowed, match="Local file access requires base_path" + ): + html_doc._load_image_data("/etc/something") + + with pytest.raises( + OperationNotAllowed, match="Local file access requires base_path" + ): + html_doc._load_image_data("image.png") + + +def test_valid_local_paths_still_work(): + """Test that valid paths within the base directory still work.""" + html_path = Path("./tests/data/html/example_01.html").resolve() + options = HTMLBackendOptions(enable_local_fetch=True, fetch_images=True) + in_doc = InputDocument( + path_or_stream=html_path, + format=InputFormat.HTML, + backend=HTMLDocumentBackend, + filename="test", + ) + html_doc = HTMLDocumentBackend( + path_or_stream=html_path, in_doc=in_doc, options=options + ) + html_doc.base_path = str(html_path) + + resolved = html_doc._resolve_relative_path("example_image_01.png") + assert "tests/data/html" in resolved + assert "example_image_01.png" in resolved
Vulnerability mechanics
Root cause
"The HTML backend failed to adequately validate URIs and paths, allowing for local file access, path traversal, and SSRF."
Attack vector
An attacker can trigger this vulnerability by providing specially crafted HTML content that includes malicious URIs. This can involve `file://` URIs to access local files when `enable_local_fetch` is true, or URIs with `../` sequences and absolute paths to traverse directories. Additionally, by exploiting the lack of validation for internal network resources when `enable_remote_fetch` is true, an attacker can initiate Server-Side Request Forgery (SSRF) attacks. HTTP redirects are not validated, potentially leading to unintended destinations, and there are no resource limits for remote image downloads or `data:` URIs, enabling denial-of-service conditions [CWE-73].
Affected code
The vulnerabilities reside within the HTML backend's resource handling logic, specifically in functions responsible for resolving relative paths and loading image data. The `_resolve_relative_path` method in `docling/backend/html_backend.py` was modified to enforce stricter path validation and prevent traversal. Additionally, the `_load_image_data` method in the same file was updated to handle `file://` URIs and ensure local file access is confined by `base_path` [patch_id=4714023]. Remote fetching and redirect validation were improved in `docling/backend/html_backend.py` and `docling/datamodel/backend_options.py` [patch_id=4714024].
What the fix does
The patches address the vulnerability by enhancing input validation and resource handling. Local path treatment is improved, blocking absolute paths and requiring `enable_local_fetch=True` along with containment within a configured `base_path` to prevent traversal. `file://` schemes are now stripped and treated as local paths. IP address validation is implemented to prevent SSRF, and HTTP redirects are validated with connection and read timeouts. Finally, size limits are enforced for remote images and `data:` URIs [patch_id=4714024, patch_id=4714023].
Preconditions
- configThe `enable_local_fetch` option must be set to `True` to allow local file access.
- configThe `enable_remote_fetch` option must be set to `True` to allow remote resource fetching.
- inputThe attacker must be able to control HTML content processed by the application.
Generated on Jun 3, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
4News mentions
1- Docling Project: Eight High-Severity Vulnerabilities Disclosed TogetherVypr Intelligence · Jun 3, 2026