Crawlee for Python: SSRF via sitemap-derived URLs
Description
Overview
- Vulnerability type: Blind SSRF
- Affected components:
src/crawlee/_utils/sitemap.py,src/crawlee/_utils/robots.py,src/crawlee/request_loaders/_sitemap_request_loader.py, and all built-in HTTP clients. - Trigger: an attacker-controlled sitemap or
robots.txtcontaining a URL that points to an internal host (layer 1) or uses a non-http scheme (layer 2).
Two-layer SSRF via sitemap-derived URLs:
1) Cross-host HTTP SSRF
Base case, affects every HTTP client.** Sitemap entries and robots.txt Sitemap: directives were accepted regardless of the host they pointed to. A sitemap on example.com could push http://internal.corp/admin into the crawler's queue, and the configured HTTP client would dispatch the request.
2) Non-HTTP scheme SSRF
Escalation, only CurlImpersonateHttpClient.** Nested-sitemap fetching dispatches the URL straight to the HTTP client, bypassing the Request construction step where Pydantic enforces http(s). Combined with the libcurl-backed CurlImpersonateHttpClient, this lets gopher://, file://, dict://, ftp://, etc., through.
Root cause
Crawlee already validates URL schemes through Pydantic's AnyHttpUrl (via validate_http_url in src/crawlee/_utils/urls.py) wherever a crawl target is materialised as a Request: the Request.url field is declared as Annotated[str, BeforeValidator(validate_http_url), Field(frozen=True)]. Anything that becomes a Request is therefore guaranteed to be http(s).
Two parts of the sitemap pipeline sidestepped this property in different ways:
1) Sitemap-derived URLs were enqueued without any host policy
SitemapRequestLoader took every ` entry, wrapped it in Request.from_url (which accepts any valid http(s) URL), and pushed the result into the request queue. RobotsTxtFile.get_sitemaps() returned every Sitemap: directive verbatim. Neither imposed any host check against the parent sitemap or robots.txt` URL, so an attacker controlling that content could push internal-network HTTP URLs into the queue and have them crawled by whichever HTTP client was configured.
2) Nested sitemap fetching bypassed the Request chokepoint entirely
When _XmlSitemapParser encountered …, or when RobotsTxtFile.parse_sitemaps forwarded Sitemap: directives into the same pipeline, _fetch_and_process_sitemap dispatched the URL directly to the HTTP client:
async with http_client.stream(
sitemap_url,
method='GET',
headers=SITEMAP_HEADERS,
proxy_info=proxy_info,
timeout=timeout,
) as response:
...
No Request was constructed, so the Pydantic validator never ran. Before the fix, the HTTP clients' own send_request() and stream() methods did not call validate_http_url either, so a non-http(s) scheme could pass straight through to the backend client.
The non-HTTP escalation in layer 2 is specific to CurlImpersonateHttpClient, which is backed by curl-cffi / libcurl and speaks gopher, file, dict, ftp, and other non-HTTP protocols. The other clients shipped with Crawlee (HttpxHttpClient, ImpitHttpClient, PlaywrightHttpClient) reject non-http(s) schemes at their own backend layer, regardless of what Crawlee passes in, so they were only affected by layer 1.
Vulnerable paths
Layer 1 — cross-host HTTP (all HTTP clients)
- *Source:* an attacker-controlled sitemap that lists internal URLs under `
or, or an attacker-controlledrobots.txtthat lists internal URLs underSitemap:`. - *Sink:* the configured HTTP client issues
GETrequests against those URLs — either viaclient.request(url=request.url, …)insidecrawl()for regular sitemap URLs, or viaclient.stream(url, …)inside the nested-sitemap fetch.
Layer 2 — non-HTTP schemes (CurlImpersonateHttpClient only)
- *Source:* a nested `
entry or arobots.txtSitemap:directive pointing to a non-http(s)` URL. - *Sink:*
CurlImpersonateHttpClient.stream(...)hands the URL string verbatim toclient.request(url=…, …), which dispatches via libcurl.
Hardening in 1.7.0 was added at both producer and consumer ends — see *Remediation*.
Exploitation preconditions
- The crawler uses sitemap loading: any of
SitemapRequestLoader,Sitemap.load/parse_sitemap,discover_valid_sitemaps, orRobotsTxtFile.parse_sitemaps. - The attacker controls the body of a sitemap or
robots.txtthat the crawler fetches — typically by being the target site, or by getting a target site to publish a malicious sitemap. - The crawler's network egress can reach the attacker-chosen destination (e.g., internal services on the same network).
- The targeted endpoint accepts unauthenticated requests. Crawlee does not supply credentials to the forged destination, so authenticated services (IMDSv2 with token, password-protected Redis, protected admin panels) are not reachable through this path.
For layer 2 (non-HTTP), the configured HTTP client must additionally be CurlImpersonateHttpClient.
Impact
Layer 1 — cross-host HTTP (any client)
The crawler can be coerced into issuing GET requests against internal HTTP services on its own network: admin panels, unauthenticated internal APIs, cloud metadata endpoints, etc. Read-back is blind — Crawlee surfaces fetched content only through its local Dataset / KeyValueStore (push_data() etc.) and does not natively forward scraped bodies anywhere external — so direct impact is mostly existence/timing probing and occasional state changes via side-effecting GET endpoints. Read-side leakage of internal content is only exploitable end-to-end if the deployer's own application separately exposes scraped data (for example, a public summariser or aggregator built on top of Crawlee).
Layer 2 — non-HTTP escalation (only CurlImpersonateHttpClient)
Under the affected client, attackers gain the libcurl scheme set:
gopher://is the canonical RESP-injection vector: pipelineFLUSHALL,CONFIG SET dir,CONFIG SET dbfilename,SAVEto an unauthenticated Redis on the crawler's network — enough to write attacker-controlled bytes to disk and, in the standard escalation, achieve remote code execution on the Redis host.file://allows the crawler to read local files (application secrets, configuration) on the crawler host.dict://andftp://permit fingerprinting and limited interaction with text-protocol services.
In both layers, the SSRF is blind in the default configuration. Write-side impact (gopher:// → Redis) and timing-based internal probing do not depend on read-back and remain viable regardless of whether the deployer surfaces scraped content.
Remediation
Both layers are fixed in crawlee==1.7.0. The fix is split across two PRs, applied at the two complementary boundaries of the affected pipeline:
- Producer-side filtering — sitemap and robots.txt loaders (PR #1864).
SitemapRequestLoaderandRobotsTxtFile.get_sitemaps()now run every nested-sitemap entry, every regular sitemap URL, and everySitemap:directive throughcrawlee._utils.urls.filter_url. This applies to anEnqueueStrategy(default'same-hostname') against the parent sitemap /robots.txtURL — cross-host entries are dropped — and rejects non-http(s)schemes. The strategy is stamped onto the emittedRequests, soBasicCrawler._check_url_after_redirectscontinues policing the policy across redirects. - Consumer-side validation — HTTP-client boundary (PR #1862).
validate_http_url(url)is now called at the top ofsend_request()andstream()inImpitHttpClient,HttpxHttpClient,CurlImpersonateHttpClient, andPlaywrightHttpClient. Non-http(s)schemes raisepydantic.ValidationErrorbefore any backend call.crawl()was already covered, becauseRequest.urlis validated by Pydantic on construction.
After these changes, validation is enforced both where sitemap-derived HTTP requests are produced (sitemap and robots.txt loaders) and where they are consumed (HTTP clients). A regression at either layer is caught by the other.
Behaviour change for upgraders
SitemapRequestLoader and RobotsTxtFile.get_sitemaps() now default to enqueue_strategy='same-hostname'. Deployers that legitimately relied on cross-host sitemap entries (e.g., a sitemap index on sitemaps.example.com that points to content on www.example.com) must opt in explicitly with enqueue_strategy='same-domain' or enqueue_strategy='all'.
Finder credits
- @r0otsu
- @Yuremin (Zhengmin Yu)
- @FORIMOC
- @invoke1442 (Ethan Carter)
- @Arturo0x90 (Arturo Melgarejo)
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
Blind SSRF in Crawlee's sitemap processing allows injection of internal HTTP URLs or non-HTTP schemes via malicious sitemap/robots.txt.
### What is the vulnerability? Crawlee's sitemap processing contains a blind Server-Side Request Forgery (SSRF) vulnerability. The root cause is that sitemap-derived URLs and Sitemap: directives from robots.txt are accepted without any host policy or sufficient scheme validation. While normal Request objects enforce HTTP/HTTPS schemes via Pydantic, two code paths bypass this: the SitemapRequestLoader and RobotsTxtFile.get_sitemaps() do not restrict URLs to the same host, and nested-sitemap fetching in CurlImpersonateHttpClient skips the Request construction entirely, allowing non-HTTP schemes [1][2].
### How is it exploited? An attacker who controls the content of a sitemap or robots.txt (e.g., by hosting a malicious sitemap) can include URLs pointing to internal hosts (e.g., http://internal.corp/admin) or, when using the CurlImpersonateHttpClient, URLs with non-HTTP schemes such as gopher://, file://, dict://, or ftp://. The crawler then fetches these URLs, potentially revealing internal information or interacting with internal services [1][2].
Impact
A successful blind SSRF attack can allow an attacker to probe internal networks, access sensitive data, or interact with internal services normally unreachable. With non-HTTP schemes, the attacker may achieve local file reads or exploit other protocols. The vulnerability is classified as low severity, but information disclosure is possible [1].
Mitigation
The issue is addressed in security advisory GHSA-3r75-xc34-5f44. Users should update to a patched version of Crawlee. The fix adds host policy validation for sitemap-derived URLs and ensures non-HTTP schemes are blocked even in nested sitemaps [1][2].
AI Insight generated on May 21, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected products
2- Range: >= 1.0.0, < 1.7.0
Patches
0No patches discovered yet.
Vulnerability mechanics
AI mechanics synthesis has not run for this CVE yet.
References
2News mentions
0No linked articles in our index yet.