VYPR
Low severityNVD Advisory· Published Feb 24, 2024· Updated Apr 22, 2025

Server-side Request Forgery In Recursive URL Loader

CVE-2024-0243

Description

With the following crawler configuration:

from bs4 import BeautifulSoup as Soup

url = "https://example.com"
loader = RecursiveUrlLoader(
    url=url, max_depth=2, extractor=lambda x: Soup(x, "html.parser").text
)
docs = loader.load()

An attacker in control of the contents of https://example.com could place a malicious HTML file in there with links like "https://example.completely.different/my_file.html" and the crawler would proceed to download that file as well even though prevent_outside=True.

https://github.com/langchain-ai/langchain/blob/bf0b3cc0b5ade1fb95a5b1b6fa260e99064c2e22/libs/community/langchain_community/document_loaders/recursive_url_loader.py#L51-L51

Resolved in https://github.com/langchain-ai/langchain/pull/15559

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

A flaw in LangChain's RecursiveUrlLoader allows SSRF by bypassing the prevent_outside check via maliciously crafted relative or absolute links.

Vulnerability

Overview

CVE-2024-0243 is an SSRF vulnerability in the RecursiveUrlLoader component of the LangChain Python library. The bug resides in how the loader validates URLs when the prevent_outside=True option is set. The root cause is that the code only checks if a discovered link starts with the base URL string, rather than properly parsing and comparing the full URL components (scheme, netloc, path). This lenient check can be bypassed by a malicious actor who controls the content of the target page [1][2].

Exploitation

Mechanism

An attacker who controls the contents of the initially crawled page (e.g., https://example.com) can craft an HTML anchor tag with a link such as https://example.completely.different/my_file.html. Because the string "https://example.com" does start with the base URL string "https://example.com", the old check passes. The loader would then fetch and process content from https://example.completely.different, a completely different domain, without any authentication or special network position required beyond the initial controlled page [2]. The attack is triggered simply by parsing the HTML returned from the seed URL.

Impact

Successful exploitation allows an attacker to make the LangChain process issue arbitrary HTTP requests to internal or external hosts. This Server-Side Request Forgery (SSRF) can lead to information disclosure, scanning of internal networks, or interaction with otherwise inaccessible services. The attacker gains the ability to read responses from those URLs and potentially use them in downstream LLM operations.

Mitigation

The vulnerability was resolved in commit bf0b3cc0b5ade1fb95a5b1b6fa260e99064c2e22 which implements a proper URL parsing check: it now compares the netloc (network location, i.e., hostname and port) of the parsed base URL against the netloc of each extracted link. This ensures that only links whose hostname exactly matches the seed domain are followed. Users should update to LangChain versions that include this fix (see pull request #15559) [1][4].

AI Insight generated on May 20, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected packages

Versions sourced from the GitHub Security Advisory.

PackageAffected versionsPatched versions
langchainPyPI
< 0.1.00.1.0

Affected products

2
  • ghsa-coords
    Range: < 0.1.0
  • langchain-ai/langchain-ai/langchainv5
    Range: unspecified

Patches

1
bf0b3cc0b5ad

core[patch]: Further restrict recursive URL loader (#15559)

https://github.com/langchain-ai/langchainEugene YurtsevJan 4, 2024via ghsa
2 files changed · +52 11
  • libs/core/langchain_core/utils/html.py+25 11 modified
    @@ -67,23 +67,37 @@ def extract_sub_links(
         Returns:
             List[str]: sub links
         """
    -    base_url = base_url if base_url is not None else url
    +    base_url_to_use = base_url if base_url is not None else url
    +    parsed_base_url = urlparse(base_url_to_use)
         all_links = find_all_links(raw_html, pattern=pattern)
         absolute_paths = set()
         for link in all_links:
    +        parsed_link = urlparse(link)
             # Some may be absolute links like https://to/path
    -        if link.startswith("http"):
    -            absolute_paths.add(link)
    +        if parsed_link.scheme == "http" or parsed_link.scheme == "https":
    +            absolute_path = link
             # Some may have omitted the protocol like //to/path
             elif link.startswith("//"):
    -            absolute_paths.add(f"{urlparse(url).scheme}:{link}")
    +            absolute_path = f"{urlparse(url).scheme}:{link}"
             else:
    -            absolute_paths.add(urljoin(url, link))
    -    res = []
    +            absolute_path = urljoin(url, parsed_link.path)
    +        absolute_paths.add(absolute_path)
    +
    +    results = []
         for path in absolute_paths:
    -        if any(path.startswith(exclude) for exclude in exclude_prefixes):
    -            continue
    -        if prevent_outside and not path.startswith(base_url):
    +        if any(path.startswith(exclude_prefix) for exclude_prefix in exclude_prefixes):
                 continue
    -        res.append(path)
    -    return res
    +
    +        if prevent_outside:
    +            parsed_path = urlparse(path)
    +
    +            if parsed_base_url.netloc != parsed_path.netloc:
    +                continue
    +
    +            # Will take care of verifying rest of path after netloc
    +            # if it's more specific
    +            if not path.startswith(base_url_to_use):
    +                continue
    +
    +        results.append(path)
    +    return results
    
  • libs/core/tests/unit_tests/utils/test_html.py+27 0 modified
    @@ -156,3 +156,30 @@ def test_extract_sub_links_exclude() -> None:
             )
         )
         assert actual == expected
    +
    +
    +def test_prevent_outside() -> None:
    +    """Test that prevent outside compares against full base URL."""
    +    html = (
    +        '<a href="https://foobar.comic.com">BAD</a>'
    +        '<a href="https://foobar.comic:9999">BAD</a>'
    +        '<a href="https://foobar.com:9999">BAD</a>'
    +        '<a href="http://foobar.com:9999/">BAD</a>'
    +        '<a href="https://foobar.com/OK">OK</a>'
    +        '<a href="http://foobar.com/BAD">BAD</a>'  # Change in scheme is not OK here
    +    )
    +
    +    expected = sorted(
    +        [
    +            "https://foobar.com/OK",
    +        ]
    +    )
    +    actual = sorted(
    +        extract_sub_links(
    +            html,
    +            "https://foobar.com/hello/bill.html",
    +            base_url="https://foobar.com",
    +            prevent_outside=True,
    +        )
    +    )
    +    assert actual == expected
    

Vulnerability mechanics

Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

7

News mentions

0

No linked articles in our index yet.