VYPR
Moderate severityNVD Advisory· Published Mar 5, 2026· Updated Mar 6, 2026

lxml_html_clean: <base> tag injection through default Cleaner configuration

CVE-2026-28350

Description

lxml_html_clean is a project for HTML cleaning functionalities copied from lxml.html.clean. Prior to version 0.4.4, the tag passes through the default Cleaner configuration. While page_structure=True removes html, head, and title tags, there is no specific handling for , allowing an attacker to inject it and hijack relative links on the page. This issue has been patched in version 0.4.4.

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

In lxml_html_clean prior to 0.4.4, the tag is not removed by the default Cleaner configuration, allowing attackers to hijack relative links on the page.

Vulnerability

In lxml_html_clean before version 0.4.4, the default Cleaner configuration does not strip ` tags from HTML input. The page_structure=True setting removes , , and tags, but is not included in that "kill set" and passes through unchanged [1][3]. The tag defines a base URL against which relative URLs in the document are resolved. Since browsers accept even when it appears outside the ` element, the Cleaner's failure to remove it creates a security gap [3].

Exploitation

An attacker who can inject arbitrary HTML into a page processed by the default Cleaner (e.g., through a comment or content field) can include a ` tag. After cleaning, that tag remains in the output and changes the base URL for all relative links on the page [3]. The official proof of concept demonstrates that clean_html('Account') produces a result carrying the attacker's ` [3]. No special authentication or network position is required beyond the ability to submit HTML that the library cleans.

Impact

With the injected `, relative URLs—including navigation links, form actions, script src attributes, and image src` attributes—are all resolved relative to the attacker's domain. This enables three primary attack vectors: phishing and redirection (stealing credentials via fake login forms), stored Cross-Site Scripting (XSS) when relative JavaScript paths are loaded from the attacker's server, and UI defacement (replacing images or stylesheets) [3].

Mitigation

The issue is patched in version 0.4.4 of lxml_html_clean. In the fix, ` tags are removed whenever page_structure=True (the default) and also when a tag is explicitly listed in remove_tags [2]. Users should upgrade to 0.4.4 or later. The library's maintainers note that lxml_html-0 Cleaner is not recommended for security-sensitive environments in general, suggesting alternative like nh3` for stricter sanitization [4].

AI Insight generated on May 18, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected packages

Versions sourced from the GitHub Security Advisory.

PackageAffected versionsPatched versions
lxml-html-cleanPyPI
< 0.4.40.4.4

Affected products

2

Patches

1
9c5612ca33b9

Remove <base> tags to prevent URL hijacking attacks

https://github.com/fedora-python/lxml_html_cleanLumir BalharFeb 25, 2026via ghsa
3 files changed · +59 0
  • CHANGES.rst+5 0 modified
    @@ -12,6 +12,11 @@ Bugs fixed
     * Fixed a bug where Unicode escapes in CSS were not properly decoded
       before security checks. This prevents attackers from bypassing filters
       using escape sequences.
    +* Fixed a security issue where ``<base>`` tags could be used for URL
    +  hijacking attacks. The ``<base>`` tag is now automatically removed
    +  whenever the ``<head>`` tag is removed (via ``page_structure=True``
    +  or manual configuration), as ``<base>`` must be inside ``<head>``
    +  according to HTML specifications.
     
     0.4.3 (2025-10-02)
     ==================
    
  • lxml_html_clean/clean.py+6 0 modified
    @@ -422,6 +422,12 @@ def __call__(self, doc):
             if self.annoying_tags:
                 remove_tags.update(('blink', 'marquee'))
     
    +        # Remove <base> tags whenever <head> is being removed.
    +        # According to HTML spec, <base> must be in <head>, but browsers
    +        # may interpret it even when misplaced, allowing URL hijacking attacks.
    +        if 'head' in kill_tags or 'head' in remove_tags:
    +            kill_tags.add('base')
    +
             _remove = deque()
             _kill = deque()
             for el in doc.iter():
    
  • tests/test_clean.py+48 0 modified
    @@ -394,6 +394,54 @@ def test_possibly_invalid_url_without_whitelist(self):
             self.assertNotIn("google.com", result)
             self.assertNotIn("example.com", result)
     
    +    def test_base_tag_removed_with_page_structure(self):
    +        # Test that <base> tags are removed when page_structure=True (default)
    +        # This prevents URL hijacking attacks where <base> redirects all relative URLs
    +
    +        test_cases = [
    +            # <base> in proper location (inside <head>)
    +            '<html><head><base href="http://evil.com/"></head><body><a href="page.html">link</a></body></html>',
    +            # <base> outside <head>
    +            '<div><base href="http://evil.com/"><a href="page.html">link</a></div>',
    +            # Multiple <base> tags
    +            '<base href="http://evil.com/"><div><base href="http://evil2.com/"></div>',
    +            # <base> with target attribute
    +            '<base target="_blank"><div>content</div>',
    +            # <base> at various positions
    +            '<html><base href="http://evil.com/"><body>test</body></html>',
    +        ]
    +
    +        for html in test_cases:
    +            with self.subTest(html=html):
    +                cleaned = clean_html(html)
    +                # Verify <base> tag is completely removed
    +                self.assertNotIn('base', cleaned.lower())
    +                self.assertNotIn('evil.com', cleaned)
    +                self.assertNotIn('evil2.com', cleaned)
    +
    +    def test_base_tag_kept_when_page_structure_false(self):
    +        # When page_structure=False and head is not removed, <base> should be kept
    +        cleaner = Cleaner(page_structure=False)
    +        html = '<html><head><base href="http://example.com/"></head><body>test</body></html>'
    +        cleaned = cleaner.clean_html(html)
    +        self.assertIn('<base href="http://example.com/">', cleaned)
    +
    +    def test_base_tag_removed_when_head_in_remove_tags(self):
    +        # Even with page_structure=False, <base> should be removed if head is manually removed
    +        cleaner = Cleaner(page_structure=False, remove_tags=['head'])
    +        html = '<html><head><base href="http://evil.com/"></head><body>test</body></html>'
    +        cleaned = cleaner.clean_html(html)
    +        self.assertNotIn('base', cleaned.lower())
    +        self.assertNotIn('evil.com', cleaned)
    +
    +    def test_base_tag_removed_when_head_in_kill_tags(self):
    +        # Even with page_structure=False, <base> should be removed if head is in kill_tags
    +        cleaner = Cleaner(page_structure=False, kill_tags=['head'])
    +        html = '<html><head><base href="http://evil.com/"></head><body>test</body></html>'
    +        cleaned = cleaner.clean_html(html)
    +        self.assertNotIn('base', cleaned.lower())
    +        self.assertNotIn('evil.com', cleaned)
    +
         def test_unicode_escape_in_style(self):
             # Test that CSS Unicode escapes are properly decoded before security checks
             # This prevents attackers from bypassing filters using escape sequences
    

Vulnerability mechanics

Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

4

News mentions

0

No linked articles in our index yet.