lxml_html_clean: <base> tag injection through default Cleaner configuration
Description
lxml_html_clean is a project for HTML cleaning functionalities copied from lxml.html.clean. Prior to version 0.4.4, the tag passes through the default Cleaner configuration. While page_structure=True removes html, head, and title tags, there is no specific handling for , allowing an attacker to inject it and hijack relative links on the page. This issue has been patched in version 0.4.4.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
In lxml_html_clean prior to 0.4.4, the tag is not removed by the default Cleaner configuration, allowing attackers to hijack relative links on the page.
Vulnerability
In lxml_html_clean before version 0.4.4, the default Cleaner configuration does not strip ` tags from HTML input. The page_structure=True setting removes , , and tags, but is not included in that "kill set" and passes through unchanged [1][3]. The tag defines a base URL against which relative URLs in the document are resolved. Since browsers accept even when it appears outside the ` element, the Cleaner's failure to remove it creates a security gap [3].
Exploitation
An attacker who can inject arbitrary HTML into a page processed by the default Cleaner (e.g., through a comment or content field) can include a ` tag. After cleaning, that tag remains in the output and changes the base URL for all relative links on the page [3]. The official proof of concept demonstrates that clean_html('Account') produces a result carrying the attacker's ` [3]. No special authentication or network position is required beyond the ability to submit HTML that the library cleans.
Impact
With the injected `, relative URLs—including navigation links, form actions, script src attributes, and image src` attributes—are all resolved relative to the attacker's domain. This enables three primary attack vectors: phishing and redirection (stealing credentials via fake login forms), stored Cross-Site Scripting (XSS) when relative JavaScript paths are loaded from the attacker's server, and UI defacement (replacing images or stylesheets) [3].
Mitigation
The issue is patched in version 0.4.4 of lxml_html_clean. In the fix, ` tags are removed whenever page_structure=True (the default) and also when a tag is explicitly listed in remove_tags [2]. Users should upgrade to 0.4.4 or later. The library's maintainers note that lxml_html-0 Cleaner is not recommended for security-sensitive environments in general, suggesting alternative like nh3` for stricter sanitization [4].
AI Insight generated on May 18, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
lxml-html-cleanPyPI | < 0.4.4 | 0.4.4 |
Affected products
2- Range: <0.4.4
- fedora-python/lxml_html_cleanv5Range: < 0.4.4
Patches
19c5612ca33b9Remove <base> tags to prevent URL hijacking attacks
3 files changed · +59 −0
CHANGES.rst+5 −0 modified@@ -12,6 +12,11 @@ Bugs fixed * Fixed a bug where Unicode escapes in CSS were not properly decoded before security checks. This prevents attackers from bypassing filters using escape sequences. +* Fixed a security issue where ``<base>`` tags could be used for URL + hijacking attacks. The ``<base>`` tag is now automatically removed + whenever the ``<head>`` tag is removed (via ``page_structure=True`` + or manual configuration), as ``<base>`` must be inside ``<head>`` + according to HTML specifications. 0.4.3 (2025-10-02) ==================
lxml_html_clean/clean.py+6 −0 modified@@ -422,6 +422,12 @@ def __call__(self, doc): if self.annoying_tags: remove_tags.update(('blink', 'marquee')) + # Remove <base> tags whenever <head> is being removed. + # According to HTML spec, <base> must be in <head>, but browsers + # may interpret it even when misplaced, allowing URL hijacking attacks. + if 'head' in kill_tags or 'head' in remove_tags: + kill_tags.add('base') + _remove = deque() _kill = deque() for el in doc.iter():
tests/test_clean.py+48 −0 modified@@ -394,6 +394,54 @@ def test_possibly_invalid_url_without_whitelist(self): self.assertNotIn("google.com", result) self.assertNotIn("example.com", result) + def test_base_tag_removed_with_page_structure(self): + # Test that <base> tags are removed when page_structure=True (default) + # This prevents URL hijacking attacks where <base> redirects all relative URLs + + test_cases = [ + # <base> in proper location (inside <head>) + '<html><head><base href="http://evil.com/"></head><body><a href="page.html">link</a></body></html>', + # <base> outside <head> + '<div><base href="http://evil.com/"><a href="page.html">link</a></div>', + # Multiple <base> tags + '<base href="http://evil.com/"><div><base href="http://evil2.com/"></div>', + # <base> with target attribute + '<base target="_blank"><div>content</div>', + # <base> at various positions + '<html><base href="http://evil.com/"><body>test</body></html>', + ] + + for html in test_cases: + with self.subTest(html=html): + cleaned = clean_html(html) + # Verify <base> tag is completely removed + self.assertNotIn('base', cleaned.lower()) + self.assertNotIn('evil.com', cleaned) + self.assertNotIn('evil2.com', cleaned) + + def test_base_tag_kept_when_page_structure_false(self): + # When page_structure=False and head is not removed, <base> should be kept + cleaner = Cleaner(page_structure=False) + html = '<html><head><base href="http://example.com/"></head><body>test</body></html>' + cleaned = cleaner.clean_html(html) + self.assertIn('<base href="http://example.com/">', cleaned) + + def test_base_tag_removed_when_head_in_remove_tags(self): + # Even with page_structure=False, <base> should be removed if head is manually removed + cleaner = Cleaner(page_structure=False, remove_tags=['head']) + html = '<html><head><base href="http://evil.com/"></head><body>test</body></html>' + cleaned = cleaner.clean_html(html) + self.assertNotIn('base', cleaned.lower()) + self.assertNotIn('evil.com', cleaned) + + def test_base_tag_removed_when_head_in_kill_tags(self): + # Even with page_structure=False, <base> should be removed if head is in kill_tags + cleaner = Cleaner(page_structure=False, kill_tags=['head']) + html = '<html><head><base href="http://evil.com/"></head><body>test</body></html>' + cleaned = cleaner.clean_html(html) + self.assertNotIn('base', cleaned.lower()) + self.assertNotIn('evil.com', cleaned) + def test_unicode_escape_in_style(self): # Test that CSS Unicode escapes are properly decoded before security checks # This prevents attackers from bypassing filters using escape sequences
Vulnerability mechanics
Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
4- github.com/advisories/GHSA-xvp8-3mhv-424cghsaADVISORY
- nvd.nist.gov/vuln/detail/CVE-2026-28350ghsaADVISORY
- github.com/fedora-python/lxml_html_clean/commit/9c5612ca33b941eec4178abf8a5294b103403f34ghsax_refsource_MISCWEB
- github.com/fedora-python/lxml_html_clean/security/advisories/GHSA-xvp8-3mhv-424cghsax_refsource_CONFIRMWEB
News mentions
0No linked articles in our index yet.