lxml_html_clean: CSS @import Filter Bypass via Unicode Escapes
Description
lxml_html_clean is a project for HTML cleaning functionalities copied from lxml.html.clean. Prior to version 0.4.4, the _has_sneaky_javascript() method strips backslashes before checking for dangerous CSS keywords. This causes CSS Unicode escape sequences to bypass the @import and expression() filters, allowing external CSS loading or XSS in older browsers. This issue has been patched in version 0.4.4.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
lxml_html_clean prior to 0.4.4 allows CSS @import and expression() filter bypass via Unicode escape sequences due to backslash stripping in _has_sneaky_javascript(), enabling external CSS loading and potential XSS in older browsers.
The vulnerability lies in the _has_sneaky_javascript() method of lxml_html_clean, which strips backslashes before checking for dangerous CSS keywords [1]. This transformation converts CSS Unicode escape sequences (e.g., \69 for i) into plain text, so @\69mport becomes @69mport and does not match the blacklist keyword @import. However, modern browsers' CSS parsers decode the Unicode escapes according to the CSS specification, interpreting the input as a valid @import statement [3]. The same mechanism bypasses the expression() filter, which is relevant for Internet Explorer.
To exploit, an attacker crafts HTML with inline styles or ` or test` (the latter only works in IE) [2][3]. No authentication is required beyond enticing a user to view the crafted content.
Successful exploitation allows an attacker to load arbitrary external stylesheets. This can be used for data exfiltration via attribute selectors (e.g., reading CSRF tokens), UI redressing, or phishing. In older browsers (IE), the expression() bypass enables full cross-site scripting (XSS) [3].
The issue has been patched in version 0.4.4. The fix implements proper decoding of CSS Unicode escape sequences before applying the blacklist filters [2]. Users should upgrade to the latest version. The project maintainers also note that the HTML cleaner is not considered appropriate for security-sensitive environments and recommend alternatives like nh3 [4].
AI Insight generated on May 18, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
lxml-html-cleanPyPI | < 0.4.4 | 0.4.4 |
Affected products
2- Range: <0.4.4
- fedora-python/lxml_html_cleanv5Range: < 0.4.4
Patches
12ef732667ddbImplement unicode escape decoding
3 files changed · +76 −1
CHANGES.rst+7 −0 modified@@ -6,6 +6,13 @@ lxml_html_clean changelog Unreleased ========== +Bugs fixed +---------- + +* Fixed a bug where Unicode escapes in CSS were not properly decoded + before security checks. This prevents attackers from bypassing filters + using escape sequences. + 0.4.3 (2025-10-02) ==================
lxml_html_clean/clean.py+21 −1 modified@@ -578,6 +578,26 @@ def _remove_javascript_link(self, link): _comments_re = re.compile(r'/\*.*?\*/', re.S) _find_comments = _comments_re.finditer _substitute_comments = _comments_re.sub + _css_unicode_escape_re = re.compile(r'\\([0-9a-fA-F]{1,6})\s?') + + def _decode_css_unicode_escapes(self, style): + """ + Decode CSS Unicode escape sequences like \\69 or \\000069 to their + actual character values. This prevents bypassing security checks + using CSS escape sequences. + + CSS escape syntax: backslash followed by 1-6 hex digits, + optionally followed by a whitespace character. + """ + def replace_escape(match): + hex_value = match.group(1) + try: + return chr(int(hex_value, 16)) + except (ValueError, OverflowError): + # Invalid unicode codepoint, keep original + return match.group(0) + + return self._css_unicode_escape_re.sub(replace_escape, style) def _has_sneaky_javascript(self, style): """ @@ -591,7 +611,7 @@ def _has_sneaky_javascript(self, style): more sneaky attempts. """ style = self._substitute_comments('', style) - style = style.replace('\\', '') + style = self._decode_css_unicode_escapes(style) style = _substitute_whitespace('', style) style = style.lower() if _has_javascript_scheme(style):
tests/test_clean.py+48 −0 modified@@ -393,3 +393,51 @@ def test_possibly_invalid_url_without_whitelist(self): self.assertEqual(len(w), 0) self.assertNotIn("google.com", result) self.assertNotIn("example.com", result) + + def test_unicode_escape_in_style(self): + # Test that CSS Unicode escapes are properly decoded before security checks + # This prevents attackers from bypassing filters using escape sequences + # CSS escape syntax: \HHHHHH where H is a hex digit (1-6 digits) + + # Test inline style attributes (requires safe_attrs_only=False) + cleaner = Cleaner(safe_attrs_only=False) + inline_style_cases = [ + # \6a\61\76\61\73\63\72\69\70\74 = "javascript" + ('<div style="background: url(\\6a\\61\\76\\61\\73\\63\\72\\69\\70\\74:alert(1))">test</div>', '<div>test</div>'), + # \69 = 'i', so \69mport = "import" + ('<div style="@\\69mport url(evil.css)">test</div>', '<div>test</div>'), + # \69 with space after = 'i', space consumed as part of escape + ('<div style="@\\69 mport url(evil.css)">test</div>', '<div>test</div>'), + # \65\78\70\72\65\73\73\69\6f\6e = "expression" + ('<div style="\\65\\78\\70\\72\\65\\73\\73\\69\\6f\\6e(alert(1))">test</div>', '<div>test</div>'), + ] + + for html, expected in inline_style_cases: + with self.subTest(html=html): + cleaned = cleaner.clean_html(html) + self.assertEqual(expected, cleaned) + + # Test <style> tag content (uses default clean_html) + style_tag_cases = [ + # Unicode-escaped "javascript:" in url() + '<style>url(\\6a\\61\\76\\61\\73\\63\\72\\69\\70\\74:alert(1))</style>', + # Unicode-escaped "javascript:" without url() + '<style>\\6a\\61\\76\\61\\73\\63\\72\\69\\70\\74:alert(1)</style>', + # Unicode-escaped "expression" + '<style>\\65\\78\\70\\72\\65\\73\\73\\69\\6f\\6e(alert(1))</style>', + # Unicode-escaped @import with 'i' + '<style>@\\69mport url(evil.css)</style>', + # Unicode-escaped "data:" scheme + '<style>url(\\64\\61\\74\\61:image/svg+xml;base64,PHN2ZyBvbmxvYWQ9YWxlcnQoMSk+)</style>', + # Space after escape is consumed: \69 mport = "import" + '<style>@\\69 mport url(evil.css)</style>', + # 6-digit escape: \000069 = 'i' + '<style>@\\000069mport url(evil.css)</style>', + # 6-digit escape with space + '<style>@\\000069 mport url(evil.css)</style>', + ] + + for html in style_tag_cases: + with self.subTest(html=html): + cleaned = clean_html(html) + self.assertEqual('<div><style>/* deleted */</style></div>', cleaned)
Vulnerability mechanics
Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
4- github.com/advisories/GHSA-hw26-mmpg-fqfgghsaADVISORY
- nvd.nist.gov/vuln/detail/CVE-2026-28348ghsaADVISORY
- github.com/fedora-python/lxml_html_clean/commit/2ef732667ddbc74ea59847bcf24b75809aaeed3bghsax_refsource_MISCWEB
- github.com/fedora-python/lxml_html_clean/security/advisories/GHSA-hw26-mmpg-fqfgghsax_refsource_CONFIRMWEB
News mentions
0No linked articles in our index yet.