Inefficient Regular Expression Complexity in nltk/nltk
Description
nltk is vulnerable to Inefficient Regular Expression Complexity
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
NLTK contains an inefficient regular expression in the Corpus Reader, leading to a ReDoS vulnerability that can cause excessive CPU consumption.
Vulnerability
NLTK (Natural Language Toolkit) prior to version 3.6.6 contains an inefficient regular expression in the comparative_sents.py module's KEYWORD pattern, used by the Corpus Reader. This flaw is classified as a ReDoS (Regular Expression Denial of Service) due to the regular expression's polynomial or exponential backtracking on crafted input. The vulnerable code path is reachable when a user processes text that includes a sequence of many opening parentheses ( [1][2][4].
Exploitation
An attacker can exploit this vulnerability by providing a specially crafted input string containing a large number of nested parentheses to an application that uses the NLTK Corpus Reader to process untrusted text. No authentication or special privileges are required; the attacker only needs to supply the malicious input through any channel that is processed by the affected component. The exploit results in a significant slowdown, as demonstrated by execution time roughly 80 times longer for a long input compared to a short one, before the fix [4].
Impact
Successful exploitation leads to a denial of service condition, causing high CPU consumption and potentially making the application unresponsive. This is a resource exhaustion issue affecting availability. There is no evidence of information disclosure, data corruption, or remote code execution from this vulnerability [1][2].
Mitigation
The vulnerability is fixed in NLTK version 3.6.6, released on 2021-09-27, via commit 277711ab1dec729e626b27aab6fa35ea5efbd7e6. Users should upgrade to NLTK 3.6.6 or later. There is no known workaround for earlier versions; limiting the processing of untrusted input may reduce risk but does not eliminate the vulnerability. The CVE is not listed in the CISA KEV catalog [2][3][4].
AI Insight generated on May 21, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
nltkPyPI | < 3.6.4 | 3.6.4 |
Affected products
4- ghsa-coords3 versionspkg:pypi/nltkpkg:rpm/opensuse/python-nltk&distro=openSUSE%20Tumbleweedpkg:rpm/suse/python-nltk&distro=SUSE%20Package%20Hub%2015%20SP2
< 3.6.4+ 2 more
- (no CPE)range: < 3.6.4
- (no CPE)range: < 3.7-1.1
- (no CPE)range: < 3.7-bp152.3.3.1
- nltk/nltk/nltkv5Range: unspecified
Patches
1277711ab1decResolved ReDoS vulnerability in Corpus Reader (#2816)
2 files changed · +33 −1
nltk/corpus/reader/comparative_sents.py+1 −1 modified@@ -45,7 +45,7 @@ GRAD_COMPARISON = re.compile(r"<cs-[123]>") NON_GRAD_COMPARISON = re.compile(r"<cs-4>") ENTITIES_FEATS = re.compile(r"(\d)_((?:[\.\w\s/-](?!\d_))+)") -KEYWORD = re.compile(r"\((?!.*\()(.*)\)$") +KEYWORD = re.compile(r"\(([^\(]*)\)$") class Comparison:
nltk/test/corpus.doctest+32 −0 modified@@ -2162,3 +2162,35 @@ access to its tuples() method >>> from nltk.corpus import qc >>> qc.tuples('test.txt') [('NUM:dist', 'How far is it from Denver to Aspen ?'), ('LOC:city', 'What county is Modesto , California in ?'), ...] + +Ensure that KEYWORD from `comparative_sents.py` no longer contains a ReDoS vulnerability. + + >>> import re + >>> import time + >>> from nltk.corpus.reader.comparative_sents import KEYWORD + >>> sizes = { + ... "short": 4000, + ... "long": 40000 + ... } + >>> exec_times = { + ... "short": [], + ... "long": [], + ... } + >>> for size_name, size in sizes.items(): + ... for j in range(9): + ... start_t = time.perf_counter() + ... payload = "( " + "(" * size + ... output = KEYWORD.findall(payload) + ... exec_times[size_name].append(time.perf_counter() - start_t) + ... exec_times[size_name] = sorted(exec_times[size_name])[4] # Get the mean + +Ideally, the execution time of such a regular expression is linear +in the length of the input. As such, we would expect exec_times["long"] +to be roughly 10 times as big as exec_times["short"]. +With the ReDoS in place, it took roughly 80 times as long. +For now, we accept values below 30 (times as long), due to the potential +for variance. This ensures that the ReDoS has certainly been reduced, +if not removed. + + >>> exec_times["long"] / exec_times["short"] < 30 + True
Vulnerability mechanics
Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
6- github.com/advisories/GHSA-2ww3-fxvq-293jghsaADVISORY
- nvd.nist.gov/vuln/detail/CVE-2021-3828ghsaADVISORY
- github.com/nltk/nltk/commit/277711ab1dec729e626b27aab6fa35ea5efbd7e6ghsax_refsource_MISCWEB
- github.com/nltk/nltk/pull/2816ghsaWEB
- github.com/pypa/advisory-database/tree/main/vulns/nltk/PYSEC-2021-356.yamlghsaWEB
- huntr.dev/bounties/d19aed43-75bc-4a03-91a0-4d0bb516bc32ghsax_refsource_CONFIRMWEB
News mentions
0No linked articles in our index yet.