VYPR
High severityNVD Advisory· Published Sep 27, 2021· Updated Aug 3, 2024

Inefficient Regular Expression Complexity in nltk/nltk

CVE-2021-3828

Description

nltk is vulnerable to Inefficient Regular Expression Complexity

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

NLTK contains an inefficient regular expression in the Corpus Reader, leading to a ReDoS vulnerability that can cause excessive CPU consumption.

Vulnerability

NLTK (Natural Language Toolkit) prior to version 3.6.6 contains an inefficient regular expression in the comparative_sents.py module's KEYWORD pattern, used by the Corpus Reader. This flaw is classified as a ReDoS (Regular Expression Denial of Service) due to the regular expression's polynomial or exponential backtracking on crafted input. The vulnerable code path is reachable when a user processes text that includes a sequence of many opening parentheses ( [1][2][4].

Exploitation

An attacker can exploit this vulnerability by providing a specially crafted input string containing a large number of nested parentheses to an application that uses the NLTK Corpus Reader to process untrusted text. No authentication or special privileges are required; the attacker only needs to supply the malicious input through any channel that is processed by the affected component. The exploit results in a significant slowdown, as demonstrated by execution time roughly 80 times longer for a long input compared to a short one, before the fix [4].

Impact

Successful exploitation leads to a denial of service condition, causing high CPU consumption and potentially making the application unresponsive. This is a resource exhaustion issue affecting availability. There is no evidence of information disclosure, data corruption, or remote code execution from this vulnerability [1][2].

Mitigation

The vulnerability is fixed in NLTK version 3.6.6, released on 2021-09-27, via commit 277711ab1dec729e626b27aab6fa35ea5efbd7e6. Users should upgrade to NLTK 3.6.6 or later. There is no known workaround for earlier versions; limiting the processing of untrusted input may reduce risk but does not eliminate the vulnerability. The CVE is not listed in the CISA KEV catalog [2][3][4].

AI Insight generated on May 21, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected packages

Versions sourced from the GitHub Security Advisory.

PackageAffected versionsPatched versions
nltkPyPI
< 3.6.43.6.4

Affected products

4

Patches

1
277711ab1dec

Resolved ReDoS vulnerability in Corpus Reader (#2816)

https://github.com/nltk/nltkTom AarsenSep 24, 2021via ghsa
2 files changed · +33 1
  • nltk/corpus/reader/comparative_sents.py+1 1 modified
    @@ -45,7 +45,7 @@
     GRAD_COMPARISON = re.compile(r"<cs-[123]>")
     NON_GRAD_COMPARISON = re.compile(r"<cs-4>")
     ENTITIES_FEATS = re.compile(r"(\d)_((?:[\.\w\s/-](?!\d_))+)")
    -KEYWORD = re.compile(r"\((?!.*\()(.*)\)$")
    +KEYWORD = re.compile(r"\(([^\(]*)\)$")
     
     
     class Comparison:
    
  • nltk/test/corpus.doctest+32 0 modified
    @@ -2162,3 +2162,35 @@ access to its tuples() method
         >>> from nltk.corpus import qc
         >>> qc.tuples('test.txt')
         [('NUM:dist', 'How far is it from Denver to Aspen ?'), ('LOC:city', 'What county is Modesto , California in ?'), ...]
    +
    +Ensure that KEYWORD from `comparative_sents.py` no longer contains a ReDoS vulnerability.
    +
    +    >>> import re
    +    >>> import time
    +    >>> from nltk.corpus.reader.comparative_sents import KEYWORD
    +    >>> sizes = {
    +    ...     "short": 4000,
    +    ...     "long": 40000
    +    ... }
    +    >>> exec_times = {
    +    ...     "short": [],
    +    ...     "long": [],
    +    ... }
    +    >>> for size_name, size in sizes.items():
    +    ...     for j in range(9):
    +    ...         start_t = time.perf_counter()
    +    ...         payload = "( " + "(" * size
    +    ...         output = KEYWORD.findall(payload)
    +    ...         exec_times[size_name].append(time.perf_counter() - start_t)
    +    ...     exec_times[size_name] = sorted(exec_times[size_name])[4] # Get the mean
    +
    +Ideally, the execution time of such a regular expression is linear
    +in the length of the input. As such, we would expect exec_times["long"]
    +to be roughly 10 times as big as exec_times["short"].
    +With the ReDoS in place, it took roughly 80 times as long.
    +For now, we accept values below 30 (times as long), due to the potential
    +for variance. This ensures that the ReDoS has certainly been reduced,
    +if not removed.
    +
    +    >>> exec_times["long"] / exec_times["short"] < 30
    +    True
    

Vulnerability mechanics

Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

6

News mentions

0

No linked articles in our index yet.