VYPR
High severityNVD Advisory· Published Jan 4, 2022· Updated Aug 3, 2024

Inefficient Regular Expression Complexity in nltk/nltk

CVE-2021-3842

Description

nltk is vulnerable to Inefficient Regular Expression Complexity

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

NLTK's RegexpTagger contains a ReDoS vulnerability due to an unescaped dot in a regex pattern, enabling denial of service via crafted input.

Vulnerability

NLTK versions prior to the fix contain an inefficient regular expression in the RegexpTagger training example and potentially other taggers. Specifically, the pattern ^-?[0-9]+(.[0-9]+)?$ uses an unescaped dot (.) that matches any character, causing catastrophic backtracking when processing crafted inputs [1][3]. This affects NLTK versions before commit 2a50a3e [4].

Exploitation

An attacker can exploit this by providing a string that triggers exponential backtracking in the regex engine. For example, a long sequence of zeros preceded by a dash and followed by a non-matching character (e.g., - + 0*N + q) causes the regex to take time exponential in N [3]. The attack requires no authentication and can be launched over the network if the application processes user-supplied text using the vulnerable regex.

Impact

Successful exploitation leads to denial of service (DoS) through excessive CPU consumption, potentially making the application unresponsive. No data confidentiality or integrity is affected; the impact is limited to resource exhaustion [2].

Mitigation

The vulnerability is fixed by escaping the dot as \. in the regex pattern, as implemented in commit 2a50a3e [1]. Users should update NLTK to a version that includes this commit (e.g., via pip install --upgrade nltk). If updating is not possible, avoid using the default RegexpTagger pattern with the unescaped dot, and review all regex patterns for similar issues [3][4].

AI Insight generated on May 21, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected packages

Versions sourced from the GitHub Security Advisory.

PackageAffected versionsPatched versions
nltkPyPI
< 3.6.63.6.6

Affected products

2
  • ghsa-coords
    Range: < 3.6.6
  • nltk/nltk/nltkv5
    Range: unspecified

Patches

1
2a50a3edc9d3

Resolve ReDoS opportunity by fixing incorrectly specified regex (#2906)

https://github.com/nltk/nltkTom AarsenDec 8, 2021via ghsa
6 files changed · +20 20
  • nltk/parse/malt.py+1 1 modified
    @@ -32,7 +32,7 @@ def malt_regex_tagger():
                 (r"\)$", ")"),  # round brackets
                 (r"\[$", "["),
                 (r"\]$", "]"),  # square brackets
    -            (r"^-?[0-9]+(.[0-9]+)?$", "CD"),  # cardinal numbers
    +            (r"^-?[0-9]+(\.[0-9]+)?$", "CD"),  # cardinal numbers
                 (r"(The|the|A|a|An|an)$", "DT"),  # articles
                 (r"(He|he|She|she|It|it|I|me|Me|You|you)$", "PRP"),  # pronouns
                 (r"(His|his|Her|her|Its|its)$", "PRP$"),  # possessive
    
  • nltk/sem/glue.py+1 1 modified
    @@ -703,7 +703,7 @@ def get_pos_tagger(self):
     
             regexp_tagger = RegexpTagger(
                 [
    -                (r"^-?[0-9]+(.[0-9]+)?$", "CD"),  # cardinal numbers
    +                (r"^-?[0-9]+(\.[0-9]+)?$", "CD"),  # cardinal numbers
                     (r"(The|the|A|a|An|an)$", "AT"),  # articles
                     (r".*able$", "JJ"),  # adjectives
                     (r".*ness$", "NN"),  # nouns formed from adjectives
    
  • nltk/tag/brill.py+1 1 modified
    @@ -329,7 +329,7 @@ def print_train_stats():
                 )
                 print(
                     "TRAIN ({tokencount:7d} tokens) initial {initialerrors:5d} {initialacc:.4f} "
    -                "final: {finalerrors:5d} {finalacc:.4f} ".format(**train_stats)
    +                "final: {finalerrors:5d} {finalacc:.4f}".format(**train_stats)
                 )
                 head = "#ID | Score (train) |  #Rules     | Template"
                 print(head, "\n", "-" * len(head), sep="")
    
  • nltk/tag/brill_trainer.py+11 11 modified
    @@ -91,7 +91,7 @@ def __init__(
         # Training
     
         def train(self, train_sents, max_rules=200, min_score=2, min_acc=None):
    -        """
    +        r"""
             Trains the Brill tagger on the corpus *train_sents*,
             producing at most *max_rules* transformations, each of which
             reduces the net number of errors in the corpus by at least
    @@ -111,7 +111,7 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None):
             >>> testing_data = [untag(s) for s in gold_data]
     
             >>> backoff = RegexpTagger([
    -        ... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
    +        ... (r'^-?[0-9]+(\.[0-9]+)?$', 'CD'),  # cardinal numbers
             ... (r'(The|the|A|a|An|an)$', 'AT'),   # articles
             ... (r'.*able$', 'JJ'),                # adjectives
             ... (r'.*ness$', 'NN'),                # nouns formed from adjectives
    @@ -125,7 +125,7 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None):
             >>> baseline = backoff #see NOTE1
     
             >>> baseline.evaluate(gold_data) #doctest: +ELLIPSIS
    -        0.2450142...
    +        0.2433862...
     
             >>> # Set up templates
             >>> Template._cleartemplates() #clear any templates created in earlier tests
    @@ -137,7 +137,7 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None):
             >>> tagger1 = tt.train(training_data, max_rules=10)
             TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: None)
             Finding initial useful rules...
    -            Found 845 useful rules.
    +            Found 847 useful rules.
             <BLANKLINE>
                        B      |
                S   F   r   O  |        Score = Fixed - Broken
    @@ -150,7 +150,7 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None):
               85  85   0   0  | NN->, if Pos:NN@[-1] & Word:,@[0]
               69  69   0   0  | NN->. if Pos:NN@[-1] & Word:.@[0]
               51  51   0   0  | NN->IN if Pos:NN@[-1] & Word:of@[0]
    -          47  63  16 161  | NN->IN if Pos:NNS@[-1]
    +          47  63  16 162  | NN->IN if Pos:NNS@[-1]
               33  33   0   0  | NN->TO if Pos:NN@[-1] & Word:to@[0]
               26  26   0   0  | IN->. if Pos:NNS@[-1] & Word:.@[0]
               24  24   0   0  | IN->, if Pos:NNS@[-1] & Word:,@[0]
    @@ -162,11 +162,11 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None):
     
             >>> train_stats = tagger1.train_stats()
             >>> [train_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']]
    -        [1775, 1269, [132, 85, 69, 51, 47, 33, 26, 24, 22, 17]]
    +        [1776, 1270, [132, 85, 69, 51, 47, 33, 26, 24, 22, 17]]
     
             >>> tagger1.print_template_statistics(printunused=False)
             TEMPLATE STATISTICS (TRAIN)  2 templates, 10 rules)
    -        TRAIN (   2417 tokens) initial  1775 0.2656 final:  1269 0.4750
    +        TRAIN (   2417 tokens) initial  1776 0.2652 final:  1270 0.4746
             #ID | Score (train) |  #Rules     | Template
             --------------------------------------------
             001 |   305   0.603 |   7   0.700 | Template(Pos([-1]),Word([0]))
    @@ -175,7 +175,7 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None):
             <BLANKLINE>
     
             >>> tagger1.evaluate(gold_data) # doctest: +ELLIPSIS
    -        0.43996...
    +        0.43833...
     
             >>> tagged, test_stats = tagger1.batch_tag_incremental(testing_data, gold_data)
     
    @@ -185,13 +185,13 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None):
             True
     
             >>> [test_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']]
    -        [1855, 1376, [100, 85, 67, 58, 27, 36, 27, 16, 31, 32]]
    +        [1859, 1380, [100, 85, 67, 58, 27, 36, 27, 16, 31, 32]]
     
             >>> # A high-accuracy tagger
             >>> tagger2 = tt.train(training_data, max_rules=10, min_acc=0.99)
             TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: 0.99)
             Finding initial useful rules...
    -            Found 845 useful rules.
    +            Found 847 useful rules.
             <BLANKLINE>
                        B      |
                S   F   r   O  |        Score = Fixed - Broken
    @@ -212,7 +212,7 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None):
               18  18   0   0  | NN->CC if Pos:NN@[-1] & Word:and@[0]
     
             >>> tagger2.evaluate(gold_data)  # doctest: +ELLIPSIS
    -        0.44159544...
    +        0.43996743...
             >>> tagger2.rules()[2:4]
             (Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]), Rule('001', 'NN', 'IN', [(Pos([-1]),'NN'), (Word([0]),'of')]))
     
    
  • nltk/tag/sequential.py+4 4 modified
    @@ -337,7 +337,7 @@ class UnigramTagger(NgramTagger):
             >>> test_sent = brown.sents(categories='news')[0]
             >>> unigram_tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
             >>> for tok, tag in unigram_tagger.tag(test_sent):
    -        ...     print("({}, {}), ".format(tok, tag))
    +        ...     print("({}, {}), ".format(tok, tag)) # doctest: +NORMALIZE_WHITESPACE
             (The, AT), (Fulton, NP-TL), (County, NN-TL), (Grand, JJ-TL),
             (Jury, NN-TL), (said, VBD), (Friday, NR), (an, AT),
             (investigation, NN), (of, IN), (Atlanta's, NP$), (recent, JJ),
    @@ -491,7 +491,7 @@ def context(self, tokens, index, history):
     
     @jsontags.register_tag
     class RegexpTagger(SequentialBackoffTagger):
    -    """
    +    r"""
         Regular Expression Tagger
     
         The RegexpTagger assigns tags to tokens by comparing their
    @@ -503,7 +503,7 @@ class RegexpTagger(SequentialBackoffTagger):
             >>> from nltk.tag import RegexpTagger
             >>> test_sent = brown.sents(categories='news')[0]
             >>> regexp_tagger = RegexpTagger(
    -        ...     [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
    +        ...     [(r'^-?[0-9]+(\.[0-9]+)?$', 'CD'),  # cardinal numbers
             ...      (r'(The|the|A|a|An|an)$', 'AT'),   # articles
             ...      (r'.*able$', 'JJ'),                # adjectives
             ...      (r'.*ness$', 'NN'),                # nouns formed from adjectives
    @@ -515,7 +515,7 @@ class RegexpTagger(SequentialBackoffTagger):
             ... ])
             >>> regexp_tagger
             <Regexp Tagger: size=9>
    -        >>> regexp_tagger.tag(test_sent)
    +        >>> regexp_tagger.tag(test_sent) # doctest: +NORMALIZE_WHITESPACE
             [('The', 'AT'), ('Fulton', 'NN'), ('County', 'NN'), ('Grand', 'NN'), ('Jury', 'NN'),
             ('said', 'NN'), ('Friday', 'NN'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'NN'),
             ("Atlanta's", 'NNS'), ('recent', 'NN'), ('primary', 'NN'), ('election', 'NN'),
    
  • nltk/tbl/demo.py+2 2 modified
    @@ -393,11 +393,11 @@ def _demo_plot(learning_curve_output, teststats, trainstats=None, take=None):
         plt.savefig(learning_curve_output)
     
     
    -NN_CD_TAGGER = RegexpTagger([(r"^-?[0-9]+(.[0-9]+)?$", "CD"), (r".*", "NN")])
    +NN_CD_TAGGER = RegexpTagger([(r"^-?[0-9]+(\.[0-9]+)?$", "CD"), (r".*", "NN")])
     
     REGEXP_TAGGER = RegexpTagger(
         [
    -        (r"^-?[0-9]+(.[0-9]+)?$", "CD"),  # cardinal numbers
    +        (r"^-?[0-9]+(\.[0-9]+)?$", "CD"),  # cardinal numbers
             (r"(The|the|A|a|An|an)$", "AT"),  # articles
             (r".*able$", "JJ"),  # adjectives
             (r".*ness$", "NN"),  # nouns formed from adjectives
    

Vulnerability mechanics

Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

6

News mentions

0

No linked articles in our index yet.