Inefficient Regular Expression Complexity in nltk/nltk
Description
nltk is vulnerable to Inefficient Regular Expression Complexity
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
NLTK's RegexpTagger contains a ReDoS vulnerability due to an unescaped dot in a regex pattern, enabling denial of service via crafted input.
Vulnerability
NLTK versions prior to the fix contain an inefficient regular expression in the RegexpTagger training example and potentially other taggers. Specifically, the pattern ^-?[0-9]+(.[0-9]+)?$ uses an unescaped dot (.) that matches any character, causing catastrophic backtracking when processing crafted inputs [1][3]. This affects NLTK versions before commit 2a50a3e [4].
Exploitation
An attacker can exploit this by providing a string that triggers exponential backtracking in the regex engine. For example, a long sequence of zeros preceded by a dash and followed by a non-matching character (e.g., - + 0*N + q) causes the regex to take time exponential in N [3]. The attack requires no authentication and can be launched over the network if the application processes user-supplied text using the vulnerable regex.
Impact
Successful exploitation leads to denial of service (DoS) through excessive CPU consumption, potentially making the application unresponsive. No data confidentiality or integrity is affected; the impact is limited to resource exhaustion [2].
Mitigation
The vulnerability is fixed by escaping the dot as \. in the regex pattern, as implemented in commit 2a50a3e [1]. Users should update NLTK to a version that includes this commit (e.g., via pip install --upgrade nltk). If updating is not possible, avoid using the default RegexpTagger pattern with the unescaped dot, and review all regex patterns for similar issues [3][4].
AI Insight generated on May 21, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
nltkPyPI | < 3.6.6 | 3.6.6 |
Affected products
2- nltk/nltk/nltkv5Range: unspecified
Patches
12a50a3edc9d3Resolve ReDoS opportunity by fixing incorrectly specified regex (#2906)
6 files changed · +20 −20
nltk/parse/malt.py+1 −1 modified@@ -32,7 +32,7 @@ def malt_regex_tagger(): (r"\)$", ")"), # round brackets (r"\[$", "["), (r"\]$", "]"), # square brackets - (r"^-?[0-9]+(.[0-9]+)?$", "CD"), # cardinal numbers + (r"^-?[0-9]+(\.[0-9]+)?$", "CD"), # cardinal numbers (r"(The|the|A|a|An|an)$", "DT"), # articles (r"(He|he|She|she|It|it|I|me|Me|You|you)$", "PRP"), # pronouns (r"(His|his|Her|her|Its|its)$", "PRP$"), # possessive
nltk/sem/glue.py+1 −1 modified@@ -703,7 +703,7 @@ def get_pos_tagger(self): regexp_tagger = RegexpTagger( [ - (r"^-?[0-9]+(.[0-9]+)?$", "CD"), # cardinal numbers + (r"^-?[0-9]+(\.[0-9]+)?$", "CD"), # cardinal numbers (r"(The|the|A|a|An|an)$", "AT"), # articles (r".*able$", "JJ"), # adjectives (r".*ness$", "NN"), # nouns formed from adjectives
nltk/tag/brill.py+1 −1 modified@@ -329,7 +329,7 @@ def print_train_stats(): ) print( "TRAIN ({tokencount:7d} tokens) initial {initialerrors:5d} {initialacc:.4f} " - "final: {finalerrors:5d} {finalacc:.4f} ".format(**train_stats) + "final: {finalerrors:5d} {finalacc:.4f}".format(**train_stats) ) head = "#ID | Score (train) | #Rules | Template" print(head, "\n", "-" * len(head), sep="")
nltk/tag/brill_trainer.py+11 −11 modified@@ -91,7 +91,7 @@ def __init__( # Training def train(self, train_sents, max_rules=200, min_score=2, min_acc=None): - """ + r""" Trains the Brill tagger on the corpus *train_sents*, producing at most *max_rules* transformations, each of which reduces the net number of errors in the corpus by at least @@ -111,7 +111,7 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None): >>> testing_data = [untag(s) for s in gold_data] >>> backoff = RegexpTagger([ - ... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers + ... (r'^-?[0-9]+(\.[0-9]+)?$', 'CD'), # cardinal numbers ... (r'(The|the|A|a|An|an)$', 'AT'), # articles ... (r'.*able$', 'JJ'), # adjectives ... (r'.*ness$', 'NN'), # nouns formed from adjectives @@ -125,7 +125,7 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None): >>> baseline = backoff #see NOTE1 >>> baseline.evaluate(gold_data) #doctest: +ELLIPSIS - 0.2450142... + 0.2433862... >>> # Set up templates >>> Template._cleartemplates() #clear any templates created in earlier tests @@ -137,7 +137,7 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None): >>> tagger1 = tt.train(training_data, max_rules=10) TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: None) Finding initial useful rules... - Found 845 useful rules. + Found 847 useful rules. <BLANKLINE> B | S F r O | Score = Fixed - Broken @@ -150,7 +150,7 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None): 85 85 0 0 | NN->, if Pos:NN@[-1] & Word:,@[0] 69 69 0 0 | NN->. if Pos:NN@[-1] & Word:.@[0] 51 51 0 0 | NN->IN if Pos:NN@[-1] & Word:of@[0] - 47 63 16 161 | NN->IN if Pos:NNS@[-1] + 47 63 16 162 | NN->IN if Pos:NNS@[-1] 33 33 0 0 | NN->TO if Pos:NN@[-1] & Word:to@[0] 26 26 0 0 | IN->. if Pos:NNS@[-1] & Word:.@[0] 24 24 0 0 | IN->, if Pos:NNS@[-1] & Word:,@[0] @@ -162,11 +162,11 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None): >>> train_stats = tagger1.train_stats() >>> [train_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']] - [1775, 1269, [132, 85, 69, 51, 47, 33, 26, 24, 22, 17]] + [1776, 1270, [132, 85, 69, 51, 47, 33, 26, 24, 22, 17]] >>> tagger1.print_template_statistics(printunused=False) TEMPLATE STATISTICS (TRAIN) 2 templates, 10 rules) - TRAIN ( 2417 tokens) initial 1775 0.2656 final: 1269 0.4750 + TRAIN ( 2417 tokens) initial 1776 0.2652 final: 1270 0.4746 #ID | Score (train) | #Rules | Template -------------------------------------------- 001 | 305 0.603 | 7 0.700 | Template(Pos([-1]),Word([0])) @@ -175,7 +175,7 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None): <BLANKLINE> >>> tagger1.evaluate(gold_data) # doctest: +ELLIPSIS - 0.43996... + 0.43833... >>> tagged, test_stats = tagger1.batch_tag_incremental(testing_data, gold_data) @@ -185,13 +185,13 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None): True >>> [test_stats[stat] for stat in ['initialerrors', 'finalerrors', 'rulescores']] - [1855, 1376, [100, 85, 67, 58, 27, 36, 27, 16, 31, 32]] + [1859, 1380, [100, 85, 67, 58, 27, 36, 27, 16, 31, 32]] >>> # A high-accuracy tagger >>> tagger2 = tt.train(training_data, max_rules=10, min_acc=0.99) TBL train (fast) (seqs: 100; tokens: 2417; tpls: 2; min score: 2; min acc: 0.99) Finding initial useful rules... - Found 845 useful rules. + Found 847 useful rules. <BLANKLINE> B | S F r O | Score = Fixed - Broken @@ -212,7 +212,7 @@ def train(self, train_sents, max_rules=200, min_score=2, min_acc=None): 18 18 0 0 | NN->CC if Pos:NN@[-1] & Word:and@[0] >>> tagger2.evaluate(gold_data) # doctest: +ELLIPSIS - 0.44159544... + 0.43996743... >>> tagger2.rules()[2:4] (Rule('001', 'NN', '.', [(Pos([-1]),'NN'), (Word([0]),'.')]), Rule('001', 'NN', 'IN', [(Pos([-1]),'NN'), (Word([0]),'of')]))
nltk/tag/sequential.py+4 −4 modified@@ -337,7 +337,7 @@ class UnigramTagger(NgramTagger): >>> test_sent = brown.sents(categories='news')[0] >>> unigram_tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500]) >>> for tok, tag in unigram_tagger.tag(test_sent): - ... print("({}, {}), ".format(tok, tag)) + ... print("({}, {}), ".format(tok, tag)) # doctest: +NORMALIZE_WHITESPACE (The, AT), (Fulton, NP-TL), (County, NN-TL), (Grand, JJ-TL), (Jury, NN-TL), (said, VBD), (Friday, NR), (an, AT), (investigation, NN), (of, IN), (Atlanta's, NP$), (recent, JJ), @@ -491,7 +491,7 @@ def context(self, tokens, index, history): @jsontags.register_tag class RegexpTagger(SequentialBackoffTagger): - """ + r""" Regular Expression Tagger The RegexpTagger assigns tags to tokens by comparing their @@ -503,7 +503,7 @@ class RegexpTagger(SequentialBackoffTagger): >>> from nltk.tag import RegexpTagger >>> test_sent = brown.sents(categories='news')[0] >>> regexp_tagger = RegexpTagger( - ... [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers + ... [(r'^-?[0-9]+(\.[0-9]+)?$', 'CD'), # cardinal numbers ... (r'(The|the|A|a|An|an)$', 'AT'), # articles ... (r'.*able$', 'JJ'), # adjectives ... (r'.*ness$', 'NN'), # nouns formed from adjectives @@ -515,7 +515,7 @@ class RegexpTagger(SequentialBackoffTagger): ... ]) >>> regexp_tagger <Regexp Tagger: size=9> - >>> regexp_tagger.tag(test_sent) + >>> regexp_tagger.tag(test_sent) # doctest: +NORMALIZE_WHITESPACE [('The', 'AT'), ('Fulton', 'NN'), ('County', 'NN'), ('Grand', 'NN'), ('Jury', 'NN'), ('said', 'NN'), ('Friday', 'NN'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'NN'), ("Atlanta's", 'NNS'), ('recent', 'NN'), ('primary', 'NN'), ('election', 'NN'),
nltk/tbl/demo.py+2 −2 modified@@ -393,11 +393,11 @@ def _demo_plot(learning_curve_output, teststats, trainstats=None, take=None): plt.savefig(learning_curve_output) -NN_CD_TAGGER = RegexpTagger([(r"^-?[0-9]+(.[0-9]+)?$", "CD"), (r".*", "NN")]) +NN_CD_TAGGER = RegexpTagger([(r"^-?[0-9]+(\.[0-9]+)?$", "CD"), (r".*", "NN")]) REGEXP_TAGGER = RegexpTagger( [ - (r"^-?[0-9]+(.[0-9]+)?$", "CD"), # cardinal numbers + (r"^-?[0-9]+(\.[0-9]+)?$", "CD"), # cardinal numbers (r"(The|the|A|a|An|an)$", "AT"), # articles (r".*able$", "JJ"), # adjectives (r".*ness$", "NN"), # nouns formed from adjectives
Vulnerability mechanics
Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
6- github.com/advisories/GHSA-rqjh-jp2r-59cjghsaADVISORY
- nvd.nist.gov/vuln/detail/CVE-2021-3842ghsaADVISORY
- github.com/nltk/nltk/commit/2a50a3edc9d35f57ae42a921c621edc160877f4dghsax_refsource_MISCWEB
- github.com/nltk/nltk/pull/2906ghsaWEB
- github.com/pypa/advisory-database/tree/main/vulns/nltk/PYSEC-2022-5.yamlghsaWEB
- huntr.dev/bounties/761a761e-2be2-430a-8d92-6f74ffe9866aghsax_refsource_CONFIRMWEB
News mentions
0No linked articles in our index yet.