Regular Expression Denial of Service (ReDoS) in huggingface/transformers
Description
A Regular Expression Denial of Service (ReDoS) vulnerability was identified in the huggingface/transformers library, specifically in the file tokenization_gpt_neox_japanese.py of the GPT-NeoX-Japanese model. The vulnerability occurs in the SubWordJapaneseTokenizer class, where regular expressions process specially crafted inputs. The issue stems from a regex exhibiting exponential complexity under certain conditions, leading to excessive backtracking. This can result in high CPU usage and potential application downtime, effectively creating a Denial of Service (DoS) scenario. The affected version is v4.48.1 (latest).
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
A ReDoS vulnerability in huggingface/transformers GPT-NeoX-Japanese tokenizer allows crafted inputs to cause high CPU usage via exponential regex backtracking.
The vulnerability is a Regular Expression Denial of Service (ReDoS) found in the huggingface/transformers library, specifically in the file tokenization_gpt_neox_japanese.py of the GPT-NeoX-Japanese model. The issue resides in the SubWordJapaneseTokenizer class within the regex pattern content_repatter6, which exhibits exponential backtracking complexity when processing specially crafted inputs [2]. This leads to excessive CPU consumption, potentially causing application downtime.
Exploitation requires an attacker to supply crafted text to the tokenizer via any application that uses this model. No authentication or special privileges are needed—any user-provided input processed by the vulnerable tokenizer can trigger the DoS condition. The regex is used for handling certain Japanese numerical expressions, and malicious inputs can cause catastrophic backtracking [2][3].
The impact is primarily availability: high CPU usage can degrade service performance or cause complete denial of service for the affected application. While no data confidentiality or integrity is compromised, the DoS can disrupt operations for users relying on the transformer model.
A fix was provided in commit 92c5ca9 [3]. For Python 3.11 and above, the fix uses possessive quantifiers to prevent backtracking. For earlier Python versions, a slightly different regex is used to maintain functionality while avoiding the vulnerability. The affected version is v4.48.1, and users are advised to update to a patched version of the library as soon as possible [2][3].
AI Insight generated on May 20, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
transformersPyPI | < 4.50.0 | 4.50.0 |
Affected products
5- Range: = 4.48.1
- osv-coords3 versionspkg:apk/chainguard/nemopkg:apk/chainguard/tritonserver-backend-tensorrtllm-24.04pkg:pypi/transformers
< 2.5.2-r2+ 2 more
- (no CPE)range: < 2.5.2-r2
- (no CPE)range: < 0.9.0-r5
- (no CPE)range: < 4.50.0
- huggingface/huggingface/transformersv5Range: unspecified
Patches
192c5ca9dd70dFix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese (#36121)
3 files changed · +51 −30
src/transformers/models/deprecated/gptsan_japanese/tokenization_gptsan_japanese.py+18 −3 modified@@ -18,6 +18,7 @@ import json import os import re +import sys from typing import List, Optional, Tuple, Union import numpy as np @@ -407,9 +408,23 @@ def __init__(self, vocab, ids_to_tokens, emoji): self.content_repatter5 = re.compile( r"(明治|大正|昭和|平成|令和|㍾|㍽|㍼|㍻|\u32ff)\d{1,2}年(0?[1-9]|1[0-2])月(0?[1-9]|[12][0-9]|3[01])日(\d{1,2}|:|\d{1,2}時|\d{1,2}分|\(日\)|\(月\)|\(火\)|\(水\)|\(木\)|\(金\)|\(土\)|㈰|㈪|㈫|㈬|㈭|㈮|㈯)*" ) - self.content_repatter6 = re.compile( - r"((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*億)*((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*万)*((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*千)*(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*(千円|万円|千万円|円|千ドル|万ドル|千万ドル|ドル|千ユーロ|万ユーロ|千万ユーロ|ユーロ)+(\(税込\)|\(税抜\)|\+tax)*" - ) + # The original version of this regex displays catastrophic backtracking behaviour. We avoid this using + # possessive quantifiers in Py >= 3.11. In versions below this, we avoid the vulnerability using a slightly + # different regex that should generally have the same behaviour in most non-pathological cases. + if sys.version_info >= (3, 11): + self.content_repatter6 = re.compile( + r"(?:\d,\d{3}|[\d億])*+" + r"(?:\d,\d{3}|[\d万])*+" + r"(?:\d,\d{3}|[\d千])*+" + r"(?:千円|万円|千万円|円|千ドル|万ドル|千万ドル|ドル|千ユーロ|万ユーロ|千万ユーロ|ユーロ)+" + r"(?:\(税込\)|\(税抜\)|\+tax)*" + ) + else: + self.content_repatter6 = re.compile( + r"(?:\d,\d{3}|[\d億万千])*" + r"(?:千円|万円|千万円|円|千ドル|万ドル|千万ドル|ドル|千ユーロ|万ユーロ|千万ユーロ|ユーロ)+" + r"(?:\(税込\)|\(税抜\)|\+tax)*" + ) keisen = "─━│┃┄┅┆┇┈┉┊┋┌┍┎┏┐┑┒┓└┕┖┗┘┙┚┛├┝┞┟┠┡┢┣┤┥┦┧┨┩┪┫┬┭┮┯┰┱┲┳┴┵┶┷┸┹┺┻┼┽┾┿╀╁╂╃╄╅╆╇╈╉╊╋╌╍╎╏═║╒╓╔╕╖╗╘╙╚╛╜╝╞╟╠╡╢╣╤╥╦╧╨╩╪╫╬╭╮╯╰╱╲╳╴╵╶╷╸╹╺╻╼╽╾╿" blocks = "▀▁▂▃▄▅▆▇█▉▊▋▌▍▎▏▐░▒▓▔▕▖▗▘▙▚▛▜▝▞▟" self.content_trans1 = str.maketrans({k: "<BLOCK>" for k in keisen + blocks})
src/transformers/models/gpt_neox_japanese/tokenization_gpt_neox_japanese.py+18 −3 modified@@ -18,6 +18,7 @@ import json import os import re +import sys from typing import Optional, Tuple import numpy as np @@ -230,9 +231,23 @@ def __init__(self, vocab, ids_to_tokens, emoji): self.content_repatter5 = re.compile( r"(明治|大正|昭和|平成|令和|㍾|㍽|㍼|㍻|\u32ff)\d{1,2}年(0?[1-9]|1[0-2])月(0?[1-9]|[12][0-9]|3[01])日(\d{1,2}|:|\d{1,2}時|\d{1,2}分|\(日\)|\(月\)|\(火\)|\(水\)|\(木\)|\(金\)|\(土\)|㈰|㈪|㈫|㈬|㈭|㈮|㈯)*" ) - self.content_repatter6 = re.compile( - r"((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*億)*((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*万)*((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*千)*(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*(千円|万円|千万円|円|千ドル|万ドル|千万ドル|ドル|千ユーロ|万ユーロ|千万ユーロ|ユーロ)+(\(税込\)|\(税抜\)|\+tax)*" - ) + # The original version of this regex displays catastrophic backtracking behaviour. We avoid this using + # possessive quantifiers in Py >= 3.11. In versions below this, we avoid the vulnerability using a slightly + # different regex that should generally have the same behaviour in most non-pathological cases. + if sys.version_info >= (3, 11): + self.content_repatter6 = re.compile( + r"(?:\d,\d{3}|[\d億])*+" + r"(?:\d,\d{3}|[\d万])*+" + r"(?:\d,\d{3}|[\d千])*+" + r"(?:千円|万円|千万円|円|千ドル|万ドル|千万ドル|ドル|千ユーロ|万ユーロ|千万ユーロ|ユーロ)+" + r"(?:\(税込\)|\(税抜\)|\+tax)*" + ) + else: + self.content_repatter6 = re.compile( + r"(?:\d,\d{3}|[\d億万千])*" + r"(?:千円|万円|千万円|円|千ドル|万ドル|千万ドル|ドル|千ユーロ|万ユーロ|千万ユーロ|ユーロ)+" + r"(?:\(税込\)|\(税抜\)|\+tax)*" + ) keisen = "─━│┃┄┅┆┇┈┉┊┋┌┍┎┏┐┑┒┓└┕┖┗┘┙┚┛├┝┞┟┠┡┢┣┤┥┦┧┨┩┪┫┬┭┮┯┰┱┲┳┴┵┶┷┸┹┺┻┼┽┾┿╀╁╂╃╄╅╆╇╈╉╊╋╌╍╎╏═║╒╓╔╕╖╗╘╙╚╛╜╝╞╟╠╡╢╣╤╥╦╧╨╩╪╫╬╭╮╯╰╱╲╳╴╵╶╷╸╹╺╻╼╽╾╿" blocks = "▀▁▂▃▄▅▆▇█▉▊▋▌▍▎▏▐░▒▓▔▕▖▗▘▙▚▛▜▝▞▟" self.content_trans1 = str.maketrans({k: "<BLOCK>" for k in keisen + blocks})
src/transformers/models/nougat/tokenization_nougat_fast.py+15 −24 modified@@ -113,26 +113,17 @@ def normalize_list_like_lines(generation): normalization adjusts the bullet point style and nesting levels based on the captured patterns. """ - # This matches lines starting with - or *, not followed by - or * (lists) - # that are then numbered by digits \d or roman numerals (one or more) - # and then, optional additional numbering of this line is captured - # this is then fed to re.finditer. - pattern = r"(?:^)(-|\*)?(?!-|\*) ?((?:\d|[ixv])+ )?.+? (-|\*) (((?:\d|[ixv])+)\.(\d|[ixv]) )?.*(?:$)" - - for match in reversed(list(re.finditer(pattern, generation, flags=re.I | re.M))): - start, stop = match.span() - delim = match.group(3) + " " - splits = match.group(0).split(delim) + lines = generation.split("\n") + output_lines = [] + for line_no, line in enumerate(lines): + match = re.search(r". ([-*]) ", line) + if not match or line[0] not in ("-", "*"): + output_lines.append(line) + continue # Doesn't fit the pattern we want, no changes + delim = match.group(1) + " " + splits = line.split(delim)[1:] replacement = "" - - if match.group(1) is not None: - splits = splits[1:] - delim1 = match.group(1) + " " - else: - delim1 = "" - continue # Skip false positives - - pre, post = generation[:start], generation[stop:] + delim1 = line[0] + " " for i, item in enumerate(splits): level = 0 @@ -144,15 +135,15 @@ def normalize_list_like_lines(generation): level = potential_numeral.count(".") replacement += ( - ("\n" if i > 0 else "") + ("\t" * level) + (delim if i > 0 or start == 0 else delim1) + item.strip() + ("\n" if i > 0 else "") + ("\t" * level) + (delim if i > 0 or line_no == 0 else delim1) + item.strip() ) - if post == "": - post = "\n" + if line_no == len(lines) - 1: # If this is the last line in the generation + replacement += "\n" # Add an empty line to the end of the generation - generation = pre + replacement + post + output_lines.append(replacement) - return generation + return "\n".join(output_lines) def find_next_punctuation(text: str, start_idx=0):
Vulnerability mechanics
Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
4News mentions
0No linked articles in our index yet.