Regular Expression Denial of Service (ReDoS) in huggingface/transformers
Description
A Regular Expression Denial of Service (ReDoS) vulnerability was discovered in the Hugging Face Transformers library, specifically affecting the MarianTokenizer's remove_language_code() method. This vulnerability is present in version 4.52.4 and has been fixed in version 4.53.0. The issue arises from inefficient regex processing, which can be exploited by crafted input strings containing malformed language code patterns, leading to excessive CPU consumption and potential denial of service.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
A ReDoS vulnerability in Hugging Face Transformers' MarianTokenizer allows denial of service via crafted input, fixed in version 4.53.0.
A Regular Expression Denial of Service (ReDoS) vulnerability exists in the Hugging Face Transformers library, specifically within the MarianTokenizer.remove_language_code() method. The root cause is the use of an inefficient regular expression pattern >>.+<< to detect and remove language codes, which can exhibit catastrophic backtracking when processing crafted input strings containing malformed language code patterns [2].
An attacker can exploit this vulnerability by providing a specially crafted string to any application that utilizes the MarianTokenizer for text processing. The attack requires no authentication and can be triggered remotely, as the tokenizer processes user-supplied input. The malicious input causes excessive CPU consumption due to the regex engine's backtracking behavior, leading to a denial of service condition [2].
The impact of a successful exploitation is a denial of service, where the targeted application becomes unresponsive or crashes due to CPU exhaustion. This can disrupt services relying on the Transformers library for machine learning inference or training tasks [2].
The vulnerability is present in version 4.52.4 and has been fixed in version 4.53.0. The fix replaces the vulnerable regex pattern with simple string prefix/suffix checks, eliminating the risk of catastrophic backtracking [3][4]. Users are advised to upgrade to version 4.53.0 or later to mitigate the vulnerability.
AI Insight generated on May 19, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
transformersPyPI | < 4.53.0 | 4.53.0 |
Affected products
2- Range: >= 4.52.4, < 4.53.0
- huggingface/huggingface/transformersv5Range: unspecified
Patches
2d37f7517972fTwo ReDOS fixes (#39013)
2 files changed · +7 −8
src/transformers/models/marian/tokenization_marian.py+5 −5 modified@@ -13,7 +13,6 @@ # limitations under the License. import json import os -import re import warnings from pathlib import Path from shutil import copyfile @@ -104,7 +103,6 @@ class MarianTokenizer(PreTrainedTokenizer): vocab_files_names = VOCAB_FILES_NAMES model_input_names = ["input_ids", "attention_mask"] - language_code_re = re.compile(">>.+<<") # type: re.Pattern def __init__( self, @@ -186,9 +184,11 @@ def _convert_token_to_id(self, token): def remove_language_code(self, text: str): """Remove language codes like >>fr<< before sentencepiece""" - match = self.language_code_re.match(text) - code: list = [match.group(0)] if match else [] - return code, self.language_code_re.sub("", text) + code = [] + if text.startswith(">>") and (end_loc := text.find("<<")) != -1: + code.append(text[: end_loc + 2]) + text = text[end_loc + 2 :] + return code, text def _tokenize(self, text: str) -> list[str]: code, text = self.remove_language_code(text)
src/transformers/optimization_tf.py+2 −3 modified@@ -14,7 +14,6 @@ # ============================================================================== """Functions and classes related to optimization (weight updates).""" -import re from typing import Callable, Optional, Union import tensorflow as tf @@ -296,12 +295,12 @@ def _do_use_weight_decay(self, param_name): if self._include_in_weight_decay: for r in self._include_in_weight_decay: - if re.search(r, param_name) is not None: + if r in param_name: return True if self._exclude_from_weight_decay: for r in self._exclude_from_weight_decay: - if re.search(r, param_name) is not None: + if r in param_name: return False return True
47c34fba5c30Just don't use RE at all
1 file changed · +5 −5
src/transformers/models/marian/tokenization_marian.py+5 −5 modified@@ -18,7 +18,6 @@ from shutil import copyfile from typing import Any, Optional, Union -import regex as re import sentencepiece from ...tokenization_utils import PreTrainedTokenizer @@ -104,7 +103,6 @@ class MarianTokenizer(PreTrainedTokenizer): vocab_files_names = VOCAB_FILES_NAMES model_input_names = ["input_ids", "attention_mask"] - language_code_re = re.compile(">>.++<<") # type: re.Pattern def __init__( self, @@ -186,9 +184,11 @@ def _convert_token_to_id(self, token): def remove_language_code(self, text: str): """Remove language codes like >>fr<< before sentencepiece""" - match = self.language_code_re.match(text) - code: list = [match.group(0)] if match else [] - return code, self.language_code_re.sub("", text) + code = [] + if text.startswith(">>") and (end_loc := text.find("<<")) != -1: + code.append(text[: end_loc + 2]) + text = text[end_loc + 2 :] + return code, text def _tokenize(self, text: str) -> list[str]: code, text = self.remove_language_code(text)
Vulnerability mechanics
Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
5- github.com/advisories/GHSA-59p9-h35m-wg4gghsaADVISORY
- nvd.nist.gov/vuln/detail/CVE-2025-6638ghsaADVISORY
- github.com/huggingface/transformers/commit/47c34fba5c303576560cb29767efb452ff12b8beghsaWEB
- github.com/huggingface/transformers/commit/d37f7517972f67e3f2194c000ed0f87f064e5099ghsaWEB
- huntr.com/bounties/6a6c933f-9ce8-4ded-8b3b-2c1444c61f36ghsaWEB
News mentions
0No linked articles in our index yet.