VYPR
Moderate severityNVD Advisory· Published Sep 12, 2025· Updated Sep 12, 2025

Regular Expression Denial of Service (ReDoS) in huggingface/transformers

CVE-2025-6638

Description

A Regular Expression Denial of Service (ReDoS) vulnerability was discovered in the Hugging Face Transformers library, specifically affecting the MarianTokenizer's remove_language_code() method. This vulnerability is present in version 4.52.4 and has been fixed in version 4.53.0. The issue arises from inefficient regex processing, which can be exploited by crafted input strings containing malformed language code patterns, leading to excessive CPU consumption and potential denial of service.

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

A ReDoS vulnerability in Hugging Face Transformers' MarianTokenizer allows denial of service via crafted input, fixed in version 4.53.0.

A Regular Expression Denial of Service (ReDoS) vulnerability exists in the Hugging Face Transformers library, specifically within the MarianTokenizer.remove_language_code() method. The root cause is the use of an inefficient regular expression pattern >>.+<< to detect and remove language codes, which can exhibit catastrophic backtracking when processing crafted input strings containing malformed language code patterns [2].

An attacker can exploit this vulnerability by providing a specially crafted string to any application that utilizes the MarianTokenizer for text processing. The attack requires no authentication and can be triggered remotely, as the tokenizer processes user-supplied input. The malicious input causes excessive CPU consumption due to the regex engine's backtracking behavior, leading to a denial of service condition [2].

The impact of a successful exploitation is a denial of service, where the targeted application becomes unresponsive or crashes due to CPU exhaustion. This can disrupt services relying on the Transformers library for machine learning inference or training tasks [2].

The vulnerability is present in version 4.52.4 and has been fixed in version 4.53.0. The fix replaces the vulnerable regex pattern with simple string prefix/suffix checks, eliminating the risk of catastrophic backtracking [3][4]. Users are advised to upgrade to version 4.53.0 or later to mitigate the vulnerability.

AI Insight generated on May 19, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected packages

Versions sourced from the GitHub Security Advisory.

PackageAffected versionsPatched versions
transformersPyPI
< 4.53.04.53.0

Affected products

2
  • Range: >= 4.52.4, < 4.53.0
  • huggingface/huggingface/transformersv5
    Range: unspecified

Patches

2
d37f7517972f

Two ReDOS fixes (#39013)

2 files changed · +7 8
  • src/transformers/models/marian/tokenization_marian.py+5 5 modified
    @@ -13,7 +13,6 @@
     # limitations under the License.
     import json
     import os
    -import re
     import warnings
     from pathlib import Path
     from shutil import copyfile
    @@ -104,7 +103,6 @@ class MarianTokenizer(PreTrainedTokenizer):
     
         vocab_files_names = VOCAB_FILES_NAMES
         model_input_names = ["input_ids", "attention_mask"]
    -    language_code_re = re.compile(">>.+<<")  # type: re.Pattern
     
         def __init__(
             self,
    @@ -186,9 +184,11 @@ def _convert_token_to_id(self, token):
     
         def remove_language_code(self, text: str):
             """Remove language codes like >>fr<< before sentencepiece"""
    -        match = self.language_code_re.match(text)
    -        code: list = [match.group(0)] if match else []
    -        return code, self.language_code_re.sub("", text)
    +        code = []
    +        if text.startswith(">>") and (end_loc := text.find("<<")) != -1:
    +            code.append(text[: end_loc + 2])
    +            text = text[end_loc + 2 :]
    +        return code, text
     
         def _tokenize(self, text: str) -> list[str]:
             code, text = self.remove_language_code(text)
    
  • src/transformers/optimization_tf.py+2 3 modified
    @@ -14,7 +14,6 @@
     # ==============================================================================
     """Functions and classes related to optimization (weight updates)."""
     
    -import re
     from typing import Callable, Optional, Union
     
     import tensorflow as tf
    @@ -296,12 +295,12 @@ def _do_use_weight_decay(self, param_name):
     
             if self._include_in_weight_decay:
                 for r in self._include_in_weight_decay:
    -                if re.search(r, param_name) is not None:
    +                if r in param_name:
                         return True
     
             if self._exclude_from_weight_decay:
                 for r in self._exclude_from_weight_decay:
    -                if re.search(r, param_name) is not None:
    +                if r in param_name:
                         return False
             return True
     
    
47c34fba5c30

Just don't use RE at all

1 file changed · +5 5
  • src/transformers/models/marian/tokenization_marian.py+5 5 modified
    @@ -18,7 +18,6 @@
     from shutil import copyfile
     from typing import Any, Optional, Union
     
    -import regex as re
     import sentencepiece
     
     from ...tokenization_utils import PreTrainedTokenizer
    @@ -104,7 +103,6 @@ class MarianTokenizer(PreTrainedTokenizer):
     
         vocab_files_names = VOCAB_FILES_NAMES
         model_input_names = ["input_ids", "attention_mask"]
    -    language_code_re = re.compile(">>.++<<")  # type: re.Pattern
     
         def __init__(
             self,
    @@ -186,9 +184,11 @@ def _convert_token_to_id(self, token):
     
         def remove_language_code(self, text: str):
             """Remove language codes like >>fr<< before sentencepiece"""
    -        match = self.language_code_re.match(text)
    -        code: list = [match.group(0)] if match else []
    -        return code, self.language_code_re.sub("", text)
    +        code = []
    +        if text.startswith(">>") and (end_loc := text.find("<<")) != -1:
    +            code.append(text[: end_loc + 2])
    +            text = text[end_loc + 2 :]
    +        return code, text
     
         def _tokenize(self, text: str) -> list[str]:
             code, text = self.remove_language_code(text)
    

Vulnerability mechanics

Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

5

News mentions

0

No linked articles in our index yet.