Low severityNVD Advisory· Published Jul 7, 2025· Updated Jul 7, 2025

Improper Input Validation in huggingface/transformers

CVE-2025-3777

Description

Hugging Face Transformers versions up to 4.49.0 are affected by an improper input validation vulnerability in the image_utils.py file. The vulnerability arises from insecure URL validation using the startswith() method, which can be bypassed through URL username injection. This allows attackers to craft URLs that appear to be from YouTube but resolve to malicious domains, potentially leading to phishing attacks, malware distribution, or data exfiltration. The issue is fixed in version 4.52.1.

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

Hugging Face Transformers ≤4.49.0 has an improper URL validation in image_utils.py that allows username injection to bypass YouTube domain checks, leading to phishing/malware risks.

Vulnerability

Description

CVE-2025-3777 is an improper input validation vulnerability in Hugging Face Transformers versions up to 4.49.0, located in the image_utils.py file. The issue stems from insecure URL validation that uses Python's startswith() method to verify that an image URL belongs to a trusted domain (e.g., YouTube). This check can be bypassed by injecting a username into the URL, such as https://youtube.com@malicious.example.com, which the startswith() method will treat as valid because the string begins with youtube.com even though the actual host is malicious.example.com [1][3].

Attack

Vector

An attacker can craft a URL that appears to be from YouTube but actually resolves to a malicious domain by leveraging the URL username injection technique. When Transformers processes an image from such a URL, it will pass the validation check due to the flawed startswith() logic, allowing the attacker's malicious server to be contacted. This attack requires no special authentication or network position, as it can be triggered simply by providing a crafted URL to a Transformers-based application that loads images from external sources [1][3].

Impact

Successful exploitation could lead to phishing attacks where users are tricked into interacting with a seemingly legitimate YouTube URL, malware distribution if the malicious domain serves infected content, or data exfiltration if the attacker's server proxies or captures sensitive data. The vulnerability affects the trust that users and applications place in URL validation within Transformers, potentially impacting any downstream service that relies on this filtering to fetch images [3].

Mitigation

The issue has been fixed in Transformers version 4.52.1. Users are strongly advised to update to this version or later. No official workaround is documented for earlier versions, so upgrading is the primary remediation. The vulnerability is not listed on CISA's Known Exploited Vulnerabilities catalog as of this writing [2][3].

References

AI Insight generated on May 19, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected packages

Versions sourced from the GitHub Security Advisory.

Package	Affected versions	Patched versions
transformersPyPI	< 4.52.1	4.52.1

Affected products

Huggingface/Transformersllm-fuzzy
Range: <=4.49.0
huggingface/huggingface/transformersv5
Range: unspecified

Patches

4dda5f71b35f

Merge branch 'main' into chat-template-url

https://github.com/huggingface/transformersRaushan TurganbayApr 17, 2025via ghsa

commit

42 files changed · +805 −112

docs/source/en/model_doc/bridgetower.md+5 −0 modified

@@ -147,6 +147,11 @@ Tips:
 [[autodoc]] BridgeTowerImageProcessor
     - preprocess
 
+## BridgeTowerImageProcessorFast
+
+[[autodoc]] BridgeTowerImageProcessorFast
+    - preprocess
+
 ## BridgeTowerProcessor
 
 [[autodoc]] BridgeTowerProcessor

docs/source/en/model_doc/efficientnet.md+5 −0 modified

@@ -43,6 +43,11 @@ The original code can be found [here](https://github.com/tensorflow/tpu/tree/mas
 [[autodoc]] EfficientNetImageProcessor
     - preprocess
 
+## EfficientNetImageProcessorFast
+
+[[autodoc]] EfficientNetImageProcessorFast
+    - preprocess
+
 ## EfficientNetModel
 
 [[autodoc]] EfficientNetModel

docs/source/ja/model_doc/bridgetower.md+5 −0 modified

@@ -144,6 +144,11 @@ BridgeTower は、ビジュアル エンコーダー、テキスト エンコー
 [[autodoc]] BridgeTowerImageProcessor
     - preprocess
 
+## BridgeTowerImageProcessorFast
+
+[[autodoc]] BridgeTowerImageProcessorFast
+    - preprocess
+
 ## BridgeTowerProcessor
 
 [[autodoc]] BridgeTowerProcessor

src/transformers/image_utils.py+1 −1 modified

@@ -66,7 +66,7 @@
         from torchvision.transforms import InterpolationMode
 
         pil_torch_interpolation_mapping = {
-            PILImageResampling.NEAREST: InterpolationMode.NEAREST,
+            PILImageResampling.NEAREST: InterpolationMode.NEAREST_EXACT,
             PILImageResampling.BOX: InterpolationMode.BOX,
             PILImageResampling.BILINEAR: InterpolationMode.BILINEAR,
             PILImageResampling.HAMMING: InterpolationMode.HAMMING,

src/transformers/models/auto/image_processing_auto.py+3 −3 modified

@@ -56,13 +56,13 @@
 else:
     IMAGE_PROCESSOR_MAPPING_NAMES = OrderedDict(
         [
-            ("align", ("EfficientNetImageProcessor",)),
+            ("align", ("EfficientNetImageProcessor", "EfficientNetImageProcessorFast")),
             ("aria", ("AriaImageProcessor",)),
             ("beit", ("BeitImageProcessor",)),
             ("bit", ("BitImageProcessor", "BitImageProcessorFast")),
             ("blip", ("BlipImageProcessor", "BlipImageProcessorFast")),
             ("blip-2", ("BlipImageProcessor", "BlipImageProcessorFast")),
-            ("bridgetower", ("BridgeTowerImageProcessor",)),
+            ("bridgetower", ("BridgeTowerImageProcessor", "BridgeTowerImageProcessorFast")),
             ("chameleon", ("ChameleonImageProcessor",)),
             ("chinese_clip", ("ChineseCLIPImageProcessor", "ChineseCLIPImageProcessorFast")),
             ("clip", ("CLIPImageProcessor", "CLIPImageProcessorFast")),
@@ -83,7 +83,7 @@
             ("donut-swin", ("DonutImageProcessor", "DonutImageProcessorFast")),
             ("dpt", ("DPTImageProcessor",)),
             ("efficientformer", ("EfficientFormerImageProcessor",)),
-            ("efficientnet", ("EfficientNetImageProcessor",)),
+            ("efficientnet", ("EfficientNetImageProcessor", "EfficientNetImageProcessorFast")),
             ("flava", ("FlavaImageProcessor", "FlavaImageProcessorFast")),
             ("focalnet", ("BitImageProcessor", "BitImageProcessorFast")),
             ("fuyu", ("FuyuImageProcessor",)),

src/transformers/models/bamba/modeling_bamba.py+2 −2 modified

@@ -783,8 +783,8 @@ def torch_forward(
             hidden_states = hidden_states.reshape(batch_size, seq_len, -1, self.head_dim).float()
             B = B.reshape(batch_size, seq_len, -1, self.ssm_state_size).float()
             C = C.reshape(batch_size, seq_len, -1, self.ssm_state_size).float()
-            B = B.repeat(1, 1, self.num_heads // self.n_groups, 1)
-            C = C.repeat(1, 1, self.num_heads // self.n_groups, 1)
+            B = B.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads)
+            C = C.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads)
             pad_size = (self.chunk_size - seq_len % self.chunk_size) % self.chunk_size
 
             D_residual = self.D[..., None] * pad_tensor_by_size(hidden_states, pad_size)

src/transformers/models/bamba/modular_bamba.py+2 −2 modified

@@ -580,8 +580,8 @@ def torch_forward(
             hidden_states = hidden_states.reshape(batch_size, seq_len, -1, self.head_dim).float()
             B = B.reshape(batch_size, seq_len, -1, self.ssm_state_size).float()
             C = C.reshape(batch_size, seq_len, -1, self.ssm_state_size).float()
-            B = B.repeat(1, 1, self.num_heads // self.n_groups, 1)
-            C = C.repeat(1, 1, self.num_heads // self.n_groups, 1)
+            B = B.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads)
+            C = C.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads)
             pad_size = (self.chunk_size - seq_len % self.chunk_size) % self.chunk_size
 
             D_residual = self.D[..., None] * pad_tensor_by_size(hidden_states, pad_size)

src/transformers/models/beit/modeling_beit.py+1 −1 modified

@@ -663,7 +663,7 @@ def __init__(self, config: BeitConfig, window_size: Optional[tuple] = None) -> N
             self.relative_position_bias = BeitRelativePositionBias(config, window_size=window_size)
 
         # stochastic depth decay rule
-        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)]
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers, device="cpu")]
         self.layer = nn.ModuleList(
             [
                 BeitLayer(

src/transformers/models/bridgetower/image_processing_bridgetower_fast.py+345 −0 added

@@ -0,0 +1,345 @@
+# coding=utf-8
+# Copyright 2025 The Intel Labs Team Authors, The Microsoft Research Team Authors and HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Fast Image processor class for BridgeTower."""
+
+from typing import Dict, Iterable, Optional, Tuple, Union
+
+from ...image_processing_utils_fast import (
+    BASE_IMAGE_PROCESSOR_FAST_DOCSTRING,
+    BASE_IMAGE_PROCESSOR_FAST_DOCSTRING_PREPROCESS,
+    BaseImageProcessorFast,
+    BatchFeature,
+    DefaultFastImageProcessorKwargs,
+    ImageInput,
+    SizeDict,
+    TensorType,
+    Unpack,
+    get_max_height_width,
+    group_images_by_shape,
+    reorder_images,
+)
+from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling
+from ...utils import add_start_docstrings, is_torch_available, is_torchvision_available, is_torchvision_v2_available
+
+
+if is_torch_available():
+    import torch
+
+if is_torchvision_available():
+    if is_torchvision_v2_available():
+        from torchvision.transforms.v2 import functional as F
+    else:
+        from torchvision.transforms import functional as F
+
+
+def make_pixel_mask(
+    image: "torch.Tensor",
+    output_size: Tuple[int, int],
+) -> "torch.Tensor":
+    """
+    Make a pixel mask for the image, where 1 indicates a valid pixel and 0 indicates padding.
+
+    Args:
+        image (`np.ndarray`):
+            Image to make the pixel mask for.
+        output_size (`Tuple[int, int]`):
+            Output size of the mask.
+    """
+    input_height, input_width = image.shape[-2:]
+    batch_size = image.size(0)
+    mask = torch.zeros((batch_size, *output_size), dtype=torch.long)
+    mask[:input_height, :input_width] = 1
+    return mask
+
+
+def get_resize_output_image_size(
+    input_image: "torch.Tensor",
+    shorter: int = 800,
+    longer: int = 1333,
+    size_divisor: int = 32,
+) -> Tuple[int, int]:
+    input_height, input_width = input_image.shape[-2:]
+    min_size, max_size = shorter, longer
+
+    scale = min_size / min(input_height, input_width)
+
+    if input_height < input_width:
+        new_height = min_size
+        new_width = scale * input_width
+    else:
+        new_height = scale * input_height
+        new_width = min_size
+
+    if max(new_height, new_width) > max_size:
+        scale = max_size / max(new_height, new_width)
+        new_height = scale * new_height
+        new_width = scale * new_width
+
+    new_height, new_width = int(new_height + 0.5), int(new_width + 0.5)
+    new_height = new_height // size_divisor * size_divisor
+    new_width = new_width // size_divisor * size_divisor
+
+    return new_height, new_width
+
+
+class BridgeTowerFastImageProcessorKwargs(DefaultFastImageProcessorKwargs):
+    size_divisor: Optional[int]
+    do_pad: Optional[bool]
+
+
+@add_start_docstrings(
+    "Constructs a fast BridgeTower image processor.",
+    BASE_IMAGE_PROCESSOR_FAST_DOCSTRING,
+    """
+        size_divisor (`int`, *optional*, defaults to 32):
+            The size by which to make sure both the height and width can be divided. Only has an effect if `do_resize`
+            is set to `True`. Can be overridden by the `size_divisor` parameter in the `preprocess` method.
+        do_pad (`bool`, *optional*, defaults to `True`):
+            Whether to pad the image to the `(max_height, max_width)` of the images in the batch. Can be overridden by
+            the `do_pad` parameter in the `preprocess` method.
+    """,
+)
+class BridgeTowerImageProcessorFast(BaseImageProcessorFast):
+    resample = PILImageResampling.BICUBIC
+    image_mean = OPENAI_CLIP_MEAN
+    image_std = OPENAI_CLIP_STD
+    size = {"shortest_edge": 288}
+    default_to_square = False
+    crop_size = {"shortest_edge": 288}
+    do_resize = True
+    do_center_crop = True
+    do_rescale = True
+    do_normalize = True
+    do_pad = True
+    size_divisor = 32
+    valid_kwargs = BridgeTowerFastImageProcessorKwargs
+
+    def __init__(self, **kwargs: Unpack[BridgeTowerFastImageProcessorKwargs]):
+        super().__init__(**kwargs)
+
+    @add_start_docstrings(
+        BASE_IMAGE_PROCESSOR_FAST_DOCSTRING_PREPROCESS,
+        """
+            size_divisor (`int`, *optional*, defaults to 32):
+                The size by which to make sure both the height and width can be divided. Only has an effect if `do_resize`
+                is set to `True`. Can be overridden by the `size_divisor` parameter in the `preprocess` method.
+            do_pad (`bool`, *optional*, defaults to `True`):
+                Whether to pad the image to the `(max_height, max_width)` of the images in the batch. Can be overridden by
+                the `do_pad` parameter in the `preprocess` method.
+        """,
+    )
+    def preprocess(self, images: ImageInput, **kwargs: Unpack[BridgeTowerFastImageProcessorKwargs]) -> BatchFeature:
+        return super().preprocess(images, **kwargs)
+
+    def resize(
+        self,
+        image: "torch.Tensor",
+        size: SizeDict,
+        size_divisor: int = 32,
+        interpolation: "F.InterpolationMode" = None,
+        antialias: bool = True,
+        **kwargs,
+    ) -> "torch.Tensor":
+        """
+        Resize an image.
+
+        Resizes the shorter side of the image to `size["shortest_edge"]` while preserving the aspect ratio. If the
+        longer side is larger than the max size `(int(`size["shortest_edge"]` * 1333 / 800))`, the longer side is then
+        resized to the max size while preserving the aspect ratio.
+
+        Args:
+            image (`torch.Tensor`):
+                Image to resize.
+            size (`SizeDict`):
+                Dictionary in the format `{"height": int, "width": int}` specifying the size of the output image.
+            size_divisor (`int`, *optional*, defaults to 32):
+                The image is resized to a size that is a multiple of this value.
+            resample (`InterpolationMode`, *optional*, defaults to `InterpolationMode.BILINEAR`):
+                `InterpolationMode` filter to use when resizing the image e.g. `InterpolationMode.BICUBIC`.
+
+        Returns:
+            `torch.Tensor`: The resized image.
+        """
+        interpolation = interpolation if interpolation is not None else F.InterpolationMode.BILINEAR
+        if not size.shortest_edge:
+            raise ValueError(f"The `size` dictionary must contain the key `shortest_edge`. Got {size.keys()}")
+        shorter = size.shortest_edge
+        longer = int(1333 / 800 * shorter)
+        output_size = get_resize_output_image_size(
+            image,
+            shorter=shorter,
+            longer=longer,
+            size_divisor=size_divisor,
+        )
+        return F.resize(image, output_size, interpolation=interpolation, antialias=antialias)
+
+    def center_crop(
+        self,
+        image: "torch.Tensor",
+        size: Dict[str, int],
+        **kwargs,
+    ) -> "torch.Tensor":
+        """
+        Center crop an image to `(size["height"], size["width"])`. If the input size is smaller than `crop_size` along
+        any edge, the image is padded with 0's and then center cropped.
+
+        Args:
+            image (`torch.Tensor`):
+                Image to center crop.
+            size (`Dict[str, int]`):
+                Size of the output image in the form `{"height": h, "width": w}`.
+        """
+        output_size = size.shortest_edge
+        return F.center_crop(
+            image,
+            output_size=(output_size, output_size),
+            **kwargs,
+        )
+
+    def _pad_image(
+        self,
+        image: "torch.Tensor",
+        output_size: Tuple[int, int],
+        constant_values: Union[float, Iterable[float]] = 0,
+    ) -> "torch.Tensor":
+        """
+        Pad an image with zeros to the given size.
+        """
+        input_height, input_width = image.shape[-2:]
+        output_height, output_width = output_size
+
+        pad_bottom = output_height - input_height
+        pad_right = output_width - input_width
+        padding = (0, 0, pad_right, pad_bottom)
+        padded_image = F.pad(
+            image,
+            padding,
+            fill=constant_values,
+        )
+        return padded_image
+
+    def pad(
+        self,
+        images: list["torch.Tensor"],
+        constant_values: Union[float, Iterable[float]] = 0,
+        return_pixel_mask: bool = True,
+    ) -> tuple:
+        """
+        Pads a batch of images to the bottom and right of the image with zeros to the size of largest height and width
+        in the batch and optionally returns their corresponding pixel mask.
+
+        Args:
+            image (`torch.Tensor`):
+                Image to pad.
+            constant_values (`float` or `Iterable[float]`, *optional*):
+                The value to use for the padding if `mode` is `"constant"`.
+            return_pixel_mask (`bool`, *optional*, defaults to `True`):
+                Whether to return a pixel mask.
+            return_tensors (`str` or `TensorType`, *optional*):
+                The type of tensors to return. Can be one of:
+                    - Unset: Return a list of `np.ndarray`.
+                    - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
+                    - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
+                    - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
+                    - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
+        """
+        pad_size = get_max_height_width(images)
+
+        grouped_images, grouped_images_index = group_images_by_shape(images)
+        processed_images_grouped = {}
+        processed_masks_grouped = {}
+        for shape, stacked_images in grouped_images.items():
+            stacked_images = self._pad_image(
+                stacked_images,
+                pad_size,
+                constant_values=constant_values,
+            )
+            processed_images_grouped[shape] = stacked_images
+
+            if return_pixel_mask:
+                stacked_masks = make_pixel_mask(image=stacked_images, output_size=pad_size)
+                processed_masks_grouped[shape] = stacked_masks
+
+        processed_images = reorder_images(processed_images_grouped, grouped_images_index)
+
+        processed_masks = None
+        if return_pixel_mask:
+            processed_masks = reorder_images(processed_masks_grouped, grouped_images_index)
+
+        return processed_images, processed_masks
+
+    def _preprocess(
+        self,
+        images: list["torch.Tensor"],
+        do_resize: bool,
+        size: SizeDict,
+        size_divisor: Optional[int],
+        interpolation: Optional["F.InterpolationMode"],
+        do_pad: bool,
+        do_center_crop: bool,
+        crop_size: SizeDict,
+        do_rescale: bool,
+        rescale_factor: float,
+        do_normalize: bool,
+        image_mean: Optional[Union[float, list[float]]],
+        image_std: Optional[Union[float, list[float]]],
+        return_tensors: Optional[Union[str, TensorType]],
+        **kwargs,
+    ) -> BatchFeature:
+        # Group images by size for batched resizing
+        grouped_images, grouped_images_index = group_images_by_shape(images)
+        resized_images_grouped = {}
+        for shape, stacked_images in grouped_images.items():
+            if do_resize:
+                stacked_images = self.resize(
+                    image=stacked_images, size=size, size_divisor=size_divisor, interpolation=interpolation
+                )
+            resized_images_grouped[shape] = stacked_images
+        resized_images = reorder_images(resized_images_grouped, grouped_images_index)
+
+        # Group images by size for further processing
+        # Needed in case do_resize is False, or resize returns images with different sizes
+        grouped_images, grouped_images_index = group_images_by_shape(resized_images)
+        processed_images_grouped = {}
+        for shape, stacked_images in grouped_images.items():
+            if do_center_crop:
+                stacked_images = self.center_crop(stacked_images, crop_size)
+            # Fused rescale and normalize
+            stacked_images = self.rescale_and_normalize(
+                stacked_images, do_rescale, rescale_factor, do_normalize, image_mean, image_std
+            )
+            processed_images_grouped[shape] = stacked_images
+
+        processed_images = reorder_images(processed_images_grouped, grouped_images_index)
+
+        data = {}
+        if do_pad:
+            processed_images, processed_masks = self.pad(processed_images, return_pixel_mask=True)
+            processed_masks = torch.stack(processed_masks, dim=0) if return_tensors else processed_masks
+            data["pixel_mask"] = processed_masks
+
+        processed_images = torch.stack(processed_images, dim=0) if return_tensors else processed_images
+        data["pixel_values"] = processed_images
+
+        return BatchFeature(data=data, tensor_type=return_tensors)
+
+    def to_dict(self):
+        encoder_dict = super().to_dict()
+        encoder_dict.pop("_valid_processor_keys", None)
+        encoder_dict.pop("crop_size", None)
+        return encoder_dict
+
+
+__all__ = ["BridgeTowerImageProcessorFast"]

src/transformers/models/bridgetower/image_processing_bridgetower.py+3 −5 modified

@@ -28,8 +28,8 @@
     PILImageResampling,
     get_image_size,
     infer_channel_dimension_format,
-    is_batched,
     is_scaled_image,
+    make_flat_list_of_images,
     to_numpy_array,
     valid_images,
     validate_preprocess_arguments,
@@ -455,7 +455,7 @@ def preprocess(
         image_mean = image_mean if image_mean is not None else self.image_mean
         image_std = image_std if image_std is not None else self.image_std
         do_pad = do_pad if do_pad is not None else self.do_pad
-        do_center_crop if do_center_crop is not None else self.do_center_crop
+        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
         # For backwards compatibility. Initial version of this processor was cropping to the "size" argument, which
         # it should default to if crop_size is undefined.
         crop_size = (
@@ -464,9 +464,7 @@ def preprocess(
 
         size = size if size is not None else self.size
         size = get_size_dict(size, default_to_square=False)
-
-        if not is_batched(images):
-            images = [images]
+        images = make_flat_list_of_images(images)
 
         if not valid_images(images):
             raise ValueError(

src/transformers/models/bridgetower/__init__.py+1 −0 modified

@@ -20,6 +20,7 @@
 if TYPE_CHECKING:
     from .configuration_bridgetower import *
     from .image_processing_bridgetower import *
+    from .image_processing_bridgetower_fast import *
     from .modeling_bridgetower import *
     from .processing_bridgetower import *
 else:

src/transformers/models/clap/modeling_clap.py+1 −1 modified

@@ -829,7 +829,7 @@ def __init__(self, config):
 
         self.num_features = int(config.patch_embeds_hidden_size * 2 ** (self.num_layers - 1))
 
-        drop_path_rate = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))]
+        drop_path_rate = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")]
 
         grid_size = self.patch_embed.grid_size
         self.input_resolutions = [(grid_size[0] // (2**i), grid_size[1] // (2**i)) for i in range(self.num_layers)]

src/transformers/models/convnext/modeling_convnext.py+2 −1 modified

@@ -225,7 +225,8 @@ def __init__(self, config):
         super().__init__()
         self.stages = nn.ModuleList()
         drop_path_rates = [
-            x.tolist() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths)).split(config.depths)
+            x.tolist()
+            for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu").split(config.depths)
         ]
         prev_chs = config.hidden_sizes[0]
         for i in range(config.num_stages):

src/transformers/models/convnextv2/modeling_convnextv2.py+2 −1 modified

@@ -245,7 +245,8 @@ def __init__(self, config):
         super().__init__()
         self.stages = nn.ModuleList()
         drop_path_rates = [
-            x.tolist() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths)).split(config.depths)
+            x.tolist()
+            for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu").split(config.depths)
         ]
         prev_chs = config.hidden_sizes[0]
         for i in range(config.num_stages):

src/transformers/models/cvt/modeling_cvt.py+3 −1 modified

@@ -449,7 +449,9 @@ def __init__(self, config, stage):
             dropout_rate=config.drop_rate[self.stage],
         )
 
-        drop_path_rates = [x.item() for x in torch.linspace(0, config.drop_path_rate[self.stage], config.depth[stage])]
+        drop_path_rates = [
+            x.item() for x in torch.linspace(0, config.drop_path_rate[self.stage], config.depth[stage], device="cpu")
+        ]
 
         self.layers = nn.Sequential(
             *[

src/transformers/models/data2vec/modeling_data2vec_vision.py+1 −1 modified

@@ -676,7 +676,7 @@ def __init__(self, config: Data2VecVisionConfig, window_size: Optional[tuple] =
             self.relative_position_bias = Data2VecVisionRelativePositionBias(config, window_size=window_size)
 
         # stochastic depth decay rule
-        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)]
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers, device="cpu")]
         self.layer = nn.ModuleList(
             [
                 Data2VecVisionLayer(

src/transformers/models/donut/modeling_donut_swin.py+1 −1 modified

@@ -790,7 +790,7 @@ def __init__(self, config, grid_size):
         super().__init__()
         self.num_layers = len(config.depths)
         self.config = config
-        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))]
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")]
         self.layers = nn.ModuleList(
             [
                 DonutSwinStage(

src/transformers/models/efficientnet/image_processing_efficientnet_fast.py+226 −0 added

@@ -0,0 +1,226 @@
+# coding=utf-8
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Fast Image processor class for EfficientNet."""
+
+from functools import lru_cache
+from typing import Optional, Union
+
+from ...image_processing_utils_fast import (
+    BASE_IMAGE_PROCESSOR_FAST_DOCSTRING,
+    BASE_IMAGE_PROCESSOR_FAST_DOCSTRING_PREPROCESS,
+    BaseImageProcessorFast,
+    BatchFeature,
+    DefaultFastImageProcessorKwargs,
+)
+from ...image_transforms import group_images_by_shape, reorder_images
+from ...image_utils import (
+    IMAGENET_STANDARD_MEAN,
+    IMAGENET_STANDARD_STD,
+    ImageInput,
+    PILImageResampling,
+    SizeDict,
+)
+from ...processing_utils import Unpack
+from ...utils import (
+    TensorType,
+    add_start_docstrings,
+    is_torch_available,
+    is_torchvision_available,
+    is_torchvision_v2_available,
+)
+
+
+if is_torch_available():
+    import torch
+
+if is_torchvision_available():
+    if is_torchvision_v2_available():
+        from torchvision.transforms.v2 import functional as F
+    else:
+        from torchvision.transforms import functional as F
+
+
+class EfficientNetFastImageProcessorKwargs(DefaultFastImageProcessorKwargs):
+    rescale_offset: bool
+    include_top: bool
+
+
+@add_start_docstrings(
+    "Constructs a fast EfficientNet image processor.",
+    BASE_IMAGE_PROCESSOR_FAST_DOCSTRING,
+)
+class EfficientNetImageProcessorFast(BaseImageProcessorFast):
+    resample = PILImageResampling.NEAREST
+    image_mean = IMAGENET_STANDARD_MEAN
+    image_std = IMAGENET_STANDARD_STD
+    size = {"height": 346, "width": 346}
+    crop_size = {"height": 289, "width": 289}
+    do_resize = True
+    do_center_crop = False
+    do_rescale = True
+    rescale_factor = 1 / 255
+    rescale_offset = False
+    do_normalize = True
+    include_top = True
+    valid_kwargs = EfficientNetFastImageProcessorKwargs
+
+    def __init__(self, **kwargs: Unpack[EfficientNetFastImageProcessorKwargs]):
+        super().__init__(**kwargs)
+
+    def rescale(
+        self,
+        image: "torch.Tensor",
+        scale: float,
+        offset: Optional[bool] = True,
+        **kwargs,
+    ) -> "torch.Tensor":
+        """
+        Rescale an image by a scale factor.
+
+        If `offset` is `True`, the image has its values rescaled by `scale` and then offset by 1. If `scale` is
+        1/127.5, the image is rescaled between [-1, 1].
+            image = image * scale - 1
+
+        If `offset` is `False`, and `scale` is 1/255, the image is rescaled between [0, 1].
+            image = image * scale
+
+        Args:
+            image (`torch.Tensor`):
+                Image to rescale.
+            scale (`float`):
+                The scaling factor to rescale pixel values by.
+            offset (`bool`, *optional*):
+                Whether to scale the image in both negative and positive directions.
+
+        Returns:
+            `torch.Tensor`: The rescaled image.
+        """
+
+        rescaled_image = image * scale
+
+        if offset:
+            rescaled_image -= 1
+
+        return rescaled_image
+
+    @lru_cache(maxsize=10)
+    def _fuse_mean_std_and_rescale_factor(
+        self,
+        do_normalize: Optional[bool] = None,
+        image_mean: Optional[Union[float, list[float]]] = None,
+        image_std: Optional[Union[float, list[float]]] = None,
+        do_rescale: Optional[bool] = None,
+        rescale_factor: Optional[float] = None,
+        device: Optional["torch.device"] = None,
+        rescale_offset: Optional[bool] = False,
+    ) -> tuple:
+        if do_rescale and do_normalize and not rescale_offset:
+            # Fused rescale and normalize
+            image_mean = torch.tensor(image_mean, device=device) * (1.0 / rescale_factor)
+            image_std = torch.tensor(image_std, device=device) * (1.0 / rescale_factor)
+            do_rescale = False
+        return image_mean, image_std, do_rescale
+
+    def rescale_and_normalize(
+        self,
+        images: "torch.Tensor",
+        do_rescale: bool,
+        rescale_factor: float,
+        do_normalize: bool,
+        image_mean: Union[float, list[float]],
+        image_std: Union[float, list[float]],
+        rescale_offset: bool = False,
+    ) -> "torch.Tensor":
+        """
+        Rescale and normalize images.
+        """
+        image_mean, image_std, do_rescale = self._fuse_mean_std_and_rescale_factor(
+            do_normalize=do_normalize,
+            image_mean=image_mean,
+            image_std=image_std,
+            do_rescale=do_rescale,
+            rescale_factor=rescale_factor,
+            device=images.device,
+            rescale_offset=rescale_offset,
+        )
+        # if/elif as we use fused rescale and normalize if both are set to True
+        if do_rescale:
+            images = self.rescale(images, rescale_factor, rescale_offset)
+        if do_normalize:
+            images = self.normalize(images.to(dtype=torch.float32), image_mean, image_std)
+
+        return images
+
+    def _preprocess(
+        self,
+        images: list["torch.Tensor"],
+        do_resize: bool,
+        size: SizeDict,
+        interpolation: Optional["F.InterpolationMode"],
+        do_center_crop: bool,
+        crop_size: SizeDict,
+        do_rescale: bool,
+        rescale_factor: float,
+        rescale_offset: bool,
+        do_normalize: bool,
+        include_top: bool,
+        image_mean: Optional[Union[float, list[float]]],
+        image_std: Optional[Union[float, list[float]]],
+        return_tensors: Optional[Union[str, TensorType]],
+        **kwargs,
+    ) -> BatchFeature:
+        # Group images by size for batched resizing
+        grouped_images, grouped_images_index = group_images_by_shape(images)
+        resized_images_grouped = {}
+        for shape, stacked_images in grouped_images.items():
+            if do_resize:
+                stacked_images = self.resize(image=stacked_images, size=size, interpolation=interpolation)
+            resized_images_grouped[shape] = stacked_images
+        resized_images = reorder_images(resized_images_grouped, grouped_images_index)
+
+        # Group images by size for further processing
+        # Needed in case do_resize is False, or resize returns images with different sizes
+        grouped_images, grouped_images_index = group_images_by_shape(resized_images)
+        processed_images_grouped = {}
+        for shape, stacked_images in grouped_images.items():
+            if do_center_crop:
+                stacked_images = self.center_crop(stacked_images, crop_size)
+            # Fused rescale and normalize
+            stacked_images = self.rescale_and_normalize(
+                stacked_images, do_rescale, rescale_factor, do_normalize, image_mean, image_std, rescale_offset
+            )
+            if include_top:
+                stacked_images = self.normalize(stacked_images, 0, image_std)
+            processed_images_grouped[shape] = stacked_images
+
+        processed_images = reorder_images(processed_images_grouped, grouped_images_index)
+        processed_images = torch.stack(processed_images, dim=0) if return_tensors else processed_images
+
+        return BatchFeature(data={"pixel_values": processed_images}, tensor_type=return_tensors)
+
+    @add_start_docstrings(
+        BASE_IMAGE_PROCESSOR_FAST_DOCSTRING_PREPROCESS,
+        """
+        rescale_offset (`bool`, *optional*, defaults to `self.rescale_offset`):
+            Whether to rescale the image between [-max_range/2, scale_range/2] instead of [0, scale_range].
+        include_top (`bool`, *optional*, defaults to `self.include_top`):
+            Normalize the image again with the standard deviation only for image classification if set to True.
+        """,
+    )
+    def preprocess(self, images: ImageInput, **kwargs: Unpack[EfficientNetFastImageProcessorKwargs]) -> BatchFeature:
+        return super().preprocess(images, **kwargs)
+
+
+__all__ = ["EfficientNetImageProcessorFast"]

src/transformers/models/efficientnet/__init__.py+1 −0 modified

@@ -20,6 +20,7 @@
 if TYPE_CHECKING:
     from .configuration_efficientnet import *
     from .image_processing_efficientnet import *
+    from .image_processing_efficientnet_fast import *
     from .modeling_efficientnet import *
 else:
     import sys

src/transformers/models/focalnet/modeling_focalnet.py+1 −1 modified

@@ -486,7 +486,7 @@ def __init__(self, config, index, input_resolution):
         downsample = FocalNetPatchEmbeddings if (index < self.num_stages - 1) else None
 
         # stochastic depth decay rule
-        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))]
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")]
         drop_path = dpr[sum(config.depths[:index]) : sum(config.depths[: index + 1])]
 
         self.layers = nn.ModuleList(

src/transformers/models/glpn/modeling_glpn.py+1 −1 modified

@@ -331,7 +331,7 @@ def __init__(self, config):
         self.config = config
 
         # stochastic depth decay rule
-        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))]
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")]
 
         # patch embeddings
         embeddings = []

src/transformers/models/hiera/modeling_hiera.py+2 −2 modified

@@ -639,9 +639,9 @@ def __init__(self, config: HieraConfig) -> None:
         super().__init__()
         total_depth = sum(config.depths)
         # stochastic depth decay rule
-        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, total_depth)]
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, total_depth, device="cpu")]
         # query strides rule
-        cumulative_depths = torch.tensor(config.depths).cumsum(0).tolist()
+        cumulative_depths = torch.tensor(config.depths, device="cpu").cumsum(0).tolist()
         query_pool_layer = cumulative_depths[: config.num_query_pool]
         query_strides = [math.prod(config.query_stride) if i in query_pool_layer else 1 for i in range(total_depth)]

src/transformers/models/mamba2/modeling_mamba2.py+2 −2 modified

@@ -572,8 +572,8 @@ def torch_forward(self, input_states, cache_params: Optional[Mamba2Cache]=None,
             hidden_states = hidden_states.reshape(batch_size, seq_len, -1, self.head_dim).float()
             B = B.reshape(batch_size, seq_len, -1, self.ssm_state_size).float()
             C = C.reshape(batch_size, seq_len, -1, self.ssm_state_size).float()
-            B = B.repeat(1, 1, self.num_heads // self.n_groups, 1)
-            C = C.repeat(1, 1, self.num_heads // self.n_groups, 1)
+            B = B.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads)
+            C = C.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads)
             pad_size = (self.chunk_size - seq_len % self.chunk_size) % self.chunk_size
 
             D_residual = self.D[..., None] * pad_tensor_by_size(hidden_states, pad_size)

src/transformers/models/maskformer/modeling_maskformer_swin.py+1 −1 modified

@@ -692,7 +692,7 @@ def __init__(self, config, grid_size):
         super().__init__()
         self.num_layers = len(config.depths)
         self.config = config
-        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))]
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")]
         self.layers = nn.ModuleList(
             [
                 MaskFormerSwinStage(

src/transformers/models/mgp_str/modeling_mgp_str.py+1 −1 modified

@@ -246,7 +246,7 @@ class MgpstrEncoder(nn.Module):
     def __init__(self, config: MgpstrConfig):
         super().__init__()
         # stochastic depth decay rule
-        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)]
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers, device="cpu")]
 
         self.blocks = nn.Sequential(
             *[MgpstrLayer(config=config, drop_path=dpr[i]) for i in range(config.num_hidden_layers)]

src/transformers/models/poolformer/modeling_poolformer.py+1 −1 modified

@@ -194,7 +194,7 @@ def __init__(self, config):
         super().__init__()
         self.config = config
         # stochastic depth decay rule
-        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))]
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")]
 
         # patch embeddings
         embeddings = []

src/transformers/models/pvt/modeling_pvt.py+1 −1 modified

@@ -369,7 +369,7 @@ def __init__(self, config: PvtConfig):
         self.config = config
 
         # stochastic depth decay rule
-        drop_path_decays = torch.linspace(0, config.drop_path_rate, sum(config.depths)).tolist()
+        drop_path_decays = torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu").tolist()
 
         # patch embeddings
         embeddings = []

src/transformers/models/pvt_v2/modeling_pvt_v2.py+1 −1 modified

@@ -323,7 +323,7 @@ def __init__(self, config: PvtV2Config, layer_idx: int):
         )
         # Transformer block
         # stochastic depth decay rule
-        drop_path_decays = torch.linspace(0, config.drop_path_rate, sum(config.depths)).tolist()
+        drop_path_decays = torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu").tolist()
         block_layers = []
         for block_idx in range(config.depths[layer_idx]):
             block_layers.append(

src/transformers/models/segformer/modeling_segformer.py+3 −1 modified

@@ -356,7 +356,9 @@ def __init__(self, config):
         self.config = config
 
         # stochastic depth decay rule
-        drop_path_decays = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))]
+        drop_path_decays = [
+            x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")
+        ]
 
         # patch embeddings
         embeddings = []

src/transformers/models/seggpt/modeling_seggpt.py+1 −1 modified

@@ -460,7 +460,7 @@ class SegGptEncoder(nn.Module):
     def __init__(self, config: SegGptConfig) -> None:
         super().__init__()
         self.config = config
-        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)]
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers, device="cpu")]
         self.layers = nn.ModuleList([SegGptLayer(config, dpr[i]) for i in range(config.num_hidden_layers)])
         self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
         self.gradient_checkpointing = False

src/transformers/models/swin2sr/modeling_swin2sr.py+1 −1 modified

@@ -682,7 +682,7 @@ def __init__(self, config, grid_size):
         super().__init__()
         self.num_stages = len(config.depths)
         self.config = config
-        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))]
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")]
         self.stages = nn.ModuleList(
             [
                 Swin2SRStage(

src/transformers/models/swin/modeling_swin.py+1 −1 modified

@@ -823,7 +823,7 @@ def __init__(self, config, grid_size):
         super().__init__()
         self.num_layers = len(config.depths)
         self.config = config
-        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))]
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")]
         self.layers = nn.ModuleList(
             [
                 SwinStage(

src/transformers/models/swinv2/modeling_swinv2.py+1 −1 modified

@@ -877,7 +877,7 @@ def __init__(self, config, grid_size, pretrained_window_sizes=(0, 0, 0, 0)):
         self.config = config
         if self.config.pretrained_window_sizes is not None:
             pretrained_window_sizes = config.pretrained_window_sizes
-        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))]
+        dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")]
 
         layers = []
         for i_layer in range(self.num_layers):

src/transformers/models/timesformer/modeling_timesformer.py+1 −1 modified

@@ -295,7 +295,7 @@ def __init__(self, config: TimesformerConfig, layer_index: int) -> None:
         attention_type = config.attention_type
 
         drop_path_rates = [
-            x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)
+            x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers, device="cpu")
         ]  # stochastic depth decay rule
         drop_path_rate = drop_path_rates[layer_index]

src/transformers/models/vitdet/modeling_vitdet.py+1 −1 modified

@@ -535,7 +535,7 @@ def __init__(self, config: VitDetConfig) -> None:
         depth = config.num_hidden_layers
 
         # stochastic depth decay rule
-        drop_path_rate = [x.item() for x in torch.linspace(0, config.drop_path_rate, depth)]
+        drop_path_rate = [x.item() for x in torch.linspace(0, config.drop_path_rate, depth, device="cpu")]
 
         layers = []
         for i in range(depth):

src/transformers/models/zamba2/modeling_zamba2.py+2 −2 modified

@@ -860,8 +860,8 @@ def torch_forward(self, input_states, cache_params: Optional[Zamba2HybridDynamic
             hidden_states = hidden_states.reshape(batch_size, seq_len, -1, self.head_dim).float()
             B = B.reshape(batch_size, seq_len,  -1, self.ssm_state_size).float()
             C = C.reshape(batch_size, seq_len, -1, self.ssm_state_size).float()
-            B = B.repeat(1, 1, self.num_heads // self.n_groups, 1)
-            C = C.repeat(1, 1, self.num_heads // self.n_groups, 1)
+            B = B.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads)
+            C = C.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads)
             pad_size = (self.chunk_size - seq_len % self.chunk_size) % self.chunk_size
 
             D_residual = self.D[..., None] * pad_tensor_by_size(hidden_states, pad_size)

src/transformers/models/zamba2/modular_zamba2.py+2 −2 modified

@@ -630,8 +630,8 @@ def torch_forward(self, input_states, cache_params: Optional[Zamba2HybridDynamic
             hidden_states = hidden_states.reshape(batch_size, seq_len, -1, self.head_dim).float()
             B = B.reshape(batch_size, seq_len,  -1, self.ssm_state_size).float()
             C = C.reshape(batch_size, seq_len, -1, self.ssm_state_size).float()
-            B = B.repeat(1, 1, self.num_heads // self.n_groups, 1)
-            C = C.repeat(1, 1, self.num_heads // self.n_groups, 1)
+            B = B.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads)
+            C = C.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads)
             pad_size = (self.chunk_size - seq_len % self.chunk_size) % self.chunk_size
 
             D_residual = self.D[..., None] * pad_tensor_by_size(hidden_states, pad_size)

tests/models/bridgetower/test_image_processing_bridgetower.py+67 −49 modified

@@ -16,19 +16,25 @@
 import unittest
 from typing import Optional, Union
 
-import numpy as np
+import requests
 
 from transformers.testing_utils import require_torch, require_vision
-from transformers.utils import is_vision_available
+from transformers.utils import is_torch_available, is_torchvision_available, is_vision_available
 
 from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs
 
 
+if is_torch_available():
+    import torch
+
 if is_vision_available():
     from PIL import Image
 
     from transformers import BridgeTowerImageProcessor
 
+    if is_torchvision_available():
+        from transformers import BridgeTowerImageProcessorFast
+
 
 class BridgeTowerImageProcessingTester:
     def __init__(
@@ -76,46 +82,7 @@ def prepare_image_processor_dict(self):
         }
 
     def get_expected_values(self, image_inputs, batched=False):
-        """
-        This function computes the expected height and width when providing images to BridgeTowerImageProcessor,
-        assuming do_resize is set to True with a scalar size and size_divisor.
-        """
-        if not batched:
-            size = self.size["shortest_edge"]
-            image = image_inputs[0]
-            if isinstance(image, Image.Image):
-                w, h = image.size
-            elif isinstance(image, np.ndarray):
-                h, w = image.shape[0], image.shape[1]
-            else:
-                h, w = image.shape[1], image.shape[2]
-            scale = size / min(w, h)
-            if h < w:
-                newh, neww = size, scale * w
-            else:
-                newh, neww = scale * h, size
-
-            max_size = int((1333 / 800) * size)
-            if max(newh, neww) > max_size:
-                scale = max_size / max(newh, neww)
-                newh = newh * scale
-                neww = neww * scale
-
-            newh, neww = int(newh + 0.5), int(neww + 0.5)
-            expected_height, expected_width = (
-                newh // self.size_divisor * self.size_divisor,
-                neww // self.size_divisor * self.size_divisor,
-            )
-
-        else:
-            expected_values = []
-            for image in image_inputs:
-                expected_height, expected_width = self.get_expected_values([image])
-                expected_values.append((expected_height, expected_width))
-            expected_height = max(expected_values, key=lambda item: item[0])[0]
-            expected_width = max(expected_values, key=lambda item: item[1])[1]
-
-        return expected_height, expected_width
+        return self.size["shortest_edge"], self.size["shortest_edge"]
 
     def expected_output_image_shape(self, images):
         height, width = self.get_expected_values(images, batched=True)
@@ -137,6 +104,7 @@ def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=F
 @require_vision
 class BridgeTowerImageProcessingTest(ImageProcessingTestMixin, unittest.TestCase):
     image_processing_class = BridgeTowerImageProcessor if is_vision_available() else None
+    fast_image_processing_class = BridgeTowerImageProcessorFast if is_torchvision_available() else None
 
     def setUp(self):
         super().setUp()
@@ -147,10 +115,60 @@ def image_processor_dict(self):
         return self.image_processor_tester.prepare_image_processor_dict()
 
     def test_image_processor_properties(self):
-        image_processing = self.image_processing_class(**self.image_processor_dict)
-        self.assertTrue(hasattr(image_processing, "image_mean"))
-        self.assertTrue(hasattr(image_processing, "image_std"))
-        self.assertTrue(hasattr(image_processing, "do_normalize"))
-        self.assertTrue(hasattr(image_processing, "do_resize"))
-        self.assertTrue(hasattr(image_processing, "size"))
-        self.assertTrue(hasattr(image_processing, "size_divisor"))
+        for image_processing_class in self.image_processor_list:
+            image_processing = image_processing_class(**self.image_processor_dict)
+            self.assertTrue(hasattr(image_processing, "image_mean"))
+            self.assertTrue(hasattr(image_processing, "image_std"))
+            self.assertTrue(hasattr(image_processing, "do_normalize"))
+            self.assertTrue(hasattr(image_processing, "do_resize"))
+            self.assertTrue(hasattr(image_processing, "size"))
+            self.assertTrue(hasattr(image_processing, "size_divisor"))
+
+    def _assertEquivalence(self, a, b):
+        self.assertTrue(torch.allclose(a, b, atol=1e-1))
+        self.assertLessEqual(torch.mean(torch.abs(a - b)).item(), 1e-3)
+
+    @require_vision
+    @require_torch
+    def test_slow_fast_equivalence(self):
+        if not self.test_slow_image_processor or not self.test_fast_image_processor:
+            self.skipTest(reason="Skipping slow/fast equivalence test")
+
+        if self.image_processing_class is None or self.fast_image_processing_class is None:
+            self.skipTest(reason="Skipping slow/fast equivalence test as one of the image processors is not defined")
+
+        dummy_image = Image.open(
+            requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw
+        )
+        image_processor_slow = self.image_processing_class(**self.image_processor_dict)
+        image_processor_fast = self.fast_image_processing_class(**self.image_processor_dict)
+
+        encoding_slow = image_processor_slow(dummy_image, return_tensors="pt")
+        encoding_fast = image_processor_fast(dummy_image, return_tensors="pt")
+
+        self._assertEquivalence(encoding_slow.pixel_values, encoding_fast.pixel_values)
+        self._assertEquivalence(encoding_slow.pixel_mask.float(), encoding_fast.pixel_mask.float())
+
+    @require_vision
+    @require_torch
+    def test_slow_fast_equivalence_batched(self):
+        if not self.test_slow_image_processor or not self.test_fast_image_processor:
+            self.skipTest(reason="Skipping slow/fast equivalence test")
+
+        if self.image_processing_class is None or self.fast_image_processing_class is None:
+            self.skipTest(reason="Skipping slow/fast equivalence test as one of the image processors is not defined")
+
+        if hasattr(self.image_processor_tester, "do_center_crop") and self.image_processor_tester.do_center_crop:
+            self.skipTest(
+                reason="Skipping as do_center_crop is True and center_crop functions are not equivalent for fast and slow processors"
+            )
+
+        dummy_images = self.image_processor_tester.prepare_image_inputs(equal_resolution=False, torchify=True)
+        image_processor_slow = self.image_processing_class(**self.image_processor_dict)
+        image_processor_fast = self.fast_image_processing_class(**self.image_processor_dict)
+
+        encoding_slow = image_processor_slow(dummy_images, return_tensors="pt")
+        encoding_fast = image_processor_fast(dummy_images, return_tensors="pt")
+
+        self._assertEquivalence(encoding_slow.pixel_values, encoding_fast.pixel_values)
+        self._assertEquivalence(encoding_slow.pixel_mask.float(), encoding_fast.pixel_mask.float())

tests/models/efficientnet/test_image_processing_efficientnet.py+87 −19 modified

@@ -17,15 +17,26 @@
 
 import numpy as np
 
+from transformers.image_utils import PILImageResampling
 from transformers.testing_utils import require_torch, require_vision
-from transformers.utils import is_vision_available
+from transformers.utils import (
+    is_torch_available,
+    is_torchvision_available,
+    is_vision_available,
+)
 
 from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs
 
 
+if is_torch_available():
+    import torch
+
 if is_vision_available():
     from transformers import EfficientNetImageProcessor
 
+    if is_torchvision_available():
+        from transformers import EfficientNetImageProcessorFast
+
 
 class EfficientNetImageProcessorTester:
     def __init__(
@@ -41,6 +52,10 @@ def __init__(
         do_normalize=True,
         image_mean=[0.5, 0.5, 0.5],
         image_std=[0.5, 0.5, 0.5],
+        do_rescale=True,
+        rescale_offset=True,
+        rescale_factor=1 / 127.5,
+        resample=PILImageResampling.BILINEAR,  # NEAREST is too different between PIL and torchvision
     ):
         size = size if size is not None else {"height": 18, "width": 18}
         self.parent = parent
@@ -54,6 +69,7 @@ def __init__(
         self.do_normalize = do_normalize
         self.image_mean = image_mean
         self.image_std = image_std
+        self.resample = resample
 
     def prepare_image_processor_dict(self):
         return {
@@ -62,6 +78,7 @@ def prepare_image_processor_dict(self):
             "do_normalize": self.do_normalize,
             "do_resize": self.do_resize,
             "size": self.size,
+            "resample": self.resample,
         }
 
     def expected_output_image_shape(self, images):
@@ -83,6 +100,7 @@ def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=F
 @require_vision
 class EfficientNetImageProcessorTest(ImageProcessingTestMixin, unittest.TestCase):
     image_processing_class = EfficientNetImageProcessor if is_vision_available() else None
+    fast_image_processing_class = EfficientNetImageProcessorFast if is_torchvision_available() else None
 
     def setUp(self):
         super().setUp()
@@ -93,30 +111,80 @@ def image_processor_dict(self):
         return self.image_processor_tester.prepare_image_processor_dict()
 
     def test_image_processor_properties(self):
-        image_processing = self.image_processing_class(**self.image_processor_dict)
-        self.assertTrue(hasattr(image_processing, "image_mean"))
-        self.assertTrue(hasattr(image_processing, "image_std"))
-        self.assertTrue(hasattr(image_processing, "do_normalize"))
-        self.assertTrue(hasattr(image_processing, "do_resize"))
-        self.assertTrue(hasattr(image_processing, "size"))
+        for image_processing_class in self.image_processor_list:
+            image_processing = image_processing_class(**self.image_processor_dict)
+            self.assertTrue(hasattr(image_processing, "image_mean"))
+            self.assertTrue(hasattr(image_processing, "image_std"))
+            self.assertTrue(hasattr(image_processing, "do_normalize"))
+            self.assertTrue(hasattr(image_processing, "do_resize"))
+            self.assertTrue(hasattr(image_processing, "size"))
 
     def test_image_processor_from_dict_with_kwargs(self):
-        image_processor = self.image_processing_class.from_dict(self.image_processor_dict)
-        self.assertEqual(image_processor.size, {"height": 18, "width": 18})
+        for image_processing_class in self.image_processor_list:
+            image_processor = image_processing_class.from_dict(self.image_processor_dict)
+            self.assertEqual(image_processor.size, {"height": 18, "width": 18})
 
-        image_processor = self.image_processing_class.from_dict(self.image_processor_dict, size=42)
-        self.assertEqual(image_processor.size, {"height": 42, "width": 42})
+            image_processor = image_processing_class.from_dict(self.image_processor_dict, size=42)
+            self.assertEqual(image_processor.size, {"height": 42, "width": 42})
 
     def test_rescale(self):
         # EfficientNet optionally rescales between -1 and 1 instead of the usual 0 and 1
         image = np.arange(0, 256, 1, dtype=np.uint8).reshape(1, 8, 32)
 
-        image_processor = self.image_processing_class(**self.image_processor_dict)
-
-        rescaled_image = image_processor.rescale(image, scale=1 / 127.5)
-        expected_image = (image * (1 / 127.5)).astype(np.float32) - 1
-        self.assertTrue(np.allclose(rescaled_image, expected_image))
+        for image_processing_class in self.image_processor_list:
+            image_processor = image_processing_class(**self.image_processor_dict)
+            if image_processing_class == EfficientNetImageProcessorFast:
+                image = torch.from_numpy(image)
+
+                # Scale between [-1, 1] with rescale_factor 1/127.5 and rescale_offset=True
+                rescaled_image = image_processor.rescale(image, scale=1 / 127.5, offset=True)
+                expected_image = (image * (1 / 127.5)) - 1
+                self.assertTrue(torch.allclose(rescaled_image, expected_image))
+
+                # Scale between [0, 1] with rescale_factor 1/255 and rescale_offset=True
+                rescaled_image = image_processor.rescale(image, scale=1 / 255, offset=False)
+                expected_image = image / 255.0
+                self.assertTrue(torch.allclose(rescaled_image, expected_image))
+
+            else:
+                rescaled_image = image_processor.rescale(image, scale=1 / 127.5, dtype=np.float64)
+                expected_image = (image * (1 / 127.5)).astype(np.float64) - 1
+                self.assertTrue(np.allclose(rescaled_image, expected_image))
+
+                rescaled_image = image_processor.rescale(image, scale=1 / 255, offset=False, dtype=np.float64)
+                expected_image = (image / 255.0).astype(np.float64)
+                self.assertTrue(np.allclose(rescaled_image, expected_image))
+
+    @require_vision
+    @require_torch
+    def test_rescale_normalize(self):
+        if self.image_processing_class is None or self.fast_image_processing_class is None:
+            self.skipTest(reason="Skipping slow/fast equivalence test as one of the image processors is not defined")
+
+        image = torch.arange(0, 256, 1, dtype=torch.uint8).reshape(1, 8, 32).repeat(3, 1, 1)
+        image_mean_0 = (0.0, 0.0, 0.0)
+        image_std_0 = (1.0, 1.0, 1.0)
+        image_mean_1 = (0.5, 0.5, 0.5)
+        image_std_1 = (0.5, 0.5, 0.5)
+
+        image_processor_fast = self.fast_image_processing_class(**self.image_processor_dict)
+
+        # Rescale between [-1, 1] with rescale_factor=1/127.5 and rescale_offset=True. Then normalize
+        rescaled_normalized = image_processor_fast.rescale_and_normalize(
+            image, True, 1 / 127.5, True, image_mean_0, image_std_0, True
+        )
+        expected_image = (image * (1 / 127.5)) - 1
+        expected_image = (expected_image - torch.tensor(image_mean_0).view(3, 1, 1)) / torch.tensor(image_std_0).view(
+            3, 1, 1
+        )
+        self.assertTrue(torch.allclose(rescaled_normalized, expected_image, rtol=1e-3))
 
-        rescaled_image = image_processor.rescale(image, scale=1 / 255, offset=False)
-        expected_image = (image / 255.0).astype(np.float32)
-        self.assertTrue(np.allclose(rescaled_image, expected_image))
+        # Rescale between [0, 1] with rescale_factor=1/255 and rescale_offset=False. Then normalize
+        rescaled_normalized = image_processor_fast.rescale_and_normalize(
+            image, True, 1 / 255, True, image_mean_1, image_std_1, False
+        )
+        expected_image = image * (1 / 255.0)
+        expected_image = (expected_image - torch.tensor(image_mean_1).view(3, 1, 1)) / torch.tensor(image_std_1).view(
+            3, 1, 1
+        )
+        self.assertTrue(torch.allclose(rescaled_normalized, expected_image, rtol=1e-3))

tests/models/mamba2/test_modeling_mamba2.py+8 −0 modified

@@ -238,6 +238,14 @@ def test_mamba2_slow_vs_fast_forward(self):
         config_and_inputs = self.model_tester.prepare_config_and_inputs()
         self.model_tester.create_and_check_mamba2_slow_vs_fast_forward(*config_and_inputs)
 
+    # This test adjusts n_groups to half the original setting and effectively
+    # creates a grouped SSD configuration in the mamba2 layers
+    # See https://github.com/huggingface/transformers/pull/37533/
+    def test_mamba2_slow_vs_fast_forward_grouped(self):
+        config_and_inputs = self.model_tester.prepare_config_and_inputs()
+        config_and_inputs[0].n_groups //= 2
+        self.model_tester.create_and_check_mamba2_slow_vs_fast_forward(*config_and_inputs)
+
     def test_initialization(self):
         config, _ = self.model_tester.prepare_config_and_inputs_for_common()

tests/test_image_processing_common.py+2 −2 modified

@@ -181,7 +181,7 @@ def test_slow_fast_equivalence(self):
         encoding_fast = image_processor_fast(dummy_image, return_tensors="pt")
         self.assertTrue(torch.allclose(encoding_slow.pixel_values, encoding_fast.pixel_values, atol=1e-1))
         self.assertLessEqual(
-            torch.mean(torch.abs(encoding_slow.pixel_values - encoding_fast.pixel_values)).item(), 1e-3
+            torch.mean(torch.abs(encoding_slow.pixel_values - encoding_fast.pixel_values)).item(), 5e-3
         )
 
     @require_vision
@@ -207,7 +207,7 @@ def test_slow_fast_equivalence_batched(self):
 
         self.assertTrue(torch.allclose(encoding_slow.pixel_values, encoding_fast.pixel_values, atol=1e-1))
         self.assertLessEqual(
-            torch.mean(torch.abs(encoding_slow.pixel_values - encoding_fast.pixel_values)).item(), 1e-3
+            torch.mean(torch.abs(encoding_slow.pixel_values - encoding_fast.pixel_values)).item(), 5e-3
         )
 
     @require_vision

tests/test_modeling_common.py+7 −0 modified

@@ -4528,6 +4528,13 @@ def test_generation_tester_mixin_inheritance(self):
                 ),
             )
 
+    def test_can_be_initialized_on_meta(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+        for model_class in self.all_model_classes:
+            # If it does not raise here, the test passes
+            with torch.device("meta"):
+                _ = model_class(config)
+
     @require_torch_accelerator
     def test_can_load_with_device_context_manager(self):
         config, _ = self.model_tester.prepare_config_and_inputs_for_common()

Vulnerability mechanics

Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

github.com/advisories/GHSA-phhr-52qp-3mj4ghsaADVISORY
nvd.nist.gov/vuln/detail/CVE-2025-3777ghsaADVISORY
github.com/huggingface/transformers/commit/4dda5f71b35fb70cf602187eef84bb17a50b9082ghsaWEB
huntr.com/bounties/ccba0730-9248-4853-b7ff-5c20e6364f09ghsaWEB

News mentions

No linked articles in our index yet.

cvss	0.065
epss	0.000
exploit	0.000
kev	0.000
patch	-0.070
ransomware	0.000