Improper Input Validation in huggingface/transformers
Description
Hugging Face Transformers versions up to 4.49.0 are affected by an improper input validation vulnerability in the image_utils.py file. The vulnerability arises from insecure URL validation using the startswith() method, which can be bypassed through URL username injection. This allows attackers to craft URLs that appear to be from YouTube but resolve to malicious domains, potentially leading to phishing attacks, malware distribution, or data exfiltration. The issue is fixed in version 4.52.1.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
Hugging Face Transformers ≤4.49.0 has an improper URL validation in image_utils.py that allows username injection to bypass YouTube domain checks, leading to phishing/malware risks.
Vulnerability
Description
CVE-2025-3777 is an improper input validation vulnerability in Hugging Face Transformers versions up to 4.49.0, located in the image_utils.py file. The issue stems from insecure URL validation that uses Python's startswith() method to verify that an image URL belongs to a trusted domain (e.g., YouTube). This check can be bypassed by injecting a username into the URL, such as https://youtube.com@malicious.example.com, which the startswith() method will treat as valid because the string begins with youtube.com even though the actual host is malicious.example.com [1][3].
Attack
Vector
An attacker can craft a URL that appears to be from YouTube but actually resolves to a malicious domain by leveraging the URL username injection technique. When Transformers processes an image from such a URL, it will pass the validation check due to the flawed startswith() logic, allowing the attacker's malicious server to be contacted. This attack requires no special authentication or network position, as it can be triggered simply by providing a crafted URL to a Transformers-based application that loads images from external sources [1][3].
Impact
Successful exploitation could lead to phishing attacks where users are tricked into interacting with a seemingly legitimate YouTube URL, malware distribution if the malicious domain serves infected content, or data exfiltration if the attacker's server proxies or captures sensitive data. The vulnerability affects the trust that users and applications place in URL validation within Transformers, potentially impacting any downstream service that relies on this filtering to fetch images [3].
Mitigation
The issue has been fixed in Transformers version 4.52.1. Users are strongly advised to update to this version or later. No official workaround is documented for earlier versions, so upgrading is the primary remediation. The vulnerability is not listed on CISA's Known Exploited Vulnerabilities catalog as of this writing [2][3].
- Blaming transformers/src/transformers/image_utils.py at a7d2bbaaa8aac64f7c1ee8c1421cfe84b38359a4 · huggingface/transformers
- GitHub - huggingface/transformers: 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
- NVD - CVE-2025-3777
AI Insight generated on May 19, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
transformersPyPI | < 4.52.1 | 4.52.1 |
Affected products
2- Range: <=4.49.0
- huggingface/huggingface/transformersv5Range: unspecified
Patches
14dda5f71b35fMerge branch 'main' into chat-template-url
42 files changed · +805 −112
docs/source/en/model_doc/bridgetower.md+5 −0 modified@@ -147,6 +147,11 @@ Tips: [[autodoc]] BridgeTowerImageProcessor - preprocess +## BridgeTowerImageProcessorFast + +[[autodoc]] BridgeTowerImageProcessorFast + - preprocess + ## BridgeTowerProcessor [[autodoc]] BridgeTowerProcessor
docs/source/en/model_doc/efficientnet.md+5 −0 modified@@ -43,6 +43,11 @@ The original code can be found [here](https://github.com/tensorflow/tpu/tree/mas [[autodoc]] EfficientNetImageProcessor - preprocess +## EfficientNetImageProcessorFast + +[[autodoc]] EfficientNetImageProcessorFast + - preprocess + ## EfficientNetModel [[autodoc]] EfficientNetModel
docs/source/ja/model_doc/bridgetower.md+5 −0 modified@@ -144,6 +144,11 @@ BridgeTower は、ビジュアル エンコーダー、テキスト エンコー [[autodoc]] BridgeTowerImageProcessor - preprocess +## BridgeTowerImageProcessorFast + +[[autodoc]] BridgeTowerImageProcessorFast + - preprocess + ## BridgeTowerProcessor [[autodoc]] BridgeTowerProcessor
src/transformers/image_utils.py+1 −1 modified@@ -66,7 +66,7 @@ from torchvision.transforms import InterpolationMode pil_torch_interpolation_mapping = { - PILImageResampling.NEAREST: InterpolationMode.NEAREST, + PILImageResampling.NEAREST: InterpolationMode.NEAREST_EXACT, PILImageResampling.BOX: InterpolationMode.BOX, PILImageResampling.BILINEAR: InterpolationMode.BILINEAR, PILImageResampling.HAMMING: InterpolationMode.HAMMING,
src/transformers/models/auto/image_processing_auto.py+3 −3 modified@@ -56,13 +56,13 @@ else: IMAGE_PROCESSOR_MAPPING_NAMES = OrderedDict( [ - ("align", ("EfficientNetImageProcessor",)), + ("align", ("EfficientNetImageProcessor", "EfficientNetImageProcessorFast")), ("aria", ("AriaImageProcessor",)), ("beit", ("BeitImageProcessor",)), ("bit", ("BitImageProcessor", "BitImageProcessorFast")), ("blip", ("BlipImageProcessor", "BlipImageProcessorFast")), ("blip-2", ("BlipImageProcessor", "BlipImageProcessorFast")), - ("bridgetower", ("BridgeTowerImageProcessor",)), + ("bridgetower", ("BridgeTowerImageProcessor", "BridgeTowerImageProcessorFast")), ("chameleon", ("ChameleonImageProcessor",)), ("chinese_clip", ("ChineseCLIPImageProcessor", "ChineseCLIPImageProcessorFast")), ("clip", ("CLIPImageProcessor", "CLIPImageProcessorFast")), @@ -83,7 +83,7 @@ ("donut-swin", ("DonutImageProcessor", "DonutImageProcessorFast")), ("dpt", ("DPTImageProcessor",)), ("efficientformer", ("EfficientFormerImageProcessor",)), - ("efficientnet", ("EfficientNetImageProcessor",)), + ("efficientnet", ("EfficientNetImageProcessor", "EfficientNetImageProcessorFast")), ("flava", ("FlavaImageProcessor", "FlavaImageProcessorFast")), ("focalnet", ("BitImageProcessor", "BitImageProcessorFast")), ("fuyu", ("FuyuImageProcessor",)),
src/transformers/models/bamba/modeling_bamba.py+2 −2 modified@@ -783,8 +783,8 @@ def torch_forward( hidden_states = hidden_states.reshape(batch_size, seq_len, -1, self.head_dim).float() B = B.reshape(batch_size, seq_len, -1, self.ssm_state_size).float() C = C.reshape(batch_size, seq_len, -1, self.ssm_state_size).float() - B = B.repeat(1, 1, self.num_heads // self.n_groups, 1) - C = C.repeat(1, 1, self.num_heads // self.n_groups, 1) + B = B.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads) + C = C.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads) pad_size = (self.chunk_size - seq_len % self.chunk_size) % self.chunk_size D_residual = self.D[..., None] * pad_tensor_by_size(hidden_states, pad_size)
src/transformers/models/bamba/modular_bamba.py+2 −2 modified@@ -580,8 +580,8 @@ def torch_forward( hidden_states = hidden_states.reshape(batch_size, seq_len, -1, self.head_dim).float() B = B.reshape(batch_size, seq_len, -1, self.ssm_state_size).float() C = C.reshape(batch_size, seq_len, -1, self.ssm_state_size).float() - B = B.repeat(1, 1, self.num_heads // self.n_groups, 1) - C = C.repeat(1, 1, self.num_heads // self.n_groups, 1) + B = B.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads) + C = C.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads) pad_size = (self.chunk_size - seq_len % self.chunk_size) % self.chunk_size D_residual = self.D[..., None] * pad_tensor_by_size(hidden_states, pad_size)
src/transformers/models/beit/modeling_beit.py+1 −1 modified@@ -663,7 +663,7 @@ def __init__(self, config: BeitConfig, window_size: Optional[tuple] = None) -> N self.relative_position_bias = BeitRelativePositionBias(config, window_size=window_size) # stochastic depth decay rule - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)] + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers, device="cpu")] self.layer = nn.ModuleList( [ BeitLayer(
src/transformers/models/bridgetower/image_processing_bridgetower_fast.py+345 −0 added@@ -0,0 +1,345 @@ +# coding=utf-8 +# Copyright 2025 The Intel Labs Team Authors, The Microsoft Research Team Authors and HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Fast Image processor class for BridgeTower.""" + +from typing import Dict, Iterable, Optional, Tuple, Union + +from ...image_processing_utils_fast import ( + BASE_IMAGE_PROCESSOR_FAST_DOCSTRING, + BASE_IMAGE_PROCESSOR_FAST_DOCSTRING_PREPROCESS, + BaseImageProcessorFast, + BatchFeature, + DefaultFastImageProcessorKwargs, + ImageInput, + SizeDict, + TensorType, + Unpack, + get_max_height_width, + group_images_by_shape, + reorder_images, +) +from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling +from ...utils import add_start_docstrings, is_torch_available, is_torchvision_available, is_torchvision_v2_available + + +if is_torch_available(): + import torch + +if is_torchvision_available(): + if is_torchvision_v2_available(): + from torchvision.transforms.v2 import functional as F + else: + from torchvision.transforms import functional as F + + +def make_pixel_mask( + image: "torch.Tensor", + output_size: Tuple[int, int], +) -> "torch.Tensor": + """ + Make a pixel mask for the image, where 1 indicates a valid pixel and 0 indicates padding. + + Args: + image (`np.ndarray`): + Image to make the pixel mask for. + output_size (`Tuple[int, int]`): + Output size of the mask. + """ + input_height, input_width = image.shape[-2:] + batch_size = image.size(0) + mask = torch.zeros((batch_size, *output_size), dtype=torch.long) + mask[:input_height, :input_width] = 1 + return mask + + +def get_resize_output_image_size( + input_image: "torch.Tensor", + shorter: int = 800, + longer: int = 1333, + size_divisor: int = 32, +) -> Tuple[int, int]: + input_height, input_width = input_image.shape[-2:] + min_size, max_size = shorter, longer + + scale = min_size / min(input_height, input_width) + + if input_height < input_width: + new_height = min_size + new_width = scale * input_width + else: + new_height = scale * input_height + new_width = min_size + + if max(new_height, new_width) > max_size: + scale = max_size / max(new_height, new_width) + new_height = scale * new_height + new_width = scale * new_width + + new_height, new_width = int(new_height + 0.5), int(new_width + 0.5) + new_height = new_height // size_divisor * size_divisor + new_width = new_width // size_divisor * size_divisor + + return new_height, new_width + + +class BridgeTowerFastImageProcessorKwargs(DefaultFastImageProcessorKwargs): + size_divisor: Optional[int] + do_pad: Optional[bool] + + +@add_start_docstrings( + "Constructs a fast BridgeTower image processor.", + BASE_IMAGE_PROCESSOR_FAST_DOCSTRING, + """ + size_divisor (`int`, *optional*, defaults to 32): + The size by which to make sure both the height and width can be divided. Only has an effect if `do_resize` + is set to `True`. Can be overridden by the `size_divisor` parameter in the `preprocess` method. + do_pad (`bool`, *optional*, defaults to `True`): + Whether to pad the image to the `(max_height, max_width)` of the images in the batch. Can be overridden by + the `do_pad` parameter in the `preprocess` method. + """, +) +class BridgeTowerImageProcessorFast(BaseImageProcessorFast): + resample = PILImageResampling.BICUBIC + image_mean = OPENAI_CLIP_MEAN + image_std = OPENAI_CLIP_STD + size = {"shortest_edge": 288} + default_to_square = False + crop_size = {"shortest_edge": 288} + do_resize = True + do_center_crop = True + do_rescale = True + do_normalize = True + do_pad = True + size_divisor = 32 + valid_kwargs = BridgeTowerFastImageProcessorKwargs + + def __init__(self, **kwargs: Unpack[BridgeTowerFastImageProcessorKwargs]): + super().__init__(**kwargs) + + @add_start_docstrings( + BASE_IMAGE_PROCESSOR_FAST_DOCSTRING_PREPROCESS, + """ + size_divisor (`int`, *optional*, defaults to 32): + The size by which to make sure both the height and width can be divided. Only has an effect if `do_resize` + is set to `True`. Can be overridden by the `size_divisor` parameter in the `preprocess` method. + do_pad (`bool`, *optional*, defaults to `True`): + Whether to pad the image to the `(max_height, max_width)` of the images in the batch. Can be overridden by + the `do_pad` parameter in the `preprocess` method. + """, + ) + def preprocess(self, images: ImageInput, **kwargs: Unpack[BridgeTowerFastImageProcessorKwargs]) -> BatchFeature: + return super().preprocess(images, **kwargs) + + def resize( + self, + image: "torch.Tensor", + size: SizeDict, + size_divisor: int = 32, + interpolation: "F.InterpolationMode" = None, + antialias: bool = True, + **kwargs, + ) -> "torch.Tensor": + """ + Resize an image. + + Resizes the shorter side of the image to `size["shortest_edge"]` while preserving the aspect ratio. If the + longer side is larger than the max size `(int(`size["shortest_edge"]` * 1333 / 800))`, the longer side is then + resized to the max size while preserving the aspect ratio. + + Args: + image (`torch.Tensor`): + Image to resize. + size (`SizeDict`): + Dictionary in the format `{"height": int, "width": int}` specifying the size of the output image. + size_divisor (`int`, *optional*, defaults to 32): + The image is resized to a size that is a multiple of this value. + resample (`InterpolationMode`, *optional*, defaults to `InterpolationMode.BILINEAR`): + `InterpolationMode` filter to use when resizing the image e.g. `InterpolationMode.BICUBIC`. + + Returns: + `torch.Tensor`: The resized image. + """ + interpolation = interpolation if interpolation is not None else F.InterpolationMode.BILINEAR + if not size.shortest_edge: + raise ValueError(f"The `size` dictionary must contain the key `shortest_edge`. Got {size.keys()}") + shorter = size.shortest_edge + longer = int(1333 / 800 * shorter) + output_size = get_resize_output_image_size( + image, + shorter=shorter, + longer=longer, + size_divisor=size_divisor, + ) + return F.resize(image, output_size, interpolation=interpolation, antialias=antialias) + + def center_crop( + self, + image: "torch.Tensor", + size: Dict[str, int], + **kwargs, + ) -> "torch.Tensor": + """ + Center crop an image to `(size["height"], size["width"])`. If the input size is smaller than `crop_size` along + any edge, the image is padded with 0's and then center cropped. + + Args: + image (`torch.Tensor`): + Image to center crop. + size (`Dict[str, int]`): + Size of the output image in the form `{"height": h, "width": w}`. + """ + output_size = size.shortest_edge + return F.center_crop( + image, + output_size=(output_size, output_size), + **kwargs, + ) + + def _pad_image( + self, + image: "torch.Tensor", + output_size: Tuple[int, int], + constant_values: Union[float, Iterable[float]] = 0, + ) -> "torch.Tensor": + """ + Pad an image with zeros to the given size. + """ + input_height, input_width = image.shape[-2:] + output_height, output_width = output_size + + pad_bottom = output_height - input_height + pad_right = output_width - input_width + padding = (0, 0, pad_right, pad_bottom) + padded_image = F.pad( + image, + padding, + fill=constant_values, + ) + return padded_image + + def pad( + self, + images: list["torch.Tensor"], + constant_values: Union[float, Iterable[float]] = 0, + return_pixel_mask: bool = True, + ) -> tuple: + """ + Pads a batch of images to the bottom and right of the image with zeros to the size of largest height and width + in the batch and optionally returns their corresponding pixel mask. + + Args: + image (`torch.Tensor`): + Image to pad. + constant_values (`float` or `Iterable[float]`, *optional*): + The value to use for the padding if `mode` is `"constant"`. + return_pixel_mask (`bool`, *optional*, defaults to `True`): + Whether to return a pixel mask. + return_tensors (`str` or `TensorType`, *optional*): + The type of tensors to return. Can be one of: + - Unset: Return a list of `np.ndarray`. + - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`. + - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`. + - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`. + - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`. + """ + pad_size = get_max_height_width(images) + + grouped_images, grouped_images_index = group_images_by_shape(images) + processed_images_grouped = {} + processed_masks_grouped = {} + for shape, stacked_images in grouped_images.items(): + stacked_images = self._pad_image( + stacked_images, + pad_size, + constant_values=constant_values, + ) + processed_images_grouped[shape] = stacked_images + + if return_pixel_mask: + stacked_masks = make_pixel_mask(image=stacked_images, output_size=pad_size) + processed_masks_grouped[shape] = stacked_masks + + processed_images = reorder_images(processed_images_grouped, grouped_images_index) + + processed_masks = None + if return_pixel_mask: + processed_masks = reorder_images(processed_masks_grouped, grouped_images_index) + + return processed_images, processed_masks + + def _preprocess( + self, + images: list["torch.Tensor"], + do_resize: bool, + size: SizeDict, + size_divisor: Optional[int], + interpolation: Optional["F.InterpolationMode"], + do_pad: bool, + do_center_crop: bool, + crop_size: SizeDict, + do_rescale: bool, + rescale_factor: float, + do_normalize: bool, + image_mean: Optional[Union[float, list[float]]], + image_std: Optional[Union[float, list[float]]], + return_tensors: Optional[Union[str, TensorType]], + **kwargs, + ) -> BatchFeature: + # Group images by size for batched resizing + grouped_images, grouped_images_index = group_images_by_shape(images) + resized_images_grouped = {} + for shape, stacked_images in grouped_images.items(): + if do_resize: + stacked_images = self.resize( + image=stacked_images, size=size, size_divisor=size_divisor, interpolation=interpolation + ) + resized_images_grouped[shape] = stacked_images + resized_images = reorder_images(resized_images_grouped, grouped_images_index) + + # Group images by size for further processing + # Needed in case do_resize is False, or resize returns images with different sizes + grouped_images, grouped_images_index = group_images_by_shape(resized_images) + processed_images_grouped = {} + for shape, stacked_images in grouped_images.items(): + if do_center_crop: + stacked_images = self.center_crop(stacked_images, crop_size) + # Fused rescale and normalize + stacked_images = self.rescale_and_normalize( + stacked_images, do_rescale, rescale_factor, do_normalize, image_mean, image_std + ) + processed_images_grouped[shape] = stacked_images + + processed_images = reorder_images(processed_images_grouped, grouped_images_index) + + data = {} + if do_pad: + processed_images, processed_masks = self.pad(processed_images, return_pixel_mask=True) + processed_masks = torch.stack(processed_masks, dim=0) if return_tensors else processed_masks + data["pixel_mask"] = processed_masks + + processed_images = torch.stack(processed_images, dim=0) if return_tensors else processed_images + data["pixel_values"] = processed_images + + return BatchFeature(data=data, tensor_type=return_tensors) + + def to_dict(self): + encoder_dict = super().to_dict() + encoder_dict.pop("_valid_processor_keys", None) + encoder_dict.pop("crop_size", None) + return encoder_dict + + +__all__ = ["BridgeTowerImageProcessorFast"]
src/transformers/models/bridgetower/image_processing_bridgetower.py+3 −5 modified@@ -28,8 +28,8 @@ PILImageResampling, get_image_size, infer_channel_dimension_format, - is_batched, is_scaled_image, + make_flat_list_of_images, to_numpy_array, valid_images, validate_preprocess_arguments, @@ -455,7 +455,7 @@ def preprocess( image_mean = image_mean if image_mean is not None else self.image_mean image_std = image_std if image_std is not None else self.image_std do_pad = do_pad if do_pad is not None else self.do_pad - do_center_crop if do_center_crop is not None else self.do_center_crop + do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop # For backwards compatibility. Initial version of this processor was cropping to the "size" argument, which # it should default to if crop_size is undefined. crop_size = ( @@ -464,9 +464,7 @@ def preprocess( size = size if size is not None else self.size size = get_size_dict(size, default_to_square=False) - - if not is_batched(images): - images = [images] + images = make_flat_list_of_images(images) if not valid_images(images): raise ValueError(
src/transformers/models/bridgetower/__init__.py+1 −0 modified@@ -20,6 +20,7 @@ if TYPE_CHECKING: from .configuration_bridgetower import * from .image_processing_bridgetower import * + from .image_processing_bridgetower_fast import * from .modeling_bridgetower import * from .processing_bridgetower import * else:
src/transformers/models/clap/modeling_clap.py+1 −1 modified@@ -829,7 +829,7 @@ def __init__(self, config): self.num_features = int(config.patch_embeds_hidden_size * 2 ** (self.num_layers - 1)) - drop_path_rate = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))] + drop_path_rate = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")] grid_size = self.patch_embed.grid_size self.input_resolutions = [(grid_size[0] // (2**i), grid_size[1] // (2**i)) for i in range(self.num_layers)]
src/transformers/models/convnext/modeling_convnext.py+2 −1 modified@@ -225,7 +225,8 @@ def __init__(self, config): super().__init__() self.stages = nn.ModuleList() drop_path_rates = [ - x.tolist() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths)).split(config.depths) + x.tolist() + for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu").split(config.depths) ] prev_chs = config.hidden_sizes[0] for i in range(config.num_stages):
src/transformers/models/convnextv2/modeling_convnextv2.py+2 −1 modified@@ -245,7 +245,8 @@ def __init__(self, config): super().__init__() self.stages = nn.ModuleList() drop_path_rates = [ - x.tolist() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths)).split(config.depths) + x.tolist() + for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu").split(config.depths) ] prev_chs = config.hidden_sizes[0] for i in range(config.num_stages):
src/transformers/models/cvt/modeling_cvt.py+3 −1 modified@@ -449,7 +449,9 @@ def __init__(self, config, stage): dropout_rate=config.drop_rate[self.stage], ) - drop_path_rates = [x.item() for x in torch.linspace(0, config.drop_path_rate[self.stage], config.depth[stage])] + drop_path_rates = [ + x.item() for x in torch.linspace(0, config.drop_path_rate[self.stage], config.depth[stage], device="cpu") + ] self.layers = nn.Sequential( *[
src/transformers/models/data2vec/modeling_data2vec_vision.py+1 −1 modified@@ -676,7 +676,7 @@ def __init__(self, config: Data2VecVisionConfig, window_size: Optional[tuple] = self.relative_position_bias = Data2VecVisionRelativePositionBias(config, window_size=window_size) # stochastic depth decay rule - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)] + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers, device="cpu")] self.layer = nn.ModuleList( [ Data2VecVisionLayer(
src/transformers/models/donut/modeling_donut_swin.py+1 −1 modified@@ -790,7 +790,7 @@ def __init__(self, config, grid_size): super().__init__() self.num_layers = len(config.depths) self.config = config - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))] + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")] self.layers = nn.ModuleList( [ DonutSwinStage(
src/transformers/models/efficientnet/image_processing_efficientnet_fast.py+226 −0 added@@ -0,0 +1,226 @@ +# coding=utf-8 +# Copyright 2025 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Fast Image processor class for EfficientNet.""" + +from functools import lru_cache +from typing import Optional, Union + +from ...image_processing_utils_fast import ( + BASE_IMAGE_PROCESSOR_FAST_DOCSTRING, + BASE_IMAGE_PROCESSOR_FAST_DOCSTRING_PREPROCESS, + BaseImageProcessorFast, + BatchFeature, + DefaultFastImageProcessorKwargs, +) +from ...image_transforms import group_images_by_shape, reorder_images +from ...image_utils import ( + IMAGENET_STANDARD_MEAN, + IMAGENET_STANDARD_STD, + ImageInput, + PILImageResampling, + SizeDict, +) +from ...processing_utils import Unpack +from ...utils import ( + TensorType, + add_start_docstrings, + is_torch_available, + is_torchvision_available, + is_torchvision_v2_available, +) + + +if is_torch_available(): + import torch + +if is_torchvision_available(): + if is_torchvision_v2_available(): + from torchvision.transforms.v2 import functional as F + else: + from torchvision.transforms import functional as F + + +class EfficientNetFastImageProcessorKwargs(DefaultFastImageProcessorKwargs): + rescale_offset: bool + include_top: bool + + +@add_start_docstrings( + "Constructs a fast EfficientNet image processor.", + BASE_IMAGE_PROCESSOR_FAST_DOCSTRING, +) +class EfficientNetImageProcessorFast(BaseImageProcessorFast): + resample = PILImageResampling.NEAREST + image_mean = IMAGENET_STANDARD_MEAN + image_std = IMAGENET_STANDARD_STD + size = {"height": 346, "width": 346} + crop_size = {"height": 289, "width": 289} + do_resize = True + do_center_crop = False + do_rescale = True + rescale_factor = 1 / 255 + rescale_offset = False + do_normalize = True + include_top = True + valid_kwargs = EfficientNetFastImageProcessorKwargs + + def __init__(self, **kwargs: Unpack[EfficientNetFastImageProcessorKwargs]): + super().__init__(**kwargs) + + def rescale( + self, + image: "torch.Tensor", + scale: float, + offset: Optional[bool] = True, + **kwargs, + ) -> "torch.Tensor": + """ + Rescale an image by a scale factor. + + If `offset` is `True`, the image has its values rescaled by `scale` and then offset by 1. If `scale` is + 1/127.5, the image is rescaled between [-1, 1]. + image = image * scale - 1 + + If `offset` is `False`, and `scale` is 1/255, the image is rescaled between [0, 1]. + image = image * scale + + Args: + image (`torch.Tensor`): + Image to rescale. + scale (`float`): + The scaling factor to rescale pixel values by. + offset (`bool`, *optional*): + Whether to scale the image in both negative and positive directions. + + Returns: + `torch.Tensor`: The rescaled image. + """ + + rescaled_image = image * scale + + if offset: + rescaled_image -= 1 + + return rescaled_image + + @lru_cache(maxsize=10) + def _fuse_mean_std_and_rescale_factor( + self, + do_normalize: Optional[bool] = None, + image_mean: Optional[Union[float, list[float]]] = None, + image_std: Optional[Union[float, list[float]]] = None, + do_rescale: Optional[bool] = None, + rescale_factor: Optional[float] = None, + device: Optional["torch.device"] = None, + rescale_offset: Optional[bool] = False, + ) -> tuple: + if do_rescale and do_normalize and not rescale_offset: + # Fused rescale and normalize + image_mean = torch.tensor(image_mean, device=device) * (1.0 / rescale_factor) + image_std = torch.tensor(image_std, device=device) * (1.0 / rescale_factor) + do_rescale = False + return image_mean, image_std, do_rescale + + def rescale_and_normalize( + self, + images: "torch.Tensor", + do_rescale: bool, + rescale_factor: float, + do_normalize: bool, + image_mean: Union[float, list[float]], + image_std: Union[float, list[float]], + rescale_offset: bool = False, + ) -> "torch.Tensor": + """ + Rescale and normalize images. + """ + image_mean, image_std, do_rescale = self._fuse_mean_std_and_rescale_factor( + do_normalize=do_normalize, + image_mean=image_mean, + image_std=image_std, + do_rescale=do_rescale, + rescale_factor=rescale_factor, + device=images.device, + rescale_offset=rescale_offset, + ) + # if/elif as we use fused rescale and normalize if both are set to True + if do_rescale: + images = self.rescale(images, rescale_factor, rescale_offset) + if do_normalize: + images = self.normalize(images.to(dtype=torch.float32), image_mean, image_std) + + return images + + def _preprocess( + self, + images: list["torch.Tensor"], + do_resize: bool, + size: SizeDict, + interpolation: Optional["F.InterpolationMode"], + do_center_crop: bool, + crop_size: SizeDict, + do_rescale: bool, + rescale_factor: float, + rescale_offset: bool, + do_normalize: bool, + include_top: bool, + image_mean: Optional[Union[float, list[float]]], + image_std: Optional[Union[float, list[float]]], + return_tensors: Optional[Union[str, TensorType]], + **kwargs, + ) -> BatchFeature: + # Group images by size for batched resizing + grouped_images, grouped_images_index = group_images_by_shape(images) + resized_images_grouped = {} + for shape, stacked_images in grouped_images.items(): + if do_resize: + stacked_images = self.resize(image=stacked_images, size=size, interpolation=interpolation) + resized_images_grouped[shape] = stacked_images + resized_images = reorder_images(resized_images_grouped, grouped_images_index) + + # Group images by size for further processing + # Needed in case do_resize is False, or resize returns images with different sizes + grouped_images, grouped_images_index = group_images_by_shape(resized_images) + processed_images_grouped = {} + for shape, stacked_images in grouped_images.items(): + if do_center_crop: + stacked_images = self.center_crop(stacked_images, crop_size) + # Fused rescale and normalize + stacked_images = self.rescale_and_normalize( + stacked_images, do_rescale, rescale_factor, do_normalize, image_mean, image_std, rescale_offset + ) + if include_top: + stacked_images = self.normalize(stacked_images, 0, image_std) + processed_images_grouped[shape] = stacked_images + + processed_images = reorder_images(processed_images_grouped, grouped_images_index) + processed_images = torch.stack(processed_images, dim=0) if return_tensors else processed_images + + return BatchFeature(data={"pixel_values": processed_images}, tensor_type=return_tensors) + + @add_start_docstrings( + BASE_IMAGE_PROCESSOR_FAST_DOCSTRING_PREPROCESS, + """ + rescale_offset (`bool`, *optional*, defaults to `self.rescale_offset`): + Whether to rescale the image between [-max_range/2, scale_range/2] instead of [0, scale_range]. + include_top (`bool`, *optional*, defaults to `self.include_top`): + Normalize the image again with the standard deviation only for image classification if set to True. + """, + ) + def preprocess(self, images: ImageInput, **kwargs: Unpack[EfficientNetFastImageProcessorKwargs]) -> BatchFeature: + return super().preprocess(images, **kwargs) + + +__all__ = ["EfficientNetImageProcessorFast"]
src/transformers/models/efficientnet/__init__.py+1 −0 modified@@ -20,6 +20,7 @@ if TYPE_CHECKING: from .configuration_efficientnet import * from .image_processing_efficientnet import * + from .image_processing_efficientnet_fast import * from .modeling_efficientnet import * else: import sys
src/transformers/models/focalnet/modeling_focalnet.py+1 −1 modified@@ -486,7 +486,7 @@ def __init__(self, config, index, input_resolution): downsample = FocalNetPatchEmbeddings if (index < self.num_stages - 1) else None # stochastic depth decay rule - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))] + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")] drop_path = dpr[sum(config.depths[:index]) : sum(config.depths[: index + 1])] self.layers = nn.ModuleList(
src/transformers/models/glpn/modeling_glpn.py+1 −1 modified@@ -331,7 +331,7 @@ def __init__(self, config): self.config = config # stochastic depth decay rule - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))] + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")] # patch embeddings embeddings = []
src/transformers/models/hiera/modeling_hiera.py+2 −2 modified@@ -639,9 +639,9 @@ def __init__(self, config: HieraConfig) -> None: super().__init__() total_depth = sum(config.depths) # stochastic depth decay rule - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, total_depth)] + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, total_depth, device="cpu")] # query strides rule - cumulative_depths = torch.tensor(config.depths).cumsum(0).tolist() + cumulative_depths = torch.tensor(config.depths, device="cpu").cumsum(0).tolist() query_pool_layer = cumulative_depths[: config.num_query_pool] query_strides = [math.prod(config.query_stride) if i in query_pool_layer else 1 for i in range(total_depth)]
src/transformers/models/mamba2/modeling_mamba2.py+2 −2 modified@@ -572,8 +572,8 @@ def torch_forward(self, input_states, cache_params: Optional[Mamba2Cache]=None, hidden_states = hidden_states.reshape(batch_size, seq_len, -1, self.head_dim).float() B = B.reshape(batch_size, seq_len, -1, self.ssm_state_size).float() C = C.reshape(batch_size, seq_len, -1, self.ssm_state_size).float() - B = B.repeat(1, 1, self.num_heads // self.n_groups, 1) - C = C.repeat(1, 1, self.num_heads // self.n_groups, 1) + B = B.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads) + C = C.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads) pad_size = (self.chunk_size - seq_len % self.chunk_size) % self.chunk_size D_residual = self.D[..., None] * pad_tensor_by_size(hidden_states, pad_size)
src/transformers/models/maskformer/modeling_maskformer_swin.py+1 −1 modified@@ -692,7 +692,7 @@ def __init__(self, config, grid_size): super().__init__() self.num_layers = len(config.depths) self.config = config - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))] + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")] self.layers = nn.ModuleList( [ MaskFormerSwinStage(
src/transformers/models/mgp_str/modeling_mgp_str.py+1 −1 modified@@ -246,7 +246,7 @@ class MgpstrEncoder(nn.Module): def __init__(self, config: MgpstrConfig): super().__init__() # stochastic depth decay rule - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)] + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers, device="cpu")] self.blocks = nn.Sequential( *[MgpstrLayer(config=config, drop_path=dpr[i]) for i in range(config.num_hidden_layers)]
src/transformers/models/poolformer/modeling_poolformer.py+1 −1 modified@@ -194,7 +194,7 @@ def __init__(self, config): super().__init__() self.config = config # stochastic depth decay rule - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))] + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")] # patch embeddings embeddings = []
src/transformers/models/pvt/modeling_pvt.py+1 −1 modified@@ -369,7 +369,7 @@ def __init__(self, config: PvtConfig): self.config = config # stochastic depth decay rule - drop_path_decays = torch.linspace(0, config.drop_path_rate, sum(config.depths)).tolist() + drop_path_decays = torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu").tolist() # patch embeddings embeddings = []
src/transformers/models/pvt_v2/modeling_pvt_v2.py+1 −1 modified@@ -323,7 +323,7 @@ def __init__(self, config: PvtV2Config, layer_idx: int): ) # Transformer block # stochastic depth decay rule - drop_path_decays = torch.linspace(0, config.drop_path_rate, sum(config.depths)).tolist() + drop_path_decays = torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu").tolist() block_layers = [] for block_idx in range(config.depths[layer_idx]): block_layers.append(
src/transformers/models/segformer/modeling_segformer.py+3 −1 modified@@ -356,7 +356,9 @@ def __init__(self, config): self.config = config # stochastic depth decay rule - drop_path_decays = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))] + drop_path_decays = [ + x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu") + ] # patch embeddings embeddings = []
src/transformers/models/seggpt/modeling_seggpt.py+1 −1 modified@@ -460,7 +460,7 @@ class SegGptEncoder(nn.Module): def __init__(self, config: SegGptConfig) -> None: super().__init__() self.config = config - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)] + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers, device="cpu")] self.layers = nn.ModuleList([SegGptLayer(config, dpr[i]) for i in range(config.num_hidden_layers)]) self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) self.gradient_checkpointing = False
src/transformers/models/swin2sr/modeling_swin2sr.py+1 −1 modified@@ -682,7 +682,7 @@ def __init__(self, config, grid_size): super().__init__() self.num_stages = len(config.depths) self.config = config - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))] + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")] self.stages = nn.ModuleList( [ Swin2SRStage(
src/transformers/models/swin/modeling_swin.py+1 −1 modified@@ -823,7 +823,7 @@ def __init__(self, config, grid_size): super().__init__() self.num_layers = len(config.depths) self.config = config - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))] + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")] self.layers = nn.ModuleList( [ SwinStage(
src/transformers/models/swinv2/modeling_swinv2.py+1 −1 modified@@ -877,7 +877,7 @@ def __init__(self, config, grid_size, pretrained_window_sizes=(0, 0, 0, 0)): self.config = config if self.config.pretrained_window_sizes is not None: pretrained_window_sizes = config.pretrained_window_sizes - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))] + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths), device="cpu")] layers = [] for i_layer in range(self.num_layers):
src/transformers/models/timesformer/modeling_timesformer.py+1 −1 modified@@ -295,7 +295,7 @@ def __init__(self, config: TimesformerConfig, layer_index: int) -> None: attention_type = config.attention_type drop_path_rates = [ - x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers) + x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers, device="cpu") ] # stochastic depth decay rule drop_path_rate = drop_path_rates[layer_index]
src/transformers/models/vitdet/modeling_vitdet.py+1 −1 modified@@ -535,7 +535,7 @@ def __init__(self, config: VitDetConfig) -> None: depth = config.num_hidden_layers # stochastic depth decay rule - drop_path_rate = [x.item() for x in torch.linspace(0, config.drop_path_rate, depth)] + drop_path_rate = [x.item() for x in torch.linspace(0, config.drop_path_rate, depth, device="cpu")] layers = [] for i in range(depth):
src/transformers/models/zamba2/modeling_zamba2.py+2 −2 modified@@ -860,8 +860,8 @@ def torch_forward(self, input_states, cache_params: Optional[Zamba2HybridDynamic hidden_states = hidden_states.reshape(batch_size, seq_len, -1, self.head_dim).float() B = B.reshape(batch_size, seq_len, -1, self.ssm_state_size).float() C = C.reshape(batch_size, seq_len, -1, self.ssm_state_size).float() - B = B.repeat(1, 1, self.num_heads // self.n_groups, 1) - C = C.repeat(1, 1, self.num_heads // self.n_groups, 1) + B = B.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads) + C = C.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads) pad_size = (self.chunk_size - seq_len % self.chunk_size) % self.chunk_size D_residual = self.D[..., None] * pad_tensor_by_size(hidden_states, pad_size)
src/transformers/models/zamba2/modular_zamba2.py+2 −2 modified@@ -630,8 +630,8 @@ def torch_forward(self, input_states, cache_params: Optional[Zamba2HybridDynamic hidden_states = hidden_states.reshape(batch_size, seq_len, -1, self.head_dim).float() B = B.reshape(batch_size, seq_len, -1, self.ssm_state_size).float() C = C.reshape(batch_size, seq_len, -1, self.ssm_state_size).float() - B = B.repeat(1, 1, self.num_heads // self.n_groups, 1) - C = C.repeat(1, 1, self.num_heads // self.n_groups, 1) + B = B.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads) + C = C.repeat_interleave(self.num_heads // self.n_groups, dim=2, output_size=self.num_heads) pad_size = (self.chunk_size - seq_len % self.chunk_size) % self.chunk_size D_residual = self.D[..., None] * pad_tensor_by_size(hidden_states, pad_size)
tests/models/bridgetower/test_image_processing_bridgetower.py+67 −49 modified@@ -16,19 +16,25 @@ import unittest from typing import Optional, Union -import numpy as np +import requests from transformers.testing_utils import require_torch, require_vision -from transformers.utils import is_vision_available +from transformers.utils import is_torch_available, is_torchvision_available, is_vision_available from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs +if is_torch_available(): + import torch + if is_vision_available(): from PIL import Image from transformers import BridgeTowerImageProcessor + if is_torchvision_available(): + from transformers import BridgeTowerImageProcessorFast + class BridgeTowerImageProcessingTester: def __init__( @@ -76,46 +82,7 @@ def prepare_image_processor_dict(self): } def get_expected_values(self, image_inputs, batched=False): - """ - This function computes the expected height and width when providing images to BridgeTowerImageProcessor, - assuming do_resize is set to True with a scalar size and size_divisor. - """ - if not batched: - size = self.size["shortest_edge"] - image = image_inputs[0] - if isinstance(image, Image.Image): - w, h = image.size - elif isinstance(image, np.ndarray): - h, w = image.shape[0], image.shape[1] - else: - h, w = image.shape[1], image.shape[2] - scale = size / min(w, h) - if h < w: - newh, neww = size, scale * w - else: - newh, neww = scale * h, size - - max_size = int((1333 / 800) * size) - if max(newh, neww) > max_size: - scale = max_size / max(newh, neww) - newh = newh * scale - neww = neww * scale - - newh, neww = int(newh + 0.5), int(neww + 0.5) - expected_height, expected_width = ( - newh // self.size_divisor * self.size_divisor, - neww // self.size_divisor * self.size_divisor, - ) - - else: - expected_values = [] - for image in image_inputs: - expected_height, expected_width = self.get_expected_values([image]) - expected_values.append((expected_height, expected_width)) - expected_height = max(expected_values, key=lambda item: item[0])[0] - expected_width = max(expected_values, key=lambda item: item[1])[1] - - return expected_height, expected_width + return self.size["shortest_edge"], self.size["shortest_edge"] def expected_output_image_shape(self, images): height, width = self.get_expected_values(images, batched=True) @@ -137,6 +104,7 @@ def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=F @require_vision class BridgeTowerImageProcessingTest(ImageProcessingTestMixin, unittest.TestCase): image_processing_class = BridgeTowerImageProcessor if is_vision_available() else None + fast_image_processing_class = BridgeTowerImageProcessorFast if is_torchvision_available() else None def setUp(self): super().setUp() @@ -147,10 +115,60 @@ def image_processor_dict(self): return self.image_processor_tester.prepare_image_processor_dict() def test_image_processor_properties(self): - image_processing = self.image_processing_class(**self.image_processor_dict) - self.assertTrue(hasattr(image_processing, "image_mean")) - self.assertTrue(hasattr(image_processing, "image_std")) - self.assertTrue(hasattr(image_processing, "do_normalize")) - self.assertTrue(hasattr(image_processing, "do_resize")) - self.assertTrue(hasattr(image_processing, "size")) - self.assertTrue(hasattr(image_processing, "size_divisor")) + for image_processing_class in self.image_processor_list: + image_processing = image_processing_class(**self.image_processor_dict) + self.assertTrue(hasattr(image_processing, "image_mean")) + self.assertTrue(hasattr(image_processing, "image_std")) + self.assertTrue(hasattr(image_processing, "do_normalize")) + self.assertTrue(hasattr(image_processing, "do_resize")) + self.assertTrue(hasattr(image_processing, "size")) + self.assertTrue(hasattr(image_processing, "size_divisor")) + + def _assertEquivalence(self, a, b): + self.assertTrue(torch.allclose(a, b, atol=1e-1)) + self.assertLessEqual(torch.mean(torch.abs(a - b)).item(), 1e-3) + + @require_vision + @require_torch + def test_slow_fast_equivalence(self): + if not self.test_slow_image_processor or not self.test_fast_image_processor: + self.skipTest(reason="Skipping slow/fast equivalence test") + + if self.image_processing_class is None or self.fast_image_processing_class is None: + self.skipTest(reason="Skipping slow/fast equivalence test as one of the image processors is not defined") + + dummy_image = Image.open( + requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw + ) + image_processor_slow = self.image_processing_class(**self.image_processor_dict) + image_processor_fast = self.fast_image_processing_class(**self.image_processor_dict) + + encoding_slow = image_processor_slow(dummy_image, return_tensors="pt") + encoding_fast = image_processor_fast(dummy_image, return_tensors="pt") + + self._assertEquivalence(encoding_slow.pixel_values, encoding_fast.pixel_values) + self._assertEquivalence(encoding_slow.pixel_mask.float(), encoding_fast.pixel_mask.float()) + + @require_vision + @require_torch + def test_slow_fast_equivalence_batched(self): + if not self.test_slow_image_processor or not self.test_fast_image_processor: + self.skipTest(reason="Skipping slow/fast equivalence test") + + if self.image_processing_class is None or self.fast_image_processing_class is None: + self.skipTest(reason="Skipping slow/fast equivalence test as one of the image processors is not defined") + + if hasattr(self.image_processor_tester, "do_center_crop") and self.image_processor_tester.do_center_crop: + self.skipTest( + reason="Skipping as do_center_crop is True and center_crop functions are not equivalent for fast and slow processors" + ) + + dummy_images = self.image_processor_tester.prepare_image_inputs(equal_resolution=False, torchify=True) + image_processor_slow = self.image_processing_class(**self.image_processor_dict) + image_processor_fast = self.fast_image_processing_class(**self.image_processor_dict) + + encoding_slow = image_processor_slow(dummy_images, return_tensors="pt") + encoding_fast = image_processor_fast(dummy_images, return_tensors="pt") + + self._assertEquivalence(encoding_slow.pixel_values, encoding_fast.pixel_values) + self._assertEquivalence(encoding_slow.pixel_mask.float(), encoding_fast.pixel_mask.float())
tests/models/efficientnet/test_image_processing_efficientnet.py+87 −19 modified@@ -17,15 +17,26 @@ import numpy as np +from transformers.image_utils import PILImageResampling from transformers.testing_utils import require_torch, require_vision -from transformers.utils import is_vision_available +from transformers.utils import ( + is_torch_available, + is_torchvision_available, + is_vision_available, +) from ...test_image_processing_common import ImageProcessingTestMixin, prepare_image_inputs +if is_torch_available(): + import torch + if is_vision_available(): from transformers import EfficientNetImageProcessor + if is_torchvision_available(): + from transformers import EfficientNetImageProcessorFast + class EfficientNetImageProcessorTester: def __init__( @@ -41,6 +52,10 @@ def __init__( do_normalize=True, image_mean=[0.5, 0.5, 0.5], image_std=[0.5, 0.5, 0.5], + do_rescale=True, + rescale_offset=True, + rescale_factor=1 / 127.5, + resample=PILImageResampling.BILINEAR, # NEAREST is too different between PIL and torchvision ): size = size if size is not None else {"height": 18, "width": 18} self.parent = parent @@ -54,6 +69,7 @@ def __init__( self.do_normalize = do_normalize self.image_mean = image_mean self.image_std = image_std + self.resample = resample def prepare_image_processor_dict(self): return { @@ -62,6 +78,7 @@ def prepare_image_processor_dict(self): "do_normalize": self.do_normalize, "do_resize": self.do_resize, "size": self.size, + "resample": self.resample, } def expected_output_image_shape(self, images): @@ -83,6 +100,7 @@ def prepare_image_inputs(self, equal_resolution=False, numpify=False, torchify=F @require_vision class EfficientNetImageProcessorTest(ImageProcessingTestMixin, unittest.TestCase): image_processing_class = EfficientNetImageProcessor if is_vision_available() else None + fast_image_processing_class = EfficientNetImageProcessorFast if is_torchvision_available() else None def setUp(self): super().setUp() @@ -93,30 +111,80 @@ def image_processor_dict(self): return self.image_processor_tester.prepare_image_processor_dict() def test_image_processor_properties(self): - image_processing = self.image_processing_class(**self.image_processor_dict) - self.assertTrue(hasattr(image_processing, "image_mean")) - self.assertTrue(hasattr(image_processing, "image_std")) - self.assertTrue(hasattr(image_processing, "do_normalize")) - self.assertTrue(hasattr(image_processing, "do_resize")) - self.assertTrue(hasattr(image_processing, "size")) + for image_processing_class in self.image_processor_list: + image_processing = image_processing_class(**self.image_processor_dict) + self.assertTrue(hasattr(image_processing, "image_mean")) + self.assertTrue(hasattr(image_processing, "image_std")) + self.assertTrue(hasattr(image_processing, "do_normalize")) + self.assertTrue(hasattr(image_processing, "do_resize")) + self.assertTrue(hasattr(image_processing, "size")) def test_image_processor_from_dict_with_kwargs(self): - image_processor = self.image_processing_class.from_dict(self.image_processor_dict) - self.assertEqual(image_processor.size, {"height": 18, "width": 18}) + for image_processing_class in self.image_processor_list: + image_processor = image_processing_class.from_dict(self.image_processor_dict) + self.assertEqual(image_processor.size, {"height": 18, "width": 18}) - image_processor = self.image_processing_class.from_dict(self.image_processor_dict, size=42) - self.assertEqual(image_processor.size, {"height": 42, "width": 42}) + image_processor = image_processing_class.from_dict(self.image_processor_dict, size=42) + self.assertEqual(image_processor.size, {"height": 42, "width": 42}) def test_rescale(self): # EfficientNet optionally rescales between -1 and 1 instead of the usual 0 and 1 image = np.arange(0, 256, 1, dtype=np.uint8).reshape(1, 8, 32) - image_processor = self.image_processing_class(**self.image_processor_dict) - - rescaled_image = image_processor.rescale(image, scale=1 / 127.5) - expected_image = (image * (1 / 127.5)).astype(np.float32) - 1 - self.assertTrue(np.allclose(rescaled_image, expected_image)) + for image_processing_class in self.image_processor_list: + image_processor = image_processing_class(**self.image_processor_dict) + if image_processing_class == EfficientNetImageProcessorFast: + image = torch.from_numpy(image) + + # Scale between [-1, 1] with rescale_factor 1/127.5 and rescale_offset=True + rescaled_image = image_processor.rescale(image, scale=1 / 127.5, offset=True) + expected_image = (image * (1 / 127.5)) - 1 + self.assertTrue(torch.allclose(rescaled_image, expected_image)) + + # Scale between [0, 1] with rescale_factor 1/255 and rescale_offset=True + rescaled_image = image_processor.rescale(image, scale=1 / 255, offset=False) + expected_image = image / 255.0 + self.assertTrue(torch.allclose(rescaled_image, expected_image)) + + else: + rescaled_image = image_processor.rescale(image, scale=1 / 127.5, dtype=np.float64) + expected_image = (image * (1 / 127.5)).astype(np.float64) - 1 + self.assertTrue(np.allclose(rescaled_image, expected_image)) + + rescaled_image = image_processor.rescale(image, scale=1 / 255, offset=False, dtype=np.float64) + expected_image = (image / 255.0).astype(np.float64) + self.assertTrue(np.allclose(rescaled_image, expected_image)) + + @require_vision + @require_torch + def test_rescale_normalize(self): + if self.image_processing_class is None or self.fast_image_processing_class is None: + self.skipTest(reason="Skipping slow/fast equivalence test as one of the image processors is not defined") + + image = torch.arange(0, 256, 1, dtype=torch.uint8).reshape(1, 8, 32).repeat(3, 1, 1) + image_mean_0 = (0.0, 0.0, 0.0) + image_std_0 = (1.0, 1.0, 1.0) + image_mean_1 = (0.5, 0.5, 0.5) + image_std_1 = (0.5, 0.5, 0.5) + + image_processor_fast = self.fast_image_processing_class(**self.image_processor_dict) + + # Rescale between [-1, 1] with rescale_factor=1/127.5 and rescale_offset=True. Then normalize + rescaled_normalized = image_processor_fast.rescale_and_normalize( + image, True, 1 / 127.5, True, image_mean_0, image_std_0, True + ) + expected_image = (image * (1 / 127.5)) - 1 + expected_image = (expected_image - torch.tensor(image_mean_0).view(3, 1, 1)) / torch.tensor(image_std_0).view( + 3, 1, 1 + ) + self.assertTrue(torch.allclose(rescaled_normalized, expected_image, rtol=1e-3)) - rescaled_image = image_processor.rescale(image, scale=1 / 255, offset=False) - expected_image = (image / 255.0).astype(np.float32) - self.assertTrue(np.allclose(rescaled_image, expected_image)) + # Rescale between [0, 1] with rescale_factor=1/255 and rescale_offset=False. Then normalize + rescaled_normalized = image_processor_fast.rescale_and_normalize( + image, True, 1 / 255, True, image_mean_1, image_std_1, False + ) + expected_image = image * (1 / 255.0) + expected_image = (expected_image - torch.tensor(image_mean_1).view(3, 1, 1)) / torch.tensor(image_std_1).view( + 3, 1, 1 + ) + self.assertTrue(torch.allclose(rescaled_normalized, expected_image, rtol=1e-3))
tests/models/mamba2/test_modeling_mamba2.py+8 −0 modified@@ -238,6 +238,14 @@ def test_mamba2_slow_vs_fast_forward(self): config_and_inputs = self.model_tester.prepare_config_and_inputs() self.model_tester.create_and_check_mamba2_slow_vs_fast_forward(*config_and_inputs) + # This test adjusts n_groups to half the original setting and effectively + # creates a grouped SSD configuration in the mamba2 layers + # See https://github.com/huggingface/transformers/pull/37533/ + def test_mamba2_slow_vs_fast_forward_grouped(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + config_and_inputs[0].n_groups //= 2 + self.model_tester.create_and_check_mamba2_slow_vs_fast_forward(*config_and_inputs) + def test_initialization(self): config, _ = self.model_tester.prepare_config_and_inputs_for_common()
tests/test_image_processing_common.py+2 −2 modified@@ -181,7 +181,7 @@ def test_slow_fast_equivalence(self): encoding_fast = image_processor_fast(dummy_image, return_tensors="pt") self.assertTrue(torch.allclose(encoding_slow.pixel_values, encoding_fast.pixel_values, atol=1e-1)) self.assertLessEqual( - torch.mean(torch.abs(encoding_slow.pixel_values - encoding_fast.pixel_values)).item(), 1e-3 + torch.mean(torch.abs(encoding_slow.pixel_values - encoding_fast.pixel_values)).item(), 5e-3 ) @require_vision @@ -207,7 +207,7 @@ def test_slow_fast_equivalence_batched(self): self.assertTrue(torch.allclose(encoding_slow.pixel_values, encoding_fast.pixel_values, atol=1e-1)) self.assertLessEqual( - torch.mean(torch.abs(encoding_slow.pixel_values - encoding_fast.pixel_values)).item(), 1e-3 + torch.mean(torch.abs(encoding_slow.pixel_values - encoding_fast.pixel_values)).item(), 5e-3 ) @require_vision
tests/test_modeling_common.py+7 −0 modified@@ -4528,6 +4528,13 @@ def test_generation_tester_mixin_inheritance(self): ), ) + def test_can_be_initialized_on_meta(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + for model_class in self.all_model_classes: + # If it does not raise here, the test passes + with torch.device("meta"): + _ = model_class(config) + @require_torch_accelerator def test_can_load_with_device_context_manager(self): config, _ = self.model_tester.prepare_config_and_inputs_for_common()
Vulnerability mechanics
Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
4News mentions
0No linked articles in our index yet.