VYPR
Critical severityNVD Advisory· Published Dec 19, 2023· Updated Aug 2, 2024

Deserialization of Untrusted Data in huggingface/transformers

CVE-2023-6730

Description

Deserialization of Untrusted Data in GitHub repository huggingface/transformers prior to 4.36.

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

A deserialization vulnerability in Hugging Face Transformers before 4.36 allows remote code execution via pickle.load on untrusted data.

Vulnerability

Overview

CVE-2023-6730 is a deserialization of untrusted data vulnerability in the Hugging Face Transformers library prior to version 4.36. The library used Python's pickle module to load certain model artifacts without adequate safeguards, allowing arbitrary code execution if a malicious pickle file is processed [2][3].

Exploitation

An attacker can exploit this by crafting a malicious pickle file and delivering it to a user or automated system that loads a model or index using Transformers. This can occur through a compromised model repository or direct file upload, requiring no authentication if the attacker can supply the file [4].

Impact

Successful exploitation results in remote code execution with the privileges of the application using Transformers, potentially leading to data exfiltration, system compromise, or further lateral movement within an environment [3][4].

Mitigation

The fix, introduced in commit 1d63b0e [2], disallows pickle.load unless the TRUST_REMOTE_CODE environment variable is explicitly set to True. Users should upgrade to Transformers version 4.36 or later to mitigate this vulnerability [3][4].

AI Insight generated on May 20, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected packages

Versions sourced from the GitHub Security Advisory.

PackageAffected versionsPatched versions
transformersPyPI
< 4.36.04.36.0

Affected products

2

Patches

1
1d63b0ec361e

Disallow `pickle.load` unless `TRUST_REMOTE_CODE=True` (#27776)

4 files changed · +39 62
  • docs/source/en/model_doc/transfo-xl.md+8 2 modified
    @@ -22,11 +22,17 @@ This model is in maintenance mode only, so we won't accept any new PRs changing
     
     We recommend switching to more recent models for improved security.
     
    -In case you would still like to use `TransfoXL` in your experiments, we recommend using the [Hub checkpoint](https://huggingface.co/transfo-xl-wt103) with a specific revision to ensure you are downloading safe files from the Hub:
    +In case you would still like to use `TransfoXL` in your experiments, we recommend using the [Hub checkpoint](https://huggingface.co/transfo-xl-wt103) with a specific revision to ensure you are downloading safe files from the Hub.
     
    -```
    +You will need to set the environment variable `TRUST_REMOTE_CODE` to `True` in order to allow the
    +usage of `pickle.load()`:
    +
    +```python
    +import os
     from transformers import TransfoXLTokenizer, TransfoXLLMHeadModel
     
    +os.environ["TRUST_REMOTE_CODE"] = "True"
    +
     checkpoint = 'transfo-xl-wt103'
     revision = '40a186da79458c9f9de846edfaea79c412137f97'
     
    
  • src/transformers/models/deprecated/transfo_xl/tokenization_transfo_xl.py+16 0 modified
    @@ -34,6 +34,7 @@
         is_torch_available,
         logging,
         requires_backends,
    +    strtobool,
         torch_only_method,
     )
     
    @@ -212,6 +213,14 @@ def __init__(
                 vocab_dict = None
                 if pretrained_vocab_file is not None:
                     # Priority on pickle files (support PyTorch and TF)
    +                if not strtobool(os.environ.get("TRUST_REMOTE_CODE", "False")):
    +                    raise ValueError(
    +                        "This part uses `pickle.load` which is insecure and will execute arbitrary code that is "
    +                        "potentially malicious. It's recommended to never unpickle data that could have come from an "
    +                        "untrusted source, or that could have been tampered with. If you already verified the pickle "
    +                        "data and decided to use it, you can set the environment variable "
    +                        "`TRUST_REMOTE_CODE` to `True` to allow it."
    +                    )
                     with open(pretrained_vocab_file, "rb") as f:
                         vocab_dict = pickle.load(f)
     
    @@ -790,6 +799,13 @@ def get_lm_corpus(datadir, dataset):
             corpus = torch.load(fn_pickle)
         elif os.path.exists(fn):
             logger.info("Loading cached dataset from pickle...")
    +        if not strtobool(os.environ.get("TRUST_REMOTE_CODE", "False")):
    +            raise ValueError(
    +                "This part uses `pickle.load` which is insecure and will execute arbitrary code that is potentially "
    +                "malicious. It's recommended to never unpickle data that could have come from an untrusted source, or "
    +                "that could have been tampered with. If you already verified the pickle data and decided to use it, "
    +                "you can set the environment variable `TRUST_REMOTE_CODE` to `True` to allow it."
    +            )
             with open(fn, "rb") as fp:
                 corpus = pickle.load(fp)
         else:
    
  • src/transformers/models/rag/retrieval_rag.py+15 1 modified
    @@ -23,7 +23,7 @@
     
     from ...tokenization_utils import PreTrainedTokenizer
     from ...tokenization_utils_base import BatchEncoding
    -from ...utils import cached_file, is_datasets_available, is_faiss_available, logging, requires_backends
    +from ...utils import cached_file, is_datasets_available, is_faiss_available, logging, requires_backends, strtobool
     from .configuration_rag import RagConfig
     from .tokenization_rag import RagTokenizer
     
    @@ -131,6 +131,13 @@ def _resolve_path(self, index_path, filename):
         def _load_passages(self):
             logger.info(f"Loading passages from {self.index_path}")
             passages_path = self._resolve_path(self.index_path, self.PASSAGE_FILENAME)
    +        if not strtobool(os.environ.get("TRUST_REMOTE_CODE", "False")):
    +            raise ValueError(
    +                "This part uses `pickle.load` which is insecure and will execute arbitrary code that is potentially "
    +                "malicious. It's recommended to never unpickle data that could have come from an untrusted source, or "
    +                "that could have been tampered with. If you already verified the pickle data and decided to use it, "
    +                "you can set the environment variable `TRUST_REMOTE_CODE` to `True` to allow it."
    +            )
             with open(passages_path, "rb") as passages_file:
                 passages = pickle.load(passages_file)
             return passages
    @@ -140,6 +147,13 @@ def _deserialize_index(self):
             resolved_index_path = self._resolve_path(self.index_path, self.INDEX_FILENAME + ".index.dpr")
             self.index = faiss.read_index(resolved_index_path)
             resolved_meta_path = self._resolve_path(self.index_path, self.INDEX_FILENAME + ".index_meta.dpr")
    +        if not strtobool(os.environ.get("TRUST_REMOTE_CODE", "False")):
    +            raise ValueError(
    +                "This part uses `pickle.load` which is insecure and will execute arbitrary code that is potentially "
    +                "malicious. It's recommended to never unpickle data that could have come from an untrusted source, or "
    +                "that could have been tampered with. If you already verified the pickle data and decided to use it, "
    +                "you can set the environment variable `TRUST_REMOTE_CODE` to `True` to allow it."
    +            )
             with open(resolved_meta_path, "rb") as metadata_file:
                 self.index_id_to_db_id = pickle.load(metadata_file)
             assert (
    
  • tests/models/rag/test_retrieval_rag.py+0 59 modified
    @@ -14,7 +14,6 @@
     
     import json
     import os
    -import pickle
     import shutil
     import tempfile
     from unittest import TestCase
    @@ -174,37 +173,6 @@ def get_dummy_custom_hf_index_retriever(self, from_disk: bool):
                 )
             return retriever
     
    -    def get_dummy_legacy_index_retriever(self):
    -        dataset = Dataset.from_dict(
    -            {
    -                "id": ["0", "1"],
    -                "text": ["foo", "bar"],
    -                "title": ["Foo", "Bar"],
    -                "embeddings": [np.ones(self.retrieval_vector_size + 1), 2 * np.ones(self.retrieval_vector_size + 1)],
    -            }
    -        )
    -        dataset.add_faiss_index("embeddings", string_factory="Flat", metric_type=faiss.METRIC_INNER_PRODUCT)
    -
    -        index_file_name = os.path.join(self.tmpdirname, "hf_bert_base.hnswSQ8_correct_phi_128.c_index")
    -        dataset.save_faiss_index("embeddings", index_file_name + ".index.dpr")
    -        pickle.dump(dataset["id"], open(index_file_name + ".index_meta.dpr", "wb"))
    -
    -        passages_file_name = os.path.join(self.tmpdirname, "psgs_w100.tsv.pkl")
    -        passages = {sample["id"]: [sample["text"], sample["title"]] for sample in dataset}
    -        pickle.dump(passages, open(passages_file_name, "wb"))
    -
    -        config = RagConfig(
    -            retrieval_vector_size=self.retrieval_vector_size,
    -            question_encoder=DPRConfig().to_dict(),
    -            generator=BartConfig().to_dict(),
    -            index_name="legacy",
    -            index_path=self.tmpdirname,
    -        )
    -        retriever = RagRetriever(
    -            config, question_encoder_tokenizer=self.get_dpr_tokenizer(), generator_tokenizer=self.get_bart_tokenizer()
    -        )
    -        return retriever
    -
         def test_canonical_hf_index_retriever_retrieve(self):
             n_docs = 1
             retriever = self.get_dummy_canonical_hf_index_retriever()
    @@ -288,33 +256,6 @@ def test_custom_hf_index_retriever_save_and_from_pretrained_from_disk(self):
                 out = retriever.retrieve(hidden_states, n_docs=1)
                 self.assertTrue(out is not None)
     
    -    def test_legacy_index_retriever_retrieve(self):
    -        n_docs = 1
    -        retriever = self.get_dummy_legacy_index_retriever()
    -        hidden_states = np.array(
    -            [np.ones(self.retrieval_vector_size), -np.ones(self.retrieval_vector_size)], dtype=np.float32
    -        )
    -        retrieved_doc_embeds, doc_ids, doc_dicts = retriever.retrieve(hidden_states, n_docs=n_docs)
    -        self.assertEqual(retrieved_doc_embeds.shape, (2, n_docs, self.retrieval_vector_size))
    -        self.assertEqual(len(doc_dicts), 2)
    -        self.assertEqual(sorted(doc_dicts[0]), ["text", "title"])
    -        self.assertEqual(len(doc_dicts[0]["text"]), n_docs)
    -        self.assertEqual(doc_dicts[0]["text"][0], "bar")  # max inner product is reached with second doc
    -        self.assertEqual(doc_dicts[1]["text"][0], "foo")  # max inner product is reached with first doc
    -        self.assertListEqual(doc_ids.tolist(), [[1], [0]])
    -
    -    def test_legacy_hf_index_retriever_save_and_from_pretrained(self):
    -        retriever = self.get_dummy_legacy_index_retriever()
    -        with tempfile.TemporaryDirectory() as tmp_dirname:
    -            retriever.save_pretrained(tmp_dirname)
    -            retriever = RagRetriever.from_pretrained(tmp_dirname)
    -            self.assertIsInstance(retriever, RagRetriever)
    -            hidden_states = np.array(
    -                [np.ones(self.retrieval_vector_size), -np.ones(self.retrieval_vector_size)], dtype=np.float32
    -            )
    -            out = retriever.retrieve(hidden_states, n_docs=1)
    -            self.assertTrue(out is not None)
    -
         @require_torch
         @require_tokenizers
         @require_sentencepiece
    

Vulnerability mechanics

Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

5

News mentions

0

No linked articles in our index yet.