Deserialization of Untrusted Data in huggingface/transformers
Description
Deserialization of Untrusted Data in GitHub repository huggingface/transformers prior to 4.36.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
Deserialization of untrusted data in Hugging Face Transformers <4.36 allows arbitrary code execution via pickle.load.
Vulnerability
Overview
The vulnerability is a deserialization of untrusted data flaw in the Hugging Face Transformers library prior to version 4.36 [1][2]. The issue stems from the use of Python's pickle.load function on untrusted data, which can execute arbitrary code during deserialization. Specifically, the library allowed loading of serialized objects without proper validation, enabling an attacker to craft malicious pickle payloads.
Exploitation
To exploit this, an attacker must provide a specially crafted pickle file to the affected component, such as during model loading or retrieval of legacy index data [3]. The vulnerability does not require authentication if the attacker can supply a malicious file via a public endpoint or a shared repository. In the codebase, functions like get_dummy_legacy_index_retriever used pickle.dump and pickle.load without restrictions [3].
Impact
Successful exploitation grants the attacker arbitrary code execution in the context of the application using Transformers. This can lead to full compromise of the system, including data theft, installation of malware, or further lateral movement.
Mitigation
The vulnerability is patched in Transformers version 4.36 and later. The fix disallows pickle.load unless the environment variable TRUST_REMOTE_CODE=True is explicitly set [3]. Users are advised to upgrade to the latest version. The issue is also tracked as PYSEC-2023-301 [4].
- GitHub - huggingface/transformers: 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
- NVD - CVE-2023-7018
- Disallow `pickle.load` unless `TRUST_REMOTE_CODE=True` (#27776) · huggingface/transformers@1d63b0e
- advisory-database/vulns/transformers/PYSEC-2023-301.yaml at main · pypa/advisory-database
AI Insight generated on May 20, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
transformersPyPI | < 4.36.0 | 4.36.0 |
Affected products
2- huggingface/huggingface/transformersv5Range: unspecified
Patches
11d63b0ec361eDisallow `pickle.load` unless `TRUST_REMOTE_CODE=True` (#27776)
4 files changed · +39 −62
docs/source/en/model_doc/transfo-xl.md+8 −2 modified@@ -22,11 +22,17 @@ This model is in maintenance mode only, so we won't accept any new PRs changing We recommend switching to more recent models for improved security. -In case you would still like to use `TransfoXL` in your experiments, we recommend using the [Hub checkpoint](https://huggingface.co/transfo-xl-wt103) with a specific revision to ensure you are downloading safe files from the Hub: +In case you would still like to use `TransfoXL` in your experiments, we recommend using the [Hub checkpoint](https://huggingface.co/transfo-xl-wt103) with a specific revision to ensure you are downloading safe files from the Hub. -``` +You will need to set the environment variable `TRUST_REMOTE_CODE` to `True` in order to allow the +usage of `pickle.load()`: + +```python +import os from transformers import TransfoXLTokenizer, TransfoXLLMHeadModel +os.environ["TRUST_REMOTE_CODE"] = "True" + checkpoint = 'transfo-xl-wt103' revision = '40a186da79458c9f9de846edfaea79c412137f97'
src/transformers/models/deprecated/transfo_xl/tokenization_transfo_xl.py+16 −0 modified@@ -34,6 +34,7 @@ is_torch_available, logging, requires_backends, + strtobool, torch_only_method, ) @@ -212,6 +213,14 @@ def __init__( vocab_dict = None if pretrained_vocab_file is not None: # Priority on pickle files (support PyTorch and TF) + if not strtobool(os.environ.get("TRUST_REMOTE_CODE", "False")): + raise ValueError( + "This part uses `pickle.load` which is insecure and will execute arbitrary code that is " + "potentially malicious. It's recommended to never unpickle data that could have come from an " + "untrusted source, or that could have been tampered with. If you already verified the pickle " + "data and decided to use it, you can set the environment variable " + "`TRUST_REMOTE_CODE` to `True` to allow it." + ) with open(pretrained_vocab_file, "rb") as f: vocab_dict = pickle.load(f) @@ -790,6 +799,13 @@ def get_lm_corpus(datadir, dataset): corpus = torch.load(fn_pickle) elif os.path.exists(fn): logger.info("Loading cached dataset from pickle...") + if not strtobool(os.environ.get("TRUST_REMOTE_CODE", "False")): + raise ValueError( + "This part uses `pickle.load` which is insecure and will execute arbitrary code that is potentially " + "malicious. It's recommended to never unpickle data that could have come from an untrusted source, or " + "that could have been tampered with. If you already verified the pickle data and decided to use it, " + "you can set the environment variable `TRUST_REMOTE_CODE` to `True` to allow it." + ) with open(fn, "rb") as fp: corpus = pickle.load(fp) else:
src/transformers/models/rag/retrieval_rag.py+15 −1 modified@@ -23,7 +23,7 @@ from ...tokenization_utils import PreTrainedTokenizer from ...tokenization_utils_base import BatchEncoding -from ...utils import cached_file, is_datasets_available, is_faiss_available, logging, requires_backends +from ...utils import cached_file, is_datasets_available, is_faiss_available, logging, requires_backends, strtobool from .configuration_rag import RagConfig from .tokenization_rag import RagTokenizer @@ -131,6 +131,13 @@ def _resolve_path(self, index_path, filename): def _load_passages(self): logger.info(f"Loading passages from {self.index_path}") passages_path = self._resolve_path(self.index_path, self.PASSAGE_FILENAME) + if not strtobool(os.environ.get("TRUST_REMOTE_CODE", "False")): + raise ValueError( + "This part uses `pickle.load` which is insecure and will execute arbitrary code that is potentially " + "malicious. It's recommended to never unpickle data that could have come from an untrusted source, or " + "that could have been tampered with. If you already verified the pickle data and decided to use it, " + "you can set the environment variable `TRUST_REMOTE_CODE` to `True` to allow it." + ) with open(passages_path, "rb") as passages_file: passages = pickle.load(passages_file) return passages @@ -140,6 +147,13 @@ def _deserialize_index(self): resolved_index_path = self._resolve_path(self.index_path, self.INDEX_FILENAME + ".index.dpr") self.index = faiss.read_index(resolved_index_path) resolved_meta_path = self._resolve_path(self.index_path, self.INDEX_FILENAME + ".index_meta.dpr") + if not strtobool(os.environ.get("TRUST_REMOTE_CODE", "False")): + raise ValueError( + "This part uses `pickle.load` which is insecure and will execute arbitrary code that is potentially " + "malicious. It's recommended to never unpickle data that could have come from an untrusted source, or " + "that could have been tampered with. If you already verified the pickle data and decided to use it, " + "you can set the environment variable `TRUST_REMOTE_CODE` to `True` to allow it." + ) with open(resolved_meta_path, "rb") as metadata_file: self.index_id_to_db_id = pickle.load(metadata_file) assert (
tests/models/rag/test_retrieval_rag.py+0 −59 modified@@ -14,7 +14,6 @@ import json import os -import pickle import shutil import tempfile from unittest import TestCase @@ -174,37 +173,6 @@ def get_dummy_custom_hf_index_retriever(self, from_disk: bool): ) return retriever - def get_dummy_legacy_index_retriever(self): - dataset = Dataset.from_dict( - { - "id": ["0", "1"], - "text": ["foo", "bar"], - "title": ["Foo", "Bar"], - "embeddings": [np.ones(self.retrieval_vector_size + 1), 2 * np.ones(self.retrieval_vector_size + 1)], - } - ) - dataset.add_faiss_index("embeddings", string_factory="Flat", metric_type=faiss.METRIC_INNER_PRODUCT) - - index_file_name = os.path.join(self.tmpdirname, "hf_bert_base.hnswSQ8_correct_phi_128.c_index") - dataset.save_faiss_index("embeddings", index_file_name + ".index.dpr") - pickle.dump(dataset["id"], open(index_file_name + ".index_meta.dpr", "wb")) - - passages_file_name = os.path.join(self.tmpdirname, "psgs_w100.tsv.pkl") - passages = {sample["id"]: [sample["text"], sample["title"]] for sample in dataset} - pickle.dump(passages, open(passages_file_name, "wb")) - - config = RagConfig( - retrieval_vector_size=self.retrieval_vector_size, - question_encoder=DPRConfig().to_dict(), - generator=BartConfig().to_dict(), - index_name="legacy", - index_path=self.tmpdirname, - ) - retriever = RagRetriever( - config, question_encoder_tokenizer=self.get_dpr_tokenizer(), generator_tokenizer=self.get_bart_tokenizer() - ) - return retriever - def test_canonical_hf_index_retriever_retrieve(self): n_docs = 1 retriever = self.get_dummy_canonical_hf_index_retriever() @@ -288,33 +256,6 @@ def test_custom_hf_index_retriever_save_and_from_pretrained_from_disk(self): out = retriever.retrieve(hidden_states, n_docs=1) self.assertTrue(out is not None) - def test_legacy_index_retriever_retrieve(self): - n_docs = 1 - retriever = self.get_dummy_legacy_index_retriever() - hidden_states = np.array( - [np.ones(self.retrieval_vector_size), -np.ones(self.retrieval_vector_size)], dtype=np.float32 - ) - retrieved_doc_embeds, doc_ids, doc_dicts = retriever.retrieve(hidden_states, n_docs=n_docs) - self.assertEqual(retrieved_doc_embeds.shape, (2, n_docs, self.retrieval_vector_size)) - self.assertEqual(len(doc_dicts), 2) - self.assertEqual(sorted(doc_dicts[0]), ["text", "title"]) - self.assertEqual(len(doc_dicts[0]["text"]), n_docs) - self.assertEqual(doc_dicts[0]["text"][0], "bar") # max inner product is reached with second doc - self.assertEqual(doc_dicts[1]["text"][0], "foo") # max inner product is reached with first doc - self.assertListEqual(doc_ids.tolist(), [[1], [0]]) - - def test_legacy_hf_index_retriever_save_and_from_pretrained(self): - retriever = self.get_dummy_legacy_index_retriever() - with tempfile.TemporaryDirectory() as tmp_dirname: - retriever.save_pretrained(tmp_dirname) - retriever = RagRetriever.from_pretrained(tmp_dirname) - self.assertIsInstance(retriever, RagRetriever) - hidden_states = np.array( - [np.ones(self.retrieval_vector_size), -np.ones(self.retrieval_vector_size)], dtype=np.float32 - ) - out = retriever.retrieve(hidden_states, n_docs=1) - self.assertTrue(out is not None) - @require_torch @require_tokenizers @require_sentencepiece
Vulnerability mechanics
Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
5- github.com/advisories/GHSA-v68g-wm8c-6x7jghsaADVISORY
- nvd.nist.gov/vuln/detail/CVE-2023-7018ghsaADVISORY
- github.com/huggingface/transformers/commit/1d63b0ec361e7a38f1339385e8a5a855085532ceghsaWEB
- github.com/pypa/advisory-database/tree/main/vulns/transformers/PYSEC-2023-301.yamlghsaWEB
- huntr.com/bounties/e1a3e548-e53a-48df-b708-9ee62140963cghsaWEB
News mentions
0No linked articles in our index yet.