CVE-2025-58438
Description
internetarchive is a Python and Command-Line Interface to Archive.org In versions 5.5.0 and below, there is a directory traversal (path traversal) vulnerability in the File.download() method of the internetarchive library. The file.download() method does not properly sanitize user-supplied filenames or validate the final download path. A maliciously crafted filename could contain path traversal sequences (e.g., ../../../../windows/system32/file.txt) or illegal characters that, when processed, would cause the file to be written outside of the intended target directory. An attacker could potentially overwrite critical system files or application configuration files, leading to a denial of service, privilege escalation, or remote code execution, depending on the context in which the library is used. The vulnerability is particularly critical for users on Windows systems, but all operating systems are affected. This issue is fixed in version 5.5.1.
Affected packages
Versions sourced from the GitHub Security Advisory.
| Package | Affected versions | Patched versions |
|---|---|---|
internetarchivePyPI | < 5.5.1 | 5.5.1 |
Affected products
1- Range: v0.9.1, v0.9.3, v0.9.4, …
Patches
21 file changed · +3 −0
HISTORY.rst+3 −0 modified@@ -13,6 +13,9 @@ Release History - Added path resolution checks to block directory traversal attacks. - Introduced warnings when filenames are sanitized to maintain user awareness. +**Bugfixes** + +- Fixed bug in JSON parsing for ia upload --file-metadata .... 5.5.0 (2025-07-17) ++++++++++++++++++
cba2d459e10aMerge branch 'sanitize-filename-downloads'
7 files changed · +302 −5
HISTORY.rst+8 −4 modified@@ -3,12 +3,16 @@ Release History --------------- -5.6.0 (?) -+++++++++ +5.5.1 (2025-09-05) +++++++++++++++++++ -**Bugfixes** +**Security** + +- **Fixed a critical directory traversal vulnerability in** File.download(). All users are urged to upgrade immediately. This prevents malicious filenames from writing files outside the target directory, a risk especially critical for Windows users. +- Added automatic filename sanitization with platform-specific rules. +- Added path resolution checks to block directory traversal attacks. +- Introduced warnings when filenames are sanitized to maintain user awareness. -- Fixed bug in JSON parsing for ``ia upload --file-metadata ...``. 5.5.0 (2025-07-17) ++++++++++++++++++
internetarchive/files.py+22 −0 modified@@ -29,6 +29,7 @@ import sys from contextlib import nullcontext, suppress from email.utils import parsedate_to_datetime +from pathlib import Path from time import sleep from urllib.parse import quote @@ -233,6 +234,14 @@ def download( # noqa: C901,PLR0911,PLR0912,PLR0915 self.item.session.mount_http_adapter(max_retries=retries) file_path = file_path or self.name + # Critical security check: Sanitize only the filename portion of file_path to + # prevent invalid characters and potential directory traversal issues. + # We use `utils.sanitize_filepath` instead of `utils.sanitize_filename` because: + # - `sanitize_filepath` preserves the directory path intact (does not encode path separators), + # - allowing `os.makedirs` to create intermediate directories correctly, + # - while still sanitizing just the filename to ensure it is safe for filesystem use. + file_path = utils.sanitize_filepath(file_path) + if destdir: if return_responses is not True: try: @@ -243,6 +252,19 @@ def download( # noqa: C901,PLR0911,PLR0912,PLR0915 raise OSError(f'{destdir} is not a directory!') file_path = os.path.join(destdir, file_path) + # Critical security check: Prevent directory traversal attacks by ensuring + # the download path doesn't escape the target directory using path resolution + # and relative path validation. This protects against malicious filenames + # containing ../ sequences or other path manipulation attempts. + try: + # Resolve both paths to handle symlinks and absolute paths + target_path = Path(file_path).resolve() + base_dir = Path(destdir).resolve() if destdir else Path.cwd().resolve() + # Ensure the target path is relative to base directory + target_path.relative_to(base_dir) + except ValueError: + raise ValueError(f"Download path {file_path} is outside target directory {base_dir}") + parent_dir = os.path.dirname(file_path) # Check if we should skip...
internetarchive/utils.py+135 −0 modified@@ -29,8 +29,10 @@ import hashlib import os +import platform import re import sys +import warnings from collections.abc import Mapping from typing import Iterable from xml.dom.minidom import parseString @@ -464,3 +466,136 @@ def is_valid_email(email): # Ensures the TLD has at least 2 characters pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z]{2,}$' return re.match(pattern, email) is not None + + +def is_windows() -> bool: + return ( + platform.system().lower() == "windows" + or sys.platform.startswith("win") + ) + + +def sanitize_filepath(filepath: str, avoid_colon: bool = False) -> str: + """ + Sanitizes only the filename part of a full file path, leaving the directory path intact. + + This is useful when you need to ensure the filename is safe for filesystem use + without modifying the directory structure. Typically used before creating files + or directories to prevent invalid filename characters. + + Args: + filepath (str): The full file path to sanitize. + avoid_colon (bool): If True, colon ':' in the filename will be percent-encoded + for macOS compatibility. Defaults to False. + + Returns: + str: The sanitized file path with the filename portion percent-encoded as needed. + """ + parent_dir = os.path.dirname(filepath) + filename = os.path.basename(filepath) + sanitized_filename = sanitize_filename(filename, avoid_colon) + return os.path.join(parent_dir, sanitized_filename) + + +def sanitize_filename(name: str, avoid_colon: bool = False) -> str: + """ + Sanitizes a filename by replacing invalid characters with percent-encoded values. + This function is designed to be compatible with both Windows and POSIX systems. + + Args: + name (str): The original string to sanitize. + avoid_colon (bool): If True, colon ':' will be percent-encoded. + + Returns: + str: A sanitized version of the filename. + """ + original = name + if is_windows(): + sanitized = sanitize_filename_windows(name) + else: + sanitized = sanitize_filename_posix(name, avoid_colon) + + if sanitized != original: + warnings.warn( + f"Filename sanitized: original='{original}' sanitized='{sanitized}'", + UserWarning, + stacklevel=2 + ) + + return sanitized + + +def unsanitize_filename(name: str) -> str: + """ + Reverses percent-encoding of the form %XX back to original characters. + Works for filenames sanitized by sanitize_filename (Windows or POSIX). + + Args: + name (str): Sanitized filename string with %XX encodings. + + Returns: + str: Original filename with all %XX sequences decoded. + """ + if '%' in name: + if re.search(r'%[0-9A-Fa-f]{2}', name): + warnings.warn( + "Filename contains percent-encoded sequences that will be decoded.", + UserWarning, + stacklevel=2 + ) + def decode_match(match): + hex_value = match.group(1) + return chr(int(hex_value, 16)) + + return re.sub(r'%([0-9A-Fa-f]{2})', decode_match, name) + + +def sanitize_filename_windows(name: str) -> str: + r""" + Replaces Windows-invalid filename characters with percent-encoded values. + Characters replaced: < > : " / \ | ? * % + + Args: + name (str): The original string. + + Returns: + str: A sanitized version safe for filesystem use. + """ + # Encode `%` so that it's possible to round-trip (i.e. via `unsanitize_filename`) + invalid_chars = r'[<>:"/\\|?*\x00-\x1F%]' + + def encode(char): + return f'%{ord(char.group()):02X}' + + # Replace invalid characters + name = re.sub(invalid_chars, encode, name) + + # Remove trailing dots or spaces (not allowed in Windows filenames) + return name.rstrip(' .') + + +def sanitize_filename_posix(name: str, avoid_colon: bool = False) -> str: + """ + Sanitizes filenames for Linux, BSD, and Unix-like systems. + + - Percent-encodes forward slash '/' (always) + - Optionally percent-encodes colon ':' for macOS compatibility + + Args: + name (str): Original filename string. + avoid_colon (bool): If True, colon ':' will be encoded. + + Returns: + str: Sanitized filename safe for POSIX systems. + """ + # Build regex pattern dynamically + chars_to_encode = r'/' + if avoid_colon: + chars_to_encode += ':' + + pattern = f'[{re.escape(chars_to_encode)}]' + + def encode_char(match): + return f'%{ord(match.group()):02X}' + + return re.sub(pattern, encode_char, name)
internetarchive/__version__.py+1 −1 modified@@ -1 +1 @@ -__version__ = '5.6.0.dev1' +__version__ = '5.5.1'
README.rst+4 −0 modified@@ -22,6 +22,10 @@ This package installs a command-line tool named ``ia`` for using Archive.org fro It also installs the ``internetarchive`` Python module for programmatic access to archive.org. Please report all bugs and issues on `Github <https://github.com/jjjake/internetarchive/issues>`__. +SECURITY NOTICE +_______________ + +**Please upgrade to v5.4.2+ immediately.** Versions <=5.4.1 contain a critical directory traversal vulnerability in the `File.download()` method. [See the changelog for details](https://github.com/jjjake/internetarchive/blob/master/HISTORY.rst). Thank you to Pengo Wray for their contributions in identifying and resolving this issue. Installation ------------
tests/test_files.py+42 −0 added@@ -0,0 +1,42 @@ +import os +import re +from unittest.mock import patch + +import pytest +import responses + +from tests.conftest import PROTOCOL, IaRequestsMock + +DOWNLOAD_URL_RE = re.compile(f'{PROTOCOL}//archive.org/download/.*') +EXPECTED_LAST_MOD_HEADER = {"Last-Modified": "Tue, 14 Nov 2023 20:25:48 GMT"} + + +def test_file_download_sanitizes_filename(tmpdir, nasa_item): + tmpdir.chdir() + + # Mock is_windows to return True to test Windows-style sanitization + with patch('internetarchive.utils.is_windows', return_value=True): + with IaRequestsMock(assert_all_requests_are_fired=False) as rsps: + rsps.add(responses.GET, DOWNLOAD_URL_RE, + body='test content', + adding_headers=EXPECTED_LAST_MOD_HEADER) + # Test filename with Windows-invalid characters + file_obj = nasa_item.get_file('nasa_meta.xml') + problematic_name = 'file:with<illegal>chars.xml' + file_obj.download(file_path=problematic_name, destdir=str(tmpdir)) + + # Should create sanitized filename with percent encoding + expected_name = 'file%3Awith%3Cillegal%3Echars.xml' + expected_path = os.path.join(str(tmpdir), expected_name) + assert os.path.exists(expected_path) + + +def test_file_download_prevents_directory_traversal(tmpdir, nasa_item): + tmpdir.chdir() + # Don't mock the request since it won't be made due to the security check + with IaRequestsMock(assert_all_requests_are_fired=False): + # Test directory traversal attempt by getting the file and calling download directly + file_obj = nasa_item.get_file('nasa_meta.xml') + malicious_path = os.path.join('..', 'nasa_meta.xml') + with pytest.raises(ValueError, match="outside target directory"): + file_obj.download(file_path=malicious_path, destdir=str(tmpdir))
tests/test_utils.py+90 −0 modified@@ -1,4 +1,8 @@ import string +import warnings +from unittest.mock import patch + +import pytest import internetarchive.utils from tests.conftest import NASA_METADATA_PATH, IaRequestsMock @@ -95,3 +99,89 @@ def test_is_valid_metadata_key(): for metadata_key in invalid: assert not internetarchive.utils.is_valid_metadata_key(metadata_key) + + +def test_is_windows(): + with patch('platform.system', return_value='Windows'), \ + patch('sys.platform', 'win32'): + assert internetarchive.utils.is_windows() is True + + with patch('platform.system', return_value='Linux'), \ + patch('sys.platform', 'linux'): + assert internetarchive.utils.is_windows() is False + +def test_sanitize_filename_windows(): + test_cases = [ + ('file:name.txt', 'file%3Aname.txt'), + ('file%name.txt', 'file%25name.txt'), + ('con.txt', 'con.txt'), # Reserved name, but no invalid chars so unchanged + ('file .txt', 'file .txt'), # Internal space preserved (not trailing) + ('file ', 'file'), # Trailing spaces removed + ('file..', 'file'), # Trailing dots removed + ('file . ', 'file'), # Trailing space and dot removed + ] + + for input_name, expected in test_cases: + result = internetarchive.utils.sanitize_filename_windows(input_name) + assert result == expected + + +def test_sanitize_filename_posix(): + # Test without colon encoding + result = internetarchive.utils.sanitize_filename_posix('file/name.txt', False) + assert result == 'file%2Fname.txt' + + # Test with colon encoding + result = internetarchive.utils.sanitize_filename_posix('file:name.txt', True) + assert result == 'file%3Aname.txt' + + # Test mixed encoding + result = internetarchive.utils.sanitize_filename_posix('file/:name.txt', True) + assert result == 'file%2F%3Aname.txt' + + +def test_unsanitize_filename(): + test_cases = [ + ('file%3Aname.txt', 'file:name.txt'), + ('file%2Fname.txt', 'file/name.txt'), + ('file%25name.txt', 'file%name.txt'), # Percent sign + ('normal.txt', 'normal.txt'), # No encoding + ] + + for input_name, expected in test_cases: + with warnings.catch_warnings(record=True) as w: + result = internetarchive.utils.unsanitize_filename(input_name) + assert result == expected + if '%' in input_name: + assert len(w) == 1 + assert issubclass(w[0].category, UserWarning) + + +def test_sanitize_filename(): + # Test Windows path + with patch('internetarchive.utils.is_windows', return_value=True): + with warnings.catch_warnings(record=True) as w: + result = internetarchive.utils.sanitize_filename('file:name.txt') + assert result == 'file%3Aname.txt' + assert len(w) == 1 + assert "sanitized" in str(w[0].message) + + # Test POSIX path + with patch('internetarchive.utils.is_windows', return_value=False): + result = internetarchive.utils.sanitize_filename('file/name.txt', False) + assert result == 'file%2Fname.txt' + + +def test_sanitize_filepath(): + # Test with colon encoding + result = internetarchive.utils.sanitize_filepath('/path/to/file:name.txt', True) + assert result == '/path/to/file%3Aname.txt' + + # Test without colon encoding + result = internetarchive.utils.sanitize_filepath('/path/to/file:name.txt', False) + assert result == '/path/to/file:name.txt' # Colon not encoded on POSIX by default + + # Test Windows path (mocked) + with patch('internetarchive.utils.is_windows', return_value=True): + result = internetarchive.utils.sanitize_filepath('/path/to/con.txt') + assert result == '/path/to/con.txt' # Reserved name sanitized
Vulnerability mechanics
Generated by null/stub on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
6- github.com/advisories/GHSA-wx3r-v6h7-frjpghsaADVISORY
- nvd.nist.gov/vuln/detail/CVE-2025-58438ghsaADVISORY
- github.com/jjjake/internetarchive/commit/cba2d459e10a9489fb35caeba0b03e80f5f5d7c2nvdWEB
- github.com/jjjake/internetarchive/releases/tag/v5.5.1nvdWEB
- github.com/jjjake/internetarchive/security/advisories/GHSA-wx3r-v6h7-frjpnvdWEB
- lists.debian.org/debian-lts-announce/2025/09/msg00030.htmlnvdWEB
News mentions
0No linked articles in our index yet.