Natural Language Toolkit (NLTK): URL-Encoded Path Traversal in nltk.data.load() Allows Arbitrary Local File Read
Description
NLTK's nltk.data.load() is vulnerable to path traversal via URL-encoded path separators, allowing arbitrary file read by bypassing the regex check.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
NLTK's nltk.data.load() is vulnerable to path traversal via URL-encoded path separators, allowing arbitrary file read by bypassing the regex check.
## Vulnerability nltk.data.load() in NLTK versions up to and including 3.9.4 is vulnerable to a path traversal attack when using the nltk: URL scheme [1][2]. The _UNSAFE_NO_PROTOCOL_RE regex check is performed on the raw resource string before url2pathname() decodes percent-encoded sequences. This decode-after-check flaw allows an attacker to bypass the security regex by encoding path traversal components such as %2f (/) and %2e (.) so they are invisible to the regex but become dangerous after decoding [1][2]. The affected code is in nltk/data.py in functions find(), normalize_resource_url(), and the regex check itself [1][2].
Exploitation
An attacker can supply a crafted resource name containing URL-encoded traversal sequences (e.g., %2e%2e%2f..%2fetc%2fpasswd) to nltk.data.load() or nltk.data.find(). The regex check passes because the encoded dots and slashes do not match the literal patterns [1][2]. After the check, url2pathname() decodes the string, producing a path like ../../../etc/passwd that reads arbitrary files [1][2]. The attacker does not need authentication if the application exposes the load() function to user input; no special privileges are required [1][2].
Impact
Successful exploitation allows an attacker to read arbitrary files on the local filesystem, leading to disclosure of sensitive information such as configuration files, passwords, or application source code [1][2]. The vulnerability does not directly enable code execution, but leaked credentials may facilitate further attacks. The scope of compromise is limited to file read under the privileges of the NLTK process [1][2].
Mitigation
No official patch has been released as of the advisory publication date [1][2]. Users are advised to avoid using the nltk: URL scheme with untrusted input, or to implement additional input validation after decoding. The default configuration of NLTK does not enforce the security check (ENFORCE=False), so enabling strict validation may reduce risk [1][2]. Until a fix is available, applications should sanitize or reject any user-supplied resource names that contain percent-encoded characters [1][2].
AI Insight generated on Jun 16, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected products
2Patches
0No patches discovered yet.
Vulnerability mechanics
Root cause
"The `_UNSAFE_NO_PROTOCOL_RE` regex check is performed on the raw resource string before `url2pathname()` decodes percent-encoded sequences, creating a decode-after-check / TOCTOU-style flaw."
Attack vector
An attacker supplies a resource identifier using the `nltk:` URL scheme with percent-encoded path separators (`%2f` for `/`) and traversal sequences (`%2e%2e` for `..`). Because the `_UNSAFE_NO_PROTOCOL_RE` regex checks the raw string before `url2pathname()` decodes the `%xx` sequences, the encoded variants bypass the regex and are later decoded into a real filesystem path, enabling arbitrary file read [CWE-22][ref_id=1][ref_id=2].
Affected code
The vulnerability resides in `data.py` within the `find()`, `normalize_resource_url()`, and `_UNSAFE_NO_PROTOCOL_RE` regex functions. The regex check at lines L54–L69 operates on the raw, undecoded resource string, while the final path is constructed by `url2pathname()` at lines L650–L653, which decodes percent-encoded sequences after the security check has already passed [ref_id=1][ref_id=2].
What the fix does
The advisory states that the regex check must be performed against the decoded path, not the raw input, to close the decode-after-check window [ref_id=1][ref_id=2]. No patch diff is included in the bundle, but the recommended fix is to move the safety regex check to run after `url2pathname()` has decoded percent-encoded sequences, or to decode the input before applying the regex. Until a fix is applied, the `ENFORCE=True` configuration mode is noted as providing additional protection, though the default configuration remains vulnerable [ref_id=1][ref_id=2].
Preconditions
- configENFORCE must be set to False (the default configuration; NLTK <= 3.9.4)
- inputAttacker-controlled input must reach nltk.data.load()
- networkNetwork access to the application or service that calls nltk.data.load()
Reproduction
The bundle includes a full Proof of Concept Python script. To reproduce, run the script directly against NLTK <= 3.9.4 with default configuration (ENFORCE=False). It tests multiple variants: an encoded absolute path (`nltk:%2fetc%2fpasswd`), encoded dot-dot traversal (`nltk:corpora/%2e%2e/%2e%2e/.../etc/passwd`), and a mixed variant with literal dots and encoded slashes (`nltk:corpora/..%2f..%2f.../etc/passwd`). The script prints `[VULN]` for each successful bypass [ref_id=1][ref_id=2].
Generated on Jun 16, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
2News mentions
0No linked articles in our index yet.