UltraJSON: Malformed/Truncated UTF-8 Accepted and Silently Rewritten in ujson.dumps()
Description
Summary
ujson.dumps() (or ujson.dump() or ujson.encode()) have a reject_bytes=False option. When set, they may accept malformed or truncated UTF-8 byte sequences, silently rewriting them into different Unicode characters instead of rejecting them. This leads to input validation bypass and data integrity issues.
Details
The expected behavior is that for x being any bytes string, x == ujson.loads(ujson.dumps(x, reject_bytes=False)).encode(errors="surrogatepass") should always either be true or ujson.dumps() will throw an exception. In reality, some strings which should've been errors are silently rewritten as other strings:
- Invalid continuation bytes are replaced with valid ones:
b'\xcf\x13'->b'\xcf\x93' - Unterminated sequence completes the sequence:
b'\xc3'->b'\xc3\x80' - ... or leads to reading past the end of string:
b'\xf0\x90\x94'->b"\xf0\x90\x94\x80inxcontrib'"
Impact
An application relying on reject_bytes=False for UTF-8 handling may experience:
- Data integrity issues
- Experience validation bypass if said validation occurs before serialisation
Remediation
The missing/broken UTF-8 validation checks were added/fixed in https://github.com/ultrajson/ultrajson/commit/169eaf36b1116fece5034ee79a7a0ef3f6deedcf. We recommend upgrading to UltraJSON 5.13.0.
Workarounds
Decoding bytes to strings in Python before passing them to ujson.dumps() avoids this issue.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
Affected products
1Patches
Vulnerability mechanics
Root cause
"Missing UTF-8 validation in the encoder allows malformed byte sequences to be silently rewritten into different Unicode characters."
Attack vector
An attacker provides a bytes object containing malformed or truncated UTF-8 byte sequences to `ujson.dumps()` with `reject_bytes=False`. Instead of raising an error, the encoder silently rewrites the invalid bytes into different Unicode characters — for example, `b'\xcf\x13'` becomes `b'\xcf\x93'` and `b'\xc3'` becomes `b'\xc3\x80'`. This bypasses any input validation that occurs before serialization and corrupts data integrity [ref_id=1].
Affected code
The vulnerability resides in `src/ujson/lib/ultrajsonenc.c` in the `Buffer_EscapeStringValidated` function, which handles UTF-8 validation when `reject_bytes=False` is passed to `ujson.dumps()`, `ujson.dump()`, or `ujson.encode()`. The encoder lacked checks for invalid continuation bytes, unterminated sequences, overlong sequences, and codepoints above U+10FFFF. The decoder in `src/ujson/lib/ultrajsondec.c` also had minor error message inconsistencies.
What the fix does
The patch adds missing UTF-8 validation checks in `Buffer_EscapeStringValidated` [patch_id=6627426]. For 2-byte sequences, it now verifies the continuation byte matches `0b10xx_xxxx` and checks the remaining length is at least 2 bytes. For 3-byte and 4-byte sequences, similar continuation-byte and length checks are added, plus a new check that 4-byte sequences do not encode codepoints above U+10FFFF. These changes ensure that malformed byte sequences are rejected with an error instead of being silently rewritten.
Preconditions
- configThe application must call ujson.dumps(), ujson.dump(), or ujson.encode() with reject_bytes=False
- inputThe attacker must be able to supply a bytes object containing malformed UTF-8 sequences to the serialization call
Generated on Jun 19, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
4News mentions
0No linked articles in our index yet.