CVE-2026-3276
Description
Python's unicodedata.normalize() has a CPU-intensive vulnerability when processing crafted Unicode strings, potentially leading to denial of service.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
Python's unicodedata.normalize() has a CPU-intensive vulnerability when processing crafted Unicode strings, potentially leading to denial of service.
Vulnerability
The unicodedata.normalize() function in CPython is vulnerable to a denial-of-service attack. Specially crafted Unicode input, containing long runs of combining characters with alternating Canonical Combining Class values, can cause excessive CPU consumption. This issue affects all normalization forms and was introduced by an inefficient insertion sort algorithm for canonical ordering [1, 2].
Exploitation
An attacker can exploit this vulnerability by sending a specially crafted Unicode string to an application that uses unicodedata.normalize(). The vulnerability is triggered by the processing of this string, which leads to a quadratic time complexity in the unicodedata.normalize() function, consuming excessive CPU resources and potentially causing a denial of service [1, 2].
Impact
Successful exploitation of this vulnerability can lead to a denial of service (DoS) by consuming excessive CPU resources. This can render the affected application or system unresponsive, impacting its availability [2]. The scope of the impact is limited to the process running the vulnerable Python code.
Mitigation
A fix has been implemented by replacing the O(n^2) insertion sort with a hybrid approach (insertion sort for short runs and counting sort for longer runs), reducing the worst-case complexity to O(n) [1]. The specific patched version and release date are not explicitly stated in the provided references, but the pull request was merged on June 2, 2026 [1]. Users are advised to update to a patched version of CPython once available. No workarounds are mentioned in the available references.
AI Insight generated on Jun 3, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected products
1Patches
1991224b1e831gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize() (GH-149080)
3 files changed · +150 −26
Lib/test/test_unicodedata.py+28 −0 modified@@ -616,6 +616,34 @@ def test_issue10254(self): b = 'C\u0338' * 20 + '\xC7' self.assertEqual(self.db.normalize('NFC', a), b) + def test_long_combining_mark_run(self): + # gh-149079: avoid quadratic canonical ordering. + payload = "a" + ("\u0300\u0327" * 32) + nfd = "a" + ("\u0327" * 32) + ("\u0300" * 32) + nfc = "\u00e0" + ("\u0327" * 32) + ("\u0300" * 31) + + self.assertEqual(self.db.normalize("NFD", payload), nfd) + self.assertEqual(self.db.normalize("NFKD", payload), nfd) + self.assertEqual(self.db.normalize("NFC", payload), nfc) + self.assertEqual(self.db.normalize("NFKC", payload), nfc) + + def test_combining_mark_run_fast_paths(self): + # gh-149079: cover short runs and already-sorted long runs. + short_payload = "a" + ("\u0300\u0327" * 9) + "\u0300" + short_nfd = "a" + ("\u0327" * 9) + ("\u0300" * 10) + short_nfc = "\u00e0" + ("\u0327" * 9) + ("\u0300" * 9) + long_sorted = "a" + ("\u0327" * 30) + ("\u0300" * 30) + long_sorted_nfc = "\u00e0" + ("\u0327" * 30) + ("\u0300" * 29) + + self.assertEqual(self.db.normalize("NFD", short_payload), short_nfd) + self.assertEqual(self.db.normalize("NFKD", short_payload), short_nfd) + self.assertEqual(self.db.normalize("NFC", short_payload), short_nfc) + self.assertEqual(self.db.normalize("NFKC", short_payload), short_nfc) + self.assertEqual(self.db.normalize("NFD", long_sorted), long_sorted) + self.assertEqual(self.db.normalize("NFKD", long_sorted), long_sorted) + self.assertEqual(self.db.normalize("NFC", long_sorted), long_sorted_nfc) + self.assertEqual(self.db.normalize("NFKC", long_sorted), long_sorted_nfc) + def test_issue29456(self): # Fix #29456 u1176_str_a = '\u1100\u1176\u11a8'
Misc/NEWS.d/next/Security/2026-04-27-16-36-11.gh-issue-149079.vKl-LM.rst+5 −0 added@@ -0,0 +1,5 @@ +Fix a potential denial of service in :func:`unicodedata.normalize`. The +canonical ordering step of Unicode normalization used a quadratic-time insertion +sort for reordering combining characters, which could be exploited with +crafted input containing many combining characters in non-canonical order. +Replaced with a linear-time counting sort for long runs.
Modules/unicodedata.c+117 −26 modified@@ -556,19 +556,80 @@ get_decomp_record(PyObject *self, Py_UCS4 code, (*index)++; } +/* Small combining runs are usually cheaper with insertion sort. */ +#define CANONICAL_ORDERING_COUNTING_SORT_THRESHOLD 20 + +static void +canonical_ordering_sort_insertion(int kind, void *data, + Py_ssize_t start, Py_ssize_t end) +{ + for (Py_ssize_t i = start + 1; i < end; i++) { + Py_UCS4 code = PyUnicode_READ(kind, data, i); + unsigned char combining = _getrecord_ex(code)->combining; + Py_ssize_t j = i; + + while (j > start) { + Py_UCS4 previous = PyUnicode_READ(kind, data, j - 1); + if (_getrecord_ex(previous)->combining <= combining) { + break; + } + PyUnicode_WRITE(kind, data, j, previous); + j--; + } + if (j != i) { + PyUnicode_WRITE(kind, data, j, code); + } + } +} + +static void +canonical_ordering_sort_counting(int kind, void *data, + Py_ssize_t start, Py_ssize_t end, + Py_UCS4 *sortbuf) +{ + Py_ssize_t counts[256] = {0}; + Py_ssize_t run_length = end - start; + Py_ssize_t total = 0; + + for (Py_ssize_t i = start; i < end; i++) { + Py_UCS4 code = PyUnicode_READ(kind, data, i); + unsigned char combining = _getrecord_ex(code)->combining; + counts[combining]++; + } + + for (size_t i = 0; i < Py_ARRAY_LENGTH(counts); i++) { + Py_ssize_t count = counts[i]; + counts[i] = total; + total += count; + } + + /* Reuse counts[] as the next output slot for each CCC. */ + for (Py_ssize_t i = start; i < end; i++) { + Py_UCS4 code = PyUnicode_READ(kind, data, i); + unsigned char combining = _getrecord_ex(code)->combining; + sortbuf[counts[combining]++] = code; + } + for (Py_ssize_t i = 0; i < run_length; i++) { + PyUnicode_WRITE(kind, data, start + i, sortbuf[i]); + } +} + static PyObject* nfd_nfkd(PyObject *self, PyObject *input, int k) { PyObject *result; Py_UCS4 *output; Py_ssize_t i, o, osize; - int kind; - const void *data; + int input_kind, result_kind; + const void *input_data; + void *result_data; /* Longest decomposition in Unicode 3.2: U+FDFA */ Py_UCS4 stack[20]; Py_ssize_t space, isize; int index, prefix, count, stackptr; unsigned char prev, cur; + Py_UCS4 *sortbuf = NULL; + Py_ssize_t sortbuflen = 0; stackptr = 0; isize = PyUnicode_GET_LENGTH(input); @@ -588,11 +649,11 @@ nfd_nfkd(PyObject *self, PyObject *input, int k) return NULL; } i = o = 0; - kind = PyUnicode_KIND(input); - data = PyUnicode_DATA(input); + input_kind = PyUnicode_KIND(input); + input_data = PyUnicode_DATA(input); while (i < isize) { - stack[stackptr++] = PyUnicode_READ(kind, data, i++); + stack[stackptr++] = PyUnicode_READ(input_kind, input_data, i++); while(stackptr) { Py_UCS4 code = stack[--stackptr]; /* Hangul Decomposition adds three characters in @@ -660,34 +721,64 @@ nfd_nfkd(PyObject *self, PyObject *input, int k) if (!result) return NULL; - kind = PyUnicode_KIND(result); - data = PyUnicode_DATA(result); + result_kind = PyUnicode_KIND(result); + result_data = PyUnicode_DATA(result); - /* Sort canonically. */ + /* Sort each consecutive combining-character run canonically. */ i = 0; - prev = _getrecord_ex(PyUnicode_READ(kind, data, i))->combining; - for (i++; i < PyUnicode_GET_LENGTH(result); i++) { - cur = _getrecord_ex(PyUnicode_READ(kind, data, i))->combining; - if (prev == 0 || cur == 0 || prev <= cur) { - prev = cur; + while (i < o) { + Py_ssize_t run_length, run_start; + int needs_sort = 0; + + Py_UCS4 ch = PyUnicode_READ(result_kind, result_data, i); + prev = _getrecord_ex(ch)->combining; + if (prev == 0) { + i++; continue; } - /* Non-canonical order. Need to switch *i with previous. */ - o = i - 1; - while (1) { - Py_UCS4 tmp = PyUnicode_READ(kind, data, o+1); - PyUnicode_WRITE(kind, data, o+1, - PyUnicode_READ(kind, data, o)); - PyUnicode_WRITE(kind, data, o, tmp); - o--; - if (o < 0) - break; - prev = _getrecord_ex(PyUnicode_READ(kind, data, o))->combining; - if (prev == 0 || prev <= cur) + + run_start = i++; + while (i < o) { + Py_UCS4 ch = PyUnicode_READ(result_kind, result_data, i); + cur = _getrecord_ex(ch)->combining; + if (cur == 0) { break; + } + if (prev > cur) { + needs_sort = 1; + } + prev = cur; + i++; + } + if (!needs_sort) { + continue; + } + + run_length = i - run_start; + if (run_length < CANONICAL_ORDERING_COUNTING_SORT_THRESHOLD) { + canonical_ordering_sort_insertion(result_kind, result_data, + run_start, i); + continue; } - prev = _getrecord_ex(PyUnicode_READ(kind, data, i))->combining; + + if (run_length > sortbuflen) { + Py_UCS4 *new_sortbuf = PyMem_Resize(sortbuf, + Py_UCS4, + run_length); + if (new_sortbuf == NULL) { + PyErr_NoMemory(); + PyMem_Free(sortbuf); + Py_DECREF(result); + return NULL; + } + sortbuf = new_sortbuf; + sortbuflen = run_length; + } + + canonical_ordering_sort_counting(result_kind, result_data, + run_start, i, sortbuf); } + PyMem_Free(sortbuf); return result; }
Vulnerability mechanics
Root cause
"The canonical ordering step in Unicode normalization used a quadratic-time algorithm for reordering combining characters."
Attack vector
An attacker can trigger this vulnerability by calling the `unicodedata.normalize()` function with specially crafted Unicode input. This input contains long runs of combining characters with alternating Canonical Combining Class values. The excessive CPU time consumed by the sorting algorithm can lead to a denial of service.
Affected code
The vulnerability exists within the `unicodedata.normalize()` function, specifically in the canonical ordering logic implemented in `Modules/unicodedata.c`. The commit modifies the `nfd_nfkd` function to introduce a hybrid sorting approach, replacing the previous insertion sort with counting sort for longer runs of combining characters [patch_id=4683557].
What the fix does
The patch replaces the insertion sort used for canonical ordering with a hybrid approach. For short runs of combining characters (less than 20), insertion sort is still used. For longer runs, a more efficient linear-time counting sort is employed. This change reduces the worst-case time complexity from O(n^2) to O(n), preventing the denial of service vulnerability [patch_id=4683557].
Generated on Jun 3, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
3News mentions
0No linked articles in our index yet.