VYPR
Medium severityNVD Advisory· Published Jun 3, 2026

CVE-2026-3276

CVE-2026-3276

Description

Python's unicodedata.normalize() has a CPU-intensive vulnerability when processing crafted Unicode strings, potentially leading to denial of service.

AI Insight

LLM-synthesized narrative grounded in this CVE's description and references.

Python's unicodedata.normalize() has a CPU-intensive vulnerability when processing crafted Unicode strings, potentially leading to denial of service.

Vulnerability

The unicodedata.normalize() function in CPython is vulnerable to a denial-of-service attack. Specially crafted Unicode input, containing long runs of combining characters with alternating Canonical Combining Class values, can cause excessive CPU consumption. This issue affects all normalization forms and was introduced by an inefficient insertion sort algorithm for canonical ordering [1, 2].

Exploitation

An attacker can exploit this vulnerability by sending a specially crafted Unicode string to an application that uses unicodedata.normalize(). The vulnerability is triggered by the processing of this string, which leads to a quadratic time complexity in the unicodedata.normalize() function, consuming excessive CPU resources and potentially causing a denial of service [1, 2].

Impact

Successful exploitation of this vulnerability can lead to a denial of service (DoS) by consuming excessive CPU resources. This can render the affected application or system unresponsive, impacting its availability [2]. The scope of the impact is limited to the process running the vulnerable Python code.

Mitigation

A fix has been implemented by replacing the O(n^2) insertion sort with a hybrid approach (insertion sort for short runs and counting sort for longer runs), reducing the worst-case complexity to O(n) [1]. The specific patched version and release date are not explicitly stated in the provided references, but the pull request was merged on June 2, 2026 [1]. Users are advised to update to a patched version of CPython once available. No workarounds are mentioned in the available references.

AI Insight generated on Jun 3, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.

Affected products

1

Patches

1
991224b1e831

gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize() (GH-149080)

https://github.com/python/cpythonSeth LarsonJun 2, 2026via body-scan
3 files changed · +150 26
  • Lib/test/test_unicodedata.py+28 0 modified
    @@ -616,6 +616,34 @@ def test_issue10254(self):
             b = 'C\u0338' * 20  + '\xC7'
             self.assertEqual(self.db.normalize('NFC', a), b)
     
    +    def test_long_combining_mark_run(self):
    +        # gh-149079: avoid quadratic canonical ordering.
    +        payload = "a" + ("\u0300\u0327" * 32)
    +        nfd = "a" + ("\u0327" * 32) + ("\u0300" * 32)
    +        nfc = "\u00e0" + ("\u0327" * 32) + ("\u0300" * 31)
    +
    +        self.assertEqual(self.db.normalize("NFD", payload), nfd)
    +        self.assertEqual(self.db.normalize("NFKD", payload), nfd)
    +        self.assertEqual(self.db.normalize("NFC", payload), nfc)
    +        self.assertEqual(self.db.normalize("NFKC", payload), nfc)
    +
    +    def test_combining_mark_run_fast_paths(self):
    +        # gh-149079: cover short runs and already-sorted long runs.
    +        short_payload = "a" + ("\u0300\u0327" * 9) + "\u0300"
    +        short_nfd = "a" + ("\u0327" * 9) + ("\u0300" * 10)
    +        short_nfc = "\u00e0" + ("\u0327" * 9) + ("\u0300" * 9)
    +        long_sorted = "a" + ("\u0327" * 30) + ("\u0300" * 30)
    +        long_sorted_nfc = "\u00e0" + ("\u0327" * 30) + ("\u0300" * 29)
    +
    +        self.assertEqual(self.db.normalize("NFD", short_payload), short_nfd)
    +        self.assertEqual(self.db.normalize("NFKD", short_payload), short_nfd)
    +        self.assertEqual(self.db.normalize("NFC", short_payload), short_nfc)
    +        self.assertEqual(self.db.normalize("NFKC", short_payload), short_nfc)
    +        self.assertEqual(self.db.normalize("NFD", long_sorted), long_sorted)
    +        self.assertEqual(self.db.normalize("NFKD", long_sorted), long_sorted)
    +        self.assertEqual(self.db.normalize("NFC", long_sorted), long_sorted_nfc)
    +        self.assertEqual(self.db.normalize("NFKC", long_sorted), long_sorted_nfc)
    +
         def test_issue29456(self):
             # Fix #29456
             u1176_str_a = '\u1100\u1176\u11a8'
    
  • Misc/NEWS.d/next/Security/2026-04-27-16-36-11.gh-issue-149079.vKl-LM.rst+5 0 added
    @@ -0,0 +1,5 @@
    +Fix a potential denial of service in :func:`unicodedata.normalize`. The
    +canonical ordering step of Unicode normalization used a quadratic-time insertion
    +sort for reordering combining characters, which could be exploited with
    +crafted input containing many combining characters in non-canonical order.
    +Replaced with a linear-time counting sort for long runs.
    
  • Modules/unicodedata.c+117 26 modified
    @@ -556,19 +556,80 @@ get_decomp_record(PyObject *self, Py_UCS4 code,
         (*index)++;
     }
     
    +/* Small combining runs are usually cheaper with insertion sort. */
    +#define CANONICAL_ORDERING_COUNTING_SORT_THRESHOLD 20
    +
    +static void
    +canonical_ordering_sort_insertion(int kind, void *data,
    +                                  Py_ssize_t start, Py_ssize_t end)
    +{
    +    for (Py_ssize_t i = start + 1; i < end; i++) {
    +        Py_UCS4 code = PyUnicode_READ(kind, data, i);
    +        unsigned char combining = _getrecord_ex(code)->combining;
    +        Py_ssize_t j = i;
    +
    +        while (j > start) {
    +            Py_UCS4 previous = PyUnicode_READ(kind, data, j - 1);
    +            if (_getrecord_ex(previous)->combining <= combining) {
    +                break;
    +            }
    +            PyUnicode_WRITE(kind, data, j, previous);
    +            j--;
    +        }
    +        if (j != i) {
    +            PyUnicode_WRITE(kind, data, j, code);
    +        }
    +    }
    +}
    +
    +static void
    +canonical_ordering_sort_counting(int kind, void *data,
    +                                 Py_ssize_t start, Py_ssize_t end,
    +                                 Py_UCS4 *sortbuf)
    +{
    +    Py_ssize_t counts[256] = {0};
    +    Py_ssize_t run_length = end - start;
    +    Py_ssize_t total = 0;
    +
    +    for (Py_ssize_t i = start; i < end; i++) {
    +        Py_UCS4 code = PyUnicode_READ(kind, data, i);
    +        unsigned char combining = _getrecord_ex(code)->combining;
    +        counts[combining]++;
    +    }
    +
    +    for (size_t i = 0; i < Py_ARRAY_LENGTH(counts); i++) {
    +        Py_ssize_t count = counts[i];
    +        counts[i] = total;
    +        total += count;
    +    }
    +
    +    /* Reuse counts[] as the next output slot for each CCC. */
    +    for (Py_ssize_t i = start; i < end; i++) {
    +        Py_UCS4 code = PyUnicode_READ(kind, data, i);
    +        unsigned char combining = _getrecord_ex(code)->combining;
    +        sortbuf[counts[combining]++] = code;
    +    }
    +    for (Py_ssize_t i = 0; i < run_length; i++) {
    +        PyUnicode_WRITE(kind, data, start + i, sortbuf[i]);
    +    }
    +}
    +
     static PyObject*
     nfd_nfkd(PyObject *self, PyObject *input, int k)
     {
         PyObject *result;
         Py_UCS4 *output;
         Py_ssize_t i, o, osize;
    -    int kind;
    -    const void *data;
    +    int input_kind, result_kind;
    +    const void *input_data;
    +    void *result_data;
         /* Longest decomposition in Unicode 3.2: U+FDFA */
         Py_UCS4 stack[20];
         Py_ssize_t space, isize;
         int index, prefix, count, stackptr;
         unsigned char prev, cur;
    +    Py_UCS4 *sortbuf = NULL;
    +    Py_ssize_t sortbuflen = 0;
     
         stackptr = 0;
         isize = PyUnicode_GET_LENGTH(input);
    @@ -588,11 +649,11 @@ nfd_nfkd(PyObject *self, PyObject *input, int k)
             return NULL;
         }
         i = o = 0;
    -    kind = PyUnicode_KIND(input);
    -    data = PyUnicode_DATA(input);
    +    input_kind = PyUnicode_KIND(input);
    +    input_data = PyUnicode_DATA(input);
     
         while (i < isize) {
    -        stack[stackptr++] = PyUnicode_READ(kind, data, i++);
    +        stack[stackptr++] = PyUnicode_READ(input_kind, input_data, i++);
             while(stackptr) {
                 Py_UCS4 code = stack[--stackptr];
                 /* Hangul Decomposition adds three characters in
    @@ -660,34 +721,64 @@ nfd_nfkd(PyObject *self, PyObject *input, int k)
         if (!result)
             return NULL;
     
    -    kind = PyUnicode_KIND(result);
    -    data = PyUnicode_DATA(result);
    +    result_kind = PyUnicode_KIND(result);
    +    result_data = PyUnicode_DATA(result);
     
    -    /* Sort canonically. */
    +    /* Sort each consecutive combining-character run canonically. */
         i = 0;
    -    prev = _getrecord_ex(PyUnicode_READ(kind, data, i))->combining;
    -    for (i++; i < PyUnicode_GET_LENGTH(result); i++) {
    -        cur = _getrecord_ex(PyUnicode_READ(kind, data, i))->combining;
    -        if (prev == 0 || cur == 0 || prev <= cur) {
    -            prev = cur;
    +    while (i < o) {
    +        Py_ssize_t run_length, run_start;
    +        int needs_sort = 0;
    +
    +        Py_UCS4 ch = PyUnicode_READ(result_kind, result_data, i);
    +        prev = _getrecord_ex(ch)->combining;
    +        if (prev == 0) {
    +            i++;
                 continue;
             }
    -        /* Non-canonical order. Need to switch *i with previous. */
    -        o = i - 1;
    -        while (1) {
    -            Py_UCS4 tmp = PyUnicode_READ(kind, data, o+1);
    -            PyUnicode_WRITE(kind, data, o+1,
    -                            PyUnicode_READ(kind, data, o));
    -            PyUnicode_WRITE(kind, data, o, tmp);
    -            o--;
    -            if (o < 0)
    -                break;
    -            prev = _getrecord_ex(PyUnicode_READ(kind, data, o))->combining;
    -            if (prev == 0 || prev <= cur)
    +
    +        run_start = i++;
    +        while (i < o) {
    +            Py_UCS4 ch = PyUnicode_READ(result_kind, result_data, i);
    +            cur = _getrecord_ex(ch)->combining;
    +            if (cur == 0) {
                     break;
    +            }
    +            if (prev > cur) {
    +                needs_sort = 1;
    +            }
    +            prev = cur;
    +            i++;
    +        }
    +        if (!needs_sort) {
    +            continue;
    +        }
    +
    +        run_length = i - run_start;
    +        if (run_length < CANONICAL_ORDERING_COUNTING_SORT_THRESHOLD) {
    +            canonical_ordering_sort_insertion(result_kind, result_data,
    +                                              run_start, i);
    +            continue;
             }
    -        prev = _getrecord_ex(PyUnicode_READ(kind, data, i))->combining;
    +
    +        if (run_length > sortbuflen) {
    +            Py_UCS4 *new_sortbuf = PyMem_Resize(sortbuf,
    +                                                Py_UCS4,
    +                                                run_length);
    +            if (new_sortbuf == NULL) {
    +                PyErr_NoMemory();
    +                PyMem_Free(sortbuf);
    +                Py_DECREF(result);
    +                return NULL;
    +            }
    +            sortbuf = new_sortbuf;
    +            sortbuflen = run_length;
    +        }
    +
    +        canonical_ordering_sort_counting(result_kind, result_data,
    +                                         run_start, i, sortbuf);
         }
    +    PyMem_Free(sortbuf);
         return result;
     }
     
    

Vulnerability mechanics

Root cause

"The canonical ordering step in Unicode normalization used a quadratic-time algorithm for reordering combining characters."

Attack vector

An attacker can trigger this vulnerability by calling the `unicodedata.normalize()` function with specially crafted Unicode input. This input contains long runs of combining characters with alternating Canonical Combining Class values. The excessive CPU time consumed by the sorting algorithm can lead to a denial of service.

Affected code

The vulnerability exists within the `unicodedata.normalize()` function, specifically in the canonical ordering logic implemented in `Modules/unicodedata.c`. The commit modifies the `nfd_nfkd` function to introduce a hybrid sorting approach, replacing the previous insertion sort with counting sort for longer runs of combining characters [patch_id=4683557].

What the fix does

The patch replaces the insertion sort used for canonical ordering with a hybrid approach. For short runs of combining characters (less than 20), insertion sort is still used. For longer runs, a more efficient linear-time counting sort is employed. This change reduces the worst-case time complexity from O(n^2) to O(n), preventing the denial of service vulnerability [patch_id=4683557].

Generated on Jun 3, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.

References

3

News mentions

0

No linked articles in our index yet.