VYPR
researchPublished May 12, 2026· Updated May 18, 2026· 1 source

Cloudflare's QUIC Implementation Hit by 'Death Spiral' Bug Locked Congestion Window at Minimum, Causing Widespread Test Failures

Cloudflare disclosed a bug in its open-source QUIC implementation quiche where the CUBIC congestion controller's window became permanently stuck at its minimum, causing ~60% of integration tests to fail.

Cloudflare engineers have uncovered and fixed a subtle but severe bug in their open-source QUIC implementation, quiche, that caused the CUBIC congestion controller's congestion window (cwnd) to become permanently pinned at its minimum value after a period of heavy packet loss. The bug, described in a detailed post on the Cloudflare blog, resulted in approximately 60% of integration tests failing with a 10-second timeout, as connections were unable to recover from a congestion collapse even after loss conditions ceased.

The root cause traced back to a Linux kernel change that aligned CUBIC with the app-limited exclusion described in RFC 9438 §4.2-12. When this change was ported to quiche, it created a death spiral under high early loss. In the failing scenario, after two seconds of 30% random packet loss, the congestion window dropped to its floor of 2700 bytes (two full-size packets). Once loss stopped, the window should have grown, but instead it remained flat while the congestion state oscillated between recovery and congestion avoidance approximately 999 times over 6.7 seconds — one transition every ~14ms, matching the connection's RTT.

The oscillation occurred because each ACK from the client reduced bytes_in_flight to zero, triggering the server to send another two-packet burst. This burst, under the app-limited logic, was misinterpreted as a new congestion event, causing CUBIC to re-enter recovery state repeatedly. The fix was a near-one-line change that ensured the congestion window could properly recover after heavy loss by correctly handling the app-limited condition.

Cloudflare noted that the bug was invisible in throughput dashboards and undetectable by static review, surfacing only through deliberate stress testing of the congestion controller's recovery path. The issue was specific to CUBIC; running the same test with the Reno congestion controller produced normal results. The fix has been merged into quiche, and Cloudflare recommends that other QUIC implementations using CUBIC review their app-limited handling.

This discovery highlights the complexity of congestion control algorithms and the unexpected behaviors that can arise when protocol implementations diverge from their kernel counterparts. As QUIC adoption grows, particularly in CDN and edge computing environments, such corner-case bugs could have significant real-world impact on connection reliability and throughput.

Synthesized by Vypr AI