CVE-2025-40303
Description
In the Linux kernel, the following vulnerability has been resolved:
btrfs: ensure no dirty metadata is written back for an fs with errors
[BUG] During development of a minor feature (make sure all btrfs_bio::end_io() is called in task context), I noticed a crash in generic/388, where metadata writes triggered new works after btrfs_stop_all_workers().
It turns out that it can even happen without any code modification, just using RAID5 for metadata and the same workload from generic/388 is going to trigger the use-after-free.
[CAUSE] If btrfs hits an error, the fs is marked as error, no new transaction is allowed thus metadata is in a frozen state.
But there are some metadata modifications before that error, and they are still in the btree inode page cache.
Since there will be no real transaction commit, all those dirty folios are just kept as is in the page cache, and they can not be invalidated by invalidate_inode_pages2() call inside close_ctree(), because they are dirty.
And finally after btrfs_stop_all_workers(), we call iput() on btree inode, which triggers writeback of those dirty metadata.
And if the fs is using RAID56 metadata, this will trigger RMW and queue new works into rmw_workers, which is already stopped, causing warning from queue_work() and use-after-free.
[FIX] Add a special handling for write_one_eb(), that if the fs is already in an error state, immediately mark the bbio as failure, instead of really submitting them.
Then during close_ctree(), iput() will just discard all those dirty tree blocks without really writing them back, thus no more new jobs for already stopped-and-freed workqueues.
The extra discard in write_one_eb() also acts as an extra safenet. E.g. the transaction abort is triggered by some extent/free space tree corruptions, and since extent/free space tree is already corrupted some tree blocks may be allocated where they shouldn't be (overwriting existing tree blocks). In that case writing them back will further corrupting the fs.
AI Insight
LLM-synthesized narrative grounded in this CVE's description and references.
A use-after-free in btrfs occurs when dirty metadata is written back after an error, because workqueues are already stopped.
Root
Cause
In the Linux kernel's btrfs filesystem, when an error is encountered (e.g., during RAID5 metadata operations), the filesystem is marked as erroneous and no new transactions are allowed. However, metadata modifications that occurred before the error remain as dirty folios in the btree inode page cache. Since no real transaction commit takes place, these dirty pages are not invalidated by invalidate_inode_pages2() during close_ctree() because they are still dirty. After btrfs_stop_all_workers() is called, iput() on the btree inode triggers writeback of those dirty metadata blocks. If the filesystem uses RAID56 for metadata, this writeback triggers a read-modify-write (RMW) operation that queues new work items into the already-stopped rmw_workers workqueue, leading to a warning from queue_work() and a use-after-free condition [1][2].
Attack
Vector
An attacker with the ability to trigger a btrfs filesystem error (e.g., through a crafted I/O operation or by exploiting a separate vulnerability that causes a transaction abort) can cause the kernel to attempt to write back dirty metadata after the filesystem's workqueues have been shut down. This scenario is reproducible with the generic/388 test using RAID5 metadata, but any error that marks the filesystem as erroneous and leaves dirty metadata in the page cache can lead to the same issue [1][2].
Impact
Successful exploitation results in a use-after-free condition, which can lead to a kernel crash (denial of service) or potentially allow an attacker to execute arbitrary code in the kernel context, depending on the memory layout and exploitation technique. The vulnerability also poses a risk of further filesystem corruption if the dirty metadata is written back after an error, as the corrupted tree blocks may overwrite valid data [1][2].
Mitigation
The fix adds a special handling in write_one_eb(): if the filesystem is already in an error state, the bio is immediately marked as a failure instead of being submitted. This ensures that during close_ctree(), iput() discards all dirty tree blocks without actually writing them back, preventing new work items from being queued to stopped workqueues. The patch has been applied to the stable kernel tree [1][2]. Users should update to a kernel version containing this commit to mitigate the vulnerability.
AI Insight generated on May 19, 2026. Synthesized from this CVE's description and the cited reference URLs; citations are validated against the source bundle.
Affected products
2Patches
4066ee13f05fbe2b3859067bf54a5b5a155882618849f31e7Vulnerability mechanics
Generated on May 9, 2026. Inputs: CWE entries + fix-commit diffs from this CVE's patches. Citations validated against bundle.
References
4News mentions
0No linked articles in our index yet.