Regular for-next build test by kdave · Pull Request #1157 · btrfs/linux

kdave · 2024-02-21T15:01:30Z

Keep this open, the build tests are hosted on github CI.

Add a check in remove_range_from_remap_tree() after we call btrfs_lookup_block_group(), to check if it is NULL. This shouldn't happen, but if it does we at least get an error rather than a segfault. Reported-by: Chris Mason <clm@fb.com> Link: https://lore.kernel.org/linux-btrfs/20260125125129.2245240-1-clm@meta.com/ Fixes: 979e1dc ("btrfs: handle deletions from remapped block group") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Add the definitions for the remap tree to print-tree.c, so that we get more useful information if a tree is dumped to dmesg. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

The nowait flag is always false in this context, making the conditional check unnecessary. Simplify the code by directly assigning -ENOTBLK. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Alexey Velichayshiy <a.velichayshiy@ispras.ru> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

btrfs_set_periodic_reclaim_ready() requires space_info->lock to be held, as enforced by lockdep_assert_held(). However, btrfs_reclaim_sweep() was calling it after do_reclaim_sweep() returns, at which point space_info->lock is no longer held. Fix this by explicitly acquiring space_info->lock before clearing the periodic reclaim ready flag in btrfs_reclaim_sweep(). Reported-by: Chris Mason <clm@meta.com> Link: https://lore.kernel.org/linux-btrfs/20260208182556.891815-1-clm@meta.com/ Fixes: 19eff93 ("btrfs: fix periodic reclaim condition") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

We have several call sites doing the same work to calculate the size of a bio: struct bio_vec *bvec; u32 bio_size = 0; int i; bio_for_each_bvec_all(bvec, bio, i) bio_size += bvec->bv_len; We can use a common helper instead of open-coding it everywhere. This also allows us to constify the @bio_size variables used in all the call sites. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

The member compressed_bio::compressed_len can be replaced by the bio size, as we always submit the full compressed data without any partial read/write. Furthermore we already have enough ASSERT()s making sure the bio size matches the ordered extent or the extent map. This saves 8 bytes from compressed_bio: Before: struct compressed_bio { u64 start; /* 0 8 */ unsigned int len; /* 8 4 */ unsigned int compressed_len; /* 12 4 */ u8 compress_type; /* 16 1 */ bool writeback; /* 17 1 */ /* XXX 6 bytes hole, try to pack */ struct btrfs_bio * orig_bbio; /* 24 8 */ struct btrfs_bio bbio __attribute__((__aligned__(8))); /* 32 304 */ /* XXX last struct has 1 bit hole */ /* size: 336, cachelines: 6, members: 7 */ /* sum members: 330, holes: 1, sum holes: 6 */ /* member types with bit holes: 1, total: 1 */ /* forced alignments: 1 */ /* last cacheline: 16 bytes */ } __attribute__((__aligned__(8))); After: struct compressed_bio { u64 start; /* 0 8 */ unsigned int len; /* 8 4 */ u8 compress_type; /* 12 1 */ bool writeback; /* 13 1 */ /* XXX 2 bytes hole, try to pack */ struct btrfs_bio * orig_bbio; /* 16 8 */ struct btrfs_bio bbio __attribute__((__aligned__(8))); /* 24 304 */ /* XXX last struct has 1 bit hole */ /* size: 328, cachelines: 6, members: 6 */ /* sum members: 326, holes: 1, sum holes: 2 */ /* member types with bit holes: 1, total: 1 */ /* forced alignments: 1 */ /* last cacheline: 8 bytes */ } __attribute__((__aligned__(8))); Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

…start btrfs_zoned_reserve_data_reloc_bg() is called on each mount of a file system and allocates a new block-group, to assign it to be the dedicated relocation target, if no pre-existing usable block-group for this task is found. If for some reason the transaction is aborted, btrfs_end_transaction() will wake up the transaction kthread. But the transaction kthread is not yet initialized at the time btrfs_zoned_reserve_data_reloc_bg() is called, leading to the following NULL-pointer dereference: RSP: 0018:ffffc9000c617c98 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 000000000000073c RCX: 0000000000000002 RDX: 0000000000000001 RSI: 0000000000000003 RDI: 0000000000000001 RBP: 0000000000000207 R08: ffffffff8223c71d R09: 0000000000000635 R10: ffff888108588000 R11: 0000000000000003 R12: 0000000000000003 R13: 000000000000073c R14: 0000000000000000 R15: ffff888114dd6000 FS: 00007f2993745840(0000) GS:ffff8882b508d000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000073c CR3: 0000000121a82006 CR4: 0000000000770eb0 PKRU: 55555554 Call Trace: <TASK> try_to_wake_up (./include/linux/spinlock.h:557 kernel/sched/core.c:4106) __btrfs_end_transaction (fs/btrfs/transaction.c:1115 (discriminator 2)) btrfs_zoned_reserve_data_reloc_bg (fs/btrfs/zoned.c:2840) open_ctree (fs/btrfs/disk-io.c:3588) btrfs_get_tree.cold (fs/btrfs/super.c:982 fs/btrfs/super.c:1944 fs/btrfs/super.c:2087 fs/btrfs/super.c:2121) vfs_get_tree (fs/super.c:1752) __do_sys_fsconfig (fs/fsopen.c:231 fs/fsopen.c:295 fs/fsopen.c:473) do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1)) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:131) RIP: 0033:0x7f299392740e Move the call to btrfs_zoned_reserve_data_reloc_bg() after the transaction_kthread has been initialized to fix this problem. Fixes: 694ce5e ("btrfs: zoned: reserve data_reloc block group on mount") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>

We have recently observed a number of subvolumes with broken dentries. ls-ing the parent dir looks like: drwxrwxrwt 1 root root 16 Jan 23 16:49 . drwxr-xr-x 1 root root 24 Jan 23 16:48 .. d????????? ? ? ? ? ? broken_subvol and similarly stat-ing the file fails. In this state, deleting the subvol fails with ENOENT, but attempting to create a new file or subvol over it errors out with EEXIST and even aborts the fs. Which leaves us a bit stuck. dmesg contains a single notable error message reading: "could not do orphan cleanup -2" 2 is ENOENT and the error comes from the failure handling path of btrfs_orphan_cleanup(), with the stack leading back up to btrfs_lookup(). btrfs_lookup btrfs_lookup_dentry btrfs_orphan_cleanup // prints that message and returns -ENOENT After some detailed inspection of the internal state, it became clear that: - there are no orphan items for the subvol - the subvol is otherwise healthy looking, it is not half-deleted or anything, there is no drop progress, etc. - the subvol was created a while ago and does the meaningful first btrfs_orphan_cleanup() call that sets BTRFS_ROOT_ORPHAN_CLEANUP much later. - after btrfs_orphan_cleanup() fails, btrfs_lookup_dentry() returns -ENOENT, which results in a negative dentry for the subvolume via d_splice_alias(NULL, dentry), leading to the observed behavior. The bug can be mitigated by dropping the dentry cache, at which point we can successfully delete the subvolume if we want. i.e., btrfs_lookup() btrfs_lookup_dentry() if (!sb_rdonly(inode->vfs_inode)->vfs_inode) btrfs_orphan_cleanup(sub_root) test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP) btrfs_search_slot() // finds orphan item for inode N ... prints "could not do orphan cleanup -2" if (inode == ERR_PTR(-ENOENT)) inode = NULL; return d_splice_alias(NULL, dentry) // NEGATIVE DENTRY for valid subvolume btrfs_orphan_cleanup() does test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP) on the root when it runs, so it cannot run more than once on a given root, so something else must run concurrently. However, the obvious routes to deleting an orphan when nlinks goes to 0 should not be able to run without first doing a lookup into the subvolume, which should run btrfs_orphan_cleanup() and set the bit. The final important observation is that create_subvol() calls d_instantiate_new() but does not set BTRFS_ROOT_ORPHAN_CLEANUP, so if the dentry cache gets dropped, the next lookup into the subvolume will make a real call into btrfs_orphan_cleanup() for the first time. This opens up the possibility of concurrently deleting the inode/orphan items but most typical evict() paths will be holding a reference on the parent dentry (child dentry holds parent->d_lockref.count via dget in d_alloc(), released in __dentry_kill()) and prevent the parent from being removed from the dentry cache. The one exception is delayed iputs. Ordered extent creation calls igrab() on the inode. If the file is unlinked and closed while those refs are held, iput() in __dentry_kill() decrements i_count but does not trigger eviction (i_count > 0). The child dentry is freed and the subvol dentry's d_lockref.count drops to 0, making it evictable while the inode is still alive. Since there are two races (the race between writeback and unlink and the race between lookup and delayed iputs), and there are too many moving parts, the following three diagrams show the complete picture. (Only the second and third are races) Phase 1: Create Subvol in dentry cache without BTRFS_ROOT_ORPHAN_CLEANUP set btrfs_mksubvol() lookup_one_len() __lookup_slow() d_alloc_parallel() __d_alloc() // d_lockref.count = 1 create_subvol(dentry) // doesn't touch the bit.. d_instantiate_new(dentry, inode) // dentry in cache with d_lockref.count == 1 Phase 2: Create a delayed iput for a file in the subvol but leave the subvol in state where its dentry can be evicted (d_lockref.count == 0) T1 (task) T2 (writeback) T3 (OE workqueue) write() // dirty pages btrfs_writepages() btrfs_run_delalloc_range() cow_file_range() btrfs_alloc_ordered_extent() igrab() // i_count: 1 -> 2 btrfs_unlink_inode() btrfs_orphan_add() close() __fput() dput() finish_dput() __dentry_kill() dentry_unlink_inode() iput() // 2 -> 1 --parent->d_lockref.count // 1 -> 0; evictable finish_ordered_fn() btrfs_finish_ordered_io() btrfs_put_ordered_extent() btrfs_add_delayed_iput() Phase 3: Once the delayed iput is pending and the subvol dentry is evictable, the shrinker can free it, causing the next lookup to go through btrfs_lookup() and call btrfs_orphan_cleanup() for the first time. If the cleaner kthread processes the delayed iput concurrently, the two race: T1 (shrinker) T2 (cleaner kthread) T3 (lookup) super_cache_scan() prune_dcache_sb() __dentry_kill() // subvol dentry freed btrfs_run_delayed_iputs() iput() // i_count -> 0 evict() // sets I_FREEING btrfs_evict_inode() // truncation loop btrfs_lookup() btrfs_lookup_dentry() btrfs_orphan_cleanup() // first call (bit never set) btrfs_iget() // blocks on I_FREEING btrfs_orphan_del() // inode freed // returns -ENOENT btrfs_del_orphan_item() // -ENOENT // "could not do orphan cleanup -2" d_splice_alias(NULL, dentry) // negative dentry for valid subvol The most straightforward fix is to ensure the invariant that a dentry for a subvolume can exist if and only if that subvolume has BTRFS_ROOT_ORPHAN_CLEANUP set on its root (and is known to have no orphans or ran btrfs_orphan_cleanup()). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>

…tent_buffer() Call rcu_read_lock() before exiting the loop in try_release_subpage_extent_buffer() because there is a rcu_read_unlock() call past the loop. This has been detected by the Clang thread-safety analyzer. Fixes: ad580df ("btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: David Sterba <dsterba@suse.com>

Currently we zero out all the remaining bytes of the last folio of the compressed bio, then round the bio size to fs block boundary. But that is done in two different functions, zero_last_folio() to zero the remaining bytes of the last folio, and round_up_last_block() to round up the bio to fs block boundary. There are some minor problems: - zero_last_folio() is zeroing ranges we won't submit This is mostly affecting block size < page size cases, where we can have a large folio (e.g. 64K), but the fs block size is only 4K. In that case, we may only want to submit the first 4K of the folio, the remaining range won't matter, but we still zero them all. This causes unnecessary CPU usage just to zero out some bytes we won't utilized. - compressed_bio_last_folio() is called twice in two different functions Which in theory we only need to call it once. Enhance the situation by: - Only zero out bytes up to the fs block boundary Thus this will reduce some overhead for bs < ps cases. - Move the folio_zero_range() call into round_up_last_block() So that we can reuse the same folio returned by compressed_bio_last_folio(). Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Fix move_existing_remaps() so that if we increment the slot because the key we encounter isn't a REMAP_BACKREF, we don't reuse the objectid and offset of the old item. Link: https://lore.kernel.org/linux-btrfs/20260125123908.2096548-1-clm@meta.com/ Reported-by: Chris Mason <clm@fb.com> Fixes: bbea42d ("btrfs: move existing remaps before relocating block group") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Commit 2932ba8 ("slab: Introduce kmalloc_obj() and family") introduced, among many others, the kzalloc_objs() helper, which has some benefits over kcalloc(). Namely, internal introspection of the allocated type now becomes possible, allowing for future alignment-aware choices to be made by the allocator and future hardening work that can be type sensitive. Dropping 'sizeof' comes also as a nice side-effect. Moreover, this also allows us to be in line with the recent tree-wide migration to the kmalloc_obj() and family of helpers. See commit 69050f8 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types"). Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Inside btrfs_free_compr_folio() we have an ASSERT() to make sure when we free the folio it should only have one reference (held by btrfs). However there is never any guarantee that btrfs is the only holder of the folio. Memory management may have acquired that folio for whatever reasons. I do not think we should poke into the very low-level implementation of memory management code, and I didn't find any fs code really using folio_ref_count() other than during debugging output. Just remove the ASSERT() to avoid false triggering. This was triggered during btrfs/011 with MOUNT_OPTIONS='-o compress' on a fast storage with 1GiB of RAM. Reported-by: David Sterba <dsterba@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Commit 347b704 ("Merge patch series "fs: generic file IO error reporting"") has introduced a common framework for reporting errors to fsnotify in a standard way. One of the functions being introduced is fserror_report_shutdown() that, when combined with the experimental support for shutdown in btrfs, it means that user-space can also easily detect whenever a btrfs filesystem has been marked as shutdown. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

If we attempt to create several files with names that result in the same hash, we have to pack them in same dir item and that has a limit inherent to the leaf size. However if we reach that limit, we trigger a transaction abort and turns the filesystem into RO mode. This allows for a malicious user to disrupt a system, without the need to have administration privileges/capabilities. Reproducer: $ cat exploit-hash-collisions.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi # Use smallest node size to make the test faster and require less file # names that result in hash collision. mkfs.btrfs -f --nodesize 4K $DEV mount $DEV $MNT # List of names that result in the same crc32c hash for btrfs. declare -a names=( 'foobar' '%a8tYkxfGMLWRGr55QSeQc4PBNH9PCLIvR6jZnkDtUUru1t@RouaUe_L:@xGkbO3nCwvLNYeK9vhE628gss:T$yZjZ5l-Nbd6CbC$M=hqE-ujhJICXyIxBvYrIU9-TDC' 'AQci3EUB%shMsg-N%frgU:02ByLs=IPJU0OpgiWit5nexSyxZDncY6WB:=zKZuk5Zy0DD$Ua78%MelgBuMqaHGyKsJUFf9s=UW80PcJmKctb46KveLSiUtNmqrMiL9-Y0I_l5Fnam04CGIg=8@U:Z' 'CvVqJpJzueKcuA$wqwePfyu7VxuWNN3ho$p0zi2H8QFYK$7YlEqOhhb%:hHgjhIjW5vnqWHKNP4' 'ET:vk@rFU4tsvMB0$C_p=xQHaYZjvoF%-BTc%wkFW8yaDAPcCYoR%x$FH5O:' 'HwTon%v7SGSP4FE08jBwwiu5aot2CFKXHTeEAa@38fUcNGOWvE@Mz6WBeDH_VooaZ6AgsXPkVGwy9l@@ZbNXabUU9csiWrrOp0MWUdfi$EZ3w9GkIqtz7I_eOsByOkBOO' 'Ij%2VlFGXSuPvxJGf5UWy6O@1svxGha%b@=%wjkq:CIgE6u7eJOjmQY5qTtxE2Rjbis9@us' 'KBkjG5%9R8K9sOG8UTnAYjxLNAvBmvV5vz3IiZaPmKuLYO03-6asI9lJ_j4@6Xo$KZicaLWJ3Pv8XEwVeUPMwbHYWwbx0pYvNlGMO9F:ZhHAwyctnGy%_eujl%WPd4U2BI7qooOSr85J-C2V$LfY' 'NcRfDfuUQ2=zP8K3CCF5dFcpfiOm6mwenShsAb_F%n6GAGC7fT2JFFn:c35X-3aYwoq7jNX5$ZJ6hI3wnZs$7KgGi7wjulffhHNUxAT0fRRLF39vJ@NvaEMxsMO' 'Oj42AQAEzRoTxa5OuSKIr=A_lwGMy132v4g3Pdq1GvUG9874YseIFQ6QU' 'Ono7avN5GjC:_6dBJ_' 'WHmN2gnmaN-9dVDy4aWo:yNGFzz8qsJyJhWEWcud7$QzN2D9R0efIWWEdu5kwWr73NZm4=@CoCDxrrZnRITr-kGtU_cfW2:%2_am' 'WiFnuTEhAG9FEC6zopQmj-A-$LDQ0T3WULz%ox3UZAPybSV6v1Z$b4L_XBi4M4BMBtJZpz93r9xafpB77r:lbwvitWRyo$odnAUYlYMmU4RvgnNd--e=I5hiEjGLETTtaScWlQp8mYsBovZwM2k' 'XKyH=OsOAF3p%uziGF_ZVr$ivrvhVgD@1u%5RtrV-gl_vqAwHkK@x7YwlxX3qT6WKKQ%PR56NrUBU2dOAOAdzr2=5nJuKPM-T-$ZpQfCL7phxQbUcb:BZOTPaFExc-qK-gDRCDW2' 'd3uUR6OFEwZr%ns1XH_@tbxA@cCPmbBRLdyh7p6V45H$P2$F%w0RqrD3M0g8aGvWpoTFMiBdOTJXjD:JF7=h9a_43xBywYAP%r$SPZi%zDg%ql-KvkdUCtF9OLaQlxmd' 'ePTpbnit%hyNm@WELlpKzNZYOzOTf8EQ$sEfkMy1VOfIUu3coyvIr13-Y7Sv5v-Ivax2Go_GQRFMU1b3362nktT9WOJf3SpT%z8sZmM3gvYQBDgmKI%%RM-G7hyrhgYflOw%z::ZRcv5O:lDCFm' 'evqk743Y@dvZAiG5J05L_ROFV@$2%rVWJ2%3nxV72-W7$e$-SK3tuSHA2mBt$qloC5jwNx33GmQUjD%akhBPu=VJ5g$xhlZiaFtTrjeeM5x7dt4cHpX0cZkmfImndYzGmvwQG:$euFYmXn$_2rA9mKZ' 'gkgUtnihWXsZQTEkrMAWIxir09k3t7jk_IK25t1:cy1XWN0GGqC%FrySdcmU7M8MuPO_ppkLw3=Dfr0UuBAL4%GFk2$Ma10V1jDRGJje%Xx9EV2ERaWKtjpwiZwh0gCSJsj5UL7CR8RtW5opCVFKGGy8Cky' 'hNgsG_8lNRik3PvphqPm0yEH3P%%fYG:kQLY=6O-61Wa6nrV_WVGR6TLB09vHOv%g4VQRP8Gzx7VXUY1qvZyS' 'isA7JVzN12xCxVPJZ_qoLm-pTBuhjjHMvV7o=F:EaClfYNyFGlsfw-Kf%uxdqW-kwk1sPl2vhbjyHU1A6$hz' 'kiJ_fgcdZFDiOptjgH5PN9-PSyLO4fbk_:u5_2tz35lV_iXiJ6cx7pwjTtKy-XGaQ5IefmpJ4N_ZqGsqCsKuqOOBgf9LkUdffHet@Wu' 'lvwtxyhE9:%Q3UxeHiViUyNzJsy:fm38pg_b6s25JvdhOAT=1s0$pG25x=LZ2rlHTszj=gN6M4zHZYr_qrB49i=pA--@WqWLIuX7o1S_SfS@2FSiUZN' 'rC24cw3UBDZ=5qJBUMs9e$=S4Y94ni%Z8639vnrGp=0Hv4z3dNFL0fBLmQ40=EYIY:Z=SLc@QLMSt2zsss2ZXrP7j4=' 'uwGl2s-fFrf@GqS=DQqq2I0LJSsOmM%xzTjS:lzXguE3wChdMoHYtLRKPvfaPOZF2fER@j53evbKa7R%A7r4%YEkD=kicJe@SFiGtXHbKe4gCgPAYbnVn' 'UG37U6KKua2bgc:IHzRs7BnB6FD:2Mt5Cc5NdlsW%$1tyvnfz7S27FvNkroXwAW:mBZLA1@qa9WnDbHCDmQmfPMC9z-Eq6QT0jhhPpqyymaD:R02ghwYo%yx7SAaaq-:x33LYpei$5g8DMl3C' 'y2vjek0FE1PDJC0qpfnN:x8k2wCFZ9xiUF2ege=JnP98R%wxjKkdfEiLWvQzmnW' '8-HCSgH5B%K7P8_jaVtQhBXpBk:pE-$P7ts58U0J@iR9YZntMPl7j$s62yAJO@_9eanFPS54b=UTw$94C-t=HLxT8n6o9P=QnIxq-f1=Ne2dvhe6WbjEQtc' 'YPPh:IFt2mtR6XWSmjHptXL_hbSYu8bMw-JP8@PNyaFkdNFsk$M=xfL6LDKCDM-mSyGA_2MBwZ8Dr4=R1D%7-mCaaKGxb990jzaagRktDTyp' '9hD2ApKa_t_7x-a@GCG28kY:7$M@5udI1myQ$x5udtggvagmCQcq9QXWRC5hoB0o-_zHQUqZI5rMcz_kbMgvN5jr63LeYA4Cj-c6F5Ugmx6DgVf@2Jqm%MafecpgooqreJ53P-QTS' ) # Now create files with all those names in the same parent directory. # It should not fail since a 4K leaf has enough space for them. for name in "${names[@]}"; do touch $MNT/$name done # Now add one more file name that causes a crc32c hash collision. # This should fail, but it should not turn the filesystem into RO mode # (which could be exploited by malicious users) due to a transaction # abort. touch $MNT/'W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt' # Check that we are able to create another file, with a name that does not cause # a crc32c hash collision. echo -n "hello world" > $MNT/baz # Unmount and mount again, verify file baz exists and with the right content. umount $MNT mount $DEV $MNT echo "File baz content: $(cat $MNT/baz)" umount $MNT When running the reproducer: $ ./exploit-hash-collisions.sh (...) touch: cannot touch '/mnt/sdi/W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt': Value too large for defined data type ./exploit-hash-collisions.sh: line 57: /mnt/sdi/baz: Read-only file system cat: /mnt/sdi/baz: No such file or directory File baz content: And the transaction abort stack trace in dmesg/syslog: $ dmesg (...) [758240.509761] ------------[ cut here ]------------ [758240.510668] BTRFS: Transaction aborted (error -75) [758240.511577] WARNING: fs/btrfs/inode.c:6854 at btrfs_create_new_inode+0x805/0xb50 [btrfs], CPU#6: touch/888644 [758240.513513] Modules linked in: btrfs dm_zero (...) [758240.523221] CPU: 6 UID: 0 PID: 888644 Comm: touch Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full) [758240.524621] Tainted: [W]=WARN [758240.525037] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [758240.526331] RIP: 0010:btrfs_create_new_inode+0x80b/0xb50 [btrfs] [758240.527093] Code: 0f 82 cf (...) [758240.529211] RSP: 0018:ffffce64418fbb48 EFLAGS: 00010292 [758240.529935] RAX: 00000000ffffffd3 RBX: 0000000000000000 RCX: 00000000ffffffb5 [758240.531040] RDX: 0000000d04f33e06 RSI: 00000000ffffffb5 RDI: ffffffffc0919dd0 [758240.531920] RBP: ffffce64418fbc10 R08: 0000000000000000 R09: 00000000ffffffb5 [758240.532928] R10: 0000000000000000 R11: ffff8e52c0000000 R12: ffff8e53eee7d0f0 [758240.533818] R13: ffff8e57f70932a0 R14: ffff8e5417629568 R15: 0000000000000000 [758240.534664] FS: 00007f1959a2a740(0000) GS:ffff8e5b27cae000(0000) knlGS:0000000000000000 [758240.535821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [758240.536644] CR2: 00007f1959b10ce0 CR3: 000000012a2cc005 CR4: 0000000000370ef0 [758240.537517] Call Trace: [758240.537828] <TASK> [758240.538099] btrfs_create_common+0xbf/0x140 [btrfs] [758240.538760] path_openat+0x111a/0x15b0 [758240.539252] do_filp_open+0xc2/0x170 [758240.539699] ? preempt_count_add+0x47/0xa0 [758240.540200] ? __virt_addr_valid+0xe4/0x1a0 [758240.540800] ? __check_object_size+0x1b3/0x230 [758240.541661] ? alloc_fd+0x118/0x180 [758240.542315] do_sys_openat2+0x70/0xd0 [758240.543012] __x64_sys_openat+0x50/0xa0 [758240.543723] do_syscall_64+0x50/0xf20 [758240.544462] entry_SYSCALL_64_after_hwframe+0x76/0x7e [758240.545397] RIP: 0033:0x7f1959abc687 [758240.546019] Code: 48 89 fa (...) [758240.548522] RSP: 002b:00007ffe16ff8690 EFLAGS: 00000202 ORIG_RAX: 0000000000000101 [758240.566278] RAX: ffffffffffffffda RBX: 00007f1959a2a740 RCX: 00007f1959abc687 [758240.567068] RDX: 0000000000000941 RSI: 00007ffe16ffa333 RDI: ffffffffffffff9c [758240.567860] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [758240.568707] R10: 00000000000001b6 R11: 0000000000000202 R12: 0000561eec7c4b90 [758240.569712] R13: 0000561eec7c311f R14: 00007ffe16ffa333 R15: 0000000000000000 [758240.570758] </TASK> [758240.571040] ---[ end trace 0000000000000000 ]--- [758240.571681] BTRFS: error (device sdi state A) in btrfs_create_new_inode:6854: errno=-75 unknown [758240.572899] BTRFS info (device sdi state EA): forced readonly Fix this by checking for hash collision, and if the adding a new name is possible, early in btrfs_create_new_inode() before we do any tree updates, so that we don't need to abort the transaction if we can not add the new name due to the leaf size limit. A test case for fstests will be sent soon. Fixes: caae78e ("btrfs: move common inode creation code into btrfs_create_new_inode()") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Currently a user can trigger a transaction abort by snapshotting a previously received snapshot a bunch of times until we reach a BTRFS_UUID_KEY_RECEIVED_SUBVOL item overflow (the maximum item size we can store in a leaf). This is very likely not common in practice, but if it happens, it turns the filesystem into RO mode. The snapshot, send and set_received_subvol and subvol_setflags (used by receive) don't require CAP_SYS_ADMIN, just inode_owner_or_capable(). A malicious user could use this to turn a filesystem into RO mode and disrupt a system. Reproducer script: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi # Use smallest node size to make the test faster. mkfs.btrfs -f --nodesize 4K $DEV mount $DEV $MNT # Create a subvolume and set it to RO so that it can be used for send. btrfs subvolume create $MNT/sv touch $MNT/sv/foo btrfs property set $MNT/sv ro true # Send and receive the subvolume into snaps/sv. mkdir $MNT/snaps btrfs send $MNT/sv | btrfs receive $MNT/snaps # Now snapshot the received subvolume, which has a received_uuid, a # lot of times to trigger the leaf overflow. total=500 for ((i = 1; i <= $total; i++)); do echo -ne "\rCreating snapshot $i/$total" btrfs subvolume snapshot -r $MNT/snaps/sv $MNT/snaps/sv_$i > /dev/null done echo umount $MNT When running the test: $ ./test.sh (...) Create subvolume '/mnt/sdi/sv' At subvol /mnt/sdi/sv At subvol sv Creating snapshot 496/500ERROR: Could not create subvolume: Value too large for defined data type Creating snapshot 497/500ERROR: Could not create subvolume: Read-only file system Creating snapshot 498/500ERROR: Could not create subvolume: Read-only file system Creating snapshot 499/500ERROR: Could not create subvolume: Read-only file system Creating snapshot 500/500ERROR: Could not create subvolume: Read-only file system And in dmesg/syslog: $ dmesg (...) [251067.627338] BTRFS warning (device sdi): insert uuid item failed -75 (0x4628b21c4ac8d898, 0x2598bee2b1515c91) type 252! [251067.629212] ------------[ cut here ]------------ [251067.630033] BTRFS: Transaction aborted (error -75) [251067.630871] WARNING: fs/btrfs/transaction.c:1907 at create_pending_snapshot.cold+0x52/0x465 [btrfs], CPU#10: btrfs/615235 [251067.632851] Modules linked in: btrfs dm_zero (...) [251067.644071] CPU: 10 UID: 0 PID: 615235 Comm: btrfs Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full) [251067.646165] Tainted: [W]=WARN [251067.646733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [251067.648735] RIP: 0010:create_pending_snapshot.cold+0x55/0x465 [btrfs] [251067.649984] Code: f0 48 0f (...) [251067.653313] RSP: 0018:ffffce644908fae8 EFLAGS: 00010292 [251067.653987] RAX: 00000000ffffff01 RBX: ffff8e5639e63a80 RCX: 00000000ffffffd3 [251067.655042] RDX: ffff8e53faa76b00 RSI: 00000000ffffffb5 RDI: ffffffffc0919750 [251067.656077] RBP: ffffce644908fbd8 R08: 0000000000000000 R09: ffffce644908f820 [251067.657068] R10: ffff8e5adc1fffa8 R11: 0000000000000003 R12: ffff8e53c0431bd0 [251067.658050] R13: ffff8e5414593600 R14: ffff8e55efafd000 R15: 00000000ffffffb5 [251067.659019] FS: 00007f2a4944b3c0(0000) GS:ffff8e5b27dae000(0000) knlGS:0000000000000000 [251067.660115] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [251067.660943] CR2: 00007ffc5aa57898 CR3: 00000005813a2003 CR4: 0000000000370ef0 [251067.661972] Call Trace: [251067.662292] <TASK> [251067.662653] create_pending_snapshots+0x97/0xc0 [btrfs] [251067.663413] btrfs_commit_transaction+0x26e/0xc00 [btrfs] [251067.664257] ? btrfs_qgroup_convert_reserved_meta+0x35/0x390 [btrfs] [251067.665238] ? _raw_spin_unlock+0x15/0x30 [251067.665837] ? record_root_in_trans+0xa2/0xd0 [btrfs] [251067.666531] btrfs_mksubvol+0x330/0x580 [btrfs] [251067.667145] btrfs_mksnapshot+0x74/0xa0 [btrfs] [251067.667827] __btrfs_ioctl_snap_create+0x194/0x1d0 [btrfs] [251067.668595] btrfs_ioctl_snap_create_v2+0x107/0x130 [btrfs] [251067.669479] btrfs_ioctl+0x1580/0x2690 [btrfs] [251067.670093] ? count_memcg_events+0x6d/0x180 [251067.670849] ? handle_mm_fault+0x1a0/0x2a0 [251067.671652] __x64_sys_ioctl+0x92/0xe0 [251067.672406] do_syscall_64+0x50/0xf20 [251067.673129] entry_SYSCALL_64_after_hwframe+0x76/0x7e [251067.674096] RIP: 0033:0x7f2a495648db [251067.674812] Code: 00 48 89 (...) [251067.678227] RSP: 002b:00007ffc5aa57840 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [251067.679691] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2a495648db [251067.681145] RDX: 00007ffc5aa588b0 RSI: 0000000050009417 RDI: 0000000000000004 [251067.682511] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000 [251067.683842] R10: 000000000000000a R11: 0000000000000246 R12: 00007ffc5aa59910 [251067.685176] R13: 00007ffc5aa588b0 R14: 0000000000000004 R15: 0000000000000006 [251067.686524] </TASK> [251067.686972] ---[ end trace 0000000000000000 ]--- [251067.687890] BTRFS: error (device sdi state A) in create_pending_snapshot:1907: errno=-75 unknown [251067.689049] BTRFS info (device sdi state EA): forced readonly [251067.689054] BTRFS warning (device sdi state EA): Skipping commit of aborted transaction. [251067.690119] BTRFS: error (device sdi state EA) in cleanup_transaction:2043: errno=-75 unknown [251067.702028] BTRFS info (device sdi state EA): last unmount of filesystem 46dc3975-30a2-4a69-a18f-418b859cccda Fix this by ignoring -EOVERFLOW errors from btrfs_uuid_tree_add() in the snapshot creation code when attempting to add the BTRFS_UUID_KEY_RECEIVED_SUBVOL item. This is OK because it's not critical and we are still able to delete the snapshot, as snapshot/subvolume deletion ignores if a BTRFS_UUID_KEY_RECEIVED_SUBVOL is missing (see inode.c:btrfs_delete_subvolume()). As for send/receive, we can still do send/receive operations since it always peeks the first root ID in the existing BTRFS_UUID_KEY_RECEIVED_SUBVOL (it could peek any since all snapshots have the same content), and even if the key is missing, it falls back to searching by BTRFS_UUID_KEY_SUBVOL key. A test case for fstests will be sent soon. Fixes: dd5f961 ("Btrfs: maintain subvolume items in the UUID tree") Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

We never return -EEXIST from btrfs_uuid_tree_add(), if the item already exists we extend it, so it's pointless to check for such return value. Furthermore, in create_pending_snapshot(), the logic is completely broken. The goal was to not error out and abort the transaction in case of -EEXIST but we left 'ret' with the -EEXIST value, so we end up setting pending->error to -EEXIST and return that error up the call chain up to btrfs_commit_transaction(), which will abort the transaction. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

…_add() There's no point in checking if the uuid root exists in btrfs_uuid_tree_add(), since we already do it in btrfs_uuid_tree_lookup(). We can just remove the check from btrfs_uuid_tree_add() and make btrfs_uuid_tree_lookup() return -EINVAL instead of -ENOENT in case the uuid tree does not exists. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

We're under the IS_ERR() branch so we know that 'ret', which got assigned the value of PTR_ERR(di) is always negative, so there's no point in checking if it's negative. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

Function `btrfs_relocate_chunk()` always passes verbose=true to `btrfs_relocate_block_group()` instead of the `verbose` parameter passed into it by it's callers. While user initiated rebalancing should be logged in the Kernel's log buffer. This causes excessive log spamming from automatic rebalancing, e.g. on zoned filesystems running low on usable space. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>

If the set received ioctl fails due to an item overflow when attempting to add the BTRFS_UUID_KEY_RECEIVED_SUBVOL we have to abort the transaction since we did some metadata updates before. This means that if a user calls this ioctl with the same received UUID field for a lot of subvolumes, we will hit the overflow, trigger the transaction abort and turn the filesystem into RO mode. A malicious user could exploit this, and this ioctl does not even requires that a user has admin privileges (CAP_SYS_ADMIN), only that he/she owns the subvolume. Fix this by doing an early check for item overflow before starting a transaction. This is also race safe because we are holding the subvol_sem semaphore in exclusive (write) mode. A test case for fstests will follow soon. Fixes: dd5f961 ("Btrfs: maintain subvolume items in the UUID tree") Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

…bvol ioctl If we failed to update the root we don't abort the transaction, which is wrong since we already used the transaction to remove an item from the uuid tree. Fixes: dd5f961 ("Btrfs: maintain subvolume items in the UUID tree") Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

If we fail to remove an item from the uuid tree, we don't need to abort the transaction since we have not done any change before. So remove that transaction abort. Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

It's defined twice in a row for the !CONFIG_PRINTK case, so remove one of the duplicates. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>

We have two locations using open-coded 512K size, as the async chunk size. For compression we have not only the max size a compressed extent can represent (128K), but also how large an async chunk can be (512K). Although we have a macro for the maximum compressed extent size, we do not have any macro for the async chunk size. Add such a macro and replace the two open-coded SZ_512K. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com>

…action When memory pressure causes writeback of a recently COW'd buffer, btrfs sets BTRFS_HEADER_FLAG_WRITTEN on it. Subsequent btrfs_search_slot() restarts then see the WRITTEN flag and re-COW the buffer unnecessarily, causing COW amplification that can exhaust block reservations and degrade throughput. Overwriting in place is crash-safe because the committed superblock does not reference buffers allocated in the current (uncommitted) transaction, so no on-disk tree points to this block yet. When should_cow_block() encounters a WRITTEN buffer whose generation matches the current transaction, instead of requesting a COW, re-dirty the buffer and re-register its range in the transaction's dirty_pages. Both are necessary because btrfs tracks dirty metadata through two independent mechanisms. set_extent_buffer_dirty() sets the EXTENT_BUFFER_DIRTY flag and the buffer_tree xarray PAGECACHE_TAG_DIRTY mark, which is what background writeback (btree_write_cache_pages) uses to find and write dirty buffers. The transaction's dirty_pages io tree is a separate structure used by btrfs_write_and_wait_transaction() at commit time to ensure all buffers allocated during the transaction are persisted. The dirty_pages range was originally registered in btrfs_init_new_buffer() when the block was first allocated. Normally dirty_pages is only cleared at commit time by btrfs_write_and_wait_transaction(), but if qgroups are enabled and snapshots are being created, qgroup_account_snapshot() may have already called btrfs_write_and_wait_transaction() and released the range before the final commit-time call. Keep BTRFS_HEADER_FLAG_WRITTEN set so that btrfs_free_tree_block() correctly pins the block if it is freed later. Relax the lockdep assertion in btrfs_mark_buffer_dirty() from btrfs_assert_tree_write_locked() to lockdep_assert_held() so that it accepts either a read or write lock. should_cow_block() may be called from btrfs_search_slot() when only a read lock is held (nodes above write_lock_level are read-locked). The write lock assertion previously documented the caller convention that buffer content was being modified under exclusive access, but btrfs_mark_buffer_dirty() and set_extent_buffer_dirty() themselves only perform independently synchronized operations: atomic bit ops on bflags, folio_mark_dirty() (kernel-internal folio locking), xarray mark updates (xarray spinlock), and percpu counter updates. The read lock is sufficient because it prevents lock_extent_buffer_for_io() from acquiring the write lock and racing on the dirty state. Since rw_semaphore permits concurrent readers, multiple threads can enter btrfs_mark_buffer_dirty() simultaneously for the same buffer; this is safe because test_and_set_bit(EXTENT_BUFFER_DIRTY) ensures only one thread performs the full dirty state transition. Remove the CONFIG_BTRFS_DEBUG assertion in set_extent_buffer_dirty() that checked folio_test_dirty() after marking the buffer dirty. This assertion assumed exclusive access (only one thread in set_extent_buffer_dirty() at a time), which held when the only caller was btrfs_mark_buffer_dirty() under write lock. With concurrent readers calling through should_cow_block(), a thread that loses the test_and_set_bit race sees was_dirty=true and skips the folio dirty marking, but the winning thread may not have called btrfs_meta_folio_set_dirty() yet, causing the assertion to fire. This is a benign race: the winning thread will complete the folio dirty marking, and no writeback can clear it while readers hold their locks. Hoist the EXTENT_BUFFER_WRITEBACK, BTRFS_HEADER_FLAG_RELOC, and BTRFS_ROOT_FORCE_COW checks before the WRITTEN block since they apply regardless of whether the buffer has been written back. This consolidates the exclusion logic and simplifies the WRITTEN path to only handle log trees and zoned devices. Moving the RELOC checks before the smp_mb__before_atomic() barrier is safe because both btrfs_root_id() (immutable) and BTRFS_HEADER_FLAG_RELOC (set at COW time under tree lock) are stable values not subject to concurrent modification; the barrier is only needed for BTRFS_ROOT_FORCE_COW which is set concurrently by create_pending_snapshot(). Exclude cases where in-place overwrite is not safe: - EXTENT_BUFFER_WRITEBACK: buffer is mid-I/O - Zoned devices: require sequential writes - Log trees: log blocks are immediately referenced by a committed superblock via btrfs_sync_log(), so overwriting could corrupt the committed log - BTRFS_ROOT_FORCE_COW: snapshot in progress - BTRFS_HEADER_FLAG_RELOC: block being relocated Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: Boris Burkov <boris@bur.io>

Inhibit writeback on COW'd extent buffers for the lifetime of the transaction handle, preventing background writeback from setting BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW. COW amplification occurs when background writeback flushes an extent buffer that a transaction handle is still actively modifying. When lock_extent_buffer_for_io() transitions a buffer from dirty to writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as having been persisted to disk at its current bytenr. Once WRITTEN is set, should_cow_block() must either COW the block again or overwrite it in place, both of which are unnecessary overhead when the buffer is still being modified by the same handle that allocated it. By inhibiting background writeback on actively-used buffers, WRITTEN is never set while a transaction handle holds a reference to the buffer, avoiding this overhead entirely. Add an atomic_t writeback_inhibitors counter to struct extent_buffer, which fits in an existing 6-byte hole without increasing struct size. When a buffer is COW'd in btrfs_force_cow_block(), call btrfs_inhibit_eb_writeback() to store the eb in the transaction handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a reference, and increment writeback_inhibitors. The function handles dedup (same eb inhibited twice by the same handle) and replacement (different eb at the same logical address). Allocation failure is graceful: the buffer simply falls back to the pre-existing behavior where it may be written back and re-COW'd. Also inhibit writeback in should_cow_block() when COW is skipped, so that every transaction handle that reuses an already-COW'd buffer also inhibits its writeback. Without this, if handle A COWs a block and inhibits it, and handle B later reuses the same block without inhibiting, handle A's uninhibit on end_transaction leaves the buffer unprotected while handle B is still using it. This ensures all handles that access a COW'd buffer contribute to the inhibitor count, and the buffer remains protected until the last handle releases it. In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE is used by the VM flusher threads for background and periodic writeback, which are the only paths that cause COW amplification by opportunistically writing out dirty extent buffers mid-transaction. Skipping these is safe because the buffers remain dirty in the page cache and will be written out at transaction commit time. WB_SYNC_ALL must always proceed regardless of writeback_inhibitors. This is required for correctness in the fsync path: btrfs_sync_log() writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL) while the transaction handle that inhibited those same blocks is still active. Without the WB_SYNC_ALL bypass, those inhibited log tree blocks would be silently skipped, resulting in an incomplete log on disk and corruption on replay. btrfs_write_and_wait_transaction() also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path, inhibitors are already cleared beforehand, but the bypass ensures correctness regardless. Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers) to prevent a race where the committer proceeds while buffers are still inhibited. Also uninhibit in btrfs_commit_transaction() before writing and in cleanup_transaction() for the error path. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: Boris Burkov <boris@bur.io>

Add a btrfs_search_slot_restart tracepoint that fires at each restart site in btrfs_search_slot(), recording the root, tree level, and reason for the restart. This enables tracking search slot restarts which contribute to COW amplification under memory pressure. The four restart reasons are: - write_lock: insufficient write lock level, need to restart with higher lock - setup_nodes: node setup returned -EAGAIN - slot_zero: insertion at slot 0 requires higher write lock level - read_block: read_block_for_search returned -EAGAIN (block not cached or lock contention) COW counts are already tracked by the existing trace_btrfs_cow_block() tracepoint. The per-restart-site tracepoint avoids counter overhead in the critical path when tracepoints are disabled, and provides richer per-event information that bpftrace scripts can aggregate into counts, histograms, and per-root breakdowns. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Boris Burkov <boris@bur.io>

kdave force-pushed the for-next branch from 04c8194 to 6f435ee Compare February 22, 2024 10:28

kdave changed the title ~~Post rc5 build test~~ Regular for-next build test Feb 22, 2024

kdave force-pushed the for-next branch 6 times, most recently from 2d4aefb to c9e380a Compare February 28, 2024 14:37

kdave force-pushed the for-next branch 6 times, most recently from c56343b to 1cab137 Compare March 5, 2024 17:23

kdave force-pushed the for-next branch 2 times, most recently from 6613f3c to b30a0ce Compare March 15, 2024 01:05

fdmanana force-pushed the for-next branch from 41a7195 to 787f021 Compare March 18, 2024 11:17

kdave force-pushed the for-next branch 6 times, most recently from d205ebd to c0bd9d9 Compare March 25, 2024 17:48

kdave force-pushed the for-next branch 4 times, most recently from 15022b1 to c22750c Compare March 28, 2024 02:04

kdave force-pushed the for-next branch 3 times, most recently from 28d9855 to e18d8ce Compare April 4, 2024 19:30

maharmstone and others added 19 commits March 2, 2026 19:45

kdave force-pushed the for-next branch from 22ba7c2 to 2161a50 Compare March 2, 2026 18:46

morbidrsa and others added 5 commits March 2, 2026 19:46

kdave force-pushed the for-next branch from 2161a50 to dd2ab84 Compare March 2, 2026 18:47

adam900710 and others added 4 commits March 3, 2026 07:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular for-next build test#1157

Regular for-next build test#1157
kdave wants to merge 10000 commits intobuildfrom
for-next

kdave commented Feb 21, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

20 participants

Conversation

kdave commented Feb 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

20 participants

kdave commented Feb 21, 2024 •

edited

Loading