bcachefs updates for 6.15

On disk format is now soft frozen: no more required/automatic are anticipated before taking off the experimental label. Major changes/features since 6.14: - Scrub - Blocksize greater than page size support - A number of "rebalance spinning and doing no work" issues have been fixed; we now check if the write allocation will succeed in bch2_data_update_init(), before kicking off the read. There's still more work to do in this area. Later we may want to add another bitset btree, like rebalance_work, to track "extents that rebalance was requested to move but couldn't", e.g. due to destination target having insufficient online devices. - We can now support scaling well into the petabyte range: latest bcachefs-tools will pick an appropriate bucket size at format time to ensure fsck can run in available memory (e.g. a server with 256GB of ram and 100PB of storage would want 16MB buckets). On disk format changes: - 1.21: cached backpointers (scalability improvement) Cached replicas now get backpointers, which means we no longer rely on incrementing bucket generation numbers to invalidate cached data: this lets us get rid of the bucket generation number garbage collection, which had to periodically rescan all extents to recompute bucket oldest_gen. Bucket generation numbers are now only used as a consistency check, but they're quite useful for that. - 1.22: stripe backpointers Stripes now have backpointers: erasure coded stripes have their own checksums, separate from the checksums for the extents they contain (and stripe checksums also cover the parity blocks). This is required for implementing scrub for stripes. - 1.23: stripe lru (scalability improvement) Persistent lru for stripes, ordered by "number of empty blocks". This is used by the stripe creation path, which depending on free space may create a new stripe out of a partially empty existing stripe instead of starting a brand new stripe. This replaces an in-memory heap, and means we no longer have to read in the stripes btree at startup. - 1.24: casefolding Case insensitive directory support, courtesy of Valve. This is an incompatible feature, to enable mount with -o version_upgrade=incompatible - 1.25: extent_flags Another incompatible feature requiring explicit opt-in to enable. This adds a flags entry to extents, and a flag bit that marks extents as poisoned. A poisoned extent is an extent that was unreadable due to checksum errors. We can't move such extents without giving them a new checksum, and we may have to move them (for e.g. copygc or device evacuate). We also don't want to delete them: in the future we'll have an API that lets userspace ignore checksum errors and attempt to deal with simple bitrot itself. Marking them as poisoned lets us continue to return the correct error to userspace on normal read calls. Other changes/features: - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top' command, which shows a live view of all internal filesystem counters. - Improved journal pipelining: we can now have 16 journal writes in flight concurrently, up from 4. We're logging significantly more to the journal than we used to with all the recent disk accounting changes and additions, so some users should see a performance increase on some workloads. - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to devices marked as failed. Now we will attempt to read from them, but only if we have no better options. - New option, write_error_timeout: devices will be kicked out of the filesystem if all writes have been failing for x number of seconds. We now also kick devices out when notified by blk_holder_ops that they've gone offline. - Device option handling improvements: the discard option should now be working as expected (additionally, in -tools, all device options that can be set at format time can now be set at device add time, i.e. data_allowed, state). - We now try harder to read data after a checksum error: we'll do additional retries if necessary to a device after after it gave us data with a checksum error. - More self healing work: the full inode <-> dirent consistency checks that are currently run by fsck are now also run every time we do a lookup, meaning we'll be able to correct errors at runtime. Runtime self healing will be flipped on after the new changes have seen more testing, currently they're just checking for consistency. - KMSAN fixes: our KMSAN builds should be nearly clean now, which will put a massive dent in the syzbot dashboard. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfhbnsACgkQE6szbY3K bnY6ew/9FXh3m71BvVpuqTYcUGzIC7gVrnkFy6n4W96v07OjSOoTNHOVVovajxc3 P9LvA77BHC4Xro3H7ORpsIurOZUc6yx18ZizzulVbQFuYa7LY/kNri4ZBtGHcRiV pIdQDLSNmwFjPA4x2S1qTFSF1c586lad+UNQiLam5ophBwQPEO6vG51ZEHa4wld9 +OhWTDYfrvij4D3Lt1ppvhuDP+PQBjhu/QFc0bGjHvKOjfV6sw9XU91sCYKOJIzd qzpsiQd5sepnX717Br3f5SLdxMq2lJYvRp9756vltOCaMBvJYJtHqtXCglHQEkFw yjhmPjk4r3VlKTF8K+wEJfAHwbC2kEn7csJNbt0+Nko5PPtFyrb8ok6QUbHCKscL L0VMnzaXHVqvG2VgYa31temfdz7HM/zHjQ8Al3eQPaqTHIoTXIBQxOQSea/apVMt TIlastvLoHfR8W7+LrwOmTjnBJGCJ+MrdcJzJDVk2tQmmcMA0boeZvl4aSklFuyB zNN5fxp0VMsxNyIHLJjQ3UcwVqHXC5w+f5H1ByQLUyQh+m/xaAaz7S+BTVdVbFPa 1Z1xDuvuHOTnjIOamnOD1l36afJnhq5RciPCXCNtQSB819mc+AfNGQNQTVNOTReC iTiUCcNxu0/DIPlPmeJzAlukVJUgz+/knOI/6zPs3eI7/o88ZGg= =k3cV -----END PGP SIGNATURE----- Merge tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs Pull bcachefs updates from Kent Overstreet: "On disk format is now soft frozen: no more required/automatic are anticipated before taking off the experimental label. Major changes/features since 6.14: - Scrub - Blocksize greater than page size support - A number of "rebalance spinning and doing no work" issues have been fixed; we now check if the write allocation will succeed in bch2_data_update_init(), before kicking off the read. There's still more work to do in this area. Later we may want to add another bitset btree, like rebalance_work, to track "extents that rebalance was requested to move but couldn't", e.g. due to destination target having insufficient online devices. - We can now support scaling well into the petabyte range: latest bcachefs-tools will pick an appropriate bucket size at format time to ensure fsck can run in available memory (e.g. a server with 256GB of ram and 100PB of storage would want 16MB buckets). On disk format changes: - 1.21: cached backpointers (scalability improvement) Cached replicas now get backpointers, which means we no longer rely on incrementing bucket generation numbers to invalidate cached data: this lets us get rid of the bucket generation number garbage collection, which had to periodically rescan all extents to recompute bucket oldest_gen. Bucket generation numbers are now only used as a consistency check, but they're quite useful for that. - 1.22: stripe backpointers Stripes now have backpointers: erasure coded stripes have their own checksums, separate from the checksums for the extents they contain (and stripe checksums also cover the parity blocks). This is required for implementing scrub for stripes. - 1.23: stripe lru (scalability improvement) Persistent lru for stripes, ordered by "number of empty blocks". This is used by the stripe creation path, which depending on free space may create a new stripe out of a partially empty existing stripe instead of starting a brand new stripe. This replaces an in-memory heap, and means we no longer have to read in the stripes btree at startup. - 1.24: casefolding Case insensitive directory support, courtesy of Valve. This is an incompatible feature, to enable mount with -o version_upgrade=incompatible - 1.25: extent_flags Another incompatible feature requiring explicit opt-in to enable. This adds a flags entry to extents, and a flag bit that marks extents as poisoned. A poisoned extent is an extent that was unreadable due to checksum errors. We can't move such extents without giving them a new checksum, and we may have to move them (for e.g. copygc or device evacuate). We also don't want to delete them: in the future we'll have an API that lets userspace ignore checksum errors and attempt to deal with simple bitrot itself. Marking them as poisoned lets us continue to return the correct error to userspace on normal read calls. Other changes/features: - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top' command, which shows a live view of all internal filesystem counters. - Improved journal pipelining: we can now have 16 journal writes in flight concurrently, up from 4. We're logging significantly more to the journal than we used to with all the recent disk accounting changes and additions, so some users should see a performance increase on some workloads. - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to devices marked as failed. Now we will attempt to read from them, but only if we have no better options. - New option, write_error_timeout: devices will be kicked out of the filesystem if all writes have been failing for x number of seconds. We now also kick devices out when notified by blk_holder_ops that they've gone offline. - Device option handling improvements: the discard option should now be working as expected (additionally, in -tools, all device options that can be set at format time can now be set at device add time, i.e. data_allowed, state). - We now try harder to read data after a checksum error: we'll do additional retries if necessary to a device after after it gave us data with a checksum error. - More self healing work: the full inode <-> dirent consistency checks that are currently run by fsck are now also run every time we do a lookup, meaning we'll be able to correct errors at runtime. Runtime self healing will be flipped on after the new changes have seen more testing, currently they're just checking for consistency. - KMSAN fixes: our KMSAN builds should be nearly clean now, which will put a massive dent in the syzbot dashboard" * tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs: (180 commits) bcachefs: Kill unnecessary bch2_dev_usage_read() bcachefs: btree node write errors now print btree node bcachefs: Fix race in print_chain() bcachefs: btree_trans_restart_foreign_task() bcachefs: bch2_disk_accounting_mod2() bcachefs: zero init journal bios bcachefs: Eliminate padding in move_bucket_key bcachefs: Fix a KMSAN splat in btree_update_nodes_written() bcachefs: kmsan asserts bcachefs: Fix kmsan warnings in bch2_extent_crc_pack() bcachefs: Disable asm memcpys when kmsan enabled bcachefs: Handle backpointers with unknown data types bcachefs: Count BCH_DATA_parity backpointers correctly bcachefs: Run bch2_check_dirent_target() at lookup time bcachefs: Refactor bch2_check_dirent_target() bcachefs: Move bch2_check_dirent_target() to namei.c bcachefs: fs-common.c -> namei.c bcachefs: EIO cleanup bcachefs: bch2_write_prep_encoded_data() now returns errcode bcachefs: Simplify bch2_write_op_error() ...
2025-04-19 20:58:31 +09:00 · 2025-03-27 13:20:07 -07:00 · 2025-03-27 13:20:07 -07:00 · 4a4b30ea80
commit 4a4b30ea80
parent f79adee883 d8bdc8daac
116 changed files with 4880 additions and 3008 deletions
--- a/Documentation/filesystems/bcachefs/SubmittingPatches.rst
+++ b/Documentation/filesystems/bcachefs/SubmittingPatches.rst
@ -1,8 +1,13 @@
-Submitting patches to bcachefs:
-===============================
+Submitting patches to bcachefs
+==============================
+
+Here are suggestions for submitting patches to bcachefs subsystem.
+
+Submission checklist
+--------------------

 Patches must be tested before being submitted, either with the xfstests suite
-[0], or the full bcachefs test suite in ktest [1], depending on what's being
+[0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being
 touched. Note that ktest wraps xfstests and will be an easier method to running
 it for most users; it includes single-command wrappers for all the mainstream
 in-kernel local filesystems.
@ -26,21 +31,21 @@ considered out of date), but try not to deviate too much without reason.
 Focus on writing code that reads well and is organized well; code should be
 aesthetically pleasing.

-CI:
-===
+CI
+--

 Instead of running your tests locally, when running the full test suite it's
 preferable to let a server farm do it in parallel, and then have the results
 in a nice test dashboard (which can tell you which failures are new, and
 presents results in a git log view, avoiding the need for most bisecting).

-That exists [2], and community members may request an account. If you work for
+That exists [2]_, and community members may request an account. If you work for
 a big tech company, you'll need to help out with server costs to get access -
 but the CI is not restricted to running bcachefs tests: it runs any ktest test
 (which generally makes it easy to wrap other tests that can run in qemu).

-Other things to think about:
-============================
+Other things to think about
+---------------------------

 - How will we debug this code? Is there sufficient introspection to diagnose
  when something starts acting wonky on a user machine?
@ -79,20 +84,22 @@ Other things to think about:
  tested? (Automated tests exists but aren't in the CI, due to the hassle of
  disk image management; coordinate to have them run.)

-Mailing list, IRC:
-==================
+Mailing list, IRC
+-----------------

-Patches should hit the list [3], but much discussion and code review happens on
-IRC as well [4]; many people appreciate the more conversational approach and
-quicker feedback.
+Patches should hit the list [3]_, but much discussion and code review happens
+on IRC as well [4]_; many people appreciate the more conversational approach
+and quicker feedback.

 Additionally, we have a lively user community doing excellent QA work, which
 exists primarily on IRC. Please make use of that resource; user feedback is
 important for any nontrivial feature, and documenting it in commit messages
 would be a good idea.

-[0]: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
-[1]: https://evilpiepirate.org/git/ktest.git/
-[2]: https://evilpiepirate.org/~testdashboard/ci/
-[3]: linux-bcachefs@vger.kernel.org
-[4]: irc.oftc.net#bcache, #bcachefs-dev
+.. rubric:: References
+
+.. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
+.. [1] https://evilpiepirate.org/git/ktest.git/
+.. [2] https://evilpiepirate.org/~testdashboard/ci/
+.. [3] linux-bcachefs@vger.kernel.org
+.. [4] irc.oftc.net#bcache, #bcachefs-dev
--- a/Documentation/filesystems/bcachefs/casefolding.rst
+++ b/Documentation/filesystems/bcachefs/casefolding.rst
@ -0,0 +1,90 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Casefolding
+===========
+
+bcachefs has support for case-insensitive file and directory
+lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
+casefolding attributes.
+
+The main usecase for casefolding is compatibility with software written
+against other filesystems that rely on casefolded lookups
+(eg. NTFS and Wine/Proton).
+Taking advantage of file-system level casefolding can lead to great
+loading time gains in many applications and games.
+
+Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
+Once a directory has been flagged for casefolding, a feature bit
+is enabled on the superblock which marks the filesystem as using
+casefolding.
+When the feature bit for casefolding is enabled, it is no longer possible
+to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
+
+On the lookup/query side: casefolding is implemented by allocating a new
+string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
+casefold the query string.
+
+On the dirent side: casefolding is implemented by ensuring the `bkey`'s
+hash is made from the casefolded string and storing the cached casefolded
+name with the regular name in the dirent.
+
+The structure looks like this:
+
+* Regular:    [dirent data][regular name][nul][nul]...
+* Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
+
+(Do note, the number of NULs here is merely for illustration; their count can
+vary per-key, and they may not even be present if the key is aligned to
+`sizeof(u64)`.)
+
+This is efficient as it means that for all file lookups that require casefolding,
+it has identical performance to a regular lookup:
+a hash comparison and a `memcmp` of the name.
+
+Rationale
+---------
+
+Several designs were considered for this system:
+One was to introduce a dirent_v2, however that would be painful especially as
+the hash system only has support for a single key type. This would also need
+`BCH_NAME_MAX` to change between versions, and a new feature bit.
+
+Another option was to store without the two lengths, and just take the length of
+the regular name and casefolded name contiguously / 2 as the length. This would
+assume that the regular length == casefolded length, but that could potentially
+not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
+the lowercase unicode glyph.
+It would be possible to disregard the casefold cache for those cases, but it was
+decided to simply encode the two string lengths in the key to avoid random
+performance issues if this edgecase was ever hit.
+
+The option settled on was to use a free-bit in d_type to mark a dirent as having
+a casefold cache, and then treat the first 4 bytes the name block as lengths.
+You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
+
+The feature bit was used to allow casefolding support to be enabled for the majority
+of users, but some allow users who have no need for the feature to still use bcachefs as
+`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
+which may be decider between using bcachefs for eg. embedded platforms.
+
+Other filesystems like ext4 and f2fs have a super-block level option for casefolding
+encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
+any encodings than a single UTF-8 version. When future encodings are desirable,
+they will be added trivially using the opts mechanism.
+
+dentry/dcache considerations
+----------------------------
+
+Currently, in casefolded directories, bcachefs (like other filesystems) will not cache
+negative dentry's.
+
+This is because currently doing so presents a problem in the following scenario:
+
+ - Lookup file "blAH" in a casefolded directory
+ - Creation of file "BLAH" in a casefolded directory
+ - Lookup file "blAH" in a casefolded directory
+
+This would fail if negative dentry's were cached.
+
+This is slightly suboptimal, but could be fixed in future with some vfs work.
+
--- a/Documentation/filesystems/bcachefs/index.rst
+++ b/Documentation/filesystems/bcachefs/index.rst
@ -4,10 +4,28 @@
 bcachefs Documentation
 ======================

+Subsystem-specific development process notes
+--------------------------------------------
+
+Development notes specific to bcachefs. These are intended to supplement
+:doc:`general kernel development handbook </process/index>`.
+
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
   :numbered:

   CodingStyle
   SubmittingPatches
+
+Filesystem implementation
+-------------------------
+
+Documentation for filesystem features and their implementation details.
+At this moment, only a few of these are described here.
+
+.. toctree::
+   :maxdepth: 1
+   :numbered:
+
+   casefolding
   errorcodes
--- a/fs/bcachefs/Kconfig
+++ b/fs/bcachefs/Kconfig
@ -16,7 +16,7 @@ config BCACHEFS_FS
 	select ZSTD_COMPRESS
 	select ZSTD_DECOMPRESS
 	select CRYPTO
-	select CRYPTO_SHA256
+	select CRYPTO_LIB_SHA256
 	select CRYPTO_CHACHA20
 	select CRYPTO_POLY1305
 	select KEYS
--- a/fs/bcachefs/Makefile
+++ b/fs/bcachefs/Makefile
@ -41,7 +41,6 @@ bcachefs-y		:=	\
 	extent_update.o		\
 	eytzinger.o		\
 	fs.o			\
-	fs-common.o		\
 	fs-ioctl.o		\
 	fs-io.o			\
 	fs-io-buffered.o	\
@ -64,9 +63,11 @@ bcachefs-y		:=	\
 	migrate.o		\
 	move.o			\
 	movinggc.o		\
+	namei.o			\
 	nocow_locking.o		\
 	opts.o			\
 	printbuf.o		\
+	progress.o		\
 	quota.o			\
 	rebalance.o		\
 	rcu_pending.o		\
--- a/fs/bcachefs/alloc_background.c
+++ b/fs/bcachefs/alloc_background.c
@ -232,7 +232,7 @@ int bch2_alloc_v3_validate(struct bch_fs *c, struct bkey_s_c k,
 	int ret = 0;

 	bkey_fsck_err_on(bch2_alloc_unpack_v3(&u, k),
-			 c, alloc_v2_unpack_error,
+			 c, alloc_v3_unpack_error,
 			 "unpack error");
 fsck_err:
 	return ret;
@ -777,14 +777,12 @@ static inline int bch2_dev_data_type_accounting_mod(struct btree_trans *trans, s
 						    s64 delta_sectors,
 						    s64 delta_fragmented, unsigned flags)
 {
-	struct disk_accounting_pos acc = {
-		.type = BCH_DISK_ACCOUNTING_dev_data_type,
-		.dev_data_type.dev		= ca->dev_idx,
-		.dev_data_type.data_type	= data_type,
-	};
 	s64 d[3] = { delta_buckets, delta_sectors, delta_fragmented };

-	return bch2_disk_accounting_mod(trans, &acc, d, 3, flags & BTREE_TRIGGER_gc);
+	return bch2_disk_accounting_mod2(trans, flags & BTREE_TRIGGER_gc,
+					 d, dev_data_type,
+					 .dev		= ca->dev_idx,
+					 .data_type	= data_type);
 }

 int bch2_alloc_key_to_dev_counters(struct btree_trans *trans, struct bch_dev *ca,
@ -837,7 +835,7 @@ int bch2_trigger_alloc(struct btree_trans *trans,

 	struct bch_dev *ca = bch2_dev_bucket_tryget(c, new.k->p);
 	if (!ca)
-		return -EIO;
+		return -BCH_ERR_trigger_alloc;

 	struct bch_alloc_v4 old_a_convert;
 	const struct bch_alloc_v4 *old_a = bch2_alloc_to_v4(old, &old_a_convert);
@ -871,6 +869,9 @@ int bch2_trigger_alloc(struct btree_trans *trans,
 		if (data_type_is_empty(new_a->data_type) &&
 		    BCH_ALLOC_V4_NEED_INC_GEN(new_a) &&
 		    !bch2_bucket_is_open_safe(c, new.k->p.inode, new.k->p.offset)) {
+			if (new_a->oldest_gen == new_a->gen &&
+			    !bch2_bucket_sectors_total(*new_a))
+				new_a->oldest_gen++;
 			new_a->gen++;
 			SET_BCH_ALLOC_V4_NEED_INC_GEN(new_a, false);
 			alloc_data_type_set(new_a, new_a->data_type);
@ -889,26 +890,20 @@ int bch2_trigger_alloc(struct btree_trans *trans,
 		    !new_a->io_time[READ])
 			new_a->io_time[READ] = bch2_current_io_time(c, READ);

-		u64 old_lru = alloc_lru_idx_read(*old_a);
-		u64 new_lru = alloc_lru_idx_read(*new_a);
-		if (old_lru != new_lru) {
-			ret = bch2_lru_change(trans, new.k->p.inode,
-					      bucket_to_u64(new.k->p),
-					      old_lru, new_lru);
-			if (ret)
-				goto err;
-		}
+		ret = bch2_lru_change(trans, new.k->p.inode,
+				      bucket_to_u64(new.k->p),
+				      alloc_lru_idx_read(*old_a),
+				      alloc_lru_idx_read(*new_a));
+		if (ret)
+			goto err;

-		old_lru = alloc_lru_idx_fragmentation(*old_a, ca);
-		new_lru = alloc_lru_idx_fragmentation(*new_a, ca);
-		if (old_lru != new_lru) {
-			ret = bch2_lru_change(trans,
-					BCH_LRU_FRAGMENTATION_START,
-					bucket_to_u64(new.k->p),
-					old_lru, new_lru);
-			if (ret)
-				goto err;
-		}
+		ret = bch2_lru_change(trans,
+				      BCH_LRU_BUCKET_FRAGMENTATION,
+				      bucket_to_u64(new.k->p),
+				      alloc_lru_idx_fragmentation(*old_a, ca),
+				      alloc_lru_idx_fragmentation(*new_a, ca));
+		if (ret)
+			goto err;

 		if (old_a->gen != new_a->gen) {
 			ret = bch2_bucket_gen_update(trans, new.k->p, new_a->gen);
@ -1034,7 +1029,7 @@ fsck_err:
 invalid_bucket:
 	bch2_fs_inconsistent(c, "reference to invalid bucket\n  %s",
 			     (bch2_bkey_val_to_text(&buf, c, new.s_c), buf.buf));
-	ret = -EIO;
+	ret = -BCH_ERR_trigger_alloc;
 	goto err;
 }

@ -1705,7 +1700,8 @@ static int bch2_check_alloc_to_lru_ref(struct btree_trans *trans,

 	u64 lru_idx = alloc_lru_idx_fragmentation(*a, ca);
 	if (lru_idx) {
-		ret = bch2_lru_check_set(trans, BCH_LRU_FRAGMENTATION_START,
+		ret = bch2_lru_check_set(trans, BCH_LRU_BUCKET_FRAGMENTATION,
+					 bucket_to_u64(alloc_k.k->p),
 					 lru_idx, alloc_k, last_flushed);
 		if (ret)
 			goto err;
@ -1735,7 +1731,9 @@ static int bch2_check_alloc_to_lru_ref(struct btree_trans *trans,
 		a = &a_mut->v;
 	}

-	ret = bch2_lru_check_set(trans, alloc_k.k->p.inode, a->io_time[READ],
+	ret = bch2_lru_check_set(trans, alloc_k.k->p.inode,
+				 bucket_to_u64(alloc_k.k->p),
+				 a->io_time[READ],
 				 alloc_k, last_flushed);
 	if (ret)
 		goto err;
@ -1757,7 +1755,8 @@ int bch2_check_alloc_to_lru_refs(struct bch_fs *c)
 		for_each_btree_key_commit(trans, iter, BTREE_ID_alloc,
 				POS_MIN, BTREE_ITER_prefetch, k,
 				NULL, NULL, BCH_TRANS_COMMIT_no_enospc,
-			bch2_check_alloc_to_lru_ref(trans, &iter, &last_flushed)));
+			bch2_check_alloc_to_lru_ref(trans, &iter, &last_flushed))) ?:
+		bch2_check_stripe_to_lru_refs(c);

 	bch2_bkey_buf_exit(&last_flushed, c);
 	bch_err_fn(c, ret);
@ -1805,6 +1804,19 @@ struct discard_buckets_state {
 	u64		discarded;
 };

+/*
+ * This is needed because discard is both a filesystem option and a device
+ * option, and mount options are supposed to apply to that mount and not be
+ * persisted, i.e. if it's set as a mount option we can't propagate it to the
+ * device.
+ */
+static inline bool discard_opt_enabled(struct bch_fs *c, struct bch_dev *ca)
+{
+	return test_bit(BCH_FS_discard_mount_opt_set, &c->flags)
+		? c->opts.discard
+		: ca->mi.discard;
+}
+
 static int bch2_discard_one_bucket(struct btree_trans *trans,
 				   struct bch_dev *ca,
 				   struct btree_iter *need_discard_iter,
@ -1868,7 +1880,7 @@ static int bch2_discard_one_bucket(struct btree_trans *trans,
 		s->discarded++;
 		*discard_pos_done = iter.pos;

-		if (ca->mi.discard && !c->opts.nochanges) {
+		if (discard_opt_enabled(c, ca) && !c->opts.nochanges) {
 			/*
 			 * This works without any other locks because this is the only
 			 * thread that removes items from the need_discard tree
@ -1897,7 +1909,10 @@ commit:
 	if (ret)
 		goto out;

-	count_event(c, bucket_discard);
+	if (!fastpath)
+		count_event(c, bucket_discard);
+	else
+		count_event(c, bucket_discard_fast);
 out:
 fsck_err:
 	if (discard_locked)
@ -2055,16 +2070,71 @@ put_ref:
 	bch2_write_ref_put(c, BCH_WRITE_REF_discard_fast);
 }

+static int invalidate_one_bp(struct btree_trans *trans,
+			     struct bch_dev *ca,
+			     struct bkey_s_c_backpointer bp,
+			     struct bkey_buf *last_flushed)
+{
+	struct btree_iter extent_iter;
+	struct bkey_s_c extent_k =
+		bch2_backpointer_get_key(trans, bp, &extent_iter, 0, last_flushed);
+	int ret = bkey_err(extent_k);
+	if (ret)
+		return ret;
+
+	struct bkey_i *n =
+		bch2_bkey_make_mut(trans, &extent_iter, &extent_k,
+				   BTREE_UPDATE_internal_snapshot_node);
+	ret = PTR_ERR_OR_ZERO(n);
+	if (ret)
+		goto err;
+
+	bch2_bkey_drop_device(bkey_i_to_s(n), ca->dev_idx);
+err:
+	bch2_trans_iter_exit(trans, &extent_iter);
+	return ret;
+}
+
+static int invalidate_one_bucket_by_bps(struct btree_trans *trans,
+					struct bch_dev *ca,
+					struct bpos bucket,
+					u8 gen,
+					struct bkey_buf *last_flushed)
+{
+	struct bpos bp_start	= bucket_pos_to_bp_start(ca,	bucket);
+	struct bpos bp_end	= bucket_pos_to_bp_end(ca,	bucket);
+
+	return for_each_btree_key_max_commit(trans, iter, BTREE_ID_backpointers,
+				      bp_start, bp_end, 0, k,
+				      NULL, NULL,
+				      BCH_WATERMARK_btree|
+				      BCH_TRANS_COMMIT_no_enospc, ({
+		if (k.k->type != KEY_TYPE_backpointer)
+			continue;
+
+		struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(k);
+
+		if (bp.v->bucket_gen != gen)
+			continue;
+
+		/* filter out bps with gens that don't match */
+
+		invalidate_one_bp(trans, ca, bp, last_flushed);
+	}));
+}
+
+noinline_for_stack
 static int invalidate_one_bucket(struct btree_trans *trans,
+				 struct bch_dev *ca,
 				 struct btree_iter *lru_iter,
 				 struct bkey_s_c lru_k,
+				 struct bkey_buf *last_flushed,
 				 s64 *nr_to_invalidate)
 {
 	struct bch_fs *c = trans->c;
-	struct bkey_i_alloc_v4 *a = NULL;
 	struct printbuf buf = PRINTBUF;
 	struct bpos bucket = u64_to_bucket(lru_k.k->p.offset);
-	unsigned cached_sectors;
+	struct btree_iter alloc_iter = {};
 	int ret = 0;

 	if (*nr_to_invalidate <= 0)
@ -2081,35 +2151,37 @@ static int invalidate_one_bucket(struct btree_trans *trans,
 	if (bch2_bucket_is_open_safe(c, bucket.inode, bucket.offset))
 		return 0;

-	a = bch2_trans_start_alloc_update(trans, bucket, BTREE_TRIGGER_bucket_invalidate);
-	ret = PTR_ERR_OR_ZERO(a);
+	struct bkey_s_c alloc_k = bch2_bkey_get_iter(trans, &alloc_iter,
+						     BTREE_ID_alloc, bucket,
+						     BTREE_ITER_cached);
+	ret = bkey_err(alloc_k);
 	if (ret)
-		goto out;
+		return ret;
+
+	struct bch_alloc_v4 a_convert;
+	const struct bch_alloc_v4 *a = bch2_alloc_to_v4(alloc_k, &a_convert);

 	/* We expect harmless races here due to the btree write buffer: */
-	if (lru_pos_time(lru_iter->pos) != alloc_lru_idx_read(a->v))
+	if (lru_pos_time(lru_iter->pos) != alloc_lru_idx_read(*a))
 		goto out;

-	BUG_ON(a->v.data_type != BCH_DATA_cached);
-	BUG_ON(a->v.dirty_sectors);
+	/*
+	 * Impossible since alloc_lru_idx_read() only returns nonzero if the
+	 * bucket is supposed to be on the cached bucket LRU (i.e.
+	 * BCH_DATA_cached)
+	 *
+	 * bch2_lru_validate() also disallows lru keys with lru_pos_time() == 0
+	 */
+	BUG_ON(a->data_type != BCH_DATA_cached);
+	BUG_ON(a->dirty_sectors);

-	if (!a->v.cached_sectors)
+	if (!a->cached_sectors)
 		bch_err(c, "invalidating empty bucket, confused");

-	cached_sectors = a->v.cached_sectors;
+	unsigned cached_sectors = a->cached_sectors;
+	u8 gen = a->gen;

-	SET_BCH_ALLOC_V4_NEED_INC_GEN(&a->v, false);
-	a->v.gen++;
-	a->v.data_type		= 0;
-	a->v.dirty_sectors	= 0;
-	a->v.stripe_sectors	= 0;
-	a->v.cached_sectors	= 0;
-	a->v.io_time[READ]	= bch2_current_io_time(c, READ);
-	a->v.io_time[WRITE]	= bch2_current_io_time(c, WRITE);
-
-	ret = bch2_trans_commit(trans, NULL, NULL,
-				BCH_WATERMARK_btree|
-				BCH_TRANS_COMMIT_no_enospc);
+	ret = invalidate_one_bucket_by_bps(trans, ca, bucket, gen, last_flushed);
 	if (ret)
 		goto out;

@ -2117,6 +2189,7 @@ static int invalidate_one_bucket(struct btree_trans *trans,
 	--*nr_to_invalidate;
 out:
 fsck_err:
+	bch2_trans_iter_exit(trans, &alloc_iter);
 	printbuf_exit(&buf);
 	return ret;
 }
@ -2143,6 +2216,10 @@ static void bch2_do_invalidates_work(struct work_struct *work)
 	struct btree_trans *trans = bch2_trans_get(c);
 	int ret = 0;

+	struct bkey_buf last_flushed;
+	bch2_bkey_buf_init(&last_flushed);
+	bkey_init(&last_flushed.k->k);
+
 	ret = bch2_btree_write_buffer_tryflush(trans);
 	if (ret)
 		goto err;
@ -2167,7 +2244,7 @@ static void bch2_do_invalidates_work(struct work_struct *work)
 		if (!k.k)
 			break;

-		ret = invalidate_one_bucket(trans, &iter, k, &nr_to_invalidate);
+		ret = invalidate_one_bucket(trans, ca, &iter, k, &last_flushed, &nr_to_invalidate);
 restart_err:
 		if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
 			continue;
@ -2180,6 +2257,7 @@ restart_err:
 err:
 	bch2_trans_put(trans);
 	percpu_ref_put(&ca->io_ref);
+	bch2_bkey_buf_exit(&last_flushed, c);
 	bch2_write_ref_put(c, BCH_WRITE_REF_invalidate);
 }

--- a/fs/bcachefs/alloc_background.h
+++ b/fs/bcachefs/alloc_background.h
@ -131,7 +131,7 @@ static inline enum bch_data_type alloc_data_type(struct bch_alloc_v4 a,
 	if (a.stripe)
 		return data_type == BCH_DATA_parity ? data_type : BCH_DATA_stripe;
 	if (bch2_bucket_sectors_dirty(a))
-		return data_type;
+		return bucket_data_type(data_type);
 	if (a.cached_sectors)
 		return BCH_DATA_cached;
 	if (BCH_ALLOC_V4_NEED_DISCARD(&a))
--- a/fs/bcachefs/alloc_foreground.c
+++ b/fs/bcachefs/alloc_foreground.c
@ -127,14 +127,14 @@ void __bch2_open_bucket_put(struct bch_fs *c, struct open_bucket *ob)

 void bch2_open_bucket_write_error(struct bch_fs *c,
 				  struct open_buckets *obs,
-				  unsigned dev)
+				  unsigned dev, int err)
 {
 	struct open_bucket *ob;
 	unsigned i;

 	open_bucket_for_each(c, obs, ob, i)
 		if (ob->dev == dev && ob->ec)
-			bch2_ec_bucket_cancel(c, ob);
+			bch2_ec_bucket_cancel(c, ob, err);
 }

 static struct open_bucket *bch2_open_bucket_alloc(struct bch_fs *c)
@ -179,23 +179,6 @@ static void open_bucket_free_unused(struct bch_fs *c, struct open_bucket *ob)
 	closure_wake_up(&c->freelist_wait);
 }

-static inline unsigned open_buckets_reserved(enum bch_watermark watermark)
-{
-	switch (watermark) {
-	case BCH_WATERMARK_interior_updates:
-		return 0;
-	case BCH_WATERMARK_reclaim:
-		return OPEN_BUCKETS_COUNT / 6;
-	case BCH_WATERMARK_btree:
-	case BCH_WATERMARK_btree_copygc:
-		return OPEN_BUCKETS_COUNT / 4;
-	case BCH_WATERMARK_copygc:
-		return OPEN_BUCKETS_COUNT / 3;
-	default:
-		return OPEN_BUCKETS_COUNT / 2;
-	}
-}
-
 static inline bool may_alloc_bucket(struct bch_fs *c,
 				    struct bpos bucket,
 				    struct bucket_alloc_state *s)
@ -239,7 +222,7 @@ static struct open_bucket *__try_alloc_bucket(struct bch_fs *c, struct bch_dev *

 	spin_lock(&c->freelist_lock);

-	if (unlikely(c->open_buckets_nr_free <= open_buckets_reserved(watermark))) {
+	if (unlikely(c->open_buckets_nr_free <= bch2_open_buckets_reserved(watermark))) {
 		if (cl)
 			closure_wait(&c->open_buckets_wait, cl);

@ -648,7 +631,7 @@ static inline void bch2_dev_stripe_increment_inlined(struct bch_dev *ca,
 			       struct bch_dev_usage *usage)
 {
 	u64 *v = stripe->next_alloc + ca->dev_idx;
-	u64 free_space = dev_buckets_available(ca, BCH_WATERMARK_normal);
+	u64 free_space = __dev_buckets_available(ca, *usage, BCH_WATERMARK_normal);
 	u64 free_space_inv = free_space
 		? div64_u64(1ULL << 48, free_space)
 		: 1ULL << 48;
@ -728,7 +711,7 @@ int bch2_bucket_alloc_set_trans(struct btree_trans *trans,

 		struct bch_dev_usage usage;
 		struct open_bucket *ob = bch2_bucket_alloc_trans(trans, ca, watermark, data_type,
-						     cl, flags & BCH_WRITE_ALLOC_NOWAIT, &usage);
+						     cl, flags & BCH_WRITE_alloc_nowait, &usage);
 		if (!IS_ERR(ob))
 			bch2_dev_stripe_increment_inlined(ca, stripe, &usage);
 		bch2_dev_put(ca);
@ -1336,7 +1319,7 @@ retry:
 	if (wp->data_type != BCH_DATA_user)
 		have_cache = true;

-	if (target && !(flags & BCH_WRITE_ONLY_SPECIFIED_DEVS)) {
+	if (target && !(flags & BCH_WRITE_only_specified_devs)) {
 		ret = open_bucket_add_buckets(trans, &ptrs, wp, devs_have,
 					      target, erasure_code,
 					      nr_replicas, &nr_effective,
@ -1426,7 +1409,7 @@ err:
 	if (cl && bch2_err_matches(ret, BCH_ERR_open_buckets_empty))
 		ret = -BCH_ERR_bucket_alloc_blocked;

-	if (cl && !(flags & BCH_WRITE_ALLOC_NOWAIT) &&
+	if (cl && !(flags & BCH_WRITE_alloc_nowait) &&
 	    bch2_err_matches(ret, BCH_ERR_freelist_empty))
 		ret = -BCH_ERR_bucket_alloc_blocked;

--- a/fs/bcachefs/alloc_foreground.h
+++ b/fs/bcachefs/alloc_foreground.h
@ -33,6 +33,23 @@ static inline struct bch_dev *ob_dev(struct bch_fs *c, struct open_bucket *ob)
 	return bch2_dev_have_ref(c, ob->dev);
 }

+static inline unsigned bch2_open_buckets_reserved(enum bch_watermark watermark)
+{
+	switch (watermark) {
+	case BCH_WATERMARK_interior_updates:
+		return 0;
+	case BCH_WATERMARK_reclaim:
+		return OPEN_BUCKETS_COUNT / 6;
+	case BCH_WATERMARK_btree:
+	case BCH_WATERMARK_btree_copygc:
+		return OPEN_BUCKETS_COUNT / 4;
+	case BCH_WATERMARK_copygc:
+		return OPEN_BUCKETS_COUNT / 3;
+	default:
+		return OPEN_BUCKETS_COUNT / 2;
+	}
+}
+
 struct open_bucket *bch2_bucket_alloc(struct bch_fs *, struct bch_dev *,
 				      enum bch_watermark, enum bch_data_type,
 				      struct closure *);
@ -65,7 +82,7 @@ static inline struct open_bucket *ec_open_bucket(struct bch_fs *c,
 }

 void bch2_open_bucket_write_error(struct bch_fs *,
-			struct open_buckets *, unsigned);
+			struct open_buckets *, unsigned, int);

 void __bch2_open_bucket_put(struct bch_fs *, struct open_bucket *);

--- a/fs/bcachefs/alloc_types.h
+++ b/fs/bcachefs/alloc_types.h
@ -90,6 +90,7 @@ struct dev_stripe_state {
 	x(stopped)			\
 	x(waiting_io)			\
 	x(waiting_work)			\
+	x(runnable)			\
 	x(running)

 enum write_point_state {
@ -125,6 +126,7 @@ struct write_point {
 		enum write_point_state	state;
 		u64			last_state_change;
 		u64			time[WRITE_POINT_STATE_NR];
+		u64			last_runtime;
 	} __aligned(SMP_CACHE_BYTES);
 };

--- a/fs/bcachefs/backpointers.c
+++ b/fs/bcachefs/backpointers.c
@ -11,6 +11,7 @@
 #include "checksum.h"
 #include "disk_accounting.h"
 #include "error.h"
+#include "progress.h"

 #include <linux/mm.h>

@ -49,6 +50,8 @@ void bch2_backpointer_to_text(struct printbuf *out, struct bch_fs *c, struct bke
 	}

 	bch2_btree_id_level_to_text(out, bp.v->btree_id, bp.v->level);
+	prt_str(out, " data_type=");
+	bch2_prt_data_type(out, bp.v->data_type);
 	prt_printf(out, " suboffset=%u len=%u gen=%u pos=",
 		   (u32) bp.k->p.offset & ~(~0U << MAX_EXTENT_COMPRESS_RATIO_SHIFT),
 		   bp.v->bucket_len,
@ -244,27 +247,31 @@ struct bkey_s_c bch2_backpointer_get_key(struct btree_trans *trans,
 	if (unlikely(bp.v->btree_id >= btree_id_nr_alive(c)))
 		return bkey_s_c_null;

-	if (likely(!bp.v->level)) {
-		bch2_trans_node_iter_init(trans, iter,
-					  bp.v->btree_id,
-					  bp.v->pos,
-					  0, 0,
-					  iter_flags);
-		struct bkey_s_c k = bch2_btree_iter_peek_slot(iter);
-		if (bkey_err(k)) {
-			bch2_trans_iter_exit(trans, iter);
-			return k;
-		}
-
-		if (k.k &&
-		    extent_matches_bp(c, bp.v->btree_id, bp.v->level, k, bp))
-			return k;
-
+	bch2_trans_node_iter_init(trans, iter,
+				  bp.v->btree_id,
+				  bp.v->pos,
+				  0,
+				  bp.v->level,
+				  iter_flags);
+	struct bkey_s_c k = bch2_btree_iter_peek_slot(iter);
+	if (bkey_err(k)) {
 		bch2_trans_iter_exit(trans, iter);
+		return k;
+	}
+
+	if (k.k &&
+	    extent_matches_bp(c, bp.v->btree_id, bp.v->level, k, bp))
+		return k;
+
+	bch2_trans_iter_exit(trans, iter);
+
+	if (!bp.v->level) {
 		int ret = backpointer_target_not_found(trans, bp, k, last_flushed);
 		return ret ? bkey_s_c_err(ret) : bkey_s_c_null;
 	} else {
 		struct btree *b = bch2_backpointer_get_node(trans, bp, iter, last_flushed);
+		if (b == ERR_PTR(-BCH_ERR_backpointer_to_overwritten_btree_node))
+			return bkey_s_c_null;
 		if (IS_ERR_OR_NULL(b))
 			return ((struct bkey_s_c) { .k = ERR_CAST(b) });

@ -514,6 +521,22 @@ check_existing_bp:
 	if (!other_extent.k)
 		goto missing;

+	rcu_read_lock();
+	struct bch_dev *ca = bch2_dev_rcu_noerror(c, bp->k.p.inode);
+	if (ca) {
+		struct bkey_ptrs_c other_extent_ptrs = bch2_bkey_ptrs_c(other_extent);
+		bkey_for_each_ptr(other_extent_ptrs, ptr)
+			if (ptr->dev == bp->k.p.inode &&
+			    dev_ptr_stale_rcu(ca, ptr)) {
+				ret = drop_dev_and_update(trans, other_bp.v->btree_id,
+							  other_extent, bp->k.p.inode);
+				if (ret)
+					goto err;
+				goto out;
+			}
+	}
+	rcu_read_unlock();
+
 	if (bch2_extents_match(orig_k, other_extent)) {
 		printbuf_reset(&buf);
 		prt_printf(&buf, "duplicate versions of same extent, deleting smaller\n  ");
@ -590,9 +613,6 @@ static int check_extent_to_backpointers(struct btree_trans *trans,
 	struct extent_ptr_decoded p;

 	bkey_for_each_ptr_decode(k.k, ptrs, p, entry) {
-		if (p.ptr.cached)
-			continue;
-
 		if (p.ptr.dev == BCH_SB_MEMBER_INVALID)
 			continue;

@ -600,9 +620,11 @@ static int check_extent_to_backpointers(struct btree_trans *trans,
 		struct bch_dev *ca = bch2_dev_rcu_noerror(c, p.ptr.dev);
 		bool check = ca && test_bit(PTR_BUCKET_NR(ca, &p.ptr), ca->bucket_backpointer_mismatches);
 		bool empty = ca && test_bit(PTR_BUCKET_NR(ca, &p.ptr), ca->bucket_backpointer_empty);
+
+		bool stale = p.ptr.cached && (!ca || dev_ptr_stale_rcu(ca, &p.ptr));
 		rcu_read_unlock();

-		if (check || empty) {
+		if ((check || empty) && !stale) {
 			struct bkey_i_backpointer bp;
 			bch2_extent_ptr_to_bp(c, btree, level, k, p, entry, &bp);

@ -715,71 +737,6 @@ static int bch2_get_btree_in_memory_pos(struct btree_trans *trans,
 	return ret;
 }

-struct progress_indicator_state {
-	unsigned long		next_print;
-	u64			nodes_seen;
-	u64			nodes_total;
-	struct btree		*last_node;
-};
-
-static inline void progress_init(struct progress_indicator_state *s,
-				 struct bch_fs *c,
-				 u64 btree_id_mask)
-{
-	memset(s, 0, sizeof(*s));
-
-	s->next_print = jiffies + HZ * 10;
-
-	for (unsigned i = 0; i < BTREE_ID_NR; i++) {
-		if (!(btree_id_mask & BIT_ULL(i)))
-			continue;
-
-		struct disk_accounting_pos acc = {
-			.type		= BCH_DISK_ACCOUNTING_btree,
-			.btree.id	= i,
-		};
-
-		u64 v;
-		bch2_accounting_mem_read(c, disk_accounting_pos_to_bpos(&acc), &v, 1);
-		s->nodes_total += div64_ul(v, btree_sectors(c));
-	}
-}
-
-static inline bool progress_update_p(struct progress_indicator_state *s)
-{
-	bool ret = time_after_eq(jiffies, s->next_print);
-
-	if (ret)
-		s->next_print = jiffies + HZ * 10;
-	return ret;
-}
-
-static void progress_update_iter(struct btree_trans *trans,
-				 struct progress_indicator_state *s,
-				 struct btree_iter *iter,
-				 const char *msg)
-{
-	struct bch_fs *c = trans->c;
-	struct btree *b = path_l(btree_iter_path(trans, iter))->b;
-
-	s->nodes_seen += b != s->last_node;
-	s->last_node = b;
-
-	if (progress_update_p(s)) {
-		struct printbuf buf = PRINTBUF;
-		unsigned percent = s->nodes_total
-			? div64_u64(s->nodes_seen * 100, s->nodes_total)
-			: 0;
-
-		prt_printf(&buf, "%s: %d%%, done %llu/%llu nodes, at ",
-			   msg, percent, s->nodes_seen, s->nodes_total);
-		bch2_bbpos_to_text(&buf, BBPOS(iter->btree_id, iter->pos));
-
-		bch_info(c, "%s", buf.buf);
-		printbuf_exit(&buf);
-	}
-}
-
 static int bch2_check_extents_to_backpointers_pass(struct btree_trans *trans,
 						   struct extents_to_bp_state *s)
 {
@ -787,7 +744,7 @@ static int bch2_check_extents_to_backpointers_pass(struct btree_trans *trans,
 	struct progress_indicator_state progress;
 	int ret = 0;

-	progress_init(&progress, trans->c, BIT_ULL(BTREE_ID_extents)|BIT_ULL(BTREE_ID_reflink));
+	bch2_progress_init(&progress, trans->c, BIT_ULL(BTREE_ID_extents)|BIT_ULL(BTREE_ID_reflink));

 	for (enum btree_id btree_id = 0;
 	     btree_id < btree_id_nr_alive(c);
@ -806,7 +763,7 @@ static int bch2_check_extents_to_backpointers_pass(struct btree_trans *trans,
 						  BTREE_ITER_prefetch);

 			ret = for_each_btree_key_continue(trans, iter, 0, k, ({
-				progress_update_iter(trans, &progress, &iter, "extents_to_backpointers");
+				bch2_progress_update_iter(trans, &progress, &iter, "extents_to_backpointers");
 				check_extent_to_backpointers(trans, s, btree_id, level, k) ?:
 				bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc);
 			}));
@ -827,7 +784,7 @@ enum alloc_sector_counter {
 	ALLOC_SECTORS_NR
 };

-static enum alloc_sector_counter data_type_to_alloc_counter(enum bch_data_type t)
+static int data_type_to_alloc_counter(enum bch_data_type t)
 {
 	switch (t) {
 	case BCH_DATA_btree:
@ -836,9 +793,10 @@ static enum alloc_sector_counter data_type_to_alloc_counter(enum bch_data_type t
 	case BCH_DATA_cached:
 		return ALLOC_cached;
 	case BCH_DATA_stripe:
+	case BCH_DATA_parity:
 		return ALLOC_stripe;
 	default:
-		BUG();
+		return -1;
 	}
 }

@ -889,7 +847,11 @@ static int check_bucket_backpointer_mismatch(struct btree_trans *trans, struct b
 		if (bp.v->bucket_gen != a->gen)
 			continue;

-		sectors[data_type_to_alloc_counter(bp.v->data_type)] += bp.v->bucket_len;
+		int alloc_counter = data_type_to_alloc_counter(bp.v->data_type);
+		if (alloc_counter < 0)
+			continue;
+
+		sectors[alloc_counter] += bp.v->bucket_len;
 	};
 	bch2_trans_iter_exit(trans, &iter);
 	if (ret)
@ -901,9 +863,8 @@ static int check_bucket_backpointer_mismatch(struct btree_trans *trans, struct b
 			goto err;
 	}

-	/* Cached pointers don't have backpointers: */
-
 	if (sectors[ALLOC_dirty]  != a->dirty_sectors ||
+	    sectors[ALLOC_cached] != a->cached_sectors ||
 	    sectors[ALLOC_stripe] != a->stripe_sectors) {
 		if (c->sb.version_upgrade_complete >= bcachefs_metadata_version_backpointer_bucket_gen) {
 			ret = bch2_backpointers_maybe_flush(trans, alloc_k, last_flushed);
@ -912,6 +873,7 @@ static int check_bucket_backpointer_mismatch(struct btree_trans *trans, struct b
 		}

 		if (sectors[ALLOC_dirty]  > a->dirty_sectors ||
+		    sectors[ALLOC_cached] > a->cached_sectors ||
 		    sectors[ALLOC_stripe] > a->stripe_sectors) {
 			ret = check_bucket_backpointers_to_extents(trans, ca, alloc_k.k->p) ?:
 				-BCH_ERR_transaction_restart_nested;
@ -919,7 +881,8 @@ static int check_bucket_backpointer_mismatch(struct btree_trans *trans, struct b
 		}

 		if (!sectors[ALLOC_dirty] &&
-		    !sectors[ALLOC_stripe])
+		    !sectors[ALLOC_stripe] &&
+		    !sectors[ALLOC_cached])
 			__set_bit(alloc_k.k->p.offset, ca->bucket_backpointer_empty);
 		else
 			__set_bit(alloc_k.k->p.offset, ca->bucket_backpointer_mismatches);
@ -1206,11 +1169,11 @@ static int bch2_check_backpointers_to_extents_pass(struct btree_trans *trans,

 	bch2_bkey_buf_init(&last_flushed);
 	bkey_init(&last_flushed.k->k);
-	progress_init(&progress, trans->c, BIT_ULL(BTREE_ID_backpointers));
+	bch2_progress_init(&progress, trans->c, BIT_ULL(BTREE_ID_backpointers));

 	int ret = for_each_btree_key(trans, iter, BTREE_ID_backpointers,
 				     POS_MIN, BTREE_ITER_prefetch, k, ({
-			progress_update_iter(trans, &progress, &iter, "backpointers_to_extents");
+			bch2_progress_update_iter(trans, &progress, &iter, "backpointers_to_extents");
 			check_one_backpointer(trans, start, end, k, &last_flushed);
 	}));

--- a/fs/bcachefs/backpointers.h
+++ b/fs/bcachefs/backpointers.h
@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _BCACHEFS_BACKPOINTERS_BACKGROUND_H
-#define _BCACHEFS_BACKPOINTERS_BACKGROUND_H
+#ifndef _BCACHEFS_BACKPOINTERS_H
+#define _BCACHEFS_BACKPOINTERS_H

 #include "btree_cache.h"
 #include "btree_iter.h"
@ -123,7 +123,12 @@ static inline enum bch_data_type bch2_bkey_ptr_data_type(struct bkey_s_c k,
 		return BCH_DATA_btree;
 	case KEY_TYPE_extent:
 	case KEY_TYPE_reflink_v:
-		return p.has_ec ? BCH_DATA_stripe : BCH_DATA_user;
+		if (p.has_ec)
+			return BCH_DATA_stripe;
+		if (p.ptr.cached)
+			return BCH_DATA_cached;
+		else
+			return BCH_DATA_user;
 	case KEY_TYPE_stripe: {
 		const struct bch_extent_ptr *ptr = &entry->ptr;
 		struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k);
@ -147,7 +152,20 @@ static inline void bch2_extent_ptr_to_bp(struct bch_fs *c,
 			   struct bkey_i_backpointer *bp)
 {
 	bkey_backpointer_init(&bp->k_i);
-	bp->k.p = POS(p.ptr.dev, ((u64) p.ptr.offset << MAX_EXTENT_COMPRESS_RATIO_SHIFT) + p.crc.offset);
+	bp->k.p.inode = p.ptr.dev;
+
+	if (k.k->type != KEY_TYPE_stripe)
+		bp->k.p.offset = ((u64) p.ptr.offset << MAX_EXTENT_COMPRESS_RATIO_SHIFT) + p.crc.offset;
+	else {
+		/*
+		 * Put stripe backpointers where they won't collide with the
+		 * extent backpointers within the stripe:
+		 */
+		struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k);
+		bp->k.p.offset = ((u64) (p.ptr.offset + le16_to_cpu(s.v->sectors)) <<
+				  MAX_EXTENT_COMPRESS_RATIO_SHIFT) - 1;
+	}
+
 	bp->v	= (struct bch_backpointer) {
 		.btree_id	= btree_id,
 		.level		= level,
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@ -203,6 +203,7 @@
 #include <linux/types.h>
 #include <linux/workqueue.h>
 #include <linux/zstd.h>
+#include <linux/unicode.h>

 #include "bcachefs_format.h"
 #include "btree_journal_iter_types.h"
@ -444,6 +445,7 @@ BCH_DEBUG_PARAMS_DEBUG()
 	x(btree_node_sort)			\
 	x(btree_node_read)			\
 	x(btree_node_read_done)			\
+	x(btree_node_write)			\
 	x(btree_interior_update_foreground)	\
 	x(btree_interior_update_total)		\
 	x(btree_gc)				\
@ -456,6 +458,7 @@ BCH_DEBUG_PARAMS_DEBUG()
 	x(blocked_journal_low_on_space)		\
 	x(blocked_journal_low_on_pin)		\
 	x(blocked_journal_max_in_flight)	\
+	x(blocked_journal_max_open)		\
 	x(blocked_key_cache_flush)		\
 	x(blocked_allocate)			\
 	x(blocked_allocate_open_bucket)		\
@ -533,6 +536,7 @@ struct bch_dev {
 	 */
 	struct bch_member_cpu	mi;
 	atomic64_t		errors[BCH_MEMBER_ERROR_NR];
+	unsigned long		write_errors_start;

 	__uuid_t		uuid;
 	char			name[BDEVNAME_SIZE];
@ -623,7 +627,8 @@ struct bch_dev {
 	x(topology_error)		\
 	x(errors_fixed)			\
 	x(errors_not_fixed)		\
-	x(no_invalid_checks)
+	x(no_invalid_checks)		\
+	x(discard_mount_opt_set)	\

 enum bch_fs_flags {
 #define x(n)		BCH_FS_##n,
@ -687,7 +692,8 @@ struct btree_trans_buf {
 	x(gc_gens)							\
 	x(snapshot_delete_pagecache)					\
 	x(sysfs)							\
-	x(btree_write_buffer)
+	x(btree_write_buffer)						\
+	x(btree_node_scrub)

 enum bch_write_ref {
 #define x(n) BCH_WRITE_REF_##n,
@ -696,6 +702,8 @@ enum bch_write_ref {
 	BCH_WRITE_REF_NR,
 };

+#define BCH_FS_DEFAULT_UTF8_ENCODING UNICODE_AGE(12, 1, 0)
+
 struct bch_fs {
 	struct closure		cl;

@ -780,6 +788,9 @@ struct bch_fs {
 		u64		btrees_lost_data;
 	}			sb;

+#ifdef CONFIG_UNICODE
+	struct unicode_map	*cf_encoding;
+#endif

 	struct bch_sb_handle	disk_sb;

@ -969,7 +980,6 @@ struct bch_fs {
 	mempool_t		compress_workspace[BCH_COMPRESSION_OPT_NR];
 	size_t			zstd_workspace_size;

-	struct crypto_shash	*sha256;
 	struct crypto_sync_skcipher *chacha20;
 	struct crypto_shash	*poly1305;

@ -993,15 +1003,11 @@ struct bch_fs {
 	wait_queue_head_t	copygc_running_wq;

 	/* STRIPES: */
-	GENRADIX(struct stripe) stripes;
 	GENRADIX(struct gc_stripe) gc_stripes;

 	struct hlist_head	ec_stripes_new[32];
 	spinlock_t		ec_stripes_new_lock;

-	ec_stripes_heap		ec_stripes_heap;
-	struct mutex		ec_stripes_heap_lock;
-
 	/* ERASURE CODING */
 	struct list_head	ec_stripe_head_list;
 	struct mutex		ec_stripe_head_lock;
--- a/fs/bcachefs/bcachefs_format.h
+++ b/fs/bcachefs/bcachefs_format.h
@ -686,7 +686,12 @@ struct bch_sb_field_ext {
 	x(inode_depth,			BCH_VERSION(1, 17))		\
 	x(persistent_inode_cursors,	BCH_VERSION(1, 18))		\
 	x(autofix_errors,		BCH_VERSION(1, 19))		\
-	x(directory_size,		BCH_VERSION(1, 20))
+	x(directory_size,		BCH_VERSION(1, 20))		\
+	x(cached_backpointers,		BCH_VERSION(1, 21))		\
+	x(stripe_backpointers,		BCH_VERSION(1, 22))		\
+	x(stripe_lru,			BCH_VERSION(1, 23))		\
+	x(casefolding,			BCH_VERSION(1, 24))		\
+	x(extent_flags,			BCH_VERSION(1, 25))

 enum bcachefs_metadata_version {
 	bcachefs_metadata_version_min = 9,
@ -837,6 +842,7 @@ LE64_BITMASK(BCH_SB_SHARD_INUMS,	struct bch_sb, flags[3], 28, 29);
 LE64_BITMASK(BCH_SB_INODES_USE_KEY_CACHE,struct bch_sb, flags[3], 29, 30);
 LE64_BITMASK(BCH_SB_JOURNAL_FLUSH_DELAY,struct bch_sb, flags[3], 30, 62);
 LE64_BITMASK(BCH_SB_JOURNAL_FLUSH_DISABLED,struct bch_sb, flags[3], 62, 63);
+/* one free bit */
 LE64_BITMASK(BCH_SB_JOURNAL_RECLAIM_DELAY,struct bch_sb, flags[4], 0, 32);
 LE64_BITMASK(BCH_SB_JOURNAL_TRANSACTION_NAMES,struct bch_sb, flags[4], 32, 33);
 LE64_BITMASK(BCH_SB_NOCOW,		struct bch_sb, flags[4], 33, 34);
@ -855,6 +861,8 @@ LE64_BITMASK(BCH_SB_VERSION_INCOMPAT,	struct bch_sb, flags[5], 32, 48);
 LE64_BITMASK(BCH_SB_VERSION_INCOMPAT_ALLOWED,
 					struct bch_sb, flags[5], 48, 64);
 LE64_BITMASK(BCH_SB_SHARD_INUMS_NBITS,	struct bch_sb, flags[6],  0,  4);
+LE64_BITMASK(BCH_SB_WRITE_ERROR_TIMEOUT,struct bch_sb, flags[6],  4, 14);
+LE64_BITMASK(BCH_SB_CSUM_ERR_RETRY_NR,	struct bch_sb, flags[6], 14, 20);

 static inline __u64 BCH_SB_COMPRESSION_TYPE(const struct bch_sb *sb)
 {
@ -908,7 +916,8 @@ static inline void SET_BCH_SB_BACKGROUND_COMPRESSION_TYPE(struct bch_sb *sb, __u
 	x(journal_no_flush,		16)	\
 	x(alloc_v2,			17)	\
 	x(extents_across_btree_nodes,	18)	\
-	x(incompat_version_field,	19)
+	x(incompat_version_field,	19)	\
+	x(casefolding,			20)

 #define BCH_SB_FEATURES_ALWAYS				\
 	(BIT_ULL(BCH_FEATURE_new_extent_overwrite)|	\
@ -922,7 +931,8 @@ static inline void SET_BCH_SB_BACKGROUND_COMPRESSION_TYPE(struct bch_sb *sb, __u
 	 BIT_ULL(BCH_FEATURE_new_siphash)|		\
 	 BIT_ULL(BCH_FEATURE_btree_ptr_v2)|		\
 	 BIT_ULL(BCH_FEATURE_new_varint)|		\
-	 BIT_ULL(BCH_FEATURE_journal_no_flush))
+	 BIT_ULL(BCH_FEATURE_journal_no_flush)|		\
+	 BIT_ULL(BCH_FEATURE_incompat_version_field))

 enum bch_sb_feature {
 #define x(f, n) BCH_FEATURE_##f,
--- a/fs/bcachefs/bcachefs_ioctl.h
+++ b/fs/bcachefs/bcachefs_ioctl.h
@ -87,6 +87,7 @@ struct bch_ioctl_incremental {
 #define BCH_IOCTL_FSCK_OFFLINE	_IOW(0xbc,	19,  struct bch_ioctl_fsck_offline)
 #define BCH_IOCTL_FSCK_ONLINE	_IOW(0xbc,	20,  struct bch_ioctl_fsck_online)
 #define BCH_IOCTL_QUERY_ACCOUNTING _IOW(0xbc,	21,  struct bch_ioctl_query_accounting)
+#define BCH_IOCTL_QUERY_COUNTERS _IOW(0xbc,	21,  struct bch_ioctl_query_counters)

 /* ioctl below act on a particular file, not the filesystem as a whole: */

@ -213,6 +214,10 @@ struct bch_ioctl_data {
 	struct bpos		end_pos;

 	union {
+	struct {
+		__u32		dev;
+		__u32		data_types;
+	}			scrub;
 	struct {
 		__u32		dev;
 		__u32		pad;
@ -229,6 +234,11 @@ enum bch_data_event {
 	BCH_DATA_EVENT_NR	= 1,
 };

+enum data_progress_data_type_special {
+	DATA_PROGRESS_DATA_TYPE_phys	= 254,
+	DATA_PROGRESS_DATA_TYPE_done	= 255,
+};
+
 struct bch_ioctl_data_progress {
 	__u8			data_type;
 	__u8			btree_id;
@ -237,11 +247,19 @@ struct bch_ioctl_data_progress {

 	__u64			sectors_done;
 	__u64			sectors_total;
+	__u64			sectors_error_corrected;
+	__u64			sectors_error_uncorrected;
 } __packed __aligned(8);

+enum bch_ioctl_data_event_ret {
+	BCH_IOCTL_DATA_EVENT_RET_done		= 1,
+	BCH_IOCTL_DATA_EVENT_RET_device_offline	= 2,
+};
+
 struct bch_ioctl_data_event {
 	__u8			type;
-	__u8			pad[7];
+	__u8			ret;
+	__u8			pad[6];
 	union {
 	struct bch_ioctl_data_progress p;
 	__u64			pad2[15];
@ -443,4 +461,13 @@ struct bch_ioctl_query_accounting {
 	struct bkey_i_accounting accounting[];
 };

+#define BCH_IOCTL_QUERY_COUNTERS_MOUNT	(1 << 0)
+
+struct bch_ioctl_query_counters {
+	__u16			nr;
+	__u16			flags;
+	__u32			pad;
+	__u64			d[];
+};
+
 #endif /* _BCACHEFS_IOCTL_H */
--- a/fs/bcachefs/btree_cache.c
+++ b/fs/bcachefs/btree_cache.c
@ -610,6 +610,7 @@ void bch2_fs_btree_cache_exit(struct bch_fs *c)
 		       btree_node_write_in_flight(b));

 		btree_node_data_free(bc, b);
+		cond_resched();
 	}

 	BUG_ON(!bch2_journal_error(&c->journal) &&
--- a/fs/bcachefs/btree_gc.c
+++ b/fs/bcachefs/btree_gc.c
@ -27,6 +27,7 @@
 #include "journal.h"
 #include "keylist.h"
 #include "move.h"
+#include "progress.h"
 #include "recovery_passes.h"
 #include "reflink.h"
 #include "recovery.h"
@ -656,7 +657,9 @@ fsck_err:
 	return ret;
 }

-static int bch2_gc_btree(struct btree_trans *trans, enum btree_id btree, bool initial)
+static int bch2_gc_btree(struct btree_trans *trans,
+			 struct progress_indicator_state *progress,
+			 enum btree_id btree, bool initial)
 {
 	struct bch_fs *c = trans->c;
 	unsigned target_depth = btree_node_type_has_triggers(__btree_node_type(0, btree)) ? 0 : 1;
@ -673,6 +676,7 @@ static int bch2_gc_btree(struct btree_trans *trans, enum btree_id btree, bool in
 					  BTREE_ITER_prefetch);

 		ret = for_each_btree_key_continue(trans, iter, 0, k, ({
+			bch2_progress_update_iter(trans, progress, &iter, "check_allocations");
 			gc_pos_set(c, gc_pos_btree(btree, level, k.k->p));
 			bch2_gc_mark_key(trans, btree, level, &prev, &iter, k, initial);
 		}));
@ -717,22 +721,24 @@ static inline int btree_id_gc_phase_cmp(enum btree_id l, enum btree_id r)
 static int bch2_gc_btrees(struct bch_fs *c)
 {
 	struct btree_trans *trans = bch2_trans_get(c);
-	enum btree_id ids[BTREE_ID_NR];
 	struct printbuf buf = PRINTBUF;
-	unsigned i;
 	int ret = 0;

-	for (i = 0; i < BTREE_ID_NR; i++)
+	struct progress_indicator_state progress;
+	bch2_progress_init(&progress, c, ~0ULL);
+
+	enum btree_id ids[BTREE_ID_NR];
+	for (unsigned i = 0; i < BTREE_ID_NR; i++)
 		ids[i] = i;
 	bubble_sort(ids, BTREE_ID_NR, btree_id_gc_phase_cmp);

-	for (i = 0; i < btree_id_nr_alive(c) && !ret; i++) {
+	for (unsigned i = 0; i < btree_id_nr_alive(c) && !ret; i++) {
 		unsigned btree = i < BTREE_ID_NR ? ids[i] : i;

 		if (IS_ERR_OR_NULL(bch2_btree_id_root(c, btree)->b))
 			continue;

-		ret = bch2_gc_btree(trans, btree, true);
+		ret = bch2_gc_btree(trans, &progress, btree, true);
 	}

 	printbuf_exit(&buf);
--- a/fs/bcachefs/btree_io.c
+++ b/fs/bcachefs/btree_io.c
@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0

 #include "bcachefs.h"
+#include "bkey_buf.h"
 #include "bkey_methods.h"
 #include "bkey_sort.h"
 #include "btree_cache.h"
@ -1328,6 +1329,7 @@ static void btree_node_read_work(struct work_struct *work)
 		bch_info(c, "retrying read");
 		ca = bch2_dev_get_ioref(c, rb->pick.ptr.dev, READ);
 		rb->have_ioref		= ca != NULL;
+		rb->start_time		= local_clock();
 		bio_reset(bio, NULL, REQ_OP_READ|REQ_SYNC|REQ_META);
 		bio->bi_iter.bi_sector	= rb->pick.ptr.offset;
 		bio->bi_iter.bi_size	= btree_buf_bytes(b);
@ -1338,21 +1340,26 @@ static void btree_node_read_work(struct work_struct *work)
 		} else {
 			bio->bi_status = BLK_STS_REMOVED;
 		}
+
+		bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read,
+					   rb->start_time, !bio->bi_status);
 start:
 		printbuf_reset(&buf);
 		bch2_btree_pos_to_text(&buf, c, b);
-		bch2_dev_io_err_on(ca && bio->bi_status, ca, BCH_MEMBER_ERROR_read,
-				   "btree read error %s for %s",
-				   bch2_blk_status_to_str(bio->bi_status), buf.buf);
+
+		if (ca && bio->bi_status)
+			bch_err_dev_ratelimited(ca,
+					"btree read error %s for %s",
+					bch2_blk_status_to_str(bio->bi_status), buf.buf);
 		if (rb->have_ioref)
 			percpu_ref_put(&ca->io_ref);
 		rb->have_ioref = false;

-		bch2_mark_io_failure(&failed, &rb->pick);
+		bch2_mark_io_failure(&failed, &rb->pick, false);

 		can_retry = bch2_bkey_pick_read_device(c,
 				bkey_i_to_s_c(&b->key),
-				&failed, &rb->pick) > 0;
+				&failed, &rb->pick, -1) > 0;

 		if (!bio->bi_status &&
 		    !bch2_btree_node_read_done(c, ca, b, can_retry, &saw_error)) {
@ -1400,12 +1407,11 @@ static void btree_node_read_endio(struct bio *bio)
 	struct btree_read_bio *rb =
 		container_of(bio, struct btree_read_bio, bio);
 	struct bch_fs *c	= rb->c;
+	struct bch_dev *ca	= rb->have_ioref
+		? bch2_dev_have_ref(c, rb->pick.ptr.dev) : NULL;

-	if (rb->have_ioref) {
-		struct bch_dev *ca = bch2_dev_have_ref(c, rb->pick.ptr.dev);
-
-		bch2_latency_acct(ca, rb->start_time, READ);
-	}
+	bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read,
+				   rb->start_time, !bio->bi_status);

 	queue_work(c->btree_read_complete_wq, &rb->work);
 }
@ -1697,7 +1703,7 @@ void bch2_btree_node_read(struct btree_trans *trans, struct btree *b,
 		return;

 	ret = bch2_bkey_pick_read_device(c, bkey_i_to_s_c(&b->key),
-					 NULL, &pick);
+					 NULL, &pick, -1);

 	if (ret <= 0) {
 		struct printbuf buf = PRINTBUF;
@ -1811,6 +1817,190 @@ int bch2_btree_root_read(struct bch_fs *c, enum btree_id id,
 	return bch2_trans_run(c, __bch2_btree_root_read(trans, id, k, level));
 }

+struct btree_node_scrub {
+	struct bch_fs		*c;
+	struct bch_dev		*ca;
+	void			*buf;
+	bool			used_mempool;
+	unsigned		written;
+
+	enum btree_id		btree;
+	unsigned		level;
+	struct bkey_buf		key;
+	__le64			seq;
+
+	struct work_struct	work;
+	struct bio		bio;
+};
+
+static bool btree_node_scrub_check(struct bch_fs *c, struct btree_node *data, unsigned ptr_written,
+				   struct printbuf *err)
+{
+	unsigned written = 0;
+
+	if (le64_to_cpu(data->magic) != bset_magic(c)) {
+		prt_printf(err, "bad magic: want %llx, got %llx",
+			   bset_magic(c), le64_to_cpu(data->magic));
+		return false;
+	}
+
+	while (written < (ptr_written ?: btree_sectors(c))) {
+		struct btree_node_entry *bne;
+		struct bset *i;
+		bool first = !written;
+
+		if (first) {
+			bne = NULL;
+			i = &data->keys;
+		} else {
+			bne = (void *) data + (written << 9);
+			i = &bne->keys;
+
+			if (!ptr_written && i->seq != data->keys.seq)
+				break;
+		}
+
+		struct nonce nonce = btree_nonce(i, written << 9);
+		bool good_csum_type = bch2_checksum_type_valid(c, BSET_CSUM_TYPE(i));
+
+		if (first) {
+			if (good_csum_type) {
+				struct bch_csum csum = csum_vstruct(c, BSET_CSUM_TYPE(i), nonce, data);
+				if (bch2_crc_cmp(data->csum, csum)) {
+					bch2_csum_err_msg(err, BSET_CSUM_TYPE(i), data->csum, csum);
+					return false;
+				}
+			}
+
+			written += vstruct_sectors(data, c->block_bits);
+		} else {
+			if (good_csum_type) {
+				struct bch_csum csum = csum_vstruct(c, BSET_CSUM_TYPE(i), nonce, bne);
+				if (bch2_crc_cmp(bne->csum, csum)) {
+					bch2_csum_err_msg(err, BSET_CSUM_TYPE(i), bne->csum, csum);
+					return false;
+				}
+			}
+
+			written += vstruct_sectors(bne, c->block_bits);
+		}
+	}
+
+	return true;
+}
+
+static void btree_node_scrub_work(struct work_struct *work)
+{
+	struct btree_node_scrub *scrub = container_of(work, struct btree_node_scrub, work);
+	struct bch_fs *c = scrub->c;
+	struct printbuf err = PRINTBUF;
+
+	__bch2_btree_pos_to_text(&err, c, scrub->btree, scrub->level,
+				 bkey_i_to_s_c(scrub->key.k));
+	prt_newline(&err);
+
+	if (!btree_node_scrub_check(c, scrub->buf, scrub->written, &err)) {
+		struct btree_trans *trans = bch2_trans_get(c);
+
+		struct btree_iter iter;
+		bch2_trans_node_iter_init(trans, &iter, scrub->btree,
+					  scrub->key.k->k.p, 0, scrub->level - 1, 0);
+
+		struct btree *b;
+		int ret = lockrestart_do(trans, PTR_ERR_OR_ZERO(b = bch2_btree_iter_peek_node(&iter)));
+		if (ret)
+			goto err;
+
+		if (bkey_i_to_btree_ptr_v2(&b->key)->v.seq == scrub->seq) {
+			bch_err(c, "error validating btree node during scrub on %s at btree %s",
+				scrub->ca->name, err.buf);
+
+			ret = bch2_btree_node_rewrite(trans, &iter, b, 0);
+		}
+err:
+		bch2_trans_iter_exit(trans, &iter);
+		bch2_trans_begin(trans);
+		bch2_trans_put(trans);
+	}
+
+	printbuf_exit(&err);
+	bch2_bkey_buf_exit(&scrub->key, c);;
+	btree_bounce_free(c, c->opts.btree_node_size, scrub->used_mempool, scrub->buf);
+	percpu_ref_put(&scrub->ca->io_ref);
+	kfree(scrub);
+	bch2_write_ref_put(c, BCH_WRITE_REF_btree_node_scrub);
+}
+
+static void btree_node_scrub_endio(struct bio *bio)
+{
+	struct btree_node_scrub *scrub = container_of(bio, struct btree_node_scrub, bio);
+
+	queue_work(scrub->c->btree_read_complete_wq, &scrub->work);
+}
+
+int bch2_btree_node_scrub(struct btree_trans *trans,
+			  enum btree_id btree, unsigned level,
+			  struct bkey_s_c k, unsigned dev)
+{
+	if (k.k->type != KEY_TYPE_btree_ptr_v2)
+		return 0;
+
+	struct bch_fs *c = trans->c;
+
+	if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_btree_node_scrub))
+		return -BCH_ERR_erofs_no_writes;
+
+	struct extent_ptr_decoded pick;
+	int ret = bch2_bkey_pick_read_device(c, k, NULL, &pick, dev);
+	if (ret <= 0)
+		goto err;
+
+	struct bch_dev *ca = bch2_dev_get_ioref(c, pick.ptr.dev, READ);
+	if (!ca) {
+		ret = -BCH_ERR_device_offline;
+		goto err;
+	}
+
+	bool used_mempool = false;
+	void *buf = btree_bounce_alloc(c, c->opts.btree_node_size, &used_mempool);
+
+	unsigned vecs = buf_pages(buf, c->opts.btree_node_size);
+
+	struct btree_node_scrub *scrub =
+		kzalloc(sizeof(*scrub) + sizeof(struct bio_vec) * vecs, GFP_KERNEL);
+	if (!scrub) {
+		ret = -ENOMEM;
+		goto err_free;
+	}
+
+	scrub->c		= c;
+	scrub->ca		= ca;
+	scrub->buf		= buf;
+	scrub->used_mempool	= used_mempool;
+	scrub->written		= btree_ptr_sectors_written(k);
+
+	scrub->btree		= btree;
+	scrub->level		= level;
+	bch2_bkey_buf_init(&scrub->key);
+	bch2_bkey_buf_reassemble(&scrub->key, c, k);
+	scrub->seq		= bkey_s_c_to_btree_ptr_v2(k).v->seq;
+
+	INIT_WORK(&scrub->work, btree_node_scrub_work);
+
+	bio_init(&scrub->bio, ca->disk_sb.bdev, scrub->bio.bi_inline_vecs, vecs, REQ_OP_READ);
+	bch2_bio_map(&scrub->bio, scrub->buf, c->opts.btree_node_size);
+	scrub->bio.bi_iter.bi_sector	= pick.ptr.offset;
+	scrub->bio.bi_end_io		= btree_node_scrub_endio;
+	submit_bio(&scrub->bio);
+	return 0;
+err_free:
+	btree_bounce_free(c, c->opts.btree_node_size, used_mempool, buf);
+	percpu_ref_put(&ca->io_ref);
+err:
+	bch2_write_ref_put(c, BCH_WRITE_REF_btree_node_scrub);
+	return ret;
+}
+
 static void bch2_btree_complete_write(struct bch_fs *c, struct btree *b,
 				      struct btree_write *w)
 {
@ -1831,7 +2021,7 @@ static void bch2_btree_complete_write(struct bch_fs *c, struct btree *b,
 	bch2_journal_pin_drop(&c->journal, &w->journal);
 }

-static void __btree_node_write_done(struct bch_fs *c, struct btree *b)
+static void __btree_node_write_done(struct bch_fs *c, struct btree *b, u64 start_time)
 {
 	struct btree_write *w = btree_prev_write(b);
 	unsigned long old, new;
@ -1839,6 +2029,9 @@ static void __btree_node_write_done(struct bch_fs *c, struct btree *b)

 	bch2_btree_complete_write(c, b, w);

+	if (start_time)
+		bch2_time_stats_update(&c->times[BCH_TIME_btree_node_write], start_time);
+
 	old = READ_ONCE(b->flags);
 	do {
 		new = old;
@ -1869,7 +2062,7 @@ static void __btree_node_write_done(struct bch_fs *c, struct btree *b)
 		wake_up_bit(&b->flags, BTREE_NODE_write_in_flight);
 }

-static void btree_node_write_done(struct bch_fs *c, struct btree *b)
+static void btree_node_write_done(struct bch_fs *c, struct btree *b, u64 start_time)
 {
 	struct btree_trans *trans = bch2_trans_get(c);

@ -1877,7 +2070,7 @@ static void btree_node_write_done(struct bch_fs *c, struct btree *b)

 	/* we don't need transaction context anymore after we got the lock. */
 	bch2_trans_put(trans);
-	__btree_node_write_done(c, b);
+	__btree_node_write_done(c, b, start_time);
 	six_unlock_read(&b->c.lock);
 }

@ -1887,6 +2080,7 @@ static void btree_node_write_work(struct work_struct *work)
 		container_of(work, struct btree_write_bio, work);
 	struct bch_fs *c	= wbio->wbio.c;
 	struct btree *b		= wbio->wbio.bio.bi_private;
+	u64 start_time		= wbio->start_time;
 	int ret = 0;

 	btree_bounce_free(c,
@ -1919,12 +2113,18 @@ static void btree_node_write_work(struct work_struct *work)
 	}
 out:
 	bio_put(&wbio->wbio.bio);
-	btree_node_write_done(c, b);
+	btree_node_write_done(c, b, start_time);
 	return;
 err:
 	set_btree_node_noevict(b);
-	bch2_fs_fatal_err_on(!bch2_err_matches(ret, EROFS), c,
-			     "writing btree node: %s", bch2_err_str(ret));
+
+	if (!bch2_err_matches(ret, EROFS)) {
+		struct printbuf buf = PRINTBUF;
+		prt_printf(&buf, "writing btree node: %s\n  ", bch2_err_str(ret));
+		bch2_btree_pos_to_text(&buf, c, b);
+		bch2_fs_fatal_error(c, "%s", buf.buf);
+		printbuf_exit(&buf);
+	}
 	goto out;
 }

@ -1937,16 +2137,21 @@ static void btree_node_write_endio(struct bio *bio)
 	struct bch_fs *c		= wbio->c;
 	struct btree *b			= wbio->bio.bi_private;
 	struct bch_dev *ca		= wbio->have_ioref ? bch2_dev_have_ref(c, wbio->dev) : NULL;
-	unsigned long flags;

-	if (wbio->have_ioref)
-		bch2_latency_acct(ca, wbio->submit_time, WRITE);
+	bch2_account_io_completion(ca, BCH_MEMBER_ERROR_write,
+				   wbio->submit_time, !bio->bi_status);

-	if (!ca ||
-	    bch2_dev_io_err_on(bio->bi_status, ca, BCH_MEMBER_ERROR_write,
-			       "btree write error: %s",
-			       bch2_blk_status_to_str(bio->bi_status)) ||
-	    bch2_meta_write_fault("btree")) {
+	if (ca && bio->bi_status) {
+		struct printbuf buf = PRINTBUF;
+		prt_printf(&buf, "btree write error: %s\n  ",
+			   bch2_blk_status_to_str(bio->bi_status));
+		bch2_btree_pos_to_text(&buf, c, b);
+		bch_err_dev_ratelimited(ca, "%s", buf.buf);
+		printbuf_exit(&buf);
+	}
+
+	if (bio->bi_status) {
+		unsigned long flags;
 		spin_lock_irqsave(&c->btree_write_error_lock, flags);
 		bch2_dev_list_add_dev(&orig->failed, wbio->dev);
 		spin_unlock_irqrestore(&c->btree_write_error_lock, flags);
@ -2023,6 +2228,7 @@ void __bch2_btree_node_write(struct bch_fs *c, struct btree *b, unsigned flags)
 	bool validate_before_checksum = false;
 	enum btree_write_type type = flags & BTREE_WRITE_TYPE_MASK;
 	void *data;
+	u64 start_time = local_clock();
 	int ret;

 	if (flags & BTREE_WRITE_ALREADY_STARTED)
@ -2231,6 +2437,7 @@ do_write:
 	wbio->data			= data;
 	wbio->data_bytes		= bytes;
 	wbio->sector_offset		= b->written;
+	wbio->start_time		= start_time;
 	wbio->wbio.c			= c;
 	wbio->wbio.used_mempool		= used_mempool;
 	wbio->wbio.first_btree_write	= !b->written;
@ -2258,7 +2465,7 @@ err:
 	b->written += sectors_to_write;
 nowrite:
 	btree_bounce_free(c, bytes, used_mempool, data);
-	__btree_node_write_done(c, b);
+	__btree_node_write_done(c, b, 0);
 }

 /*
--- a/fs/bcachefs/btree_io.h
+++ b/fs/bcachefs/btree_io.h
@ -52,6 +52,7 @@ struct btree_write_bio {
 	void			*data;
 	unsigned		data_bytes;
 	unsigned		sector_offset;
+	u64			start_time;
 	struct bch_write_bio	wbio;
 };

@ -132,6 +133,9 @@ void bch2_btree_node_read(struct btree_trans *, struct btree *, bool);
 int bch2_btree_root_read(struct bch_fs *, enum btree_id,
 			 const struct bkey_i *, unsigned);

+int bch2_btree_node_scrub(struct btree_trans *, enum btree_id, unsigned,
+			  struct bkey_s_c, unsigned);
+
 bool bch2_btree_post_write_cleanup(struct bch_fs *, struct btree *);

 enum btree_write_flags {
--- a/fs/bcachefs/btree_iter.c
+++ b/fs/bcachefs/btree_iter.c
@ -562,20 +562,6 @@ static inline struct bkey_s_c btree_path_level_peek_all(struct bch_fs *c,
 			bch2_btree_node_iter_peek_all(&l->iter, l->b));
 }

-static inline struct bkey_s_c btree_path_level_peek(struct btree_trans *trans,
-						    struct btree_path *path,
-						    struct btree_path_level *l,
-						    struct bkey *u)
-{
-	struct bkey_s_c k = __btree_iter_unpack(trans->c, l, u,
-			bch2_btree_node_iter_peek(&l->iter, l->b));
-
-	path->pos = k.k ? k.k->p : l->b->key.k.p;
-	trans->paths_sorted = false;
-	bch2_btree_path_verify_level(trans, path, l - path->l);
-	return k;
-}
-
 static inline struct bkey_s_c btree_path_level_prev(struct btree_trans *trans,
 						    struct btree_path *path,
 						    struct btree_path_level *l,
--- a/fs/bcachefs/btree_iter.h
+++ b/fs/bcachefs/btree_iter.h
@ -335,13 +335,20 @@ static inline void bch2_trans_verify_not_unlocked_or_in_restart(struct btree_tra
 }

 __always_inline
-static int btree_trans_restart_ip(struct btree_trans *trans, int err, unsigned long ip)
+static int btree_trans_restart_foreign_task(struct btree_trans *trans, int err, unsigned long ip)
 {
 	BUG_ON(err <= 0);
 	BUG_ON(!bch2_err_matches(-err, BCH_ERR_transaction_restart));

 	trans->restarted = err;
 	trans->last_restarted_ip = ip;
+	return -err;
+}
+
+__always_inline
+static int btree_trans_restart_ip(struct btree_trans *trans, int err, unsigned long ip)
+{
+	btree_trans_restart_foreign_task(trans, err, ip);
 #ifdef CONFIG_BCACHEFS_DEBUG
 	darray_exit(&trans->last_restarted_trace);
 	bch2_save_backtrace(&trans->last_restarted_trace, current, 0, GFP_NOWAIT);
--- a/fs/bcachefs/btree_locking.c
+++ b/fs/bcachefs/btree_locking.c
@ -91,10 +91,10 @@ static noinline void print_chain(struct printbuf *out, struct lock_graph *g)
 	struct trans_waiting_for_lock *i;

 	for (i = g->g; i != g->g + g->nr; i++) {
-		struct task_struct *task = i->trans->locking_wait.task;
+		struct task_struct *task = READ_ONCE(i->trans->locking_wait.task);
 		if (i != g->g)
 			prt_str(out, "<- ");
-		prt_printf(out, "%u ", task ?task->pid : 0);
+		prt_printf(out, "%u ", task ? task->pid : 0);
 	}
 	prt_newline(out);
 }
@ -172,7 +172,9 @@ static int abort_lock(struct lock_graph *g, struct trans_waiting_for_lock *i)
 {
 	if (i == g->g) {
 		trace_would_deadlock(g, i->trans);
-		return btree_trans_restart(i->trans, BCH_ERR_transaction_restart_would_deadlock);
+		return btree_trans_restart_foreign_task(i->trans,
+					BCH_ERR_transaction_restart_would_deadlock,
+					_THIS_IP_);
 	} else {
 		i->trans->lock_must_abort = true;
 		wake_up_process(i->trans->locking_wait.task);
--- a/fs/bcachefs/btree_node_scan.c
+++ b/fs/bcachefs/btree_node_scan.c
@ -166,11 +166,17 @@ static void try_read_btree_node(struct find_btree_nodes *f, struct bch_dev *ca,
 	bio->bi_iter.bi_sector	= offset;
 	bch2_bio_map(bio, bn, PAGE_SIZE);

+	u64 submit_time = local_clock();
 	submit_bio_wait(bio);
-	if (bch2_dev_io_err_on(bio->bi_status, ca, BCH_MEMBER_ERROR_read,
-			       "IO error in try_read_btree_node() at %llu: %s",
-			       offset, bch2_blk_status_to_str(bio->bi_status)))
+
+	bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read, submit_time, !bio->bi_status);
+
+	if (bio->bi_status) {
+		bch_err_dev_ratelimited(ca,
+				"IO error in try_read_btree_node() at %llu: %s",
+				offset, bch2_blk_status_to_str(bio->bi_status));
 		return;
+	}

 	if (le64_to_cpu(bn->magic) != bset_magic(c))
 		return;
@ -264,7 +270,7 @@ static int read_btree_nodes_worker(void *p)
 err:
 	bio_put(bio);
 	free_page((unsigned long) buf);
-	percpu_ref_get(&ca->io_ref);
+	percpu_ref_put(&ca->io_ref);
 	closure_put(w->cl);
 	kfree(w);
 	return 0;
@ -283,29 +289,28 @@ static int read_btree_nodes(struct find_btree_nodes *f)
 			continue;

 		struct find_btree_nodes_worker *w = kmalloc(sizeof(*w), GFP_KERNEL);
-		struct task_struct *t;
-
 		if (!w) {
 			percpu_ref_put(&ca->io_ref);
 			ret = -ENOMEM;
 			goto err;
 		}

-		percpu_ref_get(&ca->io_ref);
-		closure_get(&cl);
 		w->cl		= &cl;
 		w->f		= f;
 		w->ca		= ca;

-		t = kthread_run(read_btree_nodes_worker, w, "read_btree_nodes/%s", ca->name);
+		struct task_struct *t = kthread_create(read_btree_nodes_worker, w, "read_btree_nodes/%s", ca->name);
 		ret = PTR_ERR_OR_ZERO(t);
 		if (ret) {
 			percpu_ref_put(&ca->io_ref);
-			closure_put(&cl);
-			f->ret = ret;
-			bch_err(c, "error starting kthread: %i", ret);
+			kfree(w);
+			bch_err_msg(c, ret, "starting kthread");
 			break;
 		}
+
+		closure_get(&cl);
+		percpu_ref_get(&ca->io_ref);
+		wake_up_process(t);
 	}
 err:
 	closure_sync(&cl);
--- a/fs/bcachefs/btree_trans_commit.c
+++ b/fs/bcachefs/btree_trans_commit.c
@ -164,6 +164,7 @@ bool bch2_btree_bset_insert_key(struct btree_trans *trans,
 	EBUG_ON(bpos_gt(insert->k.p, b->data->max_key));
 	EBUG_ON(insert->k.u64s > bch2_btree_keys_u64s_remaining(b));
 	EBUG_ON(!b->c.level && !bpos_eq(insert->k.p, path->pos));
+	kmsan_check_memory(insert, bkey_bytes(&insert->k));

 	k = bch2_btree_node_iter_peek_all(node_iter, b);
 	if (k && bkey_cmp_left_packed(b, k, &insert->k.p))
@ -336,6 +337,7 @@ static inline void btree_insert_entry_checks(struct btree_trans *trans,
 	BUG_ON(i->cached	!= path->cached);
 	BUG_ON(i->level		!= path->level);
 	BUG_ON(i->btree_id	!= path->btree_id);
+	BUG_ON(i->bkey_type	!= __btree_node_type(path->level, path->btree_id));
 	EBUG_ON(!i->level &&
 		btree_type_has_snapshots(i->btree_id) &&
 		!(i->flags & BTREE_UPDATE_internal_snapshot_node) &&
@ -517,69 +519,45 @@ static int run_one_trans_trigger(struct btree_trans *trans, struct btree_insert_
 	}
 }

-static int run_btree_triggers(struct btree_trans *trans, enum btree_id btree_id,
-			      unsigned *btree_id_updates_start)
-{
-	bool trans_trigger_run;
-
-	/*
-	 * Running triggers will append more updates to the list of updates as
-	 * we're walking it:
-	 */
-	do {
-		trans_trigger_run = false;
-
-		for (unsigned i = *btree_id_updates_start;
-		     i < trans->nr_updates && trans->updates[i].btree_id <= btree_id;
-		     i++) {
-			if (trans->updates[i].btree_id < btree_id) {
-				*btree_id_updates_start = i;
-				continue;
-			}
-
-			int ret = run_one_trans_trigger(trans, trans->updates + i);
-			if (ret < 0)
-				return ret;
-			if (ret)
-				trans_trigger_run = true;
-		}
-	} while (trans_trigger_run);
-
-	trans_for_each_update(trans, i)
-		BUG_ON(!(i->flags & BTREE_TRIGGER_norun) &&
-		       i->btree_id == btree_id &&
-		       btree_node_type_has_trans_triggers(i->bkey_type) &&
-		       (!i->insert_trigger_run || !i->overwrite_trigger_run));
-
-	return 0;
-}
-
 static int bch2_trans_commit_run_triggers(struct btree_trans *trans)
 {
-	unsigned btree_id = 0, btree_id_updates_start = 0;
-	int ret = 0;
+	unsigned sort_id_start = 0;

-	/*
-	 *
-	 * For a given btree, this algorithm runs insert triggers before
-	 * overwrite triggers: this is so that when extents are being moved
-	 * (e.g. by FALLOCATE_FL_INSERT_RANGE), we don't drop references before
-	 * they are re-added.
-	 */
-	for (btree_id = 0; btree_id < BTREE_ID_NR; btree_id++) {
-		if (btree_id == BTREE_ID_alloc)
-			continue;
+	while (sort_id_start < trans->nr_updates) {
+		unsigned i, sort_id = trans->updates[sort_id_start].sort_order;
+		bool trans_trigger_run;

-		ret = run_btree_triggers(trans, btree_id, &btree_id_updates_start);
-		if (ret)
-			return ret;
+		/*
+		 * For a given btree, this algorithm runs insert triggers before
+		 * overwrite triggers: this is so that when extents are being
+		 * moved (e.g. by FALLOCATE_FL_INSERT_RANGE), we don't drop
+		 * references before they are re-added.
+		 *
+		 * Running triggers will append more updates to the list of
+		 * updates as we're walking it:
+		 */
+		do {
+			trans_trigger_run = false;
+
+			for (i = sort_id_start;
+			     i < trans->nr_updates && trans->updates[i].sort_order <= sort_id;
+			     i++) {
+				if (trans->updates[i].sort_order < sort_id) {
+					sort_id_start = i;
+					continue;
+				}
+
+				int ret = run_one_trans_trigger(trans, trans->updates + i);
+				if (ret < 0)
+					return ret;
+				if (ret)
+					trans_trigger_run = true;
+			}
+		} while (trans_trigger_run);
+
+		sort_id_start = i;
 	}

-	btree_id_updates_start = 0;
-	ret = run_btree_triggers(trans, BTREE_ID_alloc, &btree_id_updates_start);
-	if (ret)
-		return ret;
-
 #ifdef CONFIG_BCACHEFS_DEBUG
 	trans_for_each_update(trans, i)
 		BUG_ON(!(i->flags & BTREE_TRIGGER_norun) &&
@ -903,6 +881,24 @@ int bch2_trans_commit_error(struct btree_trans *trans, unsigned flags,
 	struct bch_fs *c = trans->c;
 	enum bch_watermark watermark = flags & BCH_WATERMARK_MASK;

+	if (bch2_err_matches(ret, BCH_ERR_journal_res_blocked)) {
+		/*
+		 * XXX: this should probably be a separate BTREE_INSERT_NONBLOCK
+		 * flag
+		 */
+		if ((flags & BCH_TRANS_COMMIT_journal_reclaim) &&
+		    watermark < BCH_WATERMARK_reclaim) {
+			ret = -BCH_ERR_journal_reclaim_would_deadlock;
+			goto out;
+		}
+
+		ret = drop_locks_do(trans,
+			bch2_trans_journal_res_get(trans,
+					(flags & BCH_WATERMARK_MASK)|
+					JOURNAL_RES_GET_CHECK));
+		goto out;
+	}
+
 	switch (ret) {
 	case -BCH_ERR_btree_insert_btree_node_full:
 		ret = bch2_btree_split_leaf(trans, i->path, flags);
@ -914,22 +910,6 @@ int bch2_trans_commit_error(struct btree_trans *trans, unsigned flags,
 		ret = drop_locks_do(trans,
 			bch2_accounting_update_sb(trans));
 		break;
-	case -BCH_ERR_journal_res_get_blocked:
-		/*
-		 * XXX: this should probably be a separate BTREE_INSERT_NONBLOCK
-		 * flag
-		 */
-		if ((flags & BCH_TRANS_COMMIT_journal_reclaim) &&
-		    watermark < BCH_WATERMARK_reclaim) {
-			ret = -BCH_ERR_journal_reclaim_would_deadlock;
-			break;
-		}
-
-		ret = drop_locks_do(trans,
-			bch2_trans_journal_res_get(trans,
-					(flags & BCH_WATERMARK_MASK)|
-					JOURNAL_RES_GET_CHECK));
-		break;
 	case -BCH_ERR_btree_insert_need_journal_reclaim:
 		bch2_trans_unlock(trans);

@ -950,7 +930,7 @@ int bch2_trans_commit_error(struct btree_trans *trans, unsigned flags,
 		BUG_ON(ret >= 0);
 		break;
 	}
-
+out:
 	BUG_ON(bch2_err_matches(ret, BCH_ERR_transaction_restart) != !!trans->restarted);

 	bch2_fs_inconsistent_on(bch2_err_matches(ret, ENOSPC) &&
--- a/fs/bcachefs/btree_types.h
+++ b/fs/bcachefs/btree_types.h
@ -423,6 +423,7 @@ static inline struct bpos btree_node_pos(struct btree_bkey_cached_common *b)

 struct btree_insert_entry {
 	unsigned		flags;
+	u8			sort_order;
 	u8			bkey_type;
 	enum btree_id		btree_id:8;
 	u8			level:4;
@ -853,6 +854,18 @@ static inline bool btree_type_uses_write_buffer(enum btree_id btree)
 	return BIT_ULL(btree) & mask;
 }

+static inline u8 btree_trigger_order(enum btree_id btree)
+{
+	switch (btree) {
+	case BTREE_ID_alloc:
+		return U8_MAX;
+	case BTREE_ID_stripes:
+		return U8_MAX - 1;
+	default:
+		return btree;
+	}
+}
+
 struct btree_root {
 	struct btree		*b;

--- a/fs/bcachefs/btree_update.c
+++ b/fs/bcachefs/btree_update.c
@ -17,7 +17,7 @@
 static inline int btree_insert_entry_cmp(const struct btree_insert_entry *l,
 					 const struct btree_insert_entry *r)
 {
-	return   cmp_int(l->btree_id,	r->btree_id) ?:
+	return   cmp_int(l->sort_order,	r->sort_order) ?:
 		 cmp_int(l->cached,	r->cached) ?:
 		 -cmp_int(l->level,	r->level) ?:
 		 bpos_cmp(l->k->k.p,	r->k->k.p);
@ -397,6 +397,7 @@ bch2_trans_update_by_path(struct btree_trans *trans, btree_path_idx_t path_idx,

 	n = (struct btree_insert_entry) {
 		.flags		= flags,
+		.sort_order	= btree_trigger_order(path->btree_id),
 		.bkey_type	= __btree_node_type(path->level, path->btree_id),
 		.btree_id	= path->btree_id,
 		.level		= path->level,
@ -511,6 +512,8 @@ static noinline int bch2_trans_update_get_key_cache(struct btree_trans *trans,
 int __must_check bch2_trans_update(struct btree_trans *trans, struct btree_iter *iter,
 				   struct bkey_i *k, enum btree_iter_update_trigger_flags flags)
 {
+	kmsan_check_memory(k, bkey_bytes(&k->k));
+
 	btree_path_idx_t path_idx = iter->update_path ?: iter->path;
 	int ret;

--- a/fs/bcachefs/btree_update.h
+++ b/fs/bcachefs/btree_update.h
@ -133,6 +133,8 @@ static inline int __must_check bch2_trans_update_buffered(struct btree_trans *tr
 					    enum btree_id btree,
 					    struct bkey_i *k)
 {
+	kmsan_check_memory(k, bkey_bytes(&k->k));
+
 	if (unlikely(!btree_type_uses_write_buffer(btree))) {
 		int ret = bch2_btree_write_buffer_insert_err(trans, btree, k);
 		dump_stack();
--- a/fs/bcachefs/btree_update_interior.c
+++ b/fs/bcachefs/btree_update_interior.c
@ -649,6 +649,14 @@ static int btree_update_nodes_written_trans(struct btree_trans *trans,
 	return 0;
 }

+/* If the node has been reused, we might be reading uninitialized memory - that's fine: */
+static noinline __no_kmsan_checks bool btree_node_seq_matches(struct btree *b, __le64 seq)
+{
+	struct btree_node *b_data = READ_ONCE(b->data);
+
+	return (b_data ? b_data->keys.seq : 0) == seq;
+}
+
 static void btree_update_nodes_written(struct btree_update *as)
 {
 	struct bch_fs *c = as->c;
@ -677,17 +685,9 @@ static void btree_update_nodes_written(struct btree_update *as)
 	 * on disk:
 	 */
 	for (i = 0; i < as->nr_old_nodes; i++) {
-		__le64 seq;
-
 		b = as->old_nodes[i];

-		bch2_trans_begin(trans);
-		btree_node_lock_nopath_nofail(trans, &b->c, SIX_LOCK_read);
-		seq = b->data ? b->data->keys.seq : 0;
-		six_unlock_read(&b->c.lock);
-		bch2_trans_unlock_long(trans);
-
-		if (seq == as->old_nodes_seq[i])
+		if (btree_node_seq_matches(b, as->old_nodes_seq[i]))
 			wait_on_bit_io(&b->flags, BTREE_NODE_write_in_flight_inner,
 				       TASK_UNINTERRUPTIBLE);
 	}
@ -2126,6 +2126,31 @@ err_free_update:
 	goto out;
 }

+static int get_iter_to_node(struct btree_trans *trans, struct btree_iter *iter,
+			    struct btree *b)
+{
+	bch2_trans_node_iter_init(trans, iter, b->c.btree_id, b->key.k.p,
+				  BTREE_MAX_DEPTH, b->c.level,
+				  BTREE_ITER_intent);
+	int ret = bch2_btree_iter_traverse(iter);
+	if (ret)
+		goto err;
+
+	/* has node been freed? */
+	if (btree_iter_path(trans, iter)->l[b->c.level].b != b) {
+		/* node has been freed: */
+		BUG_ON(!btree_node_dying(b));
+		ret = -BCH_ERR_btree_node_dying;
+		goto err;
+	}
+
+	BUG_ON(!btree_node_hashed(b));
+	return 0;
+err:
+	bch2_trans_iter_exit(trans, iter);
+	return ret;
+}
+
 int bch2_btree_node_rewrite(struct btree_trans *trans,
 			    struct btree_iter *iter,
 			    struct btree *b,
@ -2191,6 +2216,61 @@ err:
 	goto out;
 }

+static int bch2_btree_node_rewrite_key(struct btree_trans *trans,
+				       enum btree_id btree, unsigned level,
+				       struct bkey_i *k, unsigned flags)
+{
+	struct btree_iter iter;
+	bch2_trans_node_iter_init(trans, &iter,
+				  btree, k->k.p,
+				  BTREE_MAX_DEPTH, level, 0);
+	struct btree *b = bch2_btree_iter_peek_node(&iter);
+	int ret = PTR_ERR_OR_ZERO(b);
+	if (ret)
+		goto out;
+
+	bool found = b && btree_ptr_hash_val(&b->key) == btree_ptr_hash_val(k);
+	ret = found
+		? bch2_btree_node_rewrite(trans, &iter, b, flags)
+		: -ENOENT;
+out:
+	bch2_trans_iter_exit(trans, &iter);
+	return ret;
+}
+
+int bch2_btree_node_rewrite_pos(struct btree_trans *trans,
+				enum btree_id btree, unsigned level,
+				struct bpos pos, unsigned flags)
+{
+	BUG_ON(!level);
+
+	/* Traverse one depth lower to get a pointer to the node itself: */
+	struct btree_iter iter;
+	bch2_trans_node_iter_init(trans, &iter, btree, pos, 0, level - 1, 0);
+	struct btree *b = bch2_btree_iter_peek_node(&iter);
+	int ret = PTR_ERR_OR_ZERO(b);
+	if (ret)
+		goto err;
+
+	ret = bch2_btree_node_rewrite(trans, &iter, b, flags);
+err:
+	bch2_trans_iter_exit(trans, &iter);
+	return ret;
+}
+
+int bch2_btree_node_rewrite_key_get_iter(struct btree_trans *trans,
+					 struct btree *b, unsigned flags)
+{
+	struct btree_iter iter;
+	int ret = get_iter_to_node(trans, &iter, b);
+	if (ret)
+		return ret == -BCH_ERR_btree_node_dying ? 0 : ret;
+
+	ret = bch2_btree_node_rewrite(trans, &iter, b, flags);
+	bch2_trans_iter_exit(trans, &iter);
+	return ret;
+}
+
 struct async_btree_rewrite {
 	struct bch_fs		*c;
 	struct work_struct	work;
@ -2200,57 +2280,14 @@ struct async_btree_rewrite {
 	struct bkey_buf		key;
 };

-static int async_btree_node_rewrite_trans(struct btree_trans *trans,
-					  struct async_btree_rewrite *a)
-{
-	struct btree_iter iter;
-	bch2_trans_node_iter_init(trans, &iter,
-				  a->btree_id, a->key.k->k.p,
-				  BTREE_MAX_DEPTH, a->level, 0);
-	struct btree *b = bch2_btree_iter_peek_node(&iter);
-	int ret = PTR_ERR_OR_ZERO(b);
-	if (ret)
-		goto out;
-
-	bool found = b && btree_ptr_hash_val(&b->key) == btree_ptr_hash_val(a->key.k);
-	ret = found
-		? bch2_btree_node_rewrite(trans, &iter, b, 0)
-		: -ENOENT;
-
-#if 0
-	/* Tracepoint... */
-	if (!ret || ret == -ENOENT) {
-		struct bch_fs *c = trans->c;
-		struct printbuf buf = PRINTBUF;
-
-		if (!ret) {
-			prt_printf(&buf, "rewrite node:\n  ");
-			bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(a->key.k));
-		} else {
-			prt_printf(&buf, "node to rewrite not found:\n  want: ");
-			bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(a->key.k));
-			prt_printf(&buf, "\n  got:  ");
-			if (b)
-				bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(&b->key));
-			else
-				prt_str(&buf, "(null)");
-		}
-		bch_info(c, "%s", buf.buf);
-		printbuf_exit(&buf);
-	}
-#endif
-out:
-	bch2_trans_iter_exit(trans, &iter);
-	return ret;
-}
-
 static void async_btree_node_rewrite_work(struct work_struct *work)
 {
 	struct async_btree_rewrite *a =
 		container_of(work, struct async_btree_rewrite, work);
 	struct bch_fs *c = a->c;

-	int ret = bch2_trans_do(c, async_btree_node_rewrite_trans(trans, a));
+	int ret = bch2_trans_do(c, bch2_btree_node_rewrite_key(trans,
+						a->btree_id, a->level, a->key.k, 0));
 	if (ret != -ENOENT)
 		bch_err_fn_ratelimited(c, ret);

@ -2494,30 +2531,15 @@ int bch2_btree_node_update_key_get_iter(struct btree_trans *trans,
 					unsigned commit_flags, bool skip_triggers)
 {
 	struct btree_iter iter;
-	int ret;
-
-	bch2_trans_node_iter_init(trans, &iter, b->c.btree_id, b->key.k.p,
-				  BTREE_MAX_DEPTH, b->c.level,
-				  BTREE_ITER_intent);
-	ret = bch2_btree_iter_traverse(&iter);
+	int ret = get_iter_to_node(trans, &iter, b);
 	if (ret)
-		goto out;
-
-	/* has node been freed? */
-	if (btree_iter_path(trans, &iter)->l[b->c.level].b != b) {
-		/* node has been freed: */
-		BUG_ON(!btree_node_dying(b));
-		goto out;
-	}
-
-	BUG_ON(!btree_node_hashed(b));
+		return ret == -BCH_ERR_btree_node_dying ? 0 : ret;

 	bch2_bkey_drop_ptrs(bkey_i_to_s(new_key), ptr,
 			    !bch2_bkey_has_device(bkey_i_to_s(&b->key), ptr->dev));

 	ret = bch2_btree_node_update_key(trans, &iter, b, new_key,
 					 commit_flags, skip_triggers);
-out:
 	bch2_trans_iter_exit(trans, &iter);
 	return ret;
 }
--- a/fs/bcachefs/btree_update_interior.h
+++ b/fs/bcachefs/btree_update_interior.h
@ -169,7 +169,14 @@ static inline int bch2_foreground_maybe_merge(struct btree_trans *trans,

 int bch2_btree_node_rewrite(struct btree_trans *, struct btree_iter *,
 			    struct btree *, unsigned);
+int bch2_btree_node_rewrite_pos(struct btree_trans *,
+				enum btree_id, unsigned,
+				struct bpos, unsigned);
+int bch2_btree_node_rewrite_key_get_iter(struct btree_trans *,
+					 struct btree *, unsigned);
+
 void bch2_btree_node_rewrite_async(struct bch_fs *, struct btree *);
+
 int bch2_btree_node_update_key(struct btree_trans *, struct btree_iter *,
 			       struct btree *, struct bkey_i *,
 			       unsigned, bool);
--- a/fs/bcachefs/buckets.c
+++ b/fs/bcachefs/buckets.c
@ -590,11 +590,9 @@ static int bch2_trigger_pointer(struct btree_trans *trans,
 		if (ret)
 			goto err;

-		if (!p.ptr.cached) {
-			ret = bch2_bucket_backpointer_mod(trans, k, &bp, insert);
-			if (ret)
-				goto err;
-		}
+		ret = bch2_bucket_backpointer_mod(trans, k, &bp, insert);
+		if (ret)
+			goto err;
 	}

 	if (flags & BTREE_TRIGGER_gc) {
@ -674,10 +672,10 @@ err:
 			return -BCH_ERR_ENOMEM_mark_stripe_ptr;
 		}

-		mutex_lock(&c->ec_stripes_heap_lock);
+		gc_stripe_lock(m);

 		if (!m || !m->alive) {
-			mutex_unlock(&c->ec_stripes_heap_lock);
+			gc_stripe_unlock(m);
 			struct printbuf buf = PRINTBUF;
 			bch2_bkey_val_to_text(&buf, c, k);
 			bch_err_ratelimited(c, "pointer to nonexistent stripe %llu\n  while marking %s",
@ -693,7 +691,7 @@ err:
 			.type = BCH_DISK_ACCOUNTING_replicas,
 		};
 		memcpy(&acc.replicas, &m->r.e, replicas_entry_bytes(&m->r.e));
-		mutex_unlock(&c->ec_stripes_heap_lock);
+		gc_stripe_unlock(m);

 		acc.replicas.data_type = data_type;
 		int ret = bch2_disk_accounting_mod(trans, &acc, &sectors, 1, true);
@ -726,9 +724,7 @@ static int __trigger_extent(struct btree_trans *trans,
 		.replicas.nr_required	= 1,
 	};

-	struct disk_accounting_pos acct_compression_key = {
-		.type			= BCH_DISK_ACCOUNTING_compression,
-	};
+	unsigned cur_compression_type = 0;
 	u64 compression_acct[3] = { 1, 0, 0 };

 	bkey_for_each_ptr_decode(k.k, ptrs, p, entry) {
@ -762,13 +758,13 @@ static int __trigger_extent(struct btree_trans *trans,
 			acc_replicas_key.replicas.nr_required = 0;
 		}

-		if (acct_compression_key.compression.type &&
-		    acct_compression_key.compression.type != p.crc.compression_type) {
+		if (cur_compression_type &&
+		    cur_compression_type != p.crc.compression_type) {
 			if (flags & BTREE_TRIGGER_overwrite)
 				bch2_u64s_neg(compression_acct, ARRAY_SIZE(compression_acct));

-			ret = bch2_disk_accounting_mod(trans, &acct_compression_key, compression_acct,
-						       ARRAY_SIZE(compression_acct), gc);
+			ret = bch2_disk_accounting_mod2(trans, gc, compression_acct,
+							compression, cur_compression_type);
 			if (ret)
 				return ret;

@ -777,7 +773,7 @@ static int __trigger_extent(struct btree_trans *trans,
 			compression_acct[2] = 0;
 		}

-		acct_compression_key.compression.type = p.crc.compression_type;
+		cur_compression_type = p.crc.compression_type;
 		if (p.crc.compression_type) {
 			compression_acct[1] += p.crc.uncompressed_size;
 			compression_acct[2] += p.crc.compressed_size;
@ -791,45 +787,34 @@ static int __trigger_extent(struct btree_trans *trans,
 	}

 	if (acc_replicas_key.replicas.nr_devs && !level && k.k->p.snapshot) {
-		struct disk_accounting_pos acc_snapshot_key = {
-			.type			= BCH_DISK_ACCOUNTING_snapshot,
-			.snapshot.id		= k.k->p.snapshot,
-		};
-		ret = bch2_disk_accounting_mod(trans, &acc_snapshot_key, replicas_sectors, 1, gc);
+		ret = bch2_disk_accounting_mod2_nr(trans, gc, replicas_sectors, 1, snapshot, k.k->p.snapshot);
 		if (ret)
 			return ret;
 	}

-	if (acct_compression_key.compression.type) {
+	if (cur_compression_type) {
 		if (flags & BTREE_TRIGGER_overwrite)
 			bch2_u64s_neg(compression_acct, ARRAY_SIZE(compression_acct));

-		ret = bch2_disk_accounting_mod(trans, &acct_compression_key, compression_acct,
-					       ARRAY_SIZE(compression_acct), gc);
+		ret = bch2_disk_accounting_mod2(trans, gc, compression_acct,
+						compression, cur_compression_type);
 		if (ret)
 			return ret;
 	}

 	if (level) {
-		struct disk_accounting_pos acc_btree_key = {
-			.type		= BCH_DISK_ACCOUNTING_btree,
-			.btree.id	= btree_id,
-		};
-		ret = bch2_disk_accounting_mod(trans, &acc_btree_key, replicas_sectors, 1, gc);
+		ret = bch2_disk_accounting_mod2_nr(trans, gc, replicas_sectors, 1, btree, btree_id);
 		if (ret)
 			return ret;
 	} else {
 		bool insert = !(flags & BTREE_TRIGGER_overwrite);
-		struct disk_accounting_pos acc_inum_key = {
-			.type		= BCH_DISK_ACCOUNTING_inum,
-			.inum.inum	= k.k->p.inode,
-		};
+
 		s64 v[3] = {
 			insert ? 1 : -1,
 			insert ? k.k->size : -((s64) k.k->size),
 			*replicas_sectors,
 		};
-		ret = bch2_disk_accounting_mod(trans, &acc_inum_key, v, ARRAY_SIZE(v), gc);
+		ret = bch2_disk_accounting_mod2(trans, gc, v, inum, k.k->p.inode);
 		if (ret)
 			return ret;
 	}
@ -878,15 +863,15 @@ int bch2_trigger_extent(struct btree_trans *trans,
 		}

 		int need_rebalance_delta = 0;
-		s64 need_rebalance_sectors_delta = 0;
+		s64 need_rebalance_sectors_delta[1] = { 0 };

 		s64 s = bch2_bkey_sectors_need_rebalance(c, old);
 		need_rebalance_delta -= s != 0;
-		need_rebalance_sectors_delta -= s;
+		need_rebalance_sectors_delta[0] -= s;

 		s = bch2_bkey_sectors_need_rebalance(c, new.s_c);
 		need_rebalance_delta += s != 0;
-		need_rebalance_sectors_delta += s;
+		need_rebalance_sectors_delta[0] += s;

 		if ((flags & BTREE_TRIGGER_transactional) && need_rebalance_delta) {
 			int ret = bch2_btree_bit_mod_buffered(trans, BTREE_ID_rebalance_work,
@ -895,12 +880,9 @@ int bch2_trigger_extent(struct btree_trans *trans,
 				return ret;
 		}

-		if (need_rebalance_sectors_delta) {
-			struct disk_accounting_pos acc = {
-				.type		= BCH_DISK_ACCOUNTING_rebalance_work,
-			};
-			int ret = bch2_disk_accounting_mod(trans, &acc, &need_rebalance_sectors_delta, 1,
-							   flags & BTREE_TRIGGER_gc);
+		if (need_rebalance_sectors_delta[0]) {
+			int ret = bch2_disk_accounting_mod2(trans, flags & BTREE_TRIGGER_gc,
+							    need_rebalance_sectors_delta, rebalance_work);
 			if (ret)
 				return ret;
 		}
@ -916,17 +898,13 @@ static int __trigger_reservation(struct btree_trans *trans,
 			enum btree_iter_update_trigger_flags flags)
 {
 	if (flags & (BTREE_TRIGGER_transactional|BTREE_TRIGGER_gc)) {
-		s64 sectors = k.k->size;
+		s64 sectors[1] = { k.k->size };

 		if (flags & BTREE_TRIGGER_overwrite)
-			sectors = -sectors;
+			sectors[0] = -sectors[0];

-		struct disk_accounting_pos acc = {
-			.type = BCH_DISK_ACCOUNTING_persistent_reserved,
-			.persistent_reserved.nr_replicas = bkey_s_c_to_reservation(k).v->nr_replicas,
-		};
-
-		return bch2_disk_accounting_mod(trans, &acc, &sectors, 1, flags & BTREE_TRIGGER_gc);
+		return bch2_disk_accounting_mod2(trans, flags & BTREE_TRIGGER_gc, sectors,
+				persistent_reserved, bkey_s_c_to_reservation(k).v->nr_replicas);
 	}

 	return 0;
--- a/fs/bcachefs/buckets.h
+++ b/fs/bcachefs/buckets.h
@ -39,33 +39,6 @@ static inline u64 sector_to_bucket_and_offset(const struct bch_dev *ca, sector_t
 	for (_b = (_buckets)->b + (_buckets)->first_bucket;	\
 	     _b < (_buckets)->b + (_buckets)->nbuckets; _b++)

-/*
- * Ugly hack alert:
- *
- * We need to cram a spinlock in a single byte, because that's what we have left
- * in struct bucket, and we care about the size of these - during fsck, we need
- * in memory state for every single bucket on every device.
- *
- * We used to do
- *   while (xchg(&b->lock, 1) cpu_relax();
- * but, it turns out not all architectures support xchg on a single byte.
- *
- * So now we use bit_spin_lock(), with fun games since we can't burn a whole
- * ulong for this - we just need to make sure the lock bit always ends up in the
- * first byte.
- */
-
-#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
-#define BUCKET_LOCK_BITNR	0
-#else
-#define BUCKET_LOCK_BITNR	(BITS_PER_LONG - 1)
-#endif
-
-union ulong_byte_assert {
-	ulong	ulong;
-	u8	byte;
-};
-
 static inline void bucket_unlock(struct bucket *b)
 {
 	BUILD_BUG_ON(!((union ulong_byte_assert) { .ulong = 1UL << BUCKET_LOCK_BITNR }).byte);
@ -167,9 +140,7 @@ static inline int gen_cmp(u8 a, u8 b)

 static inline int gen_after(u8 a, u8 b)
 {
-	int r = gen_cmp(a, b);
-
-	return r > 0 ? r : 0;
+	return max(0, gen_cmp(a, b));
 }

 static inline int dev_ptr_stale_rcu(struct bch_dev *ca, const struct bch_extent_ptr *ptr)
--- a/fs/bcachefs/buckets_types.h
+++ b/fs/bcachefs/buckets_types.h
@ -7,6 +7,33 @@

 #define BUCKET_JOURNAL_SEQ_BITS		16

+/*
+ * Ugly hack alert:
+ *
+ * We need to cram a spinlock in a single byte, because that's what we have left
+ * in struct bucket, and we care about the size of these - during fsck, we need
+ * in memory state for every single bucket on every device.
+ *
+ * We used to do
+ *   while (xchg(&b->lock, 1) cpu_relax();
+ * but, it turns out not all architectures support xchg on a single byte.
+ *
+ * So now we use bit_spin_lock(), with fun games since we can't burn a whole
+ * ulong for this - we just need to make sure the lock bit always ends up in the
+ * first byte.
+ */
+
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+#define BUCKET_LOCK_BITNR	0
+#else
+#define BUCKET_LOCK_BITNR	(BITS_PER_LONG - 1)
+#endif
+
+union ulong_byte_assert {
+	ulong	ulong;
+	u8	byte;
+};
+
 struct bucket {
 	u8			lock;
 	u8			gen_valid:1;
--- a/fs/bcachefs/chardev.c
+++ b/fs/bcachefs/chardev.c
@ -11,6 +11,7 @@
 #include "move.h"
 #include "recovery_passes.h"
 #include "replicas.h"
+#include "sb-counters.h"
 #include "super-io.h"
 #include "thread_with_file.h"

@ -312,7 +313,12 @@ static int bch2_data_thread(void *arg)
 	struct bch_data_ctx *ctx = container_of(arg, struct bch_data_ctx, thr);

 	ctx->thr.ret = bch2_data_job(ctx->c, &ctx->stats, ctx->arg);
-	ctx->stats.data_type = U8_MAX;
+	if (ctx->thr.ret == -BCH_ERR_device_offline)
+		ctx->stats.ret = BCH_IOCTL_DATA_EVENT_RET_device_offline;
+	else {
+		ctx->stats.ret = BCH_IOCTL_DATA_EVENT_RET_done;
+		ctx->stats.data_type = (int) DATA_PROGRESS_DATA_TYPE_done;
+	}
 	return 0;
 }

@ -331,14 +337,30 @@ static ssize_t bch2_data_job_read(struct file *file, char __user *buf,
 	struct bch_data_ctx *ctx = container_of(file->private_data, struct bch_data_ctx, thr);
 	struct bch_fs *c = ctx->c;
 	struct bch_ioctl_data_event e = {
-		.type			= BCH_DATA_EVENT_PROGRESS,
-		.p.data_type		= ctx->stats.data_type,
-		.p.btree_id		= ctx->stats.pos.btree,
-		.p.pos			= ctx->stats.pos.pos,
-		.p.sectors_done		= atomic64_read(&ctx->stats.sectors_seen),
-		.p.sectors_total	= bch2_fs_usage_read_short(c).used,
+		.type				= BCH_DATA_EVENT_PROGRESS,
+		.ret				= ctx->stats.ret,
+		.p.data_type			= ctx->stats.data_type,
+		.p.btree_id			= ctx->stats.pos.btree,
+		.p.pos				= ctx->stats.pos.pos,
+		.p.sectors_done			= atomic64_read(&ctx->stats.sectors_seen),
+		.p.sectors_error_corrected	= atomic64_read(&ctx->stats.sectors_error_corrected),
+		.p.sectors_error_uncorrected	= atomic64_read(&ctx->stats.sectors_error_uncorrected),
 	};

+	if (ctx->arg.op == BCH_DATA_OP_scrub) {
+		struct bch_dev *ca = bch2_dev_tryget(c, ctx->arg.scrub.dev);
+		if (ca) {
+			struct bch_dev_usage u;
+			bch2_dev_usage_read_fast(ca, &u);
+			for (unsigned i = BCH_DATA_btree; i < ARRAY_SIZE(u.d); i++)
+				if (ctx->arg.scrub.data_types & BIT(i))
+					e.p.sectors_total += u.d[i].sectors;
+			bch2_dev_put(ca);
+		}
+	} else {
+		e.p.sectors_total	= bch2_fs_usage_read_short(c).used;
+	}
+
 	if (len < sizeof(e))
 		return -EINVAL;

@ -710,6 +732,8 @@ long bch2_fs_ioctl(struct bch_fs *c, unsigned cmd, void __user *arg)
 		BCH_IOCTL(fsck_online, struct bch_ioctl_fsck_online);
 	case BCH_IOCTL_QUERY_ACCOUNTING:
 		return bch2_ioctl_query_accounting(c, arg);
+	case BCH_IOCTL_QUERY_COUNTERS:
+		return bch2_ioctl_query_counters(c, arg);
 	default:
 		return -ENOTTY;
 	}
--- a/fs/bcachefs/checksum.c
+++ b/fs/bcachefs/checksum.c
@ -466,7 +466,7 @@ int bch2_rechecksum_bio(struct bch_fs *c, struct bio *bio,
 		prt_str(&buf, ")");
 		WARN_RATELIMIT(1, "%s", buf.buf);
 		printbuf_exit(&buf);
-		return -EIO;
+		return -BCH_ERR_recompute_checksum;
 	}

 	for (i = splits; i < splits + ARRAY_SIZE(splits); i++) {
@ -693,6 +693,14 @@ static int bch2_alloc_ciphers(struct bch_fs *c)
 	return 0;
 }

+#if 0
+
+/*
+ * This seems to be duplicating code in cmd_remove_passphrase() in
+ * bcachefs-tools, but we might want to switch userspace to use this - and
+ * perhaps add an ioctl for calling this at runtime, so we can take the
+ * passphrase off of a mounted filesystem (which has come up).
+ */
 int bch2_disable_encryption(struct bch_fs *c)
 {
 	struct bch_sb_field_crypt *crypt;
@ -725,6 +733,10 @@ out:
 	return ret;
 }

+/*
+ * For enabling encryption on an existing filesystem: not hooked up yet, but it
+ * should be
+ */
 int bch2_enable_encryption(struct bch_fs *c, bool keyed)
 {
 	struct bch_encrypted_key key;
@ -781,6 +793,7 @@ err:
 	memzero_explicit(&key, sizeof(key));
 	return ret;
 }
+#endif

 void bch2_fs_encryption_exit(struct bch_fs *c)
 {
@ -788,8 +801,6 @@ void bch2_fs_encryption_exit(struct bch_fs *c)
 		crypto_free_shash(c->poly1305);
 	if (c->chacha20)
 		crypto_free_sync_skcipher(c->chacha20);
-	if (c->sha256)
-		crypto_free_shash(c->sha256);
 }

 int bch2_fs_encryption_init(struct bch_fs *c)
@ -798,14 +809,6 @@ int bch2_fs_encryption_init(struct bch_fs *c)
 	struct bch_key key;
 	int ret = 0;

-	c->sha256 = crypto_alloc_shash("sha256", 0, 0);
-	ret = PTR_ERR_OR_ZERO(c->sha256);
-	if (ret) {
-		c->sha256 = NULL;
-		bch_err(c, "error requesting sha256 module: %s", bch2_err_str(ret));
-		goto out;
-	}
-
 	crypt = bch2_sb_field_get(c->disk_sb.sb, crypt);
 	if (!crypt)
 		goto out;
--- a/fs/bcachefs/checksum.h
+++ b/fs/bcachefs/checksum.h
@ -103,8 +103,10 @@ extern const struct bch_sb_field_ops bch_sb_field_ops_crypt;
 int bch2_decrypt_sb_key(struct bch_fs *, struct bch_sb_field_crypt *,
 			struct bch_key *);

+#if 0
 int bch2_disable_encryption(struct bch_fs *);
 int bch2_enable_encryption(struct bch_fs *, bool);
+#endif

 void bch2_fs_encryption_exit(struct bch_fs *);
 int bch2_fs_encryption_init(struct bch_fs *);
--- a/fs/bcachefs/compress.c
+++ b/fs/bcachefs/compress.c
@ -177,7 +177,7 @@ static int __bio_uncompress(struct bch_fs *c, struct bio *src,
 	size_t src_len = src->bi_iter.bi_size;
 	size_t dst_len = crc.uncompressed_size << 9;
 	void *workspace;
-	int ret;
+	int ret = 0, ret2;

 	enum bch_compression_opts opt = bch2_compression_type_to_opt(crc.compression_type);
 	mempool_t *workspace_pool = &c->compress_workspace[opt];
@ -189,7 +189,7 @@ static int __bio_uncompress(struct bch_fs *c, struct bio *src,
 		else
 			ret = -BCH_ERR_compression_workspace_not_initialized;
 		if (ret)
-			goto out;
+			goto err;
 	}

 	src_data = bio_map_or_bounce(c, src, READ);
@ -197,10 +197,10 @@ static int __bio_uncompress(struct bch_fs *c, struct bio *src,
 	switch (crc.compression_type) {
 	case BCH_COMPRESSION_TYPE_lz4_old:
 	case BCH_COMPRESSION_TYPE_lz4:
-		ret = LZ4_decompress_safe_partial(src_data.b, dst_data,
-						  src_len, dst_len, dst_len);
-		if (ret != dst_len)
-			goto err;
+		ret2 = LZ4_decompress_safe_partial(src_data.b, dst_data,
+						   src_len, dst_len, dst_len);
+		if (ret2 != dst_len)
+			ret = -BCH_ERR_decompress_lz4;
 		break;
 	case BCH_COMPRESSION_TYPE_gzip: {
 		z_stream strm = {
@ -214,45 +214,43 @@ static int __bio_uncompress(struct bch_fs *c, struct bio *src,

 		zlib_set_workspace(&strm, workspace);
 		zlib_inflateInit2(&strm, -MAX_WBITS);
-		ret = zlib_inflate(&strm, Z_FINISH);
+		ret2 = zlib_inflate(&strm, Z_FINISH);

 		mempool_free(workspace, workspace_pool);

-		if (ret != Z_STREAM_END)
-			goto err;
+		if (ret2 != Z_STREAM_END)
+			ret = -BCH_ERR_decompress_gzip;
 		break;
 	}
 	case BCH_COMPRESSION_TYPE_zstd: {
 		ZSTD_DCtx *ctx;
 		size_t real_src_len = le32_to_cpup(src_data.b);

-		if (real_src_len > src_len - 4)
+		if (real_src_len > src_len - 4) {
+			ret = -BCH_ERR_decompress_zstd_src_len_bad;
 			goto err;
+		}

 		workspace = mempool_alloc(workspace_pool, GFP_NOFS);
 		ctx = zstd_init_dctx(workspace, zstd_dctx_workspace_bound());

-		ret = zstd_decompress_dctx(ctx,
+		ret2 = zstd_decompress_dctx(ctx,
 				dst_data,	dst_len,
 				src_data.b + 4, real_src_len);

 		mempool_free(workspace, workspace_pool);

-		if (ret != dst_len)
-			goto err;
+		if (ret2 != dst_len)
+			ret = -BCH_ERR_decompress_zstd;
 		break;
 	}
 	default:
 		BUG();
 	}
-	ret = 0;
+err:
 fsck_err:
-out:
 	bio_unmap_or_unbounce(c, src_data);
 	return ret;
-err:
-	ret = -EIO;
-	goto out;
 }

 int bch2_bio_uncompress_inplace(struct bch_write_op *op,
@ -268,27 +266,22 @@ int bch2_bio_uncompress_inplace(struct bch_write_op *op,
 	BUG_ON(!bio->bi_vcnt);
 	BUG_ON(DIV_ROUND_UP(crc->live_size, PAGE_SECTORS) > bio->bi_max_vecs);

-	if (crc->uncompressed_size << 9	> c->opts.encoded_extent_max ||
-	    crc->compressed_size << 9	> c->opts.encoded_extent_max) {
-		struct printbuf buf = PRINTBUF;
-		bch2_write_op_error(&buf, op);
-		prt_printf(&buf, "error rewriting existing data: extent too big");
-		bch_err_ratelimited(c, "%s", buf.buf);
-		printbuf_exit(&buf);
-		return -EIO;
+	if (crc->uncompressed_size << 9	> c->opts.encoded_extent_max) {
+		bch2_write_op_error(op, op->pos.offset,
+				    "extent too big to decompress (%u > %u)",
+				    crc->uncompressed_size << 9, c->opts.encoded_extent_max);
+		return -BCH_ERR_decompress_exceeded_max_encoded_extent;
 	}

 	data = __bounce_alloc(c, dst_len, WRITE);

-	if (__bio_uncompress(c, bio, data.b, *crc)) {
-		if (!c->opts.no_data_io) {
-			struct printbuf buf = PRINTBUF;
-			bch2_write_op_error(&buf, op);
-			prt_printf(&buf, "error rewriting existing data: decompression error");
-			bch_err_ratelimited(c, "%s", buf.buf);
-			printbuf_exit(&buf);
-		}
-		ret = -EIO;
+	ret = __bio_uncompress(c, bio, data.b, *crc);
+
+	if (c->opts.no_data_io)
+		ret = 0;
+
+	if (ret) {
+		bch2_write_op_error(op, op->pos.offset, "%s", bch2_err_str(ret));
 		goto err;
 	}

@ -321,7 +314,7 @@ int bch2_bio_uncompress(struct bch_fs *c, struct bio *src,

 	if (crc.uncompressed_size << 9	> c->opts.encoded_extent_max ||
 	    crc.compressed_size << 9	> c->opts.encoded_extent_max)
-		return -EIO;
+		return -BCH_ERR_decompress_exceeded_max_encoded_extent;

 	dst_data = dst_len == dst_iter.bi_size
 		? __bio_map_or_bounce(c, dst, dst_iter, WRITE)
--- a/fs/bcachefs/data_update.c
+++ b/fs/bcachefs/data_update.c
@ -20,6 +20,8 @@
 #include "subvolume.h"
 #include "trace.h"

+#include <linux/ioprio.h>
+
 static void bkey_put_dev_refs(struct bch_fs *c, struct bkey_s_c k)
 {
 	struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
@ -33,7 +35,7 @@ static bool bkey_get_dev_refs(struct bch_fs *c, struct bkey_s_c k)
 	struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);

 	bkey_for_each_ptr(ptrs, ptr) {
-		if (!bch2_dev_tryget(c, ptr->dev)) {
+		if (unlikely(!bch2_dev_tryget(c, ptr->dev))) {
 			bkey_for_each_ptr(ptrs, ptr2) {
 				if (ptr2 == ptr)
 					break;
@ -91,7 +93,7 @@ static bool bkey_nocow_lock(struct bch_fs *c, struct moving_context *ctxt, struc
 	return true;
 }

-static noinline void trace_move_extent_finish2(struct data_update *u,
+static noinline void trace_io_move_finish2(struct data_update *u,
 					       struct bkey_i *new,
 					       struct bkey_i *insert)
 {
@ -111,11 +113,11 @@ static noinline void trace_move_extent_finish2(struct data_update *u,
 	bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(insert));
 	prt_newline(&buf);

-	trace_move_extent_finish(c, buf.buf);
+	trace_io_move_finish(c, buf.buf);
 	printbuf_exit(&buf);
 }

-static void trace_move_extent_fail2(struct data_update *m,
+static void trace_io_move_fail2(struct data_update *m,
 			 struct bkey_s_c new,
 			 struct bkey_s_c wrote,
 			 struct bkey_i *insert,
@ -126,7 +128,7 @@ static void trace_move_extent_fail2(struct data_update *m,
 	struct printbuf buf = PRINTBUF;
 	unsigned rewrites_found = 0;

-	if (!trace_move_extent_fail_enabled())
+	if (!trace_io_move_fail_enabled())
 		return;

 	prt_str(&buf, msg);
@ -166,7 +168,7 @@ static void trace_move_extent_fail2(struct data_update *m,
 		bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(insert));
 	}

-	trace_move_extent_fail(c, buf.buf);
+	trace_io_move_fail(c, buf.buf);
 	printbuf_exit(&buf);
 }

@ -214,7 +216,7 @@ static int __bch2_data_update_index_update(struct btree_trans *trans,
 		new = bkey_i_to_extent(bch2_keylist_front(keys));

 		if (!bch2_extents_match(k, old)) {
-			trace_move_extent_fail2(m, k, bkey_i_to_s_c(&new->k_i),
+			trace_io_move_fail2(m, k, bkey_i_to_s_c(&new->k_i),
 						NULL, "no match:");
 			goto nowork;
 		}
@ -254,7 +256,7 @@ static int __bch2_data_update_index_update(struct btree_trans *trans,
 		if (m->data_opts.rewrite_ptrs &&
 		    !rewrites_found &&
 		    bch2_bkey_durability(c, k) >= m->op.opts.data_replicas) {
-			trace_move_extent_fail2(m, k, bkey_i_to_s_c(&new->k_i), insert, "no rewrites found:");
+			trace_io_move_fail2(m, k, bkey_i_to_s_c(&new->k_i), insert, "no rewrites found:");
 			goto nowork;
 		}

@ -271,7 +273,7 @@ restart_drop_conflicting_replicas:
 			}

 		if (!bkey_val_u64s(&new->k)) {
-			trace_move_extent_fail2(m, k, bkey_i_to_s_c(&new->k_i), insert, "new replicas conflicted:");
+			trace_io_move_fail2(m, k, bkey_i_to_s_c(&new->k_i), insert, "new replicas conflicted:");
 			goto nowork;
 		}

@ -352,7 +354,7 @@ restart_drop_extra_replicas:
 			printbuf_exit(&buf);

 			bch2_fatal_error(c);
-			ret = -EIO;
+			ret = -BCH_ERR_invalid_bkey;
 			goto out;
 		}

@ -385,9 +387,9 @@ restart_drop_extra_replicas:
 		if (!ret) {
 			bch2_btree_iter_set_pos(&iter, next_pos);

-			this_cpu_add(c->counters[BCH_COUNTER_move_extent_finish], new->k.size);
-			if (trace_move_extent_finish_enabled())
-				trace_move_extent_finish2(m, &new->k_i, insert);
+			this_cpu_add(c->counters[BCH_COUNTER_io_move_finish], new->k.size);
+			if (trace_io_move_finish_enabled())
+				trace_io_move_finish2(m, &new->k_i, insert);
 		}
 err:
 		if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
@ -409,7 +411,7 @@ nowork:
 				     &m->stats->sectors_raced);
 		}

-		count_event(c, move_extent_fail);
+		count_event(c, io_move_fail);

 		bch2_btree_iter_advance(&iter);
 		goto next;
@ -427,14 +429,17 @@ int bch2_data_update_index_update(struct bch_write_op *op)
 	return bch2_trans_run(op->c, __bch2_data_update_index_update(trans, op));
 }

-void bch2_data_update_read_done(struct data_update *m,
-				struct bch_extent_crc_unpacked crc)
+void bch2_data_update_read_done(struct data_update *m)
 {
+	m->read_done = true;
+
 	/* write bio must own pages: */
 	BUG_ON(!m->op.wbio.bio.bi_vcnt);

-	m->op.crc = crc;
-	m->op.wbio.bio.bi_iter.bi_size = crc.compressed_size << 9;
+	m->op.crc = m->rbio.pick.crc;
+	m->op.wbio.bio.bi_iter.bi_size = m->op.crc.compressed_size << 9;
+
+	this_cpu_add(m->op.c->counters[BCH_COUNTER_io_move_write], m->k.k->k.size);

 	closure_call(&m->op.cl, bch2_write, NULL, NULL);
 }
@ -444,31 +449,34 @@ void bch2_data_update_exit(struct data_update *update)
 	struct bch_fs *c = update->op.c;
 	struct bkey_s_c k = bkey_i_to_s_c(update->k.k);

+	bch2_bio_free_pages_pool(c, &update->op.wbio.bio);
+	kfree(update->bvecs);
+	update->bvecs = NULL;
+
 	if (c->opts.nocow_enabled)
 		bkey_nocow_unlock(c, k);
 	bkey_put_dev_refs(c, k);
-	bch2_bkey_buf_exit(&update->k, c);
 	bch2_disk_reservation_put(c, &update->op.res);
-	bch2_bio_free_pages_pool(c, &update->op.wbio.bio);
+	bch2_bkey_buf_exit(&update->k, c);
 }

-static void bch2_update_unwritten_extent(struct btree_trans *trans,
-				  struct data_update *update)
+static int bch2_update_unwritten_extent(struct btree_trans *trans,
+					struct data_update *update)
 {
 	struct bch_fs *c = update->op.c;
-	struct bio *bio = &update->op.wbio.bio;
 	struct bkey_i_extent *e;
 	struct write_point *wp;
 	struct closure cl;
 	struct btree_iter iter;
 	struct bkey_s_c k;
-	int ret;
+	int ret = 0;

 	closure_init_stack(&cl);
 	bch2_keylist_init(&update->op.insert_keys, update->op.inline_keys);

-	while (bio_sectors(bio)) {
-		unsigned sectors = bio_sectors(bio);
+	while (bpos_lt(update->op.pos, update->k.k->k.p)) {
+		unsigned sectors = update->k.k->k.p.offset -
+			update->op.pos.offset;

 		bch2_trans_begin(trans);

@ -504,7 +512,7 @@ static void bch2_update_unwritten_extent(struct btree_trans *trans,
 		bch_err_fn_ratelimited(c, ret);

 		if (ret)
-			return;
+			break;

 		sectors = min(sectors, wp->sectors_free);

@ -514,7 +522,6 @@ static void bch2_update_unwritten_extent(struct btree_trans *trans,
 		bch2_alloc_sectors_append_ptrs(c, wp, &e->k_i, sectors, false);
 		bch2_alloc_sectors_done(c, wp);

-		bio_advance(bio, sectors << 9);
 		update->op.pos.offset += sectors;

 		extent_for_each_ptr(extent_i_to_s(e), ptr)
@ -533,13 +540,16 @@ static void bch2_update_unwritten_extent(struct btree_trans *trans,
 		bch2_trans_unlock(trans);
 		closure_sync(&cl);
 	}
+
+	return ret;
 }

 void bch2_data_update_opts_to_text(struct printbuf *out, struct bch_fs *c,
 				   struct bch_io_opts *io_opts,
 				   struct data_update_opts *data_opts)
 {
-	printbuf_tabstop_push(out, 20);
+	if (!out->nr_tabstops)
+		printbuf_tabstop_push(out, 20);

 	prt_str_indented(out, "rewrite ptrs:\t");
 	bch2_prt_u64_base2(out, data_opts->rewrite_ptrs);
@ -574,6 +584,17 @@ void bch2_data_update_to_text(struct printbuf *out, struct data_update *m)
 	bch2_bkey_val_to_text(out, m->op.c, bkey_i_to_s_c(m->k.k));
 }

+void bch2_data_update_inflight_to_text(struct printbuf *out, struct data_update *m)
+{
+	bch2_bkey_val_to_text(out, m->op.c, bkey_i_to_s_c(m->k.k));
+	prt_newline(out);
+	printbuf_indent_add(out, 2);
+	bch2_data_update_opts_to_text(out, m->op.c, &m->op.opts, &m->data_opts);
+	prt_printf(out, "read_done:\t\%u\n", m->read_done);
+	bch2_write_op_to_text(out, &m->op);
+	printbuf_indent_sub(out, 2);
+}
+
 int bch2_extent_drop_ptrs(struct btree_trans *trans,
 			  struct btree_iter *iter,
 			  struct bkey_s_c k,
@ -617,12 +638,85 @@ int bch2_extent_drop_ptrs(struct btree_trans *trans,
 		bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc);
 }

+int bch2_data_update_bios_init(struct data_update *m, struct bch_fs *c,
+			       struct bch_io_opts *io_opts)
+{
+	struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(bkey_i_to_s_c(m->k.k));
+	const union bch_extent_entry *entry;
+	struct extent_ptr_decoded p;
+
+	/* write path might have to decompress data: */
+	unsigned buf_bytes = 0;
+	bkey_for_each_ptr_decode(&m->k.k->k, ptrs, p, entry)
+		buf_bytes = max_t(unsigned, buf_bytes, p.crc.uncompressed_size << 9);
+
+	unsigned nr_vecs = DIV_ROUND_UP(buf_bytes, PAGE_SIZE);
+
+	m->bvecs = kmalloc_array(nr_vecs, sizeof*(m->bvecs), GFP_KERNEL);
+	if (!m->bvecs)
+		return -ENOMEM;
+
+	bio_init(&m->rbio.bio,		NULL, m->bvecs, nr_vecs, REQ_OP_READ);
+	bio_init(&m->op.wbio.bio,	NULL, m->bvecs, nr_vecs, 0);
+
+	if (bch2_bio_alloc_pages(&m->op.wbio.bio, buf_bytes, GFP_KERNEL)) {
+		kfree(m->bvecs);
+		m->bvecs = NULL;
+		return -ENOMEM;
+	}
+
+	rbio_init(&m->rbio.bio, c, *io_opts, NULL);
+	m->rbio.data_update		= true;
+	m->rbio.bio.bi_iter.bi_size	= buf_bytes;
+	m->rbio.bio.bi_iter.bi_sector	= bkey_start_offset(&m->k.k->k);
+	m->op.wbio.bio.bi_ioprio	= IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0);
+	return 0;
+}
+
+static int can_write_extent(struct bch_fs *c, struct data_update *m)
+{
+	if ((m->op.flags & BCH_WRITE_alloc_nowait) &&
+	    unlikely(c->open_buckets_nr_free <= bch2_open_buckets_reserved(m->op.watermark)))
+		return -BCH_ERR_data_update_done_would_block;
+
+	unsigned target = m->op.flags & BCH_WRITE_only_specified_devs
+		? m->op.target
+		: 0;
+	struct bch_devs_mask devs = target_rw_devs(c, BCH_DATA_user, target);
+
+	darray_for_each(m->op.devs_have, i)
+		__clear_bit(*i, devs.d);
+
+	rcu_read_lock();
+	unsigned nr_replicas = 0, i;
+	for_each_set_bit(i, devs.d, BCH_SB_MEMBERS_MAX) {
+		struct bch_dev *ca = bch2_dev_rcu(c, i);
+
+		struct bch_dev_usage usage;
+		bch2_dev_usage_read_fast(ca, &usage);
+
+		if (!dev_buckets_free(ca, usage, m->op.watermark))
+			continue;
+
+		nr_replicas += ca->mi.durability;
+		if (nr_replicas >= m->op.nr_replicas)
+			break;
+	}
+	rcu_read_unlock();
+
+	if (!nr_replicas)
+		return -BCH_ERR_data_update_done_no_rw_devs;
+	if (nr_replicas < m->op.nr_replicas)
+		return -BCH_ERR_insufficient_devices;
+	return 0;
+}
+
 int bch2_data_update_init(struct btree_trans *trans,
 			  struct btree_iter *iter,
 			  struct moving_context *ctxt,
 			  struct data_update *m,
 			  struct write_point_specifier wp,
-			  struct bch_io_opts io_opts,
+			  struct bch_io_opts *io_opts,
 			  struct data_update_opts data_opts,
 			  enum btree_id btree_id,
 			  struct bkey_s_c k)
@ -640,16 +734,7 @@ int bch2_data_update_init(struct btree_trans *trans,
 	 * snapshots table - just skip it, we can move it later.
 	 */
 	if (unlikely(k.k->p.snapshot && !bch2_snapshot_exists(c, k.k->p.snapshot)))
-		return -BCH_ERR_data_update_done;
-
-	if (!bkey_get_dev_refs(c, k))
-		return -BCH_ERR_data_update_done;
-
-	if (c->opts.nocow_enabled &&
-	    !bkey_nocow_lock(c, ctxt, k)) {
-		bkey_put_dev_refs(c, k);
-		return -BCH_ERR_nocow_lock_blocked;
-	}
+		return -BCH_ERR_data_update_done_no_snapshot;

 	bch2_bkey_buf_init(&m->k);
 	bch2_bkey_buf_reassemble(&m->k, c, k);
@ -658,18 +743,18 @@ int bch2_data_update_init(struct btree_trans *trans,
 	m->ctxt		= ctxt;
 	m->stats	= ctxt ? ctxt->stats : NULL;

-	bch2_write_op_init(&m->op, c, io_opts);
+	bch2_write_op_init(&m->op, c, *io_opts);
 	m->op.pos	= bkey_start_pos(k.k);
 	m->op.version	= k.k->bversion;
 	m->op.target	= data_opts.target;
 	m->op.write_point = wp;
 	m->op.nr_replicas = 0;
-	m->op.flags	|= BCH_WRITE_PAGES_STABLE|
-		BCH_WRITE_PAGES_OWNED|
-		BCH_WRITE_DATA_ENCODED|
-		BCH_WRITE_MOVE|
+	m->op.flags	|= BCH_WRITE_pages_stable|
+		BCH_WRITE_pages_owned|
+		BCH_WRITE_data_encoded|
+		BCH_WRITE_move|
 		m->data_opts.write_flags;
-	m->op.compression_opt	= io_opts.background_compression;
+	m->op.compression_opt	= io_opts->background_compression;
 	m->op.watermark		= m->data_opts.btree_insert_flags & BCH_WATERMARK_MASK;

 	unsigned durability_have = 0, durability_removing = 0;
@ -707,7 +792,7 @@ int bch2_data_update_init(struct btree_trans *trans,
 		ptr_bit <<= 1;
 	}

-	unsigned durability_required = max(0, (int) (io_opts.data_replicas - durability_have));
+	unsigned durability_required = max(0, (int) (io_opts->data_replicas - durability_have));

 	/*
 	 * If current extent durability is less than io_opts.data_replicas,
@ -740,28 +825,70 @@ int bch2_data_update_init(struct btree_trans *trans,
 		m->data_opts.rewrite_ptrs = 0;
 		/* if iter == NULL, it's just a promote */
 		if (iter)
-			ret = bch2_extent_drop_ptrs(trans, iter, k, &io_opts, &m->data_opts);
-		goto out;
+			ret = bch2_extent_drop_ptrs(trans, iter, k, io_opts, &m->data_opts);
+		if (!ret)
+			ret = -BCH_ERR_data_update_done_no_writes_needed;
+		goto out_bkey_buf_exit;
 	}

+	/*
+	 * Check if the allocation will succeed, to avoid getting an error later
+	 * in bch2_write() -> bch2_alloc_sectors_start() and doing a useless
+	 * read:
+	 *
+	 * This guards against
+	 * - BCH_WRITE_alloc_nowait allocations failing (promotes)
+	 * - Destination target full
+	 * - Device(s) in destination target offline
+	 * - Insufficient durability available in destination target
+	 *   (i.e. trying to move a durability=2 replica to a target with a
+	 *   single durability=2 device)
+	 */
+	ret = can_write_extent(c, m);
+	if (ret)
+		goto out_bkey_buf_exit;
+
 	if (reserve_sectors) {
 		ret = bch2_disk_reservation_add(c, &m->op.res, reserve_sectors,
 				m->data_opts.extra_replicas
 				? 0
 				: BCH_DISK_RESERVATION_NOFAIL);
 		if (ret)
-			goto out;
+			goto out_bkey_buf_exit;
+	}
+
+	if (!bkey_get_dev_refs(c, k)) {
+		ret = -BCH_ERR_data_update_done_no_dev_refs;
+		goto out_put_disk_res;
+	}
+
+	if (c->opts.nocow_enabled &&
+	    !bkey_nocow_lock(c, ctxt, k)) {
+		ret = -BCH_ERR_nocow_lock_blocked;
+		goto out_put_dev_refs;
 	}

 	if (bkey_extent_is_unwritten(k)) {
-		bch2_update_unwritten_extent(trans, m);
-		goto out;
+		ret = bch2_update_unwritten_extent(trans, m) ?:
+			-BCH_ERR_data_update_done_unwritten;
+		goto out_nocow_unlock;
 	}

+	ret = bch2_data_update_bios_init(m, c, io_opts);
+	if (ret)
+		goto out_nocow_unlock;
+
 	return 0;
-out:
-	bch2_data_update_exit(m);
-	return ret ?: -BCH_ERR_data_update_done;
+out_nocow_unlock:
+	if (c->opts.nocow_enabled)
+		bkey_nocow_unlock(c, k);
+out_put_dev_refs:
+	bkey_put_dev_refs(c, k);
+out_put_disk_res:
+	bch2_disk_reservation_put(c, &m->op.res);
+out_bkey_buf_exit:
+	bch2_bkey_buf_exit(&m->k, c);
+	return ret;
 }

 void bch2_data_update_opts_normalize(struct bkey_s_c k, struct data_update_opts *opts)
--- a/fs/bcachefs/data_update.h
+++ b/fs/bcachefs/data_update.h
@ -4,6 +4,7 @@
 #define _BCACHEFS_DATA_UPDATE_H

 #include "bkey_buf.h"
+#include "io_read.h"
 #include "io_write_types.h"

 struct moving_context;
@ -15,6 +16,9 @@ struct data_update_opts {
 	u8		extra_replicas;
 	unsigned	btree_insert_flags;
 	unsigned	write_flags;
+
+	int		read_dev;
+	bool		scrub;
 };

 void bch2_data_update_opts_to_text(struct printbuf *, struct bch_fs *,
@ -22,20 +26,24 @@ void bch2_data_update_opts_to_text(struct printbuf *, struct bch_fs *,

 struct data_update {
 	/* extent being updated: */
+	bool			read_done;
 	enum btree_id		btree_id;
 	struct bkey_buf		k;
 	struct data_update_opts	data_opts;
 	struct moving_context	*ctxt;
 	struct bch_move_stats	*stats;
+
+	struct bch_read_bio	rbio;
 	struct bch_write_op	op;
+	struct bio_vec		*bvecs;
 };

 void bch2_data_update_to_text(struct printbuf *, struct data_update *);
+void bch2_data_update_inflight_to_text(struct printbuf *, struct data_update *);

 int bch2_data_update_index_update(struct bch_write_op *);

-void bch2_data_update_read_done(struct data_update *,
-				struct bch_extent_crc_unpacked);
+void bch2_data_update_read_done(struct data_update *);

 int bch2_extent_drop_ptrs(struct btree_trans *,
 			  struct btree_iter *,
@ -43,12 +51,15 @@ int bch2_extent_drop_ptrs(struct btree_trans *,
 			  struct bch_io_opts *,
 			  struct data_update_opts *);

+int bch2_data_update_bios_init(struct data_update *, struct bch_fs *,
+			       struct bch_io_opts *);
+
 void bch2_data_update_exit(struct data_update *);
 int bch2_data_update_init(struct btree_trans *, struct btree_iter *,
 			  struct moving_context *,
 			  struct data_update *,
 			  struct write_point_specifier,
-			  struct bch_io_opts, struct data_update_opts,
+			  struct bch_io_opts *, struct data_update_opts,
 			  enum btree_id, struct bkey_s_c);
 void bch2_data_update_opts_normalize(struct bkey_s_c, struct data_update_opts *);

--- a/fs/bcachefs/debug.c
+++ b/fs/bcachefs/debug.c
@ -7,6 +7,7 @@
 */

 #include "bcachefs.h"
+#include "alloc_foreground.h"
 #include "bkey_methods.h"
 #include "btree_cache.h"
 #include "btree_io.h"
@ -190,7 +191,7 @@ void bch2_btree_node_ondisk_to_text(struct printbuf *out, struct bch_fs *c,
 	unsigned offset = 0;
 	int ret;

-	if (bch2_bkey_pick_read_device(c, bkey_i_to_s_c(&b->key), NULL, &pick) <= 0) {
+	if (bch2_bkey_pick_read_device(c, bkey_i_to_s_c(&b->key), NULL, &pick, -1) <= 0) {
 		prt_printf(out, "error getting device to read from: invalid device\n");
 		return;
 	}
@ -844,8 +845,11 @@ restart:
 	seqmutex_unlock(&c->btree_trans_lock);
 }

-static ssize_t bch2_btree_deadlock_read(struct file *file, char __user *buf,
-					    size_t size, loff_t *ppos)
+typedef void (*fs_to_text_fn)(struct printbuf *, struct bch_fs *);
+
+static ssize_t bch2_simple_print(struct file *file, char __user *buf,
+				 size_t size, loff_t *ppos,
+				 fs_to_text_fn fn)
 {
 	struct dump_iter *i = file->private_data;
 	struct bch_fs *c = i->c;
@ -856,7 +860,7 @@ static ssize_t bch2_btree_deadlock_read(struct file *file, char __user *buf,
 	i->ret	= 0;

 	if (!i->iter) {
-		btree_deadlock_to_text(&i->buf, c);
+		fn(&i->buf, c);
 		i->iter++;
 	}

@ -869,6 +873,12 @@ static ssize_t bch2_btree_deadlock_read(struct file *file, char __user *buf,
 	return ret ?: i->ret;
 }

+static ssize_t bch2_btree_deadlock_read(struct file *file, char __user *buf,
+					size_t size, loff_t *ppos)
+{
+	return bch2_simple_print(file, buf, size, ppos, btree_deadlock_to_text);
+}
+
 static const struct file_operations btree_deadlock_ops = {
 	.owner		= THIS_MODULE,
 	.open		= bch2_dump_open,
@ -876,6 +886,19 @@ static const struct file_operations btree_deadlock_ops = {
 	.read		= bch2_btree_deadlock_read,
 };

+static ssize_t bch2_write_points_read(struct file *file, char __user *buf,
+				     size_t size, loff_t *ppos)
+{
+	return bch2_simple_print(file, buf, size, ppos, bch2_write_points_to_text);
+}
+
+static const struct file_operations write_points_ops = {
+	.owner		= THIS_MODULE,
+	.open		= bch2_dump_open,
+	.release	= bch2_dump_release,
+	.read		= bch2_write_points_read,
+};
+
 void bch2_fs_debug_exit(struct bch_fs *c)
 {
 	if (!IS_ERR_OR_NULL(c->fs_debug_dir))
@ -927,6 +950,9 @@ void bch2_fs_debug_init(struct bch_fs *c)
 	debugfs_create_file("btree_deadlock", 0400, c->fs_debug_dir,
 			    c->btree_debug, &btree_deadlock_ops);

+	debugfs_create_file("write_points", 0400, c->fs_debug_dir,
+			    c->btree_debug, &write_points_ops);
+
 	c->btree_debug_dir = debugfs_create_dir("btrees", c->fs_debug_dir);
 	if (IS_ERR_OR_NULL(c->btree_debug_dir))
 		return;
--- a/fs/bcachefs/dirent.c
+++ b/fs/bcachefs/dirent.c
@ -13,6 +13,40 @@

 #include <linux/dcache.h>

+static int bch2_casefold(struct btree_trans *trans, const struct bch_hash_info *info,
+				const struct qstr *str, struct qstr *out_cf)
+{
+	*out_cf = (struct qstr) QSTR_INIT(NULL, 0);
+
+#ifdef CONFIG_UNICODE
+	unsigned char *buf = bch2_trans_kmalloc(trans, BCH_NAME_MAX + 1);
+	int ret = PTR_ERR_OR_ZERO(buf);
+	if (ret)
+		return ret;
+
+	ret = utf8_casefold(info->cf_encoding, str, buf, BCH_NAME_MAX + 1);
+	if (ret <= 0)
+		return ret;
+
+	*out_cf = (struct qstr) QSTR_INIT(buf, ret);
+	return 0;
+#else
+	return -EOPNOTSUPP;
+#endif
+}
+
+static inline int bch2_maybe_casefold(struct btree_trans *trans,
+				      const struct bch_hash_info *info,
+				      const struct qstr *str, struct qstr *out_cf)
+{
+	if (likely(!info->cf_encoding)) {
+		*out_cf = *str;
+		return 0;
+	} else {
+		return bch2_casefold(trans, info, str, out_cf);
+	}
+}
+
 static unsigned bch2_dirent_name_bytes(struct bkey_s_c_dirent d)
 {
 	if (bkey_val_bytes(d.k) < offsetof(struct bch_dirent, d_name))
@ -28,13 +62,38 @@ static unsigned bch2_dirent_name_bytes(struct bkey_s_c_dirent d)
 #endif

 	return bkey_bytes -
-		offsetof(struct bch_dirent, d_name) -
+		(d.v->d_casefold
+		? offsetof(struct bch_dirent, d_cf_name_block.d_names)
+		: offsetof(struct bch_dirent, d_name)) -
 		trailing_nuls;
 }

 struct qstr bch2_dirent_get_name(struct bkey_s_c_dirent d)
 {
-	return (struct qstr) QSTR_INIT(d.v->d_name, bch2_dirent_name_bytes(d));
+	if (d.v->d_casefold) {
+		unsigned name_len = le16_to_cpu(d.v->d_cf_name_block.d_name_len);
+		return (struct qstr) QSTR_INIT(&d.v->d_cf_name_block.d_names[0], name_len);
+	} else {
+		return (struct qstr) QSTR_INIT(d.v->d_name, bch2_dirent_name_bytes(d));
+	}
+}
+
+static struct qstr bch2_dirent_get_casefold_name(struct bkey_s_c_dirent d)
+{
+	if (d.v->d_casefold) {
+		unsigned name_len = le16_to_cpu(d.v->d_cf_name_block.d_name_len);
+		unsigned cf_name_len = le16_to_cpu(d.v->d_cf_name_block.d_cf_name_len);
+		return (struct qstr) QSTR_INIT(&d.v->d_cf_name_block.d_names[name_len], cf_name_len);
+	} else {
+		return (struct qstr) QSTR_INIT(NULL, 0);
+	}
+}
+
+static inline struct qstr bch2_dirent_get_lookup_name(struct bkey_s_c_dirent d)
+{
+	return d.v->d_casefold
+		? bch2_dirent_get_casefold_name(d)
+		: bch2_dirent_get_name(d);
 }

 static u64 bch2_dirent_hash(const struct bch_hash_info *info,
@ -57,7 +116,7 @@ static u64 dirent_hash_key(const struct bch_hash_info *info, const void *key)
 static u64 dirent_hash_bkey(const struct bch_hash_info *info, struct bkey_s_c k)
 {
 	struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k);
-	struct qstr name = bch2_dirent_get_name(d);
+	struct qstr name = bch2_dirent_get_lookup_name(d);

 	return bch2_dirent_hash(info, &name);
 }
@ -65,7 +124,7 @@ static u64 dirent_hash_bkey(const struct bch_hash_info *info, struct bkey_s_c k)
 static bool dirent_cmp_key(struct bkey_s_c _l, const void *_r)
 {
 	struct bkey_s_c_dirent l = bkey_s_c_to_dirent(_l);
-	const struct qstr l_name = bch2_dirent_get_name(l);
+	const struct qstr l_name = bch2_dirent_get_lookup_name(l);
 	const struct qstr *r_name = _r;

 	return !qstr_eq(l_name, *r_name);
@ -75,8 +134,8 @@ static bool dirent_cmp_bkey(struct bkey_s_c _l, struct bkey_s_c _r)
 {
 	struct bkey_s_c_dirent l = bkey_s_c_to_dirent(_l);
 	struct bkey_s_c_dirent r = bkey_s_c_to_dirent(_r);
-	const struct qstr l_name = bch2_dirent_get_name(l);
-	const struct qstr r_name = bch2_dirent_get_name(r);
+	const struct qstr l_name = bch2_dirent_get_lookup_name(l);
+	const struct qstr r_name = bch2_dirent_get_lookup_name(r);

 	return !qstr_eq(l_name, r_name);
 }
@ -104,17 +163,19 @@ int bch2_dirent_validate(struct bch_fs *c, struct bkey_s_c k,
 			 struct bkey_validate_context from)
 {
 	struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k);
+	unsigned name_block_len = bch2_dirent_name_bytes(d);
 	struct qstr d_name = bch2_dirent_get_name(d);
+	struct qstr d_cf_name = bch2_dirent_get_casefold_name(d);
 	int ret = 0;

 	bkey_fsck_err_on(!d_name.len,
 			 c, dirent_empty_name,
 			 "empty name");

-	bkey_fsck_err_on(bkey_val_u64s(k.k) > dirent_val_u64s(d_name.len),
+	bkey_fsck_err_on(d_name.len + d_cf_name.len > name_block_len,
 			 c, dirent_val_too_big,
-			 "value too big (%zu > %u)",
-			 bkey_val_u64s(k.k), dirent_val_u64s(d_name.len));
+			 "dirent names exceed bkey size (%d + %d > %d)",
+			 d_name.len, d_cf_name.len, name_block_len);

 	/*
 	 * Check new keys don't exceed the max length
@ -142,6 +203,18 @@ int bch2_dirent_validate(struct bch_fs *c, struct bkey_s_c k,
 			 le64_to_cpu(d.v->d_inum) == d.k->p.inode,
 			 c, dirent_to_itself,
 			 "dirent points to own directory");
+
+	if (d.v->d_casefold) {
+		bkey_fsck_err_on(from.from == BKEY_VALIDATE_commit &&
+				 d_cf_name.len > BCH_NAME_MAX,
+				 c, dirent_cf_name_too_big,
+				 "dirent w/ cf name too big (%u > %u)",
+				 d_cf_name.len, BCH_NAME_MAX);
+
+		bkey_fsck_err_on(d_cf_name.len != strnlen(d_cf_name.name, d_cf_name.len),
+				 c, dirent_stray_data_after_cf_name,
+				 "dirent has stray data after cf name's NUL");
+	}
 fsck_err:
 	return ret;
 }
@ -163,15 +236,14 @@ void bch2_dirent_to_text(struct printbuf *out, struct bch_fs *c, struct bkey_s_c
 	prt_printf(out, " type %s", bch2_d_type_str(d.v->d_type));
 }

-static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans,
-				subvol_inum dir, u8 type,
-				const struct qstr *name, u64 dst)
+static struct bkey_i_dirent *dirent_alloc_key(struct btree_trans *trans,
+				subvol_inum dir,
+				u8 type,
+				int name_len, int cf_name_len,
+				u64 dst)
 {
 	struct bkey_i_dirent *dirent;
-	unsigned u64s = BKEY_U64s + dirent_val_u64s(name->len);
-
-	if (name->len > BCH_NAME_MAX)
-		return ERR_PTR(-ENAMETOOLONG);
+	unsigned u64s = BKEY_U64s + dirent_val_u64s(name_len, cf_name_len);

 	BUG_ON(u64s > U8_MAX);

@ -190,14 +262,65 @@ static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans,
 	}

 	dirent->v.d_type = type;
+	dirent->v.d_unused = 0;
+	dirent->v.d_casefold = cf_name_len ? 1 : 0;

-	memcpy(dirent->v.d_name, name->name, name->len);
-	memset(dirent->v.d_name + name->len, 0,
-	       bkey_val_bytes(&dirent->k) -
-	       offsetof(struct bch_dirent, d_name) -
-	       name->len);
+	return dirent;
+}

-	EBUG_ON(bch2_dirent_name_bytes(dirent_i_to_s_c(dirent)) != name->len);
+static void dirent_init_regular_name(struct bkey_i_dirent *dirent,
+				     const struct qstr *name)
+{
+	EBUG_ON(dirent->v.d_casefold);
+
+	memcpy(&dirent->v.d_name[0], name->name, name->len);
+	memset(&dirent->v.d_name[name->len], 0,
+		bkey_val_bytes(&dirent->k) -
+		offsetof(struct bch_dirent, d_name) -
+		name->len);
+}
+
+static void dirent_init_casefolded_name(struct bkey_i_dirent *dirent,
+					const struct qstr *name,
+					const struct qstr *cf_name)
+{
+	EBUG_ON(!dirent->v.d_casefold);
+	EBUG_ON(!cf_name->len);
+
+	dirent->v.d_cf_name_block.d_name_len = name->len;
+	dirent->v.d_cf_name_block.d_cf_name_len = cf_name->len;
+	memcpy(&dirent->v.d_cf_name_block.d_names[0], name->name, name->len);
+	memcpy(&dirent->v.d_cf_name_block.d_names[name->len], cf_name->name, cf_name->len);
+	memset(&dirent->v.d_cf_name_block.d_names[name->len + cf_name->len], 0,
+		bkey_val_bytes(&dirent->k) -
+		offsetof(struct bch_dirent, d_cf_name_block.d_names) -
+		name->len + cf_name->len);
+
+	EBUG_ON(bch2_dirent_get_casefold_name(dirent_i_to_s_c(dirent)).len != cf_name->len);
+}
+
+static struct bkey_i_dirent *dirent_create_key(struct btree_trans *trans,
+				subvol_inum dir,
+				u8 type,
+				const struct qstr *name,
+				const struct qstr *cf_name,
+				u64 dst)
+{
+	struct bkey_i_dirent *dirent;
+
+	if (name->len > BCH_NAME_MAX)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	dirent = dirent_alloc_key(trans, dir, type, name->len, cf_name ? cf_name->len : 0, dst);
+	if (IS_ERR(dirent))
+		return dirent;
+
+	if (cf_name)
+		dirent_init_casefolded_name(dirent, name, cf_name);
+	else
+		dirent_init_regular_name(dirent, name);
+
+	EBUG_ON(bch2_dirent_get_name(dirent_i_to_s_c(dirent)).len != name->len);

 	return dirent;
 }
@ -213,7 +336,7 @@ int bch2_dirent_create_snapshot(struct btree_trans *trans,
 	struct bkey_i_dirent *dirent;
 	int ret;

-	dirent = dirent_create_key(trans, dir_inum, type, name, dst_inum);
+	dirent = dirent_create_key(trans, dir_inum, type, name, NULL, dst_inum);
 	ret = PTR_ERR_OR_ZERO(dirent);
 	if (ret)
 		return ret;
@ -233,16 +356,28 @@ int bch2_dirent_create(struct btree_trans *trans, subvol_inum dir,
 		       const struct bch_hash_info *hash_info,
 		       u8 type, const struct qstr *name, u64 dst_inum,
 		       u64 *dir_offset,
+		       u64 *i_size,
 		       enum btree_iter_update_trigger_flags flags)
 {
 	struct bkey_i_dirent *dirent;
 	int ret;

-	dirent = dirent_create_key(trans, dir, type, name, dst_inum);
+	if (hash_info->cf_encoding) {
+		struct qstr cf_name;
+		ret = bch2_casefold(trans, hash_info, name, &cf_name);
+		if (ret)
+			return ret;
+		dirent = dirent_create_key(trans, dir, type, name, &cf_name, dst_inum);
+	} else {
+		dirent = dirent_create_key(trans, dir, type, name, NULL, dst_inum);
+	}
+
 	ret = PTR_ERR_OR_ZERO(dirent);
 	if (ret)
 		return ret;

+	*i_size += bkey_bytes(&dirent->k);
+
 	ret = bch2_hash_set(trans, bch2_dirent_hash_desc, hash_info,
 			    dir, &dirent->k_i, flags);
 	*dir_offset = dirent->k.p.offset;
@ -275,12 +410,13 @@ int bch2_dirent_read_target(struct btree_trans *trans, subvol_inum dir,
 }

 int bch2_dirent_rename(struct btree_trans *trans,
-		subvol_inum src_dir, struct bch_hash_info *src_hash,
-		subvol_inum dst_dir, struct bch_hash_info *dst_hash,
+		subvol_inum src_dir, struct bch_hash_info *src_hash, u64 *src_dir_i_size,
+		subvol_inum dst_dir, struct bch_hash_info *dst_hash, u64 *dst_dir_i_size,
 		const struct qstr *src_name, subvol_inum *src_inum, u64 *src_offset,
 		const struct qstr *dst_name, subvol_inum *dst_inum, u64 *dst_offset,
 		enum bch_rename_mode mode)
 {
+	struct qstr src_name_lookup, dst_name_lookup;
 	struct btree_iter src_iter = { NULL };
 	struct btree_iter dst_iter = { NULL };
 	struct bkey_s_c old_src, old_dst = bkey_s_c_null;
@ -295,8 +431,11 @@ int bch2_dirent_rename(struct btree_trans *trans,
 	memset(dst_inum, 0, sizeof(*dst_inum));

 	/* Lookup src: */
+	ret = bch2_maybe_casefold(trans, src_hash, src_name, &src_name_lookup);
+	if (ret)
+		goto out;
 	old_src = bch2_hash_lookup(trans, &src_iter, bch2_dirent_hash_desc,
-				   src_hash, src_dir, src_name,
+				   src_hash, src_dir, &src_name_lookup,
 				   BTREE_ITER_intent);
 	ret = bkey_err(old_src);
 	if (ret)
@ -308,6 +447,9 @@ int bch2_dirent_rename(struct btree_trans *trans,
 		goto out;

 	/* Lookup dst: */
+	ret = bch2_maybe_casefold(trans, dst_hash, dst_name, &dst_name_lookup);
+	if (ret)
+		goto out;
 	if (mode == BCH_RENAME) {
 		/*
 		 * Note that we're _not_ checking if the target already exists -
@ -315,12 +457,12 @@ int bch2_dirent_rename(struct btree_trans *trans,
 		 * correctness:
 		 */
 		ret = bch2_hash_hole(trans, &dst_iter, bch2_dirent_hash_desc,
-				     dst_hash, dst_dir, dst_name);
+				     dst_hash, dst_dir, &dst_name_lookup);
 		if (ret)
 			goto out;
 	} else {
 		old_dst = bch2_hash_lookup(trans, &dst_iter, bch2_dirent_hash_desc,
-					    dst_hash, dst_dir, dst_name,
+					    dst_hash, dst_dir, &dst_name_lookup,
 					    BTREE_ITER_intent);
 		ret = bkey_err(old_dst);
 		if (ret)
@ -336,7 +478,8 @@ int bch2_dirent_rename(struct btree_trans *trans,
 		*src_offset = dst_iter.pos.offset;

 	/* Create new dst key: */
-	new_dst = dirent_create_key(trans, dst_dir, 0, dst_name, 0);
+	new_dst = dirent_create_key(trans, dst_dir, 0, dst_name,
+				    dst_hash->cf_encoding ? &dst_name_lookup : NULL, 0);
 	ret = PTR_ERR_OR_ZERO(new_dst);
 	if (ret)
 		goto out;
@ -346,7 +489,8 @@ int bch2_dirent_rename(struct btree_trans *trans,

 	/* Create new src key: */
 	if (mode == BCH_RENAME_EXCHANGE) {
-		new_src = dirent_create_key(trans, src_dir, 0, src_name, 0);
+		new_src = dirent_create_key(trans, src_dir, 0, src_name,
+					    src_hash->cf_encoding ? &src_name_lookup : NULL, 0);
 		ret = PTR_ERR_OR_ZERO(new_src);
 		if (ret)
 			goto out;
@ -406,6 +550,14 @@ int bch2_dirent_rename(struct btree_trans *trans,
 	    new_src->v.d_type == DT_SUBVOL)
 		new_src->v.d_parent_subvol = cpu_to_le32(src_dir.subvol);

+	if (old_dst.k)
+		*dst_dir_i_size -= bkey_bytes(old_dst.k);
+	*src_dir_i_size -= bkey_bytes(old_src.k);
+
+	if (mode == BCH_RENAME_EXCHANGE)
+		*src_dir_i_size += bkey_bytes(&new_src->k);
+	*dst_dir_i_size += bkey_bytes(&new_dst->k);
+
 	ret = bch2_trans_update(trans, &dst_iter, &new_dst->k_i, 0);
 	if (ret)
 		goto out;
@ -465,9 +617,14 @@ int bch2_dirent_lookup_trans(struct btree_trans *trans,
 			     const struct qstr *name, subvol_inum *inum,
 			     unsigned flags)
 {
+	struct qstr lookup_name;
+	int ret = bch2_maybe_casefold(trans, hash_info, name, &lookup_name);
+	if (ret)
+		return ret;
+
 	struct bkey_s_c k = bch2_hash_lookup(trans, iter, bch2_dirent_hash_desc,
-					     hash_info, dir, name, flags);
-	int ret = bkey_err(k);
+					     hash_info, dir, &lookup_name, flags);
+	ret = bkey_err(k);
 	if (ret)
 		goto err;

@ -572,3 +729,54 @@ int bch2_readdir(struct bch_fs *c, subvol_inum inum, struct dir_context *ctx)

 	return ret < 0 ? ret : 0;
 }
+
+/* fsck */
+
+static int lookup_first_inode(struct btree_trans *trans, u64 inode_nr,
+			      struct bch_inode_unpacked *inode)
+{
+	struct btree_iter iter;
+	struct bkey_s_c k;
+	int ret;
+
+	for_each_btree_key_norestart(trans, iter, BTREE_ID_inodes, POS(0, inode_nr),
+				     BTREE_ITER_all_snapshots, k, ret) {
+		if (k.k->p.offset != inode_nr)
+			break;
+		if (!bkey_is_inode(k.k))
+			continue;
+		ret = bch2_inode_unpack(k, inode);
+		goto found;
+	}
+	ret = -BCH_ERR_ENOENT_inode;
+found:
+	bch_err_msg(trans->c, ret, "fetching inode %llu", inode_nr);
+	bch2_trans_iter_exit(trans, &iter);
+	return ret;
+}
+
+int bch2_fsck_remove_dirent(struct btree_trans *trans, struct bpos pos)
+{
+	struct bch_fs *c = trans->c;
+	struct btree_iter iter;
+	struct bch_inode_unpacked dir_inode;
+	struct bch_hash_info dir_hash_info;
+	int ret;
+
+	ret = lookup_first_inode(trans, pos.inode, &dir_inode);
+	if (ret)
+		goto err;
+
+	dir_hash_info = bch2_hash_info_init(c, &dir_inode);
+
+	bch2_trans_iter_init(trans, &iter, BTREE_ID_dirents, pos, BTREE_ITER_intent);
+
+	ret =   bch2_btree_iter_traverse(&iter) ?:
+		bch2_hash_delete_at(trans, bch2_dirent_hash_desc,
+				    &dir_hash_info, &iter,
+				    BTREE_UPDATE_internal_snapshot_node);
+	bch2_trans_iter_exit(trans, &iter);
+err:
+	bch_err_fn(c, ret);
+	return ret;
+}
--- a/fs/bcachefs/dirent.h
+++ b/fs/bcachefs/dirent.h
@ -25,10 +25,13 @@ struct bch_inode_info;

 struct qstr bch2_dirent_get_name(struct bkey_s_c_dirent d);

-static inline unsigned dirent_val_u64s(unsigned len)
+static inline unsigned dirent_val_u64s(unsigned len, unsigned cf_len)
 {
-	return DIV_ROUND_UP(offsetof(struct bch_dirent, d_name) + len,
-			    sizeof(u64));
+	unsigned bytes = cf_len
+		? offsetof(struct bch_dirent, d_cf_name_block.d_names) + len + cf_len
+		: offsetof(struct bch_dirent, d_name) + len;
+
+	return DIV_ROUND_UP(bytes, sizeof(u64));
 }

 int bch2_dirent_read_target(struct btree_trans *, subvol_inum,
@ -47,7 +50,7 @@ int bch2_dirent_create_snapshot(struct btree_trans *, u32, u64, u32,
 			enum btree_iter_update_trigger_flags);
 int bch2_dirent_create(struct btree_trans *, subvol_inum,
 		       const struct bch_hash_info *, u8,
-		       const struct qstr *, u64, u64 *,
+		       const struct qstr *, u64, u64 *, u64 *,
 		       enum btree_iter_update_trigger_flags);

 static inline unsigned vfs_d_type(unsigned type)
@ -62,8 +65,8 @@ enum bch_rename_mode {
 };

 int bch2_dirent_rename(struct btree_trans *,
-		       subvol_inum, struct bch_hash_info *,
-		       subvol_inum, struct bch_hash_info *,
+		       subvol_inum, struct bch_hash_info *, u64 *,
+		       subvol_inum, struct bch_hash_info *, u64 *,
 		       const struct qstr *, subvol_inum *, u64 *,
 		       const struct qstr *, subvol_inum *, u64 *,
 		       enum bch_rename_mode);
@ -79,4 +82,6 @@ int bch2_empty_dir_snapshot(struct btree_trans *, u64, u32, u32);
 int bch2_empty_dir_trans(struct btree_trans *, subvol_inum);
 int bch2_readdir(struct bch_fs *, subvol_inum, struct dir_context *);

+int bch2_fsck_remove_dirent(struct btree_trans *, struct bpos);
+
 #endif /* _BCACHEFS_DIRENT_H */
--- a/fs/bcachefs/dirent_format.h
+++ b/fs/bcachefs/dirent_format.h
@ -29,9 +29,25 @@ struct bch_dirent {
 	 * Copy of mode bits 12-15 from the target inode - so userspace can get
 	 * the filetype without having to do a stat()
 	 */
-	__u8			d_type;
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+	__u8			d_type:5,
+				d_unused:2,
+				d_casefold:1;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+	__u8			d_casefold:1,
+				d_unused:2,
+				d_type:5;
+#endif

-	__u8			d_name[];
+	union {
+	struct {
+		__u8		d_pad;
+		__le16		d_name_len;
+		__le16		d_cf_name_len;
+		__u8		d_names[];
+	} d_cf_name_block __packed;
+	__DECLARE_FLEX_ARRAY(__u8, d_name);
+	} __packed;
 } __packed __aligned(8);

 #define DT_SUBVOL	16
--- a/fs/bcachefs/disk_accounting.h
+++ b/fs/bcachefs/disk_accounting.h
@ -85,6 +85,24 @@ static inline struct bpos disk_accounting_pos_to_bpos(struct disk_accounting_pos

 int bch2_disk_accounting_mod(struct btree_trans *, struct disk_accounting_pos *,
 			     s64 *, unsigned, bool);
+
+#define disk_accounting_key_init(_k, _type, ...)			\
+do {									\
+	memset(&(_k), 0, sizeof(_k));					\
+	(_k).type	= BCH_DISK_ACCOUNTING_##_type;			\
+	(_k)._type	= (struct bch_acct_##_type) { __VA_ARGS__ };	\
+} while (0)
+
+#define bch2_disk_accounting_mod2_nr(_trans, _gc, _v, _nr, ...)		\
+({									\
+	struct disk_accounting_pos pos;					\
+	disk_accounting_key_init(pos, __VA_ARGS__);			\
+	bch2_disk_accounting_mod(trans, &pos, _v, _nr, _gc);		\
+})
+
+#define bch2_disk_accounting_mod2(_trans, _gc, _v, ...)			\
+	bch2_disk_accounting_mod2_nr(_trans, _gc, _v, ARRAY_SIZE(_v), __VA_ARGS__)
+
 int bch2_mod_dev_cached_sectors(struct btree_trans *, unsigned, s64, bool);

 int bch2_accounting_validate(struct bch_fs *, struct bkey_s_c,
--- a/fs/bcachefs/disk_accounting_format.h
+++ b/fs/bcachefs/disk_accounting_format.h
@ -113,14 +113,14 @@ enum disk_accounting_type {
 	BCH_DISK_ACCOUNTING_TYPE_NR,
 };

-struct bch_nr_inodes {
+struct bch_acct_nr_inodes {
 };

-struct bch_persistent_reserved {
+struct bch_acct_persistent_reserved {
 	__u8			nr_replicas;
 };

-struct bch_dev_data_type {
+struct bch_acct_dev_data_type {
 	__u8			dev;
 	__u8			data_type;
 };
@ -149,10 +149,10 @@ struct disk_accounting_pos {
 	struct {
 		__u8				type;
 		union {
-		struct bch_nr_inodes		nr_inodes;
-		struct bch_persistent_reserved	persistent_reserved;
+		struct bch_acct_nr_inodes		nr_inodes;
+		struct bch_acct_persistent_reserved	persistent_reserved;
 		struct bch_replicas_entry_v1	replicas;
-		struct bch_dev_data_type	dev_data_type;
+		struct bch_acct_dev_data_type	dev_data_type;
 		struct bch_acct_compression	compression;
 		struct bch_acct_snapshot	snapshot;
 		struct bch_acct_btree		btree;
--- a/fs/bcachefs/ec.c
+++ b/fs/bcachefs/ec.c
@ -20,6 +20,7 @@
 #include "io_read.h"
 #include "io_write.h"
 #include "keylist.h"
+#include "lru.h"
 #include "recovery.h"
 #include "replicas.h"
 #include "super-io.h"
@ -104,6 +105,7 @@ struct ec_bio {
 	struct bch_dev		*ca;
 	struct ec_stripe_buf	*buf;
 	size_t			idx;
+	u64			submit_time;
 	struct bio		bio;
 };

@ -298,10 +300,22 @@ static int mark_stripe_bucket(struct btree_trans *trans,
 	struct bpos bucket = PTR_BUCKET_POS(ca, ptr);

 	if (flags & BTREE_TRIGGER_transactional) {
+		struct extent_ptr_decoded p = {
+			.ptr = *ptr,
+			.crc = bch2_extent_crc_unpack(s.k, NULL),
+		};
+		struct bkey_i_backpointer bp;
+		bch2_extent_ptr_to_bp(c, BTREE_ID_stripes, 0, s.s_c, p,
+				      (const union bch_extent_entry *) ptr, &bp);
+
 		struct bkey_i_alloc_v4 *a =
 			bch2_trans_start_alloc_update(trans, bucket, 0);
-		ret = PTR_ERR_OR_ZERO(a) ?:
-			__mark_stripe_bucket(trans, ca, s, ptr_idx, deleting, bucket, &a->v, flags);
+		ret   = PTR_ERR_OR_ZERO(a) ?:
+			__mark_stripe_bucket(trans, ca, s, ptr_idx, deleting, bucket, &a->v, flags) ?:
+			bch2_bucket_backpointer_mod(trans, s.s_c, &bp,
+						    !(flags & BTREE_TRIGGER_overwrite));
+		if (ret)
+			goto err;
 	}

 	if (flags & BTREE_TRIGGER_gc) {
@ -366,19 +380,6 @@ static int mark_stripe_buckets(struct btree_trans *trans,
 	return 0;
 }

-static inline void stripe_to_mem(struct stripe *m, const struct bch_stripe *s)
-{
-	m->sectors	= le16_to_cpu(s->sectors);
-	m->algorithm	= s->algorithm;
-	m->nr_blocks	= s->nr_blocks;
-	m->nr_redundant	= s->nr_redundant;
-	m->disk_label	= s->disk_label;
-	m->blocks_nonempty = 0;
-
-	for (unsigned i = 0; i < s->nr_blocks; i++)
-		m->blocks_nonempty += !!stripe_blockcount_get(s, i);
-}
-
 int bch2_trigger_stripe(struct btree_trans *trans,
 			enum btree_id btree, unsigned level,
 			struct bkey_s_c old, struct bkey_s _new,
@ -399,6 +400,15 @@ int bch2_trigger_stripe(struct btree_trans *trans,
 	       (new_s->nr_blocks	!= old_s->nr_blocks ||
 		new_s->nr_redundant	!= old_s->nr_redundant));

+	if (flags & BTREE_TRIGGER_transactional) {
+		int ret = bch2_lru_change(trans,
+					  BCH_LRU_STRIPE_FRAGMENTATION,
+					  idx,
+					  stripe_lru_pos(old_s),
+					  stripe_lru_pos(new_s));
+		if (ret)
+			return ret;
+	}

 	if (flags & (BTREE_TRIGGER_transactional|BTREE_TRIGGER_gc)) {
 		/*
@ -472,38 +482,6 @@ int bch2_trigger_stripe(struct btree_trans *trans,
 			return ret;
 	}

-	if (flags & BTREE_TRIGGER_atomic) {
-		struct stripe *m = genradix_ptr(&c->stripes, idx);
-
-		if (!m) {
-			struct printbuf buf1 = PRINTBUF;
-			struct printbuf buf2 = PRINTBUF;
-
-			bch2_bkey_val_to_text(&buf1, c, old);
-			bch2_bkey_val_to_text(&buf2, c, new);
-			bch_err_ratelimited(c, "error marking nonexistent stripe %llu while marking\n"
-					    "old %s\n"
-					    "new %s", idx, buf1.buf, buf2.buf);
-			printbuf_exit(&buf2);
-			printbuf_exit(&buf1);
-			bch2_inconsistent_error(c);
-			return -1;
-		}
-
-		if (!new_s) {
-			bch2_stripes_heap_del(c, m, idx);
-
-			memset(m, 0, sizeof(*m));
-		} else {
-			stripe_to_mem(m, new_s);
-
-			if (!old_s)
-				bch2_stripes_heap_insert(c, m, idx);
-			else
-				bch2_stripes_heap_update(c, m, idx);
-		}
-	}
-
 	return 0;
 }

@ -726,14 +704,15 @@ static void ec_block_endio(struct bio *bio)
 	struct bch_dev *ca = ec_bio->ca;
 	struct closure *cl = bio->bi_private;

-	if (bch2_dev_io_err_on(bio->bi_status, ca,
-			       bio_data_dir(bio)
-			       ? BCH_MEMBER_ERROR_write
-			       : BCH_MEMBER_ERROR_read,
-			       "erasure coding %s error: %s",
+	bch2_account_io_completion(ca, bio_data_dir(bio),
+				   ec_bio->submit_time, !bio->bi_status);
+
+	if (bio->bi_status) {
+		bch_err_dev_ratelimited(ca, "erasure coding %s error: %s",
 			       str_write_read(bio_data_dir(bio)),
-			       bch2_blk_status_to_str(bio->bi_status)))
+			       bch2_blk_status_to_str(bio->bi_status));
 		clear_bit(ec_bio->idx, ec_bio->buf->valid);
+	}

 	int stale = dev_ptr_stale(ca, ptr);
 	if (stale) {
@ -796,6 +775,7 @@ static void ec_block_io(struct bch_fs *c, struct ec_stripe_buf *buf,
 		ec_bio->ca			= ca;
 		ec_bio->buf			= buf;
 		ec_bio->idx			= idx;
+		ec_bio->submit_time		= local_clock();

 		ec_bio->bio.bi_iter.bi_sector	= ptr->offset + buf->offset + (offset >> 9);
 		ec_bio->bio.bi_end_io		= ec_block_endio;
@ -917,26 +897,6 @@ err:

 static int __ec_stripe_mem_alloc(struct bch_fs *c, size_t idx, gfp_t gfp)
 {
-	ec_stripes_heap n, *h = &c->ec_stripes_heap;
-
-	if (idx >= h->size) {
-		if (!init_heap(&n, max(1024UL, roundup_pow_of_two(idx + 1)), gfp))
-			return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc;
-
-		mutex_lock(&c->ec_stripes_heap_lock);
-		if (n.size > h->size) {
-			memcpy(n.data, h->data, h->nr * sizeof(h->data[0]));
-			n.nr = h->nr;
-			swap(*h, n);
-		}
-		mutex_unlock(&c->ec_stripes_heap_lock);
-
-		free_heap(&n);
-	}
-
-	if (!genradix_ptr_alloc(&c->stripes, idx, gfp))
-		return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc;
-
 	if (c->gc_pos.phase != GC_PHASE_not_running &&
 	    !genradix_ptr_alloc(&c->gc_stripes, idx, gfp))
 		return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc;
@ -1009,180 +969,50 @@ static void bch2_stripe_close(struct bch_fs *c, struct ec_stripe_new *s)
 	s->idx = 0;
 }

-/* Heap of all existing stripes, ordered by blocks_nonempty */
-
-static u64 stripe_idx_to_delete(struct bch_fs *c)
-{
-	ec_stripes_heap *h = &c->ec_stripes_heap;
-
-	lockdep_assert_held(&c->ec_stripes_heap_lock);
-
-	if (h->nr &&
-	    h->data[0].blocks_nonempty == 0 &&
-	    !bch2_stripe_is_open(c, h->data[0].idx))
-		return h->data[0].idx;
-
-	return 0;
-}
-
-static inline void ec_stripes_heap_set_backpointer(ec_stripes_heap *h,
-						   size_t i)
-{
-	struct bch_fs *c = container_of(h, struct bch_fs, ec_stripes_heap);
-
-	genradix_ptr(&c->stripes, h->data[i].idx)->heap_idx = i;
-}
-
-static inline bool ec_stripes_heap_cmp(const void *l, const void *r, void __always_unused *args)
-{
-	struct ec_stripe_heap_entry *_l = (struct ec_stripe_heap_entry *)l;
-	struct ec_stripe_heap_entry *_r = (struct ec_stripe_heap_entry *)r;
-
-	return ((_l->blocks_nonempty > _r->blocks_nonempty) <
-		(_l->blocks_nonempty < _r->blocks_nonempty));
-}
-
-static inline void ec_stripes_heap_swap(void *l, void *r, void *h)
-{
-	struct ec_stripe_heap_entry *_l = (struct ec_stripe_heap_entry *)l;
-	struct ec_stripe_heap_entry *_r = (struct ec_stripe_heap_entry *)r;
-	ec_stripes_heap *_h = (ec_stripes_heap *)h;
-	size_t i = _l - _h->data;
-	size_t j = _r - _h->data;
-
-	swap(*_l, *_r);
-
-	ec_stripes_heap_set_backpointer(_h, i);
-	ec_stripes_heap_set_backpointer(_h, j);
-}
-
-static const struct min_heap_callbacks callbacks = {
-	.less = ec_stripes_heap_cmp,
-	.swp = ec_stripes_heap_swap,
-};
-
-static void heap_verify_backpointer(struct bch_fs *c, size_t idx)
-{
-	ec_stripes_heap *h = &c->ec_stripes_heap;
-	struct stripe *m = genradix_ptr(&c->stripes, idx);
-
-	BUG_ON(m->heap_idx >= h->nr);
-	BUG_ON(h->data[m->heap_idx].idx != idx);
-}
-
-void bch2_stripes_heap_del(struct bch_fs *c,
-			   struct stripe *m, size_t idx)
-{
-	mutex_lock(&c->ec_stripes_heap_lock);
-	heap_verify_backpointer(c, idx);
-
-	min_heap_del(&c->ec_stripes_heap, m->heap_idx, &callbacks, &c->ec_stripes_heap);
-	mutex_unlock(&c->ec_stripes_heap_lock);
-}
-
-void bch2_stripes_heap_insert(struct bch_fs *c,
-			      struct stripe *m, size_t idx)
-{
-	mutex_lock(&c->ec_stripes_heap_lock);
-	BUG_ON(min_heap_full(&c->ec_stripes_heap));
-
-	genradix_ptr(&c->stripes, idx)->heap_idx = c->ec_stripes_heap.nr;
-	min_heap_push(&c->ec_stripes_heap, &((struct ec_stripe_heap_entry) {
-			.idx = idx,
-			.blocks_nonempty = m->blocks_nonempty,
-		}),
-		&callbacks,
-		&c->ec_stripes_heap);
-
-	heap_verify_backpointer(c, idx);
-	mutex_unlock(&c->ec_stripes_heap_lock);
-}
-
-void bch2_stripes_heap_update(struct bch_fs *c,
-			      struct stripe *m, size_t idx)
-{
-	ec_stripes_heap *h = &c->ec_stripes_heap;
-	bool do_deletes;
-	size_t i;
-
-	mutex_lock(&c->ec_stripes_heap_lock);
-	heap_verify_backpointer(c, idx);
-
-	h->data[m->heap_idx].blocks_nonempty = m->blocks_nonempty;
-
-	i = m->heap_idx;
-	min_heap_sift_up(h,	i, &callbacks, &c->ec_stripes_heap);
-	min_heap_sift_down(h, i, &callbacks, &c->ec_stripes_heap);
-
-	heap_verify_backpointer(c, idx);
-
-	do_deletes = stripe_idx_to_delete(c) != 0;
-	mutex_unlock(&c->ec_stripes_heap_lock);
-
-	if (do_deletes)
-		bch2_do_stripe_deletes(c);
-}
-
 /* stripe deletion */

 static int ec_stripe_delete(struct btree_trans *trans, u64 idx)
 {
-	struct bch_fs *c = trans->c;
 	struct btree_iter iter;
-	struct bkey_s_c k;
-	struct bkey_s_c_stripe s;
-	int ret;
-
-	k = bch2_bkey_get_iter(trans, &iter, BTREE_ID_stripes, POS(0, idx),
-			       BTREE_ITER_intent);
-	ret = bkey_err(k);
+	struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter,
+					       BTREE_ID_stripes, POS(0, idx),
+					       BTREE_ITER_intent);
+	int ret = bkey_err(k);
 	if (ret)
 		goto err;

-	if (k.k->type != KEY_TYPE_stripe) {
-		bch2_fs_inconsistent(c, "attempting to delete nonexistent stripe %llu", idx);
-		ret = -EINVAL;
-		goto err;
-	}
-
-	s = bkey_s_c_to_stripe(k);
-	for (unsigned i = 0; i < s.v->nr_blocks; i++)
-		if (stripe_blockcount_get(s.v, i)) {
-			struct printbuf buf = PRINTBUF;
-
-			bch2_bkey_val_to_text(&buf, c, k);
-			bch2_fs_inconsistent(c, "attempting to delete nonempty stripe %s", buf.buf);
-			printbuf_exit(&buf);
-			ret = -EINVAL;
-			goto err;
-		}
-
-	ret = bch2_btree_delete_at(trans, &iter, 0);
+	/*
+	 * We expect write buffer races here
+	 * Important: check stripe_is_open with stripe key locked:
+	 */
+	if (k.k->type == KEY_TYPE_stripe &&
+	    !bch2_stripe_is_open(trans->c, idx) &&
+	    stripe_lru_pos(bkey_s_c_to_stripe(k).v) == 1)
+		ret = bch2_btree_delete_at(trans, &iter, 0);
 err:
 	bch2_trans_iter_exit(trans, &iter);
 	return ret;
 }

+/*
+ * XXX
+ * can we kill this and delete stripes from the trigger?
+ */
 static void ec_stripe_delete_work(struct work_struct *work)
 {
 	struct bch_fs *c =
 		container_of(work, struct bch_fs, ec_stripe_delete_work);

-	while (1) {
-		mutex_lock(&c->ec_stripes_heap_lock);
-		u64 idx = stripe_idx_to_delete(c);
-		mutex_unlock(&c->ec_stripes_heap_lock);
-
-		if (!idx)
-			break;
-
-		int ret = bch2_trans_commit_do(c, NULL, NULL, BCH_TRANS_COMMIT_no_enospc,
-					ec_stripe_delete(trans, idx));
-		bch_err_fn(c, ret);
-		if (ret)
-			break;
-	}
-
+	bch2_trans_run(c,
+		bch2_btree_write_buffer_tryflush(trans) ?:
+		for_each_btree_key_max_commit(trans, lru_iter, BTREE_ID_lru,
+				lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 1, 0),
+				lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 1, LRU_TIME_MAX),
+				0, lru_k,
+				NULL, NULL,
+				BCH_TRANS_COMMIT_no_enospc, ({
+			ec_stripe_delete(trans, lru_k.k->p.offset);
+		})));
 	bch2_write_ref_put(c, BCH_WRITE_REF_stripe_delete);
 }

@ -1294,7 +1124,7 @@ static int ec_stripe_update_extent(struct btree_trans *trans,

 		bch2_fs_inconsistent(c, "%s", buf.buf);
 		printbuf_exit(&buf);
-		return -EIO;
+		return -BCH_ERR_erasure_coding_found_btree_node;
 	}

 	k = bch2_backpointer_get_key(trans, bp, &iter, BTREE_ITER_intent, last_flushed);
@ -1360,7 +1190,7 @@ static int ec_stripe_update_bucket(struct btree_trans *trans, struct ec_stripe_b

 	struct bch_dev *ca = bch2_dev_tryget(c, ptr.dev);
 	if (!ca)
-		return -EIO;
+		return -BCH_ERR_ENOENT_dev_not_found;

 	struct bpos bucket_pos = PTR_BUCKET_POS(ca, &ptr);

@ -1380,8 +1210,12 @@ static int ec_stripe_update_bucket(struct btree_trans *trans, struct ec_stripe_b
 		if (bp_k.k->type != KEY_TYPE_backpointer)
 			continue;

+		struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(bp_k);
+		if (bp.v->btree_id == BTREE_ID_stripes)
+			continue;
+
 		ec_stripe_update_extent(trans, ca, bucket_pos, ptr.gen, s,
-					bkey_s_c_to_backpointer(bp_k), &last_flushed);
+					bp, &last_flushed);
 	}));

 	bch2_bkey_buf_exit(&last_flushed, c);
@ -1393,21 +1227,19 @@ static int ec_stripe_update_extents(struct bch_fs *c, struct ec_stripe_buf *s)
 {
 	struct btree_trans *trans = bch2_trans_get(c);
 	struct bch_stripe *v = &bkey_i_to_stripe(&s->key)->v;
-	unsigned i, nr_data = v->nr_blocks - v->nr_redundant;
-	int ret = 0;
+	unsigned nr_data = v->nr_blocks - v->nr_redundant;

-	ret = bch2_btree_write_buffer_flush_sync(trans);
+	int ret = bch2_btree_write_buffer_flush_sync(trans);
 	if (ret)
 		goto err;

-	for (i = 0; i < nr_data; i++) {
+	for (unsigned i = 0; i < nr_data; i++) {
 		ret = ec_stripe_update_bucket(trans, s, i);
 		if (ret)
 			break;
 	}
 err:
 	bch2_trans_put(trans);
-
 	return ret;
 }

@ -1473,6 +1305,7 @@ static void ec_stripe_create(struct ec_stripe_new *s)
 	if (s->err) {
 		if (!bch2_err_matches(s->err, EROFS))
 			bch_err(c, "error creating stripe: error writing data buckets");
+		ret = s->err;
 		goto err;
 	}

@ -1481,6 +1314,7 @@ static void ec_stripe_create(struct ec_stripe_new *s)

 		if (ec_do_recov(c, &s->existing_stripe)) {
 			bch_err(c, "error creating stripe: error reading existing stripe");
+			ret = -BCH_ERR_ec_block_read;
 			goto err;
 		}

@ -1506,6 +1340,7 @@ static void ec_stripe_create(struct ec_stripe_new *s)

 	if (ec_nr_failed(&s->new_stripe)) {
 		bch_err(c, "error creating stripe: error writing redundancy buckets");
+		ret = -BCH_ERR_ec_block_write;
 		goto err;
 	}

@ -1527,6 +1362,8 @@ static void ec_stripe_create(struct ec_stripe_new *s)
 	if (ret)
 		goto err;
 err:
+	trace_stripe_create(c, s->idx, ret);
+
 	bch2_disk_reservation_put(c, &s->res);

 	for (i = 0; i < v->nr_blocks; i++)
@ -1612,11 +1449,11 @@ static void ec_stripe_new_cancel(struct bch_fs *c, struct ec_stripe_head *h, int
 	ec_stripe_new_set_pending(c, h);
 }

-void bch2_ec_bucket_cancel(struct bch_fs *c, struct open_bucket *ob)
+void bch2_ec_bucket_cancel(struct bch_fs *c, struct open_bucket *ob, int err)
 {
 	struct ec_stripe_new *s = ob->ec;

-	s->err = -EIO;
+	s->err = err;
 }

 void *bch2_writepoint_ec_buf(struct bch_fs *c, struct write_point *wp)
@ -1968,39 +1805,40 @@ static int new_stripe_alloc_buckets(struct btree_trans *trans,
 	return 0;
 }

-static s64 get_existing_stripe(struct bch_fs *c,
-			       struct ec_stripe_head *head)
+static int __get_existing_stripe(struct btree_trans *trans,
+				 struct ec_stripe_head *head,
+				 struct ec_stripe_buf *stripe,
+				 u64 idx)
 {
-	ec_stripes_heap *h = &c->ec_stripes_heap;
-	struct stripe *m;
-	size_t heap_idx;
-	u64 stripe_idx;
-	s64 ret = -1;
+	struct bch_fs *c = trans->c;

-	if (may_create_new_stripe(c))
-		return -1;
+	struct btree_iter iter;
+	struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter,
+					  BTREE_ID_stripes, POS(0, idx), 0);
+	int ret = bkey_err(k);
+	if (ret)
+		goto err;

-	mutex_lock(&c->ec_stripes_heap_lock);
-	for (heap_idx = 0; heap_idx < h->nr; heap_idx++) {
-		/* No blocks worth reusing, stripe will just be deleted: */
-		if (!h->data[heap_idx].blocks_nonempty)
-			continue;
+	/* We expect write buffer races here */
+	if (k.k->type != KEY_TYPE_stripe)
+		goto out;

-		stripe_idx = h->data[heap_idx].idx;
+	struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k);
+	if (stripe_lru_pos(s.v) <= 1)
+		goto out;

-		m = genradix_ptr(&c->stripes, stripe_idx);
-
-		if (m->disk_label	== head->disk_label &&
-		    m->algorithm	== head->algo &&
-		    m->nr_redundant	== head->redundancy &&
-		    m->sectors		== head->blocksize &&
-		    m->blocks_nonempty	< m->nr_blocks - m->nr_redundant &&
-		    bch2_try_open_stripe(c, head->s, stripe_idx)) {
-			ret = stripe_idx;
-			break;
-		}
+	if (s.v->disk_label		== head->disk_label &&
+	    s.v->algorithm		== head->algo &&
+	    s.v->nr_redundant		== head->redundancy &&
+	    le16_to_cpu(s.v->sectors)	== head->blocksize &&
+	    bch2_try_open_stripe(c, head->s, idx)) {
+		bkey_reassemble(&stripe->key, k);
+		ret = 1;
 	}
-	mutex_unlock(&c->ec_stripes_heap_lock);
+out:
+	bch2_set_btree_iter_dontneed(&iter);
+err:
+	bch2_trans_iter_exit(trans, &iter);
 	return ret;
 }

@ -2052,24 +1890,33 @@ static int __bch2_ec_stripe_head_reuse(struct btree_trans *trans, struct ec_stri
 				       struct ec_stripe_new *s)
 {
 	struct bch_fs *c = trans->c;
-	s64 idx;
-	int ret;

 	/*
 	 * If we can't allocate a new stripe, and there's no stripes with empty
 	 * blocks for us to reuse, that means we have to wait on copygc:
 	 */
-	idx = get_existing_stripe(c, h);
-	if (idx < 0)
-		return -BCH_ERR_stripe_alloc_blocked;
+	if (may_create_new_stripe(c))
+		return -1;

-	ret = get_stripe_key_trans(trans, idx, &s->existing_stripe);
-	bch2_fs_fatal_err_on(ret && !bch2_err_matches(ret, BCH_ERR_transaction_restart), c,
-			     "reading stripe key: %s", bch2_err_str(ret));
-	if (ret) {
-		bch2_stripe_close(c, s);
-		return ret;
+	struct btree_iter lru_iter;
+	struct bkey_s_c lru_k;
+	int ret = 0;
+
+	for_each_btree_key_max_norestart(trans, lru_iter, BTREE_ID_lru,
+			lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 2, 0),
+			lru_pos(BCH_LRU_STRIPE_FRAGMENTATION, 2, LRU_TIME_MAX),
+			0, lru_k, ret) {
+		ret = __get_existing_stripe(trans, h, &s->existing_stripe, lru_k.k->p.offset);
+		if (ret)
+			break;
 	}
+	bch2_trans_iter_exit(trans, &lru_iter);
+	if (!ret)
+		ret = -BCH_ERR_stripe_alloc_blocked;
+	if (ret == 1)
+		ret = 0;
+	if (ret)
+		return ret;

 	return init_new_stripe_from_existing(c, s);
 }
@ -2367,46 +2214,7 @@ void bch2_fs_ec_flush(struct bch_fs *c)

 int bch2_stripes_read(struct bch_fs *c)
 {
-	int ret = bch2_trans_run(c,
-		for_each_btree_key(trans, iter, BTREE_ID_stripes, POS_MIN,
-				   BTREE_ITER_prefetch, k, ({
-			if (k.k->type != KEY_TYPE_stripe)
-				continue;
-
-			ret = __ec_stripe_mem_alloc(c, k.k->p.offset, GFP_KERNEL);
-			if (ret)
-				break;
-
-			struct stripe *m = genradix_ptr(&c->stripes, k.k->p.offset);
-
-			stripe_to_mem(m, bkey_s_c_to_stripe(k).v);
-
-			bch2_stripes_heap_insert(c, m, k.k->p.offset);
-			0;
-		})));
-	bch_err_fn(c, ret);
-	return ret;
-}
-
-void bch2_stripes_heap_to_text(struct printbuf *out, struct bch_fs *c)
-{
-	ec_stripes_heap *h = &c->ec_stripes_heap;
-	struct stripe *m;
-	size_t i;
-
-	mutex_lock(&c->ec_stripes_heap_lock);
-	for (i = 0; i < min_t(size_t, h->nr, 50); i++) {
-		m = genradix_ptr(&c->stripes, h->data[i].idx);
-
-		prt_printf(out, "%zu %u/%u+%u", h->data[i].idx,
-		       h->data[i].blocks_nonempty,
-		       m->nr_blocks - m->nr_redundant,
-		       m->nr_redundant);
-		if (bch2_stripe_is_open(c, h->data[i].idx))
-			prt_str(out, " open");
-		prt_newline(out);
-	}
-	mutex_unlock(&c->ec_stripes_heap_lock);
+	return 0;
 }

 static void bch2_new_stripe_to_text(struct printbuf *out, struct bch_fs *c,
@ -2477,15 +2285,12 @@ void bch2_fs_ec_exit(struct bch_fs *c)

 	BUG_ON(!list_empty(&c->ec_stripe_new_list));

-	free_heap(&c->ec_stripes_heap);
-	genradix_free(&c->stripes);
 	bioset_exit(&c->ec_bioset);
 }

 void bch2_fs_ec_init_early(struct bch_fs *c)
 {
 	spin_lock_init(&c->ec_stripes_new_lock);
-	mutex_init(&c->ec_stripes_heap_lock);

 	INIT_LIST_HEAD(&c->ec_stripe_head_list);
 	mutex_init(&c->ec_stripe_head_lock);
@ -2503,3 +2308,40 @@ int bch2_fs_ec_init(struct bch_fs *c)
 	return bioset_init(&c->ec_bioset, 1, offsetof(struct ec_bio, bio),
 			   BIOSET_NEED_BVECS);
 }
+
+static int bch2_check_stripe_to_lru_ref(struct btree_trans *trans,
+					struct bkey_s_c k,
+					struct bkey_buf *last_flushed)
+{
+	if (k.k->type != KEY_TYPE_stripe)
+		return 0;
+
+	struct bkey_s_c_stripe s = bkey_s_c_to_stripe(k);
+
+	u64 lru_idx = stripe_lru_pos(s.v);
+	if (lru_idx) {
+		int ret = bch2_lru_check_set(trans, BCH_LRU_STRIPE_FRAGMENTATION,
+					     k.k->p.offset, lru_idx, k, last_flushed);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+int bch2_check_stripe_to_lru_refs(struct bch_fs *c)
+{
+	struct bkey_buf last_flushed;
+
+	bch2_bkey_buf_init(&last_flushed);
+	bkey_init(&last_flushed.k->k);
+
+	int ret = bch2_trans_run(c,
+		for_each_btree_key_commit(trans, iter, BTREE_ID_stripes,
+				POS_MIN, BTREE_ITER_prefetch, k,
+				NULL, NULL, BCH_TRANS_COMMIT_no_enospc,
+			bch2_check_stripe_to_lru_ref(trans, k, &last_flushed)));
+
+	bch2_bkey_buf_exit(&last_flushed, c);
+	bch_err_fn(c, ret);
+	return ret;
+}
--- a/fs/bcachefs/ec.h
+++ b/fs/bcachefs/ec.h
@ -92,6 +92,29 @@ static inline void stripe_csum_set(struct bch_stripe *s,
 	memcpy(stripe_csum(s, block, csum_idx), &csum, bch_crc_bytes[s->csum_type]);
 }

+#define STRIPE_LRU_POS_EMPTY	1
+
+static inline u64 stripe_lru_pos(const struct bch_stripe *s)
+{
+	if (!s)
+		return 0;
+
+	unsigned nr_data = s->nr_blocks - s->nr_redundant, blocks_empty = 0;
+
+	for (unsigned i = 0; i < nr_data; i++)
+		blocks_empty += !stripe_blockcount_get(s, i);
+
+	/* Will be picked up by the stripe_delete worker */
+	if (blocks_empty == nr_data)
+		return STRIPE_LRU_POS_EMPTY;
+
+	if (!blocks_empty)
+		return 0;
+
+	/* invert: more blocks empty = reuse first */
+	return LRU_TIME_MAX - blocks_empty;
+}
+
 static inline bool __bch2_ptr_matches_stripe(const struct bch_extent_ptr *stripe_ptr,
 					     const struct bch_extent_ptr *data_ptr,
 					     unsigned sectors)
@ -132,6 +155,20 @@ static inline bool bch2_ptr_matches_stripe_m(const struct gc_stripe *m,
 					 m->sectors);
 }

+static inline void gc_stripe_unlock(struct gc_stripe *s)
+{
+	BUILD_BUG_ON(!((union ulong_byte_assert) { .ulong = 1UL << BUCKET_LOCK_BITNR }).byte);
+
+	clear_bit_unlock(BUCKET_LOCK_BITNR, (void *) &s->lock);
+	wake_up_bit((void *) &s->lock, BUCKET_LOCK_BITNR);
+}
+
+static inline void gc_stripe_lock(struct gc_stripe *s)
+{
+	wait_on_bit_lock((void *) &s->lock, BUCKET_LOCK_BITNR,
+			 TASK_UNINTERRUPTIBLE);
+}
+
 struct bch_read_bio;

 struct ec_stripe_buf {
@ -212,7 +249,7 @@ int bch2_ec_read_extent(struct btree_trans *, struct bch_read_bio *, struct bkey

 void *bch2_writepoint_ec_buf(struct bch_fs *, struct write_point *);

-void bch2_ec_bucket_cancel(struct bch_fs *, struct open_bucket *);
+void bch2_ec_bucket_cancel(struct bch_fs *, struct open_bucket *, int);

 int bch2_ec_stripe_new_alloc(struct bch_fs *, struct ec_stripe_head *);

@ -221,10 +258,6 @@ struct ec_stripe_head *bch2_ec_stripe_head_get(struct btree_trans *,
 			unsigned, unsigned, unsigned,
 			enum bch_watermark, struct closure *);

-void bch2_stripes_heap_update(struct bch_fs *, struct stripe *, size_t);
-void bch2_stripes_heap_del(struct bch_fs *, struct stripe *, size_t);
-void bch2_stripes_heap_insert(struct bch_fs *, struct stripe *, size_t);
-
 void bch2_do_stripe_deletes(struct bch_fs *);
 void bch2_ec_do_stripe_creates(struct bch_fs *);
 void bch2_ec_stripe_new_free(struct bch_fs *, struct ec_stripe_new *);
@ -261,11 +294,12 @@ void bch2_fs_ec_flush(struct bch_fs *);

 int bch2_stripes_read(struct bch_fs *);

-void bch2_stripes_heap_to_text(struct printbuf *, struct bch_fs *);
 void bch2_new_stripes_to_text(struct printbuf *, struct bch_fs *);

 void bch2_fs_ec_exit(struct bch_fs *);
 void bch2_fs_ec_init_early(struct bch_fs *);
 int bch2_fs_ec_init(struct bch_fs *);

+int bch2_check_stripe_to_lru_refs(struct bch_fs *);
+
 #endif /* _BCACHEFS_EC_H */
--- a/fs/bcachefs/ec_types.h
+++ b/fs/bcachefs/ec_types.h
@ -20,23 +20,15 @@ struct stripe {
 };

 struct gc_stripe {
+	u8			lock;
+	unsigned		alive:1; /* does a corresponding key exist in stripes btree? */
 	u16			sectors;
-
 	u8			nr_blocks;
 	u8			nr_redundant;
-
-	unsigned		alive:1; /* does a corresponding key exist in stripes btree? */
 	u16			block_sectors[BCH_BKEY_PTRS_MAX];
 	struct bch_extent_ptr	ptrs[BCH_BKEY_PTRS_MAX];

 	struct bch_replicas_padded r;
 };

-struct ec_stripe_heap_entry {
-	size_t			idx;
-	unsigned		blocks_nonempty;
-};
-
-typedef DEFINE_MIN_HEAP(struct ec_stripe_heap_entry, ec_stripes_heap) ec_stripes_heap;
-
 #endif /* _BCACHEFS_EC_TYPES_H */
--- a/fs/bcachefs/errcode.h
+++ b/fs/bcachefs/errcode.h
@ -116,9 +116,11 @@
 	x(ENOENT,			ENOENT_snapshot_tree)			\
 	x(ENOENT,			ENOENT_dirent_doesnt_match_inode)	\
 	x(ENOENT,			ENOENT_dev_not_found)			\
+	x(ENOENT,			ENOENT_dev_bucket_not_found)		\
 	x(ENOENT,			ENOENT_dev_idx_not_found)		\
 	x(ENOENT,			ENOENT_inode_no_backpointer)		\
 	x(ENOENT,			ENOENT_no_snapshot_tree_subvol)		\
+	x(ENOENT,			btree_node_dying)			\
 	x(ENOTEMPTY,			ENOTEMPTY_dir_not_empty)		\
 	x(ENOTEMPTY,			ENOTEMPTY_subvol_not_empty)		\
 	x(EEXIST,			EEXIST_str_hash_set)			\
@ -180,6 +182,12 @@
 	x(EINVAL,			not_in_recovery)			\
 	x(EINVAL,			cannot_rewind_recovery)			\
 	x(0,				data_update_done)			\
+	x(BCH_ERR_data_update_done,	data_update_done_would_block)		\
+	x(BCH_ERR_data_update_done,	data_update_done_unwritten)		\
+	x(BCH_ERR_data_update_done,	data_update_done_no_writes_needed)	\
+	x(BCH_ERR_data_update_done,	data_update_done_no_snapshot)		\
+	x(BCH_ERR_data_update_done,	data_update_done_no_dev_refs)		\
+	x(BCH_ERR_data_update_done,	data_update_done_no_rw_devs)		\
 	x(EINVAL,			device_state_not_allowed)		\
 	x(EINVAL,			member_info_missing)			\
 	x(EINVAL,			mismatched_block_size)			\
@ -200,6 +208,8 @@
 	x(EINVAL,			no_resize_with_buckets_nouse)		\
 	x(EINVAL,			inode_unpack_error)			\
 	x(EINVAL,			varint_decode_error)			\
+	x(EINVAL,			erasure_coding_found_btree_node)	\
+	x(EOPNOTSUPP,			may_not_use_incompat_feature)		\
 	x(EROFS,			erofs_trans_commit)			\
 	x(EROFS,			erofs_no_writes)			\
 	x(EROFS,			erofs_journal_err)			\
@ -210,10 +220,18 @@
 	x(EROFS,			insufficient_devices)			\
 	x(0,				operation_blocked)			\
 	x(BCH_ERR_operation_blocked,	btree_cache_cannibalize_lock_blocked)	\
-	x(BCH_ERR_operation_blocked,	journal_res_get_blocked)		\
-	x(BCH_ERR_operation_blocked,	journal_preres_get_blocked)		\
-	x(BCH_ERR_operation_blocked,	bucket_alloc_blocked)			\
-	x(BCH_ERR_operation_blocked,	stripe_alloc_blocked)			\
+	x(BCH_ERR_operation_blocked,	journal_res_blocked)			\
+	x(BCH_ERR_journal_res_blocked,	journal_blocked)			\
+	x(BCH_ERR_journal_res_blocked,	journal_max_in_flight)			\
+	x(BCH_ERR_journal_res_blocked,	journal_max_open)			\
+	x(BCH_ERR_journal_res_blocked,	journal_full)				\
+	x(BCH_ERR_journal_res_blocked,	journal_pin_full)			\
+	x(BCH_ERR_journal_res_blocked,	journal_buf_enomem)			\
+	x(BCH_ERR_journal_res_blocked,	journal_stuck)				\
+	x(BCH_ERR_journal_res_blocked,	journal_retry_open)			\
+	x(BCH_ERR_journal_res_blocked,	journal_preres_get_blocked)		\
+	x(BCH_ERR_journal_res_blocked,	bucket_alloc_blocked)			\
+	x(BCH_ERR_journal_res_blocked,	stripe_alloc_blocked)			\
 	x(BCH_ERR_invalid,		invalid_sb)				\
 	x(BCH_ERR_invalid_sb,		invalid_sb_magic)			\
 	x(BCH_ERR_invalid_sb,		invalid_sb_version)			\
@ -223,6 +241,7 @@
 	x(BCH_ERR_invalid_sb,		invalid_sb_csum)			\
 	x(BCH_ERR_invalid_sb,		invalid_sb_block_size)			\
 	x(BCH_ERR_invalid_sb,		invalid_sb_uuid)			\
+	x(BCH_ERR_invalid_sb,		invalid_sb_offset)			\
 	x(BCH_ERR_invalid_sb,		invalid_sb_too_many_members)		\
 	x(BCH_ERR_invalid_sb,		invalid_sb_dev_idx)			\
 	x(BCH_ERR_invalid_sb,		invalid_sb_time_precision)		\
@ -250,6 +269,7 @@
 	x(BCH_ERR_operation_blocked,    nocow_lock_blocked)			\
 	x(EIO,				journal_shutdown)			\
 	x(EIO,				journal_flush_err)			\
+	x(EIO,				journal_write_err)			\
 	x(EIO,				btree_node_read_err)			\
 	x(BCH_ERR_btree_node_read_err,	btree_node_read_err_cached)		\
 	x(EIO,				sb_not_downgraded)			\
@ -258,17 +278,52 @@
 	x(EIO,				btree_node_read_validate_error)		\
 	x(EIO,				btree_need_topology_repair)		\
 	x(EIO,				bucket_ref_update)			\
+	x(EIO,				trigger_alloc)				\
 	x(EIO,				trigger_pointer)			\
 	x(EIO,				trigger_stripe_pointer)			\
 	x(EIO,				metadata_bucket_inconsistency)		\
 	x(EIO,				mark_stripe)				\
 	x(EIO,				stripe_reconstruct)			\
 	x(EIO,				key_type_error)				\
-	x(EIO,				no_device_to_read_from)			\
+	x(EIO,				extent_poisened)			\
 	x(EIO,				missing_indirect_extent)		\
 	x(EIO,				invalidate_stripe_to_dev)		\
 	x(EIO,				no_encryption_key)			\
 	x(EIO,				insufficient_journal_devices)		\
+	x(EIO,				device_offline)				\
+	x(EIO,				EIO_fault_injected)			\
+	x(EIO,				ec_block_read)				\
+	x(EIO,				ec_block_write)				\
+	x(EIO,				recompute_checksum)			\
+	x(EIO,				decompress)				\
+	x(BCH_ERR_decompress,		decompress_exceeded_max_encoded_extent)	\
+	x(BCH_ERR_decompress,		decompress_lz4)				\
+	x(BCH_ERR_decompress,		decompress_gzip)			\
+	x(BCH_ERR_decompress,		decompress_zstd_src_len_bad)		\
+	x(BCH_ERR_decompress,		decompress_zstd)			\
+	x(EIO,				data_write)				\
+	x(BCH_ERR_data_write,		data_write_io)				\
+	x(BCH_ERR_data_write,		data_write_csum)			\
+	x(BCH_ERR_data_write,		data_write_invalid_ptr)			\
+	x(BCH_ERR_data_write,		data_write_misaligned)			\
+	x(BCH_ERR_decompress,		data_read)				\
+	x(BCH_ERR_data_read,		no_device_to_read_from)			\
+	x(BCH_ERR_data_read,		data_read_io_err)			\
+	x(BCH_ERR_data_read,		data_read_csum_err)			\
+	x(BCH_ERR_data_read,		data_read_retry)			\
+	x(BCH_ERR_data_read_retry,	data_read_retry_avoid)			\
+	x(BCH_ERR_data_read_retry_avoid,data_read_retry_device_offline)		\
+	x(BCH_ERR_data_read_retry_avoid,data_read_retry_io_err)			\
+	x(BCH_ERR_data_read_retry_avoid,data_read_retry_ec_reconstruct_err)	\
+	x(BCH_ERR_data_read_retry_avoid,data_read_retry_csum_err)		\
+	x(BCH_ERR_data_read_retry,	data_read_retry_csum_err_maybe_userspace)\
+	x(BCH_ERR_data_read,		data_read_decompress_err)		\
+	x(BCH_ERR_data_read,		data_read_decrypt_err)			\
+	x(BCH_ERR_data_read,		data_read_ptr_stale_race)		\
+	x(BCH_ERR_data_read_retry,	data_read_ptr_stale_retry)		\
+	x(BCH_ERR_data_read,		data_read_no_encryption_key)		\
+	x(BCH_ERR_data_read,		data_read_buffer_too_small)		\
+	x(BCH_ERR_data_read,		data_read_key_overwritten)		\
 	x(BCH_ERR_btree_node_read_err,	btree_node_read_err_fixable)		\
 	x(BCH_ERR_btree_node_read_err,	btree_node_read_err_want_retry)		\
 	x(BCH_ERR_btree_node_read_err,	btree_node_read_err_must_retry)		\
--- a/fs/bcachefs/error.c
+++ b/fs/bcachefs/error.c
@ -3,8 +3,8 @@
 #include "btree_cache.h"
 #include "btree_iter.h"
 #include "error.h"
-#include "fs-common.h"
 #include "journal.h"
+#include "namei.h"
 #include "recovery_passes.h"
 #include "super.h"
 #include "thread_with_file.h"
@ -54,25 +54,41 @@ void bch2_io_error_work(struct work_struct *work)
 {
 	struct bch_dev *ca = container_of(work, struct bch_dev, io_error_work);
 	struct bch_fs *c = ca->fs;
-	bool dev;
+
+	/* XXX: if it's reads or checksums that are failing, set it to failed */

 	down_write(&c->state_lock);
-	dev = bch2_dev_state_allowed(c, ca, BCH_MEMBER_STATE_ro,
-				    BCH_FORCE_IF_DEGRADED);
-	if (dev
-	    ? __bch2_dev_set_state(c, ca, BCH_MEMBER_STATE_ro,
-				  BCH_FORCE_IF_DEGRADED)
-	    : bch2_fs_emergency_read_only(c))
+	unsigned long write_errors_start = READ_ONCE(ca->write_errors_start);
+
+	if (write_errors_start &&
+	    time_after(jiffies,
+		       write_errors_start + c->opts.write_error_timeout * HZ)) {
+		if (ca->mi.state >= BCH_MEMBER_STATE_ro)
+			goto out;
+
+		bool dev = !__bch2_dev_set_state(c, ca, BCH_MEMBER_STATE_ro,
+						 BCH_FORCE_IF_DEGRADED);
+
 		bch_err(ca,
-			"too many IO errors, setting %s RO",
+			"writes erroring for %u seconds, setting %s ro",
+			c->opts.write_error_timeout,
 			dev ? "device" : "filesystem");
+		if (!dev)
+			bch2_fs_emergency_read_only(c);
+
+	}
+out:
 	up_write(&c->state_lock);
 }

 void bch2_io_error(struct bch_dev *ca, enum bch_member_error_type type)
 {
 	atomic64_inc(&ca->errors[type]);
-	//queue_work(system_long_wq, &ca->io_error_work);
+
+	if (type == BCH_MEMBER_ERROR_write && !ca->write_errors_start)
+		ca->write_errors_start = jiffies;
+
+	queue_work(system_long_wq, &ca->io_error_work);
 }

 enum ask_yn {
@ -530,35 +546,59 @@ void bch2_flush_fsck_errs(struct bch_fs *c)
 	mutex_unlock(&c->fsck_error_msgs_lock);
 }

-int bch2_inum_err_msg_trans(struct btree_trans *trans, struct printbuf *out, subvol_inum inum)
+int bch2_inum_offset_err_msg_trans(struct btree_trans *trans, struct printbuf *out,
+				    subvol_inum inum, u64 offset)
 {
 	u32 restart_count = trans->restart_count;
 	int ret = 0;

-	/* XXX: we don't yet attempt to print paths when we don't know the subvol */
-	if (inum.subvol)
-		ret = lockrestart_do(trans, bch2_inum_to_path(trans, inum, out));
+	if (inum.subvol) {
+		ret = bch2_inum_to_path(trans, inum, out);
+		if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+			return ret;
+	}
 	if (!inum.subvol || ret)
 		prt_printf(out, "inum %llu:%llu", inum.subvol, inum.inum);
+	prt_printf(out, " offset %llu: ", offset);

 	return trans_was_restarted(trans, restart_count);
 }

-int bch2_inum_offset_err_msg_trans(struct btree_trans *trans, struct printbuf *out,
-				    subvol_inum inum, u64 offset)
-{
-	int ret = bch2_inum_err_msg_trans(trans, out, inum);
-	prt_printf(out, " offset %llu: ", offset);
-	return ret;
-}
-
-void bch2_inum_err_msg(struct bch_fs *c, struct printbuf *out, subvol_inum inum)
-{
-	bch2_trans_run(c, bch2_inum_err_msg_trans(trans, out, inum));
-}
-
 void bch2_inum_offset_err_msg(struct bch_fs *c, struct printbuf *out,
 			      subvol_inum inum, u64 offset)
 {
-	bch2_trans_run(c, bch2_inum_offset_err_msg_trans(trans, out, inum, offset));
+	bch2_trans_do(c, bch2_inum_offset_err_msg_trans(trans, out, inum, offset));
+}
+
+int bch2_inum_snap_offset_err_msg_trans(struct btree_trans *trans, struct printbuf *out,
+					struct bpos pos)
+{
+	struct bch_fs *c = trans->c;
+	int ret = 0;
+
+	if (!bch2_snapshot_is_leaf(c, pos.snapshot))
+		prt_str(out, "(multiple snapshots) ");
+
+	subvol_inum inum = {
+		.subvol	= bch2_snapshot_tree_oldest_subvol(c, pos.snapshot),
+		.inum	= pos.inode,
+	};
+
+	if (inum.subvol) {
+		ret = bch2_inum_to_path(trans, inum, out);
+		if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+			return ret;
+	}
+
+	if (!inum.subvol || ret)
+		prt_printf(out, "inum %llu:%u", pos.inode, pos.snapshot);
+
+	prt_printf(out, " offset %llu: ", pos.offset << 8);
+	return 0;
+}
+
+void bch2_inum_snap_offset_err_msg(struct bch_fs *c, struct printbuf *out,
+				  struct bpos pos)
+{
+	bch2_trans_do(c, bch2_inum_snap_offset_err_msg_trans(trans, out, pos));
 }
--- a/fs/bcachefs/error.h
+++ b/fs/bcachefs/error.h
@ -216,32 +216,43 @@ void bch2_io_error_work(struct work_struct *);
 /* Does the error handling without logging a message */
 void bch2_io_error(struct bch_dev *, enum bch_member_error_type);

-#define bch2_dev_io_err_on(cond, ca, _type, ...)			\
-({									\
-	bool _ret = (cond);						\
-									\
-	if (_ret) {							\
-		bch_err_dev_ratelimited(ca, __VA_ARGS__);		\
-		bch2_io_error(ca, _type);				\
-	}								\
-	_ret;								\
-})
+#ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT
+void bch2_latency_acct(struct bch_dev *, u64, int);
+#else
+static inline void bch2_latency_acct(struct bch_dev *ca, u64 submit_time, int rw) {}
+#endif

-#define bch2_dev_inum_io_err_on(cond, ca, _type, ...)			\
-({									\
-	bool _ret = (cond);						\
-									\
-	if (_ret) {							\
-		bch_err_inum_offset_ratelimited(ca, __VA_ARGS__);	\
-		bch2_io_error(ca, _type);				\
-	}								\
-	_ret;								\
-})
+static inline void bch2_account_io_success_fail(struct bch_dev *ca,
+						enum bch_member_error_type type,
+						bool success)
+{
+	if (likely(success)) {
+		if (type == BCH_MEMBER_ERROR_write &&
+		    ca->write_errors_start)
+			ca->write_errors_start = 0;
+	} else {
+		bch2_io_error(ca, type);
+	}
+}
+
+static inline void bch2_account_io_completion(struct bch_dev *ca,
+					      enum bch_member_error_type type,
+					      u64 submit_time, bool success)
+{
+	if (unlikely(!ca))
+		return;
+
+	if (type != BCH_MEMBER_ERROR_checksum)
+		bch2_latency_acct(ca, submit_time, type);
+
+	bch2_account_io_success_fail(ca, type, success);
+}

-int bch2_inum_err_msg_trans(struct btree_trans *, struct printbuf *, subvol_inum);
 int bch2_inum_offset_err_msg_trans(struct btree_trans *, struct printbuf *, subvol_inum, u64);

-void bch2_inum_err_msg(struct bch_fs *, struct printbuf *, subvol_inum);
 void bch2_inum_offset_err_msg(struct bch_fs *, struct printbuf *, subvol_inum, u64);

+int bch2_inum_snap_offset_err_msg_trans(struct btree_trans *, struct printbuf *, struct bpos);
+void bch2_inum_snap_offset_err_msg(struct bch_fs *, struct printbuf *, struct bpos);
+
 #endif /* _BCACHEFS_ERROR_H */
--- a/fs/bcachefs/extents.c
+++ b/fs/bcachefs/extents.c
@ -28,6 +28,13 @@
 #include "trace.h"
 #include "util.h"

+static const char * const bch2_extent_flags_strs[] = {
+#define x(n, v)	[BCH_EXTENT_FLAG_##n] = #n,
+	BCH_EXTENT_FLAGS()
+#undef x
+	NULL,
+};
+
 static unsigned bch2_crc_field_size_max[] = {
 	[BCH_EXTENT_ENTRY_crc32] = CRC32_SIZE_MAX,
 	[BCH_EXTENT_ENTRY_crc64] = CRC64_SIZE_MAX,
@ -51,7 +58,8 @@ struct bch_dev_io_failures *bch2_dev_io_failures(struct bch_io_failures *f,
 }

 void bch2_mark_io_failure(struct bch_io_failures *failed,
-			  struct extent_ptr_decoded *p)
+			  struct extent_ptr_decoded *p,
+			  bool csum_error)
 {
 	struct bch_dev_io_failures *f = bch2_dev_io_failures(failed, p->ptr.dev);

@ -59,53 +67,57 @@ void bch2_mark_io_failure(struct bch_io_failures *failed,
 		BUG_ON(failed->nr >= ARRAY_SIZE(failed->devs));

 		f = &failed->devs[failed->nr++];
-		f->dev		= p->ptr.dev;
-		f->idx		= p->idx;
-		f->nr_failed	= 1;
-		f->nr_retries	= 0;
-	} else if (p->idx != f->idx) {
-		f->idx		= p->idx;
-		f->nr_failed	= 1;
-		f->nr_retries	= 0;
-	} else {
-		f->nr_failed++;
+		memset(f, 0, sizeof(*f));
+		f->dev = p->ptr.dev;
 	}
+
+	if (p->do_ec_reconstruct)
+		f->failed_ec = true;
+	else if (!csum_error)
+		f->failed_io = true;
+	else
+		f->failed_csum_nr++;
 }

-static inline u64 dev_latency(struct bch_fs *c, unsigned dev)
+static inline u64 dev_latency(struct bch_dev *ca)
 {
-	struct bch_dev *ca = bch2_dev_rcu(c, dev);
 	return ca ? atomic64_read(&ca->cur_latency[READ]) : S64_MAX;
 }

+static inline int dev_failed(struct bch_dev *ca)
+{
+	return !ca || ca->mi.state == BCH_MEMBER_STATE_failed;
+}
+
 /*
 * returns true if p1 is better than p2:
 */
 static inline bool ptr_better(struct bch_fs *c,
 			      const struct extent_ptr_decoded p1,
-			      const struct extent_ptr_decoded p2)
+			      u64 p1_latency,
+			      struct bch_dev *ca1,
+			      const struct extent_ptr_decoded p2,
+			      u64 p2_latency)
 {
-	if (likely(!p1.idx && !p2.idx)) {
-		u64 l1 = dev_latency(c, p1.ptr.dev);
-		u64 l2 = dev_latency(c, p2.ptr.dev);
+	struct bch_dev *ca2 = bch2_dev_rcu(c, p2.ptr.dev);

-		/*
-		 * Square the latencies, to bias more in favor of the faster
-		 * device - we never want to stop issuing reads to the slower
-		 * device altogether, so that we can update our latency numbers:
-		 */
-		l1 *= l1;
-		l2 *= l2;
+	int failed_delta = dev_failed(ca1) - dev_failed(ca2);
+	if (unlikely(failed_delta))
+		return failed_delta < 0;

-		/* Pick at random, biased in favor of the faster device: */
+	if (unlikely(bch2_force_reconstruct_read))
+		return p1.do_ec_reconstruct > p2.do_ec_reconstruct;

-		return bch2_get_random_u64_below(l1 + l2) > l1;
-	}
+	if (unlikely(p1.do_ec_reconstruct || p2.do_ec_reconstruct))
+		return p1.do_ec_reconstruct < p2.do_ec_reconstruct;

-	if (bch2_force_reconstruct_read)
-		return p1.idx > p2.idx;
+	int crc_retry_delta = (int) p1.crc_retry_nr - (int) p2.crc_retry_nr;
+	if (unlikely(crc_retry_delta))
+		return crc_retry_delta < 0;

-	return p1.idx < p2.idx;
+	/* Pick at random, biased in favor of the faster device: */
+
+	return bch2_get_random_u64_below(p1_latency + p2_latency) > p1_latency;
 }

 /*
@ -115,64 +127,108 @@ static inline bool ptr_better(struct bch_fs *c,
 */
 int bch2_bkey_pick_read_device(struct bch_fs *c, struct bkey_s_c k,
 			       struct bch_io_failures *failed,
-			       struct extent_ptr_decoded *pick)
+			       struct extent_ptr_decoded *pick,
+			       int dev)
 {
-	struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
-	const union bch_extent_entry *entry;
-	struct extent_ptr_decoded p;
-	struct bch_dev_io_failures *f;
-	int ret = 0;
+	bool have_csum_errors = false, have_io_errors = false, have_missing_devs = false;
+	bool have_dirty_ptrs = false, have_pick = false;

 	if (k.k->type == KEY_TYPE_error)
 		return -BCH_ERR_key_type_error;

+	struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
+
+	if (bch2_bkey_extent_ptrs_flags(ptrs) & BIT_ULL(BCH_EXTENT_FLAG_poisoned))
+		return -BCH_ERR_extent_poisened;
+
 	rcu_read_lock();
+	const union bch_extent_entry *entry;
+	struct extent_ptr_decoded p;
+	u64 pick_latency;
+
 	bkey_for_each_ptr_decode(k.k, ptrs, p, entry) {
+		have_dirty_ptrs |= !p.ptr.cached;
+
 		/*
 		 * Unwritten extent: no need to actually read, treat it as a
 		 * hole and return 0s:
 		 */
 		if (p.ptr.unwritten) {
-			ret = 0;
-			break;
+			rcu_read_unlock();
+			return 0;
 		}

-		/*
-		 * If there are any dirty pointers it's an error if we can't
-		 * read:
-		 */
-		if (!ret && !p.ptr.cached)
-			ret = -BCH_ERR_no_device_to_read_from;
+		/* Are we being asked to read from a specific device? */
+		if (dev >= 0 && p.ptr.dev != dev)
+			continue;

 		struct bch_dev *ca = bch2_dev_rcu(c, p.ptr.dev);

 		if (p.ptr.cached && (!ca || dev_ptr_stale_rcu(ca, &p.ptr)))
 			continue;

-		f = failed ? bch2_dev_io_failures(failed, p.ptr.dev) : NULL;
-		if (f)
-			p.idx = f->nr_failed < f->nr_retries
-				? f->idx
-				: f->idx + 1;
+		struct bch_dev_io_failures *f =
+			unlikely(failed) ? bch2_dev_io_failures(failed, p.ptr.dev) : NULL;
+		if (unlikely(f)) {
+			p.crc_retry_nr	   = f->failed_csum_nr;
+			p.has_ec	  &= ~f->failed_ec;

-		if (!p.idx && (!ca || !bch2_dev_is_readable(ca)))
-			p.idx++;
+			if (ca && ca->mi.state != BCH_MEMBER_STATE_failed) {
+				have_io_errors	|= f->failed_io;
+				have_io_errors	|= f->failed_ec;
+			}
+			have_csum_errors	|= !!f->failed_csum_nr;

-		if (!p.idx && p.has_ec && bch2_force_reconstruct_read)
-			p.idx++;
+			if (p.has_ec && (f->failed_io || f->failed_csum_nr))
+				p.do_ec_reconstruct = true;
+			else if (f->failed_io ||
+				 f->failed_csum_nr > c->opts.checksum_err_retry_nr)
+				continue;
+		}

-		if (p.idx > (unsigned) p.has_ec)
-			continue;
+		have_missing_devs |= ca && !bch2_dev_is_online(ca);

-		if (ret > 0 && !ptr_better(c, p, *pick))
-			continue;
+		if (!ca || !bch2_dev_is_online(ca)) {
+			if (!p.has_ec)
+				continue;
+			p.do_ec_reconstruct = true;
+		}

-		*pick = p;
-		ret = 1;
+		if (bch2_force_reconstruct_read && p.has_ec)
+			p.do_ec_reconstruct = true;
+
+		u64 p_latency = dev_latency(ca);
+		/*
+		 * Square the latencies, to bias more in favor of the faster
+		 * device - we never want to stop issuing reads to the slower
+		 * device altogether, so that we can update our latency numbers:
+		 */
+		p_latency *= p_latency;
+
+		if (!have_pick ||
+		    ptr_better(c,
+			       p, p_latency, ca,
+			       *pick, pick_latency)) {
+			*pick = p;
+			pick_latency = p_latency;
+			have_pick = true;
+		}
 	}
 	rcu_read_unlock();

-	return ret;
+	if (have_pick)
+		return 1;
+	if (!have_dirty_ptrs)
+		return 0;
+	if (have_missing_devs)
+		return -BCH_ERR_no_device_to_read_from;
+	if (have_csum_errors)
+		return -BCH_ERR_data_read_csum_err;
+	if (have_io_errors)
+		return -BCH_ERR_data_read_io_err;
+
+	WARN_ONCE(1, "unhandled error case in %s\n", __func__);
+	return -EINVAL;
 }

 /* KEY_TYPE_btree_ptr: */
@ -536,29 +592,35 @@ static void bch2_extent_crc_pack(union bch_extent_crc *dst,
 				 struct bch_extent_crc_unpacked src,
 				 enum bch_extent_entry_type type)
 {
-#define set_common_fields(_dst, _src)					\
-		_dst.type		= 1 << type;			\
-		_dst.csum_type		= _src.csum_type,		\
-		_dst.compression_type	= _src.compression_type,	\
-		_dst._compressed_size	= _src.compressed_size - 1,	\
-		_dst._uncompressed_size	= _src.uncompressed_size - 1,	\
-		_dst.offset		= _src.offset
+#define common_fields(_src)						\
+		.type			= BIT(type),			\
+		.csum_type		= _src.csum_type,		\
+		.compression_type	= _src.compression_type,	\
+		._compressed_size	= _src.compressed_size - 1,	\
+		._uncompressed_size	= _src.uncompressed_size - 1,	\
+		.offset			= _src.offset

 	switch (type) {
 	case BCH_EXTENT_ENTRY_crc32:
-		set_common_fields(dst->crc32, src);
-		dst->crc32.csum		= (u32 __force) *((__le32 *) &src.csum.lo);
+		dst->crc32		= (struct bch_extent_crc32) {
+			common_fields(src),
+			.csum		= (u32 __force) *((__le32 *) &src.csum.lo),
+		};
 		break;
 	case BCH_EXTENT_ENTRY_crc64:
-		set_common_fields(dst->crc64, src);
-		dst->crc64.nonce	= src.nonce;
-		dst->crc64.csum_lo	= (u64 __force) src.csum.lo;
-		dst->crc64.csum_hi	= (u64 __force) *((__le16 *) &src.csum.hi);
+		dst->crc64		= (struct bch_extent_crc64) {
+			common_fields(src),
+			.nonce		= src.nonce,
+			.csum_lo	= (u64 __force) src.csum.lo,
+			.csum_hi	= (u64 __force) *((__le16 *) &src.csum.hi),
+		};
 		break;
 	case BCH_EXTENT_ENTRY_crc128:
-		set_common_fields(dst->crc128, src);
-		dst->crc128.nonce	= src.nonce;
-		dst->crc128.csum	= src.csum;
+		dst->crc128		= (struct bch_extent_crc128) {
+			common_fields(src),
+			.nonce		= src.nonce,
+			.csum		= src.csum,
+		};
 		break;
 	default:
 		BUG();
@ -997,7 +1059,7 @@ static bool want_cached_ptr(struct bch_fs *c, struct bch_io_opts *opts,

 	struct bch_dev *ca = bch2_dev_rcu_noerror(c, ptr->dev);

-	return ca && bch2_dev_is_readable(ca) && !dev_ptr_stale_rcu(ca, ptr);
+	return ca && bch2_dev_is_healthy(ca) && !dev_ptr_stale_rcu(ca, ptr);
 }

 void bch2_extent_ptr_set_cached(struct bch_fs *c,
@ -1220,6 +1282,10 @@ void bch2_bkey_ptrs_to_text(struct printbuf *out, struct bch_fs *c,
 			bch2_extent_rebalance_to_text(out, c, &entry->rebalance);
 			break;

+		case BCH_EXTENT_ENTRY_flags:
+			prt_bitflags(out, bch2_extent_flags_strs, entry->flags.flags);
+			break;
+
 		default:
 			prt_printf(out, "(invalid extent entry %.16llx)", *((u64 *) entry));
 			return;
@ -1381,6 +1447,11 @@ int bch2_bkey_ptrs_validate(struct bch_fs *c, struct bkey_s_c k,
 #endif
 			break;
 		}
+		case BCH_EXTENT_ENTRY_flags:
+			bkey_fsck_err_on(entry != ptrs.start,
+					 c, extent_flags_not_at_start,
+					 "extent flags entry not at start");
+			break;
 		}
 	}

@ -1447,6 +1518,28 @@ void bch2_ptr_swab(struct bkey_s k)
 	}
 }

+int bch2_bkey_extent_flags_set(struct bch_fs *c, struct bkey_i *k, u64 flags)
+{
+	int ret = bch2_request_incompat_feature(c, bcachefs_metadata_version_extent_flags);
+	if (ret)
+		return ret;
+
+	struct bkey_ptrs ptrs = bch2_bkey_ptrs(bkey_i_to_s(k));
+
+	if (ptrs.start != ptrs.end &&
+	    extent_entry_type(ptrs.start) == BCH_EXTENT_ENTRY_flags) {
+		ptrs.start->flags.flags = flags;
+	} else {
+		struct bch_extent_flags f = {
+			.type	= BIT(BCH_EXTENT_ENTRY_flags),
+			.flags	= flags,
+		};
+		__extent_entry_insert(k, ptrs.start, (union bch_extent_entry *) &f);
+	}
+
+	return 0;
+}
+
 /* Generic extent code: */

 int bch2_cut_front_s(struct bpos where, struct bkey_s k)
@ -1492,8 +1585,8 @@ int bch2_cut_front_s(struct bpos where, struct bkey_s k)
 				entry->crc128.offset += sub;
 				break;
 			case BCH_EXTENT_ENTRY_stripe_ptr:
-				break;
 			case BCH_EXTENT_ENTRY_rebalance:
+			case BCH_EXTENT_ENTRY_flags:
 				break;
 			}

--- a/fs/bcachefs/extents.h
+++ b/fs/bcachefs/extents.h
@ -320,8 +320,9 @@ static inline struct bkey_ptrs bch2_bkey_ptrs(struct bkey_s k)
 ({									\
 	__label__ out;							\
 									\
-	(_ptr).idx	= 0;						\
-	(_ptr).has_ec	= false;					\
+	(_ptr).has_ec			= false;			\
+	(_ptr).do_ec_reconstruct	= false;			\
+	(_ptr).crc_retry_nr		= 0;				\
 									\
 	__bkey_extent_entry_for_each_from(_entry, _end, _entry)		\
 		switch (__extent_entry_type(_entry)) {			\
@ -401,10 +402,10 @@ out:									\
 struct bch_dev_io_failures *bch2_dev_io_failures(struct bch_io_failures *,
 						 unsigned);
 void bch2_mark_io_failure(struct bch_io_failures *,
-			  struct extent_ptr_decoded *);
+			  struct extent_ptr_decoded *, bool);
 int bch2_bkey_pick_read_device(struct bch_fs *, struct bkey_s_c,
 			       struct bch_io_failures *,
-			       struct extent_ptr_decoded *);
+			       struct extent_ptr_decoded *, int);

 /* KEY_TYPE_btree_ptr: */

@ -753,4 +754,19 @@ static inline void bch2_key_resize(struct bkey *k, unsigned new_size)
 	k->size = new_size;
 }

+static inline u64 bch2_bkey_extent_ptrs_flags(struct bkey_ptrs_c ptrs)
+{
+	if (ptrs.start != ptrs.end &&
+	    extent_entry_type(ptrs.start) == BCH_EXTENT_ENTRY_flags)
+		return ptrs.start->flags.flags;
+	return 0;
+}
+
+static inline u64 bch2_bkey_extent_flags(struct bkey_s_c k)
+{
+	return bch2_bkey_extent_ptrs_flags(bch2_bkey_ptrs_c(k));
+}
+
+int bch2_bkey_extent_flags_set(struct bch_fs *, struct bkey_i *, u64);
+
 #endif /* _BCACHEFS_EXTENTS_H */
--- a/fs/bcachefs/extents_format.h
+++ b/fs/bcachefs/extents_format.h
@ -79,8 +79,9 @@
 	x(crc64,		2)		\
 	x(crc128,		3)		\
 	x(stripe_ptr,		4)		\
-	x(rebalance,		5)
-#define BCH_EXTENT_ENTRY_MAX	6
+	x(rebalance,		5)		\
+	x(flags,		6)
+#define BCH_EXTENT_ENTRY_MAX	7

 enum bch_extent_entry_type {
 #define x(f, n) BCH_EXTENT_ENTRY_##f = n,
@ -201,6 +202,25 @@ struct bch_extent_stripe_ptr {
 #endif
 };

+#define BCH_EXTENT_FLAGS()		\
+	x(poisoned,		0)
+
+enum bch_extent_flags_e {
+#define x(n, v)	BCH_EXTENT_FLAG_##n = v,
+	BCH_EXTENT_FLAGS()
+#undef x
+};
+
+struct bch_extent_flags {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+	__u64			type:7,
+				flags:57;
+#elif defined (__BIG_ENDIAN_BITFIELD)
+	__u64			flags:57,
+				type:7;
+#endif
+};
+
 /* bch_extent_rebalance: */
 #include "rebalance_format.h"

--- a/fs/bcachefs/extents_types.h
+++ b/fs/bcachefs/extents_types.h
@ -20,8 +20,9 @@ struct bch_extent_crc_unpacked {
 };

 struct extent_ptr_decoded {
-	unsigned			idx;
 	bool				has_ec;
+	bool				do_ec_reconstruct;
+	u8				crc_retry_nr;
 	struct bch_extent_crc_unpacked	crc;
 	struct bch_extent_ptr		ptr;
 	struct bch_extent_stripe_ptr	ec;
@ -31,10 +32,10 @@ struct bch_io_failures {
 	u8			nr;
 	struct bch_dev_io_failures {
 		u8		dev;
-		u8		idx;
-		u8		nr_failed;
-		u8		nr_retries;
-	}			devs[BCH_REPLICAS_MAX];
+		unsigned	failed_csum_nr:6,
+				failed_io:1,
+				failed_ec:1;
+	}			devs[BCH_REPLICAS_MAX + 1];
 };

 #endif /* _BCACHEFS_EXTENTS_TYPES_H */
--- a/fs/bcachefs/eytzinger.c
+++ b/fs/bcachefs/eytzinger.c
@ -148,87 +148,97 @@ static int do_cmp(const void *a, const void *b, cmp_r_func_t cmp, const void *pr
 	return cmp(a, b, priv);
 }

-static inline int eytzinger0_do_cmp(void *base, size_t n, size_t size,
+static inline int eytzinger1_do_cmp(void *base1, size_t n, size_t size,
 			 cmp_r_func_t cmp_func, const void *priv,
 			 size_t l, size_t r)
 {
-	return do_cmp(base + inorder_to_eytzinger0(l, n) * size,
-		      base + inorder_to_eytzinger0(r, n) * size,
+	return do_cmp(base1 + inorder_to_eytzinger1(l, n) * size,
+		      base1 + inorder_to_eytzinger1(r, n) * size,
 		      cmp_func, priv);
 }

-static inline void eytzinger0_do_swap(void *base, size_t n, size_t size,
+static inline void eytzinger1_do_swap(void *base1, size_t n, size_t size,
 			   swap_r_func_t swap_func, const void *priv,
 			   size_t l, size_t r)
 {
-	do_swap(base + inorder_to_eytzinger0(l, n) * size,
-		base + inorder_to_eytzinger0(r, n) * size,
+	do_swap(base1 + inorder_to_eytzinger1(l, n) * size,
+		base1 + inorder_to_eytzinger1(r, n) * size,
 		size, swap_func, priv);
 }

+static void eytzinger1_sort_r(void *base1, size_t n, size_t size,
+			      cmp_r_func_t cmp_func,
+			      swap_r_func_t swap_func,
+			      const void *priv)
+{
+	unsigned i, j, k;
+
+	/* called from 'sort' without swap function, let's pick the default */
+	if (swap_func == SWAP_WRAPPER && !((struct wrapper *)priv)->swap_func)
+		swap_func = NULL;
+
+	if (!swap_func) {
+		if (is_aligned(base1, size, 8))
+			swap_func = SWAP_WORDS_64;
+		else if (is_aligned(base1, size, 4))
+			swap_func = SWAP_WORDS_32;
+		else
+			swap_func = SWAP_BYTES;
+	}
+
+	/* heapify */
+	for (i = n / 2; i >= 1; --i) {
+		/* Find the sift-down path all the way to the leaves. */
+		for (j = i; k = j * 2, k < n;)
+			j = eytzinger1_do_cmp(base1, n, size, cmp_func, priv, k, k + 1) > 0 ? k : k + 1;
+
+		/* Special case for the last leaf with no sibling. */
+		if (j * 2 == n)
+			j *= 2;
+
+		/* Backtrack to the correct location. */
+		while (j != i && eytzinger1_do_cmp(base1, n, size, cmp_func, priv, i, j) >= 0)
+			j /= 2;
+
+		/* Shift the element into its correct place. */
+		for (k = j; j != i;) {
+			j /= 2;
+			eytzinger1_do_swap(base1, n, size, swap_func, priv, j, k);
+		}
+	}
+
+	/* sort */
+	for (i = n; i > 1; --i) {
+		eytzinger1_do_swap(base1, n, size, swap_func, priv, 1, i);
+
+		/* Find the sift-down path all the way to the leaves. */
+		for (j = 1; k = j * 2, k + 1 < i;)
+			j = eytzinger1_do_cmp(base1, n, size, cmp_func, priv, k, k + 1) > 0 ? k : k + 1;
+
+		/* Special case for the last leaf with no sibling. */
+		if (j * 2 + 1 == i)
+			j *= 2;
+
+		/* Backtrack to the correct location. */
+		while (j >= 1 && eytzinger1_do_cmp(base1, n, size, cmp_func, priv, 1, j) >= 0)
+			j /= 2;
+
+		/* Shift the element into its correct place. */
+		for (k = j; j > 1;) {
+			j /= 2;
+			eytzinger1_do_swap(base1, n, size, swap_func, priv, j, k);
+		}
+	}
+}
+
 void eytzinger0_sort_r(void *base, size_t n, size_t size,
 		       cmp_r_func_t cmp_func,
 		       swap_r_func_t swap_func,
 		       const void *priv)
 {
-	int i, j, k;
+	void *base1 = base - size;

-	/* called from 'sort' without swap function, let's pick the default */
-	if (swap_func == SWAP_WRAPPER && !((struct wrapper *)priv)->swap_func)
-		swap_func = NULL;
-
-	if (!swap_func) {
-		if (is_aligned(base, size, 8))
-			swap_func = SWAP_WORDS_64;
-		else if (is_aligned(base, size, 4))
-			swap_func = SWAP_WORDS_32;
-		else
-			swap_func = SWAP_BYTES;
-	}
-
-	/* heapify */
-	for (i = n / 2 - 1; i >= 0; --i) {
-		/* Find the sift-down path all the way to the leaves. */
-		for (j = i; k = j * 2 + 1, k + 1 < n;)
-			j = eytzinger0_do_cmp(base, n, size, cmp_func, priv, k, k + 1) > 0 ? k : k + 1;
-
-		/* Special case for the last leaf with no sibling. */
-		if (j * 2 + 2 == n)
-			j = j * 2 + 1;
-
-		/* Backtrack to the correct location. */
-		while (j != i && eytzinger0_do_cmp(base, n, size, cmp_func, priv, i, j) >= 0)
-			j = (j - 1) / 2;
-
-		/* Shift the element into its correct place. */
-		for (k = j; j != i;) {
-			j = (j - 1) / 2;
-			eytzinger0_do_swap(base, n, size, swap_func, priv, j, k);
-		}
-	}
-
-	/* sort */
-	for (i = n - 1; i > 0; --i) {
-		eytzinger0_do_swap(base, n, size, swap_func, priv, 0, i);
-
-		/* Find the sift-down path all the way to the leaves. */
-		for (j = 0; k = j * 2 + 1, k + 1 < i;)
-			j = eytzinger0_do_cmp(base, n, size, cmp_func, priv, k, k + 1) > 0 ? k : k + 1;
-
-		/* Special case for the last leaf with no sibling. */
-		if (j * 2 + 2 == i)
-			j = j * 2 + 1;
-
-		/* Backtrack to the correct location. */
-		while (j && eytzinger0_do_cmp(base, n, size, cmp_func, priv, 0, j) >= 0)
-			j = (j - 1) / 2;
-
-		/* Shift the element into its correct place. */
-		for (k = j; j;) {
-			j = (j - 1) / 2;
-			eytzinger0_do_swap(base, n, size, swap_func, priv, j, k);
-		}
-	}
+	return eytzinger1_sort_r(base1, n, size, cmp_func, swap_func, priv);
 }

 void eytzinger0_sort(void *base, size_t n, size_t size,
--- a/fs/bcachefs/eytzinger.h
+++ b/fs/bcachefs/eytzinger.h
@ -6,6 +6,7 @@
 #include <linux/log2.h>

 #ifdef EYTZINGER_DEBUG
+#include <linux/bug.h>
 #define EYTZINGER_BUG_ON(cond)		BUG_ON(cond)
 #else
 #define EYTZINGER_BUG_ON(cond)
@ -56,24 +57,14 @@ static inline unsigned eytzinger1_last(unsigned size)
 	return rounddown_pow_of_two(size + 1) - 1;
 }

-/*
- * eytzinger1_next() and eytzinger1_prev() have the nice properties that
- *
- * eytzinger1_next(0) == eytzinger1_first())
- * eytzinger1_prev(0) == eytzinger1_last())
- *
- * eytzinger1_prev(eytzinger1_first()) == 0
- * eytzinger1_next(eytzinger1_last()) == 0
- */
-
 static inline unsigned eytzinger1_next(unsigned i, unsigned size)
 {
-	EYTZINGER_BUG_ON(i > size);
+	EYTZINGER_BUG_ON(i == 0 || i > size);

 	if (eytzinger1_right_child(i) <= size) {
 		i = eytzinger1_right_child(i);

-		i <<= __fls(size + 1) - __fls(i);
+		i <<= __fls(size) - __fls(i);
 		i >>= i > size;
 	} else {
 		i >>= ffz(i) + 1;
@ -84,12 +75,12 @@ static inline unsigned eytzinger1_next(unsigned i, unsigned size)

 static inline unsigned eytzinger1_prev(unsigned i, unsigned size)
 {
-	EYTZINGER_BUG_ON(i > size);
+	EYTZINGER_BUG_ON(i == 0 || i > size);

 	if (eytzinger1_left_child(i) <= size) {
 		i = eytzinger1_left_child(i) + 1;

-		i <<= __fls(size + 1) - __fls(i);
+		i <<= __fls(size) - __fls(i);
 		i -= 1;
 		i >>= i > size;
 	} else {
@ -243,73 +234,63 @@ static inline unsigned inorder_to_eytzinger0(unsigned i, unsigned size)
 	     (_i) != -1;				\
 	     (_i) = eytzinger0_next((_i), (_size)))

+#define eytzinger0_for_each_prev(_i, _size)		\
+	for (unsigned (_i) = eytzinger0_last((_size));	\
+	     (_i) != -1;				\
+	     (_i) = eytzinger0_prev((_i), (_size)))
+
 /* return greatest node <= @search, or -1 if not found */
 static inline int eytzinger0_find_le(void *base, size_t nr, size_t size,
 				     cmp_func_t cmp, const void *search)
 {
-	unsigned i, n = 0;
+	void *base1 = base - size;
+	unsigned n = 1;

-	if (!nr)
-		return -1;
-
-	do {
-		i = n;
-		n = eytzinger0_child(i, cmp(base + i * size, search) <= 0);
-	} while (n < nr);
-
-	if (n & 1) {
-		/*
-		 * @i was greater than @search, return previous node:
-		 *
-		 * if @i was leftmost/smallest element,
-		 * eytzinger0_prev(eytzinger0_first())) returns -1, as expected
-		 */
-		return eytzinger0_prev(i, nr);
-	} else {
-		return i;
-	}
+	while (n <= nr)
+		n = eytzinger1_child(n, cmp(base1 + n * size, search) <= 0);
+	n >>= __ffs(n) + 1;
+	return n - 1;
 }

+/* return smallest node > @search, or -1 if not found */
 static inline int eytzinger0_find_gt(void *base, size_t nr, size_t size,
 				     cmp_func_t cmp, const void *search)
 {
-	ssize_t idx = eytzinger0_find_le(base, nr, size, cmp, search);
+	void *base1 = base - size;
+	unsigned n = 1;

-	/*
-	 * if eytitzinger0_find_le() returned -1 - no element was <= search - we
-	 * want to return the first element; next/prev identities mean this work
-	 * as expected
-	 *
-	 * similarly if find_le() returns last element, we should return -1;
-	 * identities mean this all works out:
-	 */
-	return eytzinger0_next(idx, nr);
+	while (n <= nr)
+		n = eytzinger1_child(n, cmp(base1 + n * size, search) <= 0);
+	n >>= __ffs(n + 1) + 1;
+	return n - 1;
 }

+/* return smallest node >= @search, or -1 if not found */
 static inline int eytzinger0_find_ge(void *base, size_t nr, size_t size,
 				     cmp_func_t cmp, const void *search)
 {
-	ssize_t idx = eytzinger0_find_le(base, nr, size, cmp, search);
+	void *base1 = base - size;
+	unsigned n = 1;

-	if (idx < nr && !cmp(base + idx * size, search))
-		return idx;
-
-	return eytzinger0_next(idx, nr);
+	while (n <= nr)
+		n = eytzinger1_child(n, cmp(base1 + n * size, search) < 0);
+	n >>= __ffs(n + 1) + 1;
+	return n - 1;
 }

 #define eytzinger0_find(base, nr, size, _cmp, search)			\
 ({									\
-	void *_base		= (base);				\
+	size_t _size		= (size);				\
+	void *_base1		= (void *)(base) - _size;		\
 	const void *_search	= (search);				\
 	size_t _nr		= (nr);					\
-	size_t _size		= (size);				\
-	size_t _i		= 0;					\
+	size_t _i		= 1;					\
 	int _res;							\
 									\
-	while (_i < _nr &&						\
-	       (_res = _cmp(_search, _base + _i * _size)))		\
-		_i = eytzinger0_child(_i, _res > 0);			\
-	_i;								\
+	while (_i <= _nr &&						\
+	       (_res = _cmp(_search, _base1 + _i * _size)))		\
+		_i = eytzinger1_child(_i, _res > 0);			\
+	_i - 1;								\
 })

 void eytzinger0_sort_r(void *, size_t, size_t,
--- a/fs/bcachefs/fs-io-buffered.c
+++ b/fs/bcachefs/fs-io-buffered.c
@ -110,11 +110,21 @@ static int readpage_bio_extend(struct btree_trans *trans,
 			if (!get_more)
 				break;

+			unsigned sectors_remaining = sectors_this_extent - bio_sectors(bio);
+
+			if (sectors_remaining < PAGE_SECTORS << mapping_min_folio_order(iter->mapping))
+				break;
+
+			unsigned order = ilog2(rounddown_pow_of_two(sectors_remaining) / PAGE_SECTORS);
+
+			/* ensure proper alignment */
+			order = min(order, __ffs(folio_offset|BIT(31)));
+
 			folio = xa_load(&iter->mapping->i_pages, folio_offset);
 			if (folio && !xa_is_value(folio))
 				break;

-			folio = filemap_alloc_folio(readahead_gfp_mask(iter->mapping), 0);
+			folio = filemap_alloc_folio(readahead_gfp_mask(iter->mapping), order);
 			if (!folio)
 				break;

@ -149,12 +159,10 @@ static void bchfs_read(struct btree_trans *trans,
 	struct bch_fs *c = trans->c;
 	struct btree_iter iter;
 	struct bkey_buf sk;
-	int flags = BCH_READ_RETRY_IF_STALE|
-		BCH_READ_MAY_PROMOTE;
+	int flags = BCH_READ_retry_if_stale|
+		BCH_READ_may_promote;
 	int ret = 0;

-	rbio->c = c;
-	rbio->start_time = local_clock();
 	rbio->subvol = inum.subvol;

 	bch2_bkey_buf_init(&sk);
@ -211,14 +219,14 @@ static void bchfs_read(struct btree_trans *trans,
 		swap(rbio->bio.bi_iter.bi_size, bytes);

 		if (rbio->bio.bi_iter.bi_size == bytes)
-			flags |= BCH_READ_LAST_FRAGMENT;
+			flags |= BCH_READ_last_fragment;

 		bch2_bio_page_state_set(&rbio->bio, k);

 		bch2_read_extent(trans, rbio, iter.pos,
 				 data_btree, k, offset_into_extent, flags);

-		if (flags & BCH_READ_LAST_FRAGMENT)
+		if (flags & BCH_READ_last_fragment)
 			break;

 		swap(rbio->bio.bi_iter.bi_size, bytes);
@ -232,7 +240,8 @@ err:

 	if (ret) {
 		struct printbuf buf = PRINTBUF;
-		bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter.pos.offset << 9);
+		lockrestart_do(trans,
+			bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter.pos.offset << 9));
 		prt_printf(&buf, "read error %i from btree lookup", ret);
 		bch_err_ratelimited(c, "%s", buf.buf);
 		printbuf_exit(&buf);
@ -280,12 +289,13 @@ void bch2_readahead(struct readahead_control *ractl)
 		struct bch_read_bio *rbio =
 			rbio_init(bio_alloc_bioset(NULL, n, REQ_OP_READ,
 						   GFP_KERNEL, &c->bio_read),
-				  opts);
+				  c,
+				  opts,
+				  bch2_readpages_end_io);

 		readpage_iter_advance(&readpages_iter);

 		rbio->bio.bi_iter.bi_sector = folio_sector(folio);
-		rbio->bio.bi_end_io = bch2_readpages_end_io;
 		BUG_ON(!bio_add_folio(&rbio->bio, folio, folio_size(folio), 0));

 		bchfs_read(trans, rbio, inode_inum(inode),
@ -323,10 +333,10 @@ int bch2_read_single_folio(struct folio *folio, struct address_space *mapping)
 	bch2_inode_opts_get(&opts, c, &inode->ei_inode);

 	rbio = rbio_init(bio_alloc_bioset(NULL, 1, REQ_OP_READ, GFP_KERNEL, &c->bio_read),
-			 opts);
+			 c,
+			 opts,
+			 bch2_read_single_folio_end_io);
 	rbio->bio.bi_private = &done;
-	rbio->bio.bi_end_io = bch2_read_single_folio_end_io;
-
 	rbio->bio.bi_opf = REQ_OP_READ|REQ_SYNC;
 	rbio->bio.bi_iter.bi_sector = folio_sector(folio);
 	BUG_ON(!bio_add_folio(&rbio->bio, folio, folio_size(folio), 0));
@ -420,7 +430,7 @@ static void bch2_writepage_io_done(struct bch_write_op *op)
 		}
 	}

-	if (io->op.flags & BCH_WRITE_WROTE_DATA_INLINE) {
+	if (io->op.flags & BCH_WRITE_wrote_data_inline) {
 		bio_for_each_folio_all(fi, bio) {
 			struct bch_folio *s;

--- a/fs/bcachefs/fs-io-direct.c
+++ b/fs/bcachefs/fs-io-direct.c
@ -73,6 +73,7 @@ static int bch2_direct_IO_read(struct kiocb *req, struct iov_iter *iter)
 	struct blk_plug plug;
 	loff_t offset = req->ki_pos;
 	bool sync = is_sync_kiocb(req);
+	bool split = false;
 	size_t shorten;
 	ssize_t ret;

@ -99,8 +100,6 @@ static int bch2_direct_IO_read(struct kiocb *req, struct iov_iter *iter)
 			       GFP_KERNEL,
 			       &c->dio_read_bioset);

-	bio->bi_end_io = bch2_direct_IO_read_endio;
-
 	dio = container_of(bio, struct dio_read, rbio.bio);
 	closure_init(&dio->cl, NULL);

@ -133,12 +132,13 @@ static int bch2_direct_IO_read(struct kiocb *req, struct iov_iter *iter)

 	goto start;
 	while (iter->count) {
+		split = true;
+
 		bio = bio_alloc_bioset(NULL,
 				       bio_iov_vecs_to_alloc(iter, BIO_MAX_VECS),
 				       REQ_OP_READ,
 				       GFP_KERNEL,
 				       &c->bio_read);
-		bio->bi_end_io		= bch2_direct_IO_read_split_endio;
 start:
 		bio->bi_opf		= REQ_OP_READ|REQ_SYNC;
 		bio->bi_iter.bi_sector	= offset >> 9;
@ -160,7 +160,15 @@ start:
 		if (iter->count)
 			closure_get(&dio->cl);

-		bch2_read(c, rbio_init(bio, opts), inode_inum(inode));
+		struct bch_read_bio *rbio =
+			rbio_init(bio,
+				  c,
+				  opts,
+				  split
+				  ? bch2_direct_IO_read_split_endio
+				  : bch2_direct_IO_read_endio);
+
+		bch2_read(c, rbio, inode_inum(inode));
 	}

 	blk_finish_plug(&plug);
@ -511,8 +519,8 @@ static __always_inline long bch2_dio_write_loop(struct dio_write *dio)
 		dio->op.devs_need_flush	= &inode->ei_devs_need_flush;

 		if (sync)
-			dio->op.flags |= BCH_WRITE_SYNC;
-		dio->op.flags |= BCH_WRITE_CHECK_ENOSPC;
+			dio->op.flags |= BCH_WRITE_sync;
+		dio->op.flags |= BCH_WRITE_check_enospc;

 		ret = bch2_quota_reservation_add(c, inode, &dio->quota_res,
 						 bio_sectors(bio), true);
--- a/fs/bcachefs/fs-ioctl.c
+++ b/fs/bcachefs/fs-ioctl.c
@ -5,8 +5,8 @@
 #include "chardev.h"
 #include "dirent.h"
 #include "fs.h"
-#include "fs-common.h"
 #include "fs-ioctl.h"
+#include "namei.h"
 #include "quota.h"

 #include <linux/compat.h>
@ -54,6 +54,32 @@ static int bch2_inode_flags_set(struct btree_trans *trans,
 	    (newflags & (BCH_INODE_nodump|BCH_INODE_noatime)) != newflags)
 		return -EINVAL;

+	if ((newflags ^ oldflags) & BCH_INODE_casefolded) {
+#ifdef CONFIG_UNICODE
+		int ret = 0;
+		/* Not supported on individual files. */
+		if (!S_ISDIR(bi->bi_mode))
+			return -EOPNOTSUPP;
+
+		/*
+		 * Make sure the dir is empty, as otherwise we'd need to
+		 * rehash everything and update the dirent keys.
+		 */
+		ret = bch2_empty_dir_trans(trans, inode_inum(inode));
+		if (ret < 0)
+			return ret;
+
+		ret = bch2_request_incompat_feature(c,bcachefs_metadata_version_casefolding);
+		if (ret)
+			return ret;
+
+		bch2_check_set_feature(c, BCH_FEATURE_casefolding);
+#else
+		printk(KERN_ERR "Cannot use casefolding on a kernel without CONFIG_UNICODE\n");
+		return -EOPNOTSUPP;
+#endif
+	}
+
 	if (s->set_projinherit) {
 		bi->bi_fields_set &= ~(1 << Inode_opt_project);
 		bi->bi_fields_set |= ((int) s->projinherit << Inode_opt_project);
@ -218,7 +244,7 @@ static int bch2_ioc_reinherit_attrs(struct bch_fs *c,
 	int ret = 0;
 	subvol_inum inum;

-	kname = kmalloc(BCH_NAME_MAX + 1, GFP_KERNEL);
+	kname = kmalloc(BCH_NAME_MAX, GFP_KERNEL);
 	if (!kname)
 		return -ENOMEM;

--- a/fs/bcachefs/fs-ioctl.h
+++ b/fs/bcachefs/fs-ioctl.h
@ -6,19 +6,21 @@

 /* bcachefs inode flags -> vfs inode flags: */
 static const __maybe_unused unsigned bch_flags_to_vfs[] = {
-	[__BCH_INODE_sync]	= S_SYNC,
-	[__BCH_INODE_immutable]	= S_IMMUTABLE,
-	[__BCH_INODE_append]	= S_APPEND,
-	[__BCH_INODE_noatime]	= S_NOATIME,
+	[__BCH_INODE_sync]		= S_SYNC,
+	[__BCH_INODE_immutable]		= S_IMMUTABLE,
+	[__BCH_INODE_append]		= S_APPEND,
+	[__BCH_INODE_noatime]		= S_NOATIME,
+	[__BCH_INODE_casefolded]	= S_CASEFOLD,
 };

 /* bcachefs inode flags -> FS_IOC_GETFLAGS: */
 static const __maybe_unused unsigned bch_flags_to_uflags[] = {
-	[__BCH_INODE_sync]	= FS_SYNC_FL,
-	[__BCH_INODE_immutable]	= FS_IMMUTABLE_FL,
-	[__BCH_INODE_append]	= FS_APPEND_FL,
-	[__BCH_INODE_nodump]	= FS_NODUMP_FL,
-	[__BCH_INODE_noatime]	= FS_NOATIME_FL,
+	[__BCH_INODE_sync]		= FS_SYNC_FL,
+	[__BCH_INODE_immutable]		= FS_IMMUTABLE_FL,
+	[__BCH_INODE_append]		= FS_APPEND_FL,
+	[__BCH_INODE_nodump]		= FS_NODUMP_FL,
+	[__BCH_INODE_noatime]		= FS_NOATIME_FL,
+	[__BCH_INODE_casefolded]	= FS_CASEFOLD_FL,
 };

 /* bcachefs inode flags -> FS_IOC_FSGETXATTR: */
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@ -11,7 +11,6 @@
 #include "errcode.h"
 #include "extents.h"
 #include "fs.h"
-#include "fs-common.h"
 #include "fs-io.h"
 #include "fs-ioctl.h"
 #include "fs-io-buffered.h"
@ -22,6 +21,7 @@
 #include "io_read.h"
 #include "journal.h"
 #include "keylist.h"
+#include "namei.h"
 #include "quota.h"
 #include "rebalance.h"
 #include "snapshot.h"
@ -641,7 +641,9 @@ static struct bch_inode_info *bch2_lookup_trans(struct btree_trans *trans,
 	if (ret)
 		return ERR_PTR(ret);

-	ret = bch2_dirent_read_target(trans, dir, bkey_s_c_to_dirent(k), &inum);
+	struct bkey_s_c_dirent d = bkey_s_c_to_dirent(k);
+
+	ret = bch2_dirent_read_target(trans, dir, d, &inum);
 	if (ret > 0)
 		ret = -ENOENT;
 	if (ret)
@ -651,30 +653,30 @@ static struct bch_inode_info *bch2_lookup_trans(struct btree_trans *trans,
 	if (inode)
 		goto out;

+	/*
+	 * Note: if check/repair needs it, we commit before
+	 * bch2_inode_hash_init_insert(), as after that point we can't take a
+	 * restart - not in the top level loop with a commit_do(), like we
+	 * usually do:
+	 */
+
 	struct bch_subvolume subvol;
 	struct bch_inode_unpacked inode_u;
 	ret =   bch2_subvolume_get(trans, inum.subvol, true, &subvol) ?:
 		bch2_inode_find_by_inum_nowarn_trans(trans, inum, &inode_u) ?:
+		bch2_check_dirent_target(trans, &dirent_iter, d, &inode_u, false) ?:
+		bch2_trans_commit(trans, NULL, NULL, BCH_TRANS_COMMIT_no_enospc) ?:
 		PTR_ERR_OR_ZERO(inode = bch2_inode_hash_init_insert(trans, inum, &inode_u, &subvol));

+	/*
+	 * don't remove it: check_inodes might find another inode that points
+	 * back to this dirent
+	 */
 	bch2_fs_inconsistent_on(bch2_err_matches(ret, ENOENT),
 				c, "dirent to missing inode:\n  %s",
-				(bch2_bkey_val_to_text(&buf, c, k), buf.buf));
+				(bch2_bkey_val_to_text(&buf, c, d.s_c), buf.buf));
 	if (ret)
 		goto err;
-
-	/* regular files may have hardlinks: */
-	if (bch2_fs_inconsistent_on(bch2_inode_should_have_single_bp(&inode_u) &&
-				    !bkey_eq(k.k->p, POS(inode_u.bi_dir, inode_u.bi_dir_offset)),
-				    c,
-				    "dirent points to inode that does not point back:\n  %s",
-				    (bch2_bkey_val_to_text(&buf, c, k),
-				     prt_printf(&buf, "\n  "),
-				     bch2_inode_unpacked_to_text(&buf, &inode_u),
-				     buf.buf))) {
-		ret = -ENOENT;
-		goto err;
-	}
 out:
 	bch2_trans_iter_exit(trans, &dirent_iter);
 	printbuf_exit(&buf);
@ -698,6 +700,23 @@ static struct dentry *bch2_lookup(struct inode *vdir, struct dentry *dentry,
 	if (IS_ERR(inode))
 		inode = NULL;

+#ifdef CONFIG_UNICODE
+	if (!inode && IS_CASEFOLDED(vdir)) {
+		/*
+		 * Do not cache a negative dentry in casefolded directories
+		 * as it would need to be invalidated in the following situation:
+		 * - Lookup file "blAH" in a casefolded directory
+		 * - Creation of file "BLAH" in a casefolded directory
+		 * - Lookup file "blAH" in a casefolded directory
+		 * which would fail if we had a negative dentry.
+		 *
+		 * We should come back to this when VFS has a method to handle
+		 * this edgecase.
+		 */
+		return NULL;
+	}
+#endif
+
 	return d_splice_alias(&inode->v, dentry);
 }

@ -1802,7 +1821,8 @@ static void bch2_vfs_inode_init(struct btree_trans *trans,
 		break;
 	}

-	mapping_set_large_folios(inode->v.i_mapping);
+	mapping_set_folio_min_order(inode->v.i_mapping,
+				    get_order(trans->c->opts.block_size));
 }

 static void bch2_free_inode(struct inode *vinode)
@ -2008,44 +2028,6 @@ static struct bch_fs *bch2_path_to_fs(const char *path)
 	return c ?: ERR_PTR(-ENOENT);
 }

-static int bch2_remount(struct super_block *sb, int *flags,
-			struct bch_opts opts)
-{
-	struct bch_fs *c = sb->s_fs_info;
-	int ret = 0;
-
-	opt_set(opts, read_only, (*flags & SB_RDONLY) != 0);
-
-	if (opts.read_only != c->opts.read_only) {
-		down_write(&c->state_lock);
-
-		if (opts.read_only) {
-			bch2_fs_read_only(c);
-
-			sb->s_flags |= SB_RDONLY;
-		} else {
-			ret = bch2_fs_read_write(c);
-			if (ret) {
-				bch_err(c, "error going rw: %i", ret);
-				up_write(&c->state_lock);
-				ret = -EINVAL;
-				goto err;
-			}
-
-			sb->s_flags &= ~SB_RDONLY;
-		}
-
-		c->opts.read_only = opts.read_only;
-
-		up_write(&c->state_lock);
-	}
-
-	if (opt_defined(opts, errors))
-		c->opts.errors = opts.errors;
-err:
-	return bch2_err_class(ret);
-}
-
 static int bch2_show_devname(struct seq_file *seq, struct dentry *root)
 {
 	struct bch_fs *c = root->d_sb->s_fs_info;
@ -2192,6 +2174,9 @@ static int bch2_fs_get_tree(struct fs_context *fc)
 	if (ret)
 		goto err;

+	if (opt_defined(opts, discard))
+		set_bit(BCH_FS_discard_mount_opt_set, &c->flags);
+
 	/* Some options can't be parsed until after the fs is started: */
 	opts = bch2_opts_empty();
 	ret = bch2_parse_mount_opts(c, &opts, NULL, opts_parse->parse_later.buf);
@ -2200,9 +2185,10 @@ static int bch2_fs_get_tree(struct fs_context *fc)

 	bch2_opts_apply(&c->opts, opts);

-	ret = bch2_fs_start(c);
-	if (ret)
-		goto err_stop_fs;
+	/*
+	 * need to initialise sb and set c->vfs_sb _before_ starting fs,
+	 * for blk_holder_ops
+	 */

 	sb = sget(fc->fs_type, NULL, bch2_set_super, fc->sb_flags|SB_NOSEC, c);
 	ret = PTR_ERR_OR_ZERO(sb);
@ -2264,6 +2250,10 @@ got_sb:

 	sb->s_shrink->seeks = 0;

+	ret = bch2_fs_start(c);
+	if (ret)
+		goto err_put_super;
+
 	vinode = bch2_vfs_inode_get(c, BCACHEFS_ROOT_SUBVOL_INUM);
 	ret = PTR_ERR_OR_ZERO(vinode);
 	bch_err_msg(c, ret, "mounting: error getting root inode");
@ -2351,8 +2341,39 @@ static int bch2_fs_reconfigure(struct fs_context *fc)
 {
 	struct super_block *sb = fc->root->d_sb;
 	struct bch2_opts_parse *opts = fc->fs_private;
+	struct bch_fs *c = sb->s_fs_info;
+	int ret = 0;

-	return bch2_remount(sb, &fc->sb_flags, opts->opts);
+	opt_set(opts->opts, read_only, (fc->sb_flags & SB_RDONLY) != 0);
+
+	if (opts->opts.read_only != c->opts.read_only) {
+		down_write(&c->state_lock);
+
+		if (opts->opts.read_only) {
+			bch2_fs_read_only(c);
+
+			sb->s_flags |= SB_RDONLY;
+		} else {
+			ret = bch2_fs_read_write(c);
+			if (ret) {
+				bch_err(c, "error going rw: %i", ret);
+				up_write(&c->state_lock);
+				ret = -EINVAL;
+				goto err;
+			}
+
+			sb->s_flags &= ~SB_RDONLY;
+		}
+
+		c->opts.read_only = opts->opts.read_only;
+
+		up_write(&c->state_lock);
+	}
+
+	if (opt_defined(opts->opts, errors))
+		c->opts.errors = opts->opts.errors;
+err:
+	return bch2_err_class(ret);
 }

 static const struct fs_context_operations bch2_context_ops = {
--- a/fs/bcachefs/fsck.c
+++ b/fs/bcachefs/fsck.c
@ -10,10 +10,10 @@
 #include "dirent.h"
 #include "error.h"
 #include "fs.h"
-#include "fs-common.h"
 #include "fsck.h"
 #include "inode.h"
 #include "keylist.h"
+#include "namei.h"
 #include "recovery_passes.h"
 #include "snapshot.h"
 #include "super.h"
@ -23,13 +23,6 @@
 #include <linux/bsearch.h>
 #include <linux/dcache.h> /* struct qstr */

-static bool inode_points_to_dirent(struct bch_inode_unpacked *inode,
-				   struct bkey_s_c_dirent d)
-{
-	return  inode->bi_dir		== d.k->p.inode &&
-		inode->bi_dir_offset	== d.k->p.offset;
-}
-
 static int dirent_points_to_inode_nowarn(struct bkey_s_c_dirent d,
 					 struct bch_inode_unpacked *inode)
 {
@ -116,29 +109,6 @@ static int subvol_lookup(struct btree_trans *trans, u32 subvol,
 	return ret;
 }

-static int lookup_first_inode(struct btree_trans *trans, u64 inode_nr,
-			      struct bch_inode_unpacked *inode)
-{
-	struct btree_iter iter;
-	struct bkey_s_c k;
-	int ret;
-
-	for_each_btree_key_norestart(trans, iter, BTREE_ID_inodes, POS(0, inode_nr),
-				     BTREE_ITER_all_snapshots, k, ret) {
-		if (k.k->p.offset != inode_nr)
-			break;
-		if (!bkey_is_inode(k.k))
-			continue;
-		ret = bch2_inode_unpack(k, inode);
-		goto found;
-	}
-	ret = -BCH_ERR_ENOENT_inode;
-found:
-	bch_err_msg(trans->c, ret, "fetching inode %llu", inode_nr);
-	bch2_trans_iter_exit(trans, &iter);
-	return ret;
-}
-
 static int lookup_inode(struct btree_trans *trans, u64 inode_nr, u32 snapshot,
 			struct bch_inode_unpacked *inode)
 {
@ -179,32 +149,6 @@ static int lookup_dirent_in_snapshot(struct btree_trans *trans,
 	return 0;
 }

-static int __remove_dirent(struct btree_trans *trans, struct bpos pos)
-{
-	struct bch_fs *c = trans->c;
-	struct btree_iter iter;
-	struct bch_inode_unpacked dir_inode;
-	struct bch_hash_info dir_hash_info;
-	int ret;
-
-	ret = lookup_first_inode(trans, pos.inode, &dir_inode);
-	if (ret)
-		goto err;
-
-	dir_hash_info = bch2_hash_info_init(c, &dir_inode);
-
-	bch2_trans_iter_init(trans, &iter, BTREE_ID_dirents, pos, BTREE_ITER_intent);
-
-	ret =   bch2_btree_iter_traverse(&iter) ?:
-		bch2_hash_delete_at(trans, bch2_dirent_hash_desc,
-				    &dir_hash_info, &iter,
-				    BTREE_UPDATE_internal_snapshot_node);
-	bch2_trans_iter_exit(trans, &iter);
-err:
-	bch_err_fn(c, ret);
-	return ret;
-}
-
 /*
 * Find any subvolume associated with a tree of snapshots
 * We can't rely on master_subvol - it might have been deleted.
@ -548,7 +492,7 @@ static int remove_backpointer(struct btree_trans *trans,
 				     SPOS(inode->bi_dir, inode->bi_dir_offset, inode->bi_snapshot));
 	int ret = bkey_err(d) ?:
 		  dirent_points_to_inode(c, d, inode) ?:
-		  __remove_dirent(trans, d.k->p);
+		  bch2_fsck_remove_dirent(trans, d.k->p);
 	bch2_trans_iter_exit(trans, &iter);
 	return ret;
 }
@ -1985,169 +1929,6 @@ static int check_subdir_dirents_count(struct btree_trans *trans, struct inode_wa
 		trans_was_restarted(trans, restart_count);
 }

-noinline_for_stack
-static int check_dirent_inode_dirent(struct btree_trans *trans,
-				   struct btree_iter *iter,
-				   struct bkey_s_c_dirent d,
-				   struct bch_inode_unpacked *target)
-{
-	struct bch_fs *c = trans->c;
-	struct printbuf buf = PRINTBUF;
-	struct btree_iter bp_iter = { NULL };
-	int ret = 0;
-
-	if (inode_points_to_dirent(target, d))
-		return 0;
-
-	if (!target->bi_dir &&
-	    !target->bi_dir_offset) {
-		fsck_err_on(S_ISDIR(target->bi_mode),
-			    trans, inode_dir_missing_backpointer,
-			    "directory with missing backpointer\n%s",
-			    (printbuf_reset(&buf),
-			     bch2_bkey_val_to_text(&buf, c, d.s_c),
-			     prt_printf(&buf, "\n"),
-			     bch2_inode_unpacked_to_text(&buf, target),
-			     buf.buf));
-
-		fsck_err_on(target->bi_flags & BCH_INODE_unlinked,
-			    trans, inode_unlinked_but_has_dirent,
-			    "inode unlinked but has dirent\n%s",
-			    (printbuf_reset(&buf),
-			     bch2_bkey_val_to_text(&buf, c, d.s_c),
-			     prt_printf(&buf, "\n"),
-			     bch2_inode_unpacked_to_text(&buf, target),
-			     buf.buf));
-
-		target->bi_flags &= ~BCH_INODE_unlinked;
-		target->bi_dir		= d.k->p.inode;
-		target->bi_dir_offset	= d.k->p.offset;
-		return __bch2_fsck_write_inode(trans, target);
-	}
-
-	if (bch2_inode_should_have_single_bp(target) &&
-	    !fsck_err(trans, inode_wrong_backpointer,
-		      "dirent points to inode that does not point back:\n  %s",
-		      (bch2_bkey_val_to_text(&buf, c, d.s_c),
-		       prt_printf(&buf, "\n  "),
-		       bch2_inode_unpacked_to_text(&buf, target),
-		       buf.buf)))
-		goto err;
-
-	struct bkey_s_c_dirent bp_dirent = dirent_get_by_pos(trans, &bp_iter,
-			      SPOS(target->bi_dir, target->bi_dir_offset, target->bi_snapshot));
-	ret = bkey_err(bp_dirent);
-	if (ret && !bch2_err_matches(ret, ENOENT))
-		goto err;
-
-	bool backpointer_exists = !ret;
-	ret = 0;
-
-	if (fsck_err_on(!backpointer_exists,
-			trans, inode_wrong_backpointer,
-			"inode %llu:%u has wrong backpointer:\n"
-			"got       %llu:%llu\n"
-			"should be %llu:%llu",
-			target->bi_inum, target->bi_snapshot,
-			target->bi_dir,
-			target->bi_dir_offset,
-			d.k->p.inode,
-			d.k->p.offset)) {
-		target->bi_dir		= d.k->p.inode;
-		target->bi_dir_offset	= d.k->p.offset;
-		ret = __bch2_fsck_write_inode(trans, target);
-		goto out;
-	}
-
-	bch2_bkey_val_to_text(&buf, c, d.s_c);
-	prt_newline(&buf);
-	if (backpointer_exists)
-		bch2_bkey_val_to_text(&buf, c, bp_dirent.s_c);
-
-	if (fsck_err_on(backpointer_exists &&
-			(S_ISDIR(target->bi_mode) ||
-			 target->bi_subvol),
-			trans, inode_dir_multiple_links,
-			"%s %llu:%u with multiple links\n%s",
-			S_ISDIR(target->bi_mode) ? "directory" : "subvolume",
-			target->bi_inum, target->bi_snapshot, buf.buf)) {
-		ret = __remove_dirent(trans, d.k->p);
-		goto out;
-	}
-
-	/*
-	 * hardlinked file with nlink 0:
-	 * We're just adjusting nlink here so check_nlinks() will pick
-	 * it up, it ignores inodes with nlink 0
-	 */
-	if (fsck_err_on(backpointer_exists && !target->bi_nlink,
-			trans, inode_multiple_links_but_nlink_0,
-			"inode %llu:%u type %s has multiple links but i_nlink 0\n%s",
-			target->bi_inum, target->bi_snapshot, bch2_d_types[d.v->d_type], buf.buf)) {
-		target->bi_nlink++;
-		target->bi_flags &= ~BCH_INODE_unlinked;
-		ret = __bch2_fsck_write_inode(trans, target);
-		if (ret)
-			goto err;
-	}
-out:
-err:
-fsck_err:
-	bch2_trans_iter_exit(trans, &bp_iter);
-	printbuf_exit(&buf);
-	bch_err_fn(c, ret);
-	return ret;
-}
-
-noinline_for_stack
-static int check_dirent_target(struct btree_trans *trans,
-			       struct btree_iter *iter,
-			       struct bkey_s_c_dirent d,
-			       struct bch_inode_unpacked *target)
-{
-	struct bch_fs *c = trans->c;
-	struct bkey_i_dirent *n;
-	struct printbuf buf = PRINTBUF;
-	int ret = 0;
-
-	ret = check_dirent_inode_dirent(trans, iter, d, target);
-	if (ret)
-		goto err;
-
-	if (fsck_err_on(d.v->d_type != inode_d_type(target),
-			trans, dirent_d_type_wrong,
-			"incorrect d_type: got %s, should be %s:\n%s",
-			bch2_d_type_str(d.v->d_type),
-			bch2_d_type_str(inode_d_type(target)),
-			(printbuf_reset(&buf),
-			 bch2_bkey_val_to_text(&buf, c, d.s_c), buf.buf))) {
-		n = bch2_trans_kmalloc(trans, bkey_bytes(d.k));
-		ret = PTR_ERR_OR_ZERO(n);
-		if (ret)
-			goto err;
-
-		bkey_reassemble(&n->k_i, d.s_c);
-		n->v.d_type = inode_d_type(target);
-		if (n->v.d_type == DT_SUBVOL) {
-			n->v.d_parent_subvol = cpu_to_le32(target->bi_parent_subvol);
-			n->v.d_child_subvol = cpu_to_le32(target->bi_subvol);
-		} else {
-			n->v.d_inum = cpu_to_le64(target->bi_inum);
-		}
-
-		ret = bch2_trans_update(trans, iter, &n->k_i, 0);
-		if (ret)
-			goto err;
-
-		d = dirent_i_to_s_c(n);
-	}
-err:
-fsck_err:
-	printbuf_exit(&buf);
-	bch_err_fn(c, ret);
-	return ret;
-}
-
 /* find a subvolume that's a descendent of @snapshot: */
 static int find_snapshot_subvol(struct btree_trans *trans, u32 snapshot, u32 *subvolid)
 {
@ -2247,7 +2028,7 @@ static int check_dirent_to_subvol(struct btree_trans *trans, struct btree_iter *
 		if (fsck_err(trans, dirent_to_missing_subvol,
 			     "dirent points to missing subvolume\n%s",
 			     (bch2_bkey_val_to_text(&buf, c, d.s_c), buf.buf)))
-			return __remove_dirent(trans, d.k->p);
+			return bch2_fsck_remove_dirent(trans, d.k->p);
 		ret = 0;
 		goto out;
 	}
@ -2291,7 +2072,7 @@ static int check_dirent_to_subvol(struct btree_trans *trans, struct btree_iter *
 			goto err;
 	}

-	ret = check_dirent_target(trans, iter, d, &subvol_root);
+	ret = bch2_check_dirent_target(trans, iter, d, &subvol_root, true);
 	if (ret)
 		goto err;
 out:
@ -2378,13 +2159,13 @@ static int check_dirent(struct btree_trans *trans, struct btree_iter *iter,
 				(printbuf_reset(&buf),
 				 bch2_bkey_val_to_text(&buf, c, k),
 				 buf.buf))) {
-			ret = __remove_dirent(trans, d.k->p);
+			ret = bch2_fsck_remove_dirent(trans, d.k->p);
 			if (ret)
 				goto err;
 		}

 		darray_for_each(target->inodes, i) {
-			ret = check_dirent_target(trans, iter, d, &i->inode);
+			ret = bch2_check_dirent_target(trans, iter, d, &i->inode, true);
 			if (ret)
 				goto err;
 		}
--- a/fs/bcachefs/inode.c
+++ b/fs/bcachefs/inode.c
@ -731,10 +731,9 @@ int bch2_trigger_inode(struct btree_trans *trans,
 		bkey_s_to_inode_v3(new).v->bi_journal_seq = cpu_to_le64(trans->journal_res.seq);
 	}

-	s64 nr = bkey_is_inode(new.k) - bkey_is_inode(old.k);
-	if ((flags & (BTREE_TRIGGER_transactional|BTREE_TRIGGER_gc)) && nr) {
-		struct disk_accounting_pos acc = { .type = BCH_DISK_ACCOUNTING_nr_inodes };
-		int ret = bch2_disk_accounting_mod(trans, &acc, &nr, 1, flags & BTREE_TRIGGER_gc);
+	s64 nr[1] = { bkey_is_inode(new.k) - bkey_is_inode(old.k) };
+	if ((flags & (BTREE_TRIGGER_transactional|BTREE_TRIGGER_gc)) && nr[0]) {
+		int ret = bch2_disk_accounting_mod2(trans, flags & BTREE_TRIGGER_gc, nr, nr_inodes);
 		if (ret)
 			return ret;
 	}
@ -868,19 +867,6 @@ void bch2_inode_init(struct bch_fs *c, struct bch_inode_unpacked *inode_u,
 			     uid, gid, mode, rdev, parent);
 }

-static inline u32 bkey_generation(struct bkey_s_c k)
-{
-	switch (k.k->type) {
-	case KEY_TYPE_inode:
-	case KEY_TYPE_inode_v2:
-		BUG();
-	case KEY_TYPE_inode_generation:
-		return le32_to_cpu(bkey_s_c_to_inode_generation(k).v->bi_generation);
-	default:
-		return 0;
-	}
-}
-
 static struct bkey_i_inode_alloc_cursor *
 bch2_inode_alloc_cursor_get(struct btree_trans *trans, u64 cpu, u64 *min, u64 *max)
 {
@ -1092,7 +1078,7 @@ retry:
 		bch2_fs_inconsistent(c,
 				     "inode %llu:%u not found when deleting",
 				     inum.inum, snapshot);
-		ret = -EIO;
+		ret = -BCH_ERR_ENOENT_inode;
 		goto err;
 	}

@ -1256,7 +1242,7 @@ retry:
 		bch2_fs_inconsistent(c,
 				     "inode %llu:%u not found when deleting",
 				     inum, snapshot);
-		ret = -EIO;
+		ret = -BCH_ERR_ENOENT_inode;
 		goto err;
 	}

--- a/fs/bcachefs/inode.h
+++ b/fs/bcachefs/inode.h
@ -277,6 +277,7 @@ static inline bool bch2_inode_should_have_single_bp(struct bch_inode_unpacked *i
 	bool inode_has_bp = inode->bi_dir || inode->bi_dir_offset;

 	return S_ISDIR(inode->bi_mode) ||
+		inode->bi_subvol ||
 		(!inode->bi_nlink && inode_has_bp);
 }

--- a/fs/bcachefs/inode_format.h
+++ b/fs/bcachefs/inode_format.h
@ -137,7 +137,8 @@ enum inode_opt_id {
 	x(i_sectors_dirty,		6)	\
 	x(unlinked,			7)	\
 	x(backptr_untrusted,		8)	\
-	x(has_child_snapshot,		9)
+	x(has_child_snapshot,		9)	\
+	x(casefolded,			10)

 /* bits 20+ reserved for packed fields below: */

--- a/fs/bcachefs/io_misc.c
+++ b/fs/bcachefs/io_misc.c
@ -115,7 +115,8 @@ err:
 		bch2_increment_clock(c, sectors_allocated, WRITE);
 	if (should_print_err(ret)) {
 		struct printbuf buf = PRINTBUF;
-		bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter->pos.offset << 9);
+		lockrestart_do(trans,
+			bch2_inum_offset_err_msg_trans(trans, &buf, inum, iter->pos.offset << 9));
 		prt_printf(&buf, "fallocate error: %s", bch2_err_str(ret));
 		bch_err_ratelimited(c, "%s", buf.buf);
 		printbuf_exit(&buf);
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
--- a/fs/bcachefs/io_read.h
+++ b/fs/bcachefs/io_read.h
@ -3,6 +3,7 @@
 #define _BCACHEFS_IO_READ_H

 #include "bkey_buf.h"
+#include "btree_iter.h"
 #include "reflink.h"

 struct bch_read_bio {
@ -35,19 +36,18 @@ struct bch_read_bio {
 	u16			flags;
 	union {
 	struct {
-	u16			bounce:1,
+	u16			data_update:1,
+				promote:1,
+				bounce:1,
 				split:1,
-				kmalloc:1,
 				have_ioref:1,
 				narrow_crcs:1,
-				hole:1,
-				retry:2,
+				saw_error:1,
 				context:2;
 	};
 	u16			_state;
 	};
-
-	struct bch_devs_list	devs_have;
+	s16			ret;

 	struct extent_ptr_decoded pick;

@ -65,8 +65,6 @@ struct bch_read_bio {
 	struct bpos		data_pos;
 	struct bversion		version;

-	struct promote_op	*promote;
-
 	struct bch_io_opts	opts;

 	struct work_struct	work;
@ -108,23 +106,31 @@ static inline int bch2_read_indirect_extent(struct btree_trans *trans,
 	return 0;
 }

-enum bch_read_flags {
-	BCH_READ_RETRY_IF_STALE		= 1 << 0,
-	BCH_READ_MAY_PROMOTE		= 1 << 1,
-	BCH_READ_USER_MAPPED		= 1 << 2,
-	BCH_READ_NODECODE		= 1 << 3,
-	BCH_READ_LAST_FRAGMENT		= 1 << 4,
+#define BCH_READ_FLAGS()		\
+	x(retry_if_stale)		\
+	x(may_promote)			\
+	x(user_mapped)			\
+	x(last_fragment)		\
+	x(must_bounce)			\
+	x(must_clone)			\
+	x(in_retry)

-	/* internal: */
-	BCH_READ_MUST_BOUNCE		= 1 << 5,
-	BCH_READ_MUST_CLONE		= 1 << 6,
-	BCH_READ_IN_RETRY		= 1 << 7,
+enum __bch_read_flags {
+#define x(n)	__BCH_READ_##n,
+	BCH_READ_FLAGS()
+#undef x
+};
+
+enum bch_read_flags {
+#define x(n)	BCH_READ_##n = BIT(__BCH_READ_##n),
+	BCH_READ_FLAGS()
+#undef x
 };

 int __bch2_read_extent(struct btree_trans *, struct bch_read_bio *,
 		       struct bvec_iter, struct bpos, enum btree_id,
 		       struct bkey_s_c, unsigned,
-		       struct bch_io_failures *, unsigned);
+		       struct bch_io_failures *, unsigned, int);

 static inline void bch2_read_extent(struct btree_trans *trans,
 			struct bch_read_bio *rbio, struct bpos read_pos,
@ -132,37 +138,55 @@ static inline void bch2_read_extent(struct btree_trans *trans,
 			unsigned offset_into_extent, unsigned flags)
 {
 	__bch2_read_extent(trans, rbio, rbio->bio.bi_iter, read_pos,
-			   data_btree, k, offset_into_extent, NULL, flags);
+			   data_btree, k, offset_into_extent, NULL, flags, -1);
 }

-void __bch2_read(struct bch_fs *, struct bch_read_bio *, struct bvec_iter,
-		 subvol_inum, struct bch_io_failures *, unsigned flags);
+int __bch2_read(struct btree_trans *, struct bch_read_bio *, struct bvec_iter,
+		subvol_inum, struct bch_io_failures *, unsigned flags);

 static inline void bch2_read(struct bch_fs *c, struct bch_read_bio *rbio,
 			     subvol_inum inum)
 {
-	struct bch_io_failures failed = { .nr = 0 };
-
 	BUG_ON(rbio->_state);

-	rbio->c = c;
-	rbio->start_time = local_clock();
 	rbio->subvol = inum.subvol;

-	__bch2_read(c, rbio, rbio->bio.bi_iter, inum, &failed,
-		    BCH_READ_RETRY_IF_STALE|
-		    BCH_READ_MAY_PROMOTE|
-		    BCH_READ_USER_MAPPED);
+	bch2_trans_run(c,
+		__bch2_read(trans, rbio, rbio->bio.bi_iter, inum, NULL,
+			    BCH_READ_retry_if_stale|
+			    BCH_READ_may_promote|
+			    BCH_READ_user_mapped));
 }

-static inline struct bch_read_bio *rbio_init(struct bio *bio,
-					     struct bch_io_opts opts)
+static inline struct bch_read_bio *rbio_init_fragment(struct bio *bio,
+						      struct bch_read_bio *orig)
 {
 	struct bch_read_bio *rbio = to_rbio(bio);

-	rbio->_state	= 0;
-	rbio->promote	= NULL;
-	rbio->opts	= opts;
+	rbio->c			= orig->c;
+	rbio->_state		= 0;
+	rbio->flags		= 0;
+	rbio->ret		= 0;
+	rbio->split		= true;
+	rbio->parent		= orig;
+	rbio->opts		= orig->opts;
+	return rbio;
+}
+
+static inline struct bch_read_bio *rbio_init(struct bio *bio,
+					     struct bch_fs *c,
+					     struct bch_io_opts opts,
+					     bio_end_io_t end_io)
+{
+	struct bch_read_bio *rbio = to_rbio(bio);
+
+	rbio->start_time	= local_clock();
+	rbio->c			= c;
+	rbio->_state		= 0;
+	rbio->flags		= 0;
+	rbio->ret		= 0;
+	rbio->opts		= opts;
+	rbio->bio.bi_end_io	= end_io;
 	return rbio;
 }

--- a/fs/bcachefs/io_write.c
+++ b/fs/bcachefs/io_write.c
@ -34,6 +34,12 @@
 #include <linux/random.h>
 #include <linux/sched/mm.h>

+#ifdef CONFIG_BCACHEFS_DEBUG
+static unsigned bch2_write_corrupt_ratio;
+module_param_named(write_corrupt_ratio, bch2_write_corrupt_ratio, uint, 0644);
+MODULE_PARM_DESC(write_corrupt_ratio, "");
+#endif
+
 #ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT

 static inline void bch2_congested_acct(struct bch_dev *ca, u64 io_latency,
@ -374,7 +380,7 @@ static int bch2_write_index_default(struct bch_write_op *op)
 			bch2_extent_update(trans, inum, &iter, sk.k,
 					&op->res,
 					op->new_i_size, &op->i_sectors_delta,
-					op->flags & BCH_WRITE_CHECK_ENOSPC);
+					op->flags & BCH_WRITE_check_enospc);
 		bch2_trans_iter_exit(trans, &iter);

 		if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
@ -396,29 +402,42 @@ static int bch2_write_index_default(struct bch_write_op *op)

 /* Writes */

-static void __bch2_write_op_error(struct printbuf *out, struct bch_write_op *op,
-				  u64 offset)
+void bch2_write_op_error(struct bch_write_op *op, u64 offset, const char *fmt, ...)
 {
-	bch2_inum_offset_err_msg(op->c, out,
-				 (subvol_inum) { op->subvol, op->pos.inode, },
-				 offset << 9);
-	prt_printf(out, "write error%s: ",
-		   op->flags & BCH_WRITE_MOVE ? "(internal move)" : "");
+	struct printbuf buf = PRINTBUF;
+
+	if (op->subvol) {
+		bch2_inum_offset_err_msg(op->c, &buf,
+					 (subvol_inum) { op->subvol, op->pos.inode, },
+					 offset << 9);
+	} else {
+		struct bpos pos = op->pos;
+		pos.offset = offset;
+		bch2_inum_snap_offset_err_msg(op->c, &buf, pos);
+	}
+
+	prt_str(&buf, "write error: ");
+
+	va_list args;
+	va_start(args, fmt);
+	prt_vprintf(&buf, fmt, args);
+	va_end(args);
+
+	if (op->flags & BCH_WRITE_move) {
+		struct data_update *u = container_of(op, struct data_update, op);
+
+		prt_printf(&buf, "\n  from internal move ");
+		bch2_bkey_val_to_text(&buf, op->c, bkey_i_to_s_c(u->k.k));
+	}
+
+	bch_err_ratelimited(op->c, "%s", buf.buf);
+	printbuf_exit(&buf);
 }

-void bch2_write_op_error(struct printbuf *out, struct bch_write_op *op)
+static void bch2_write_csum_err_msg(struct bch_write_op *op)
 {
-	__bch2_write_op_error(out, op, op->pos.offset);
-}
-
-static void bch2_write_op_error_trans(struct btree_trans *trans, struct printbuf *out,
-				      struct bch_write_op *op, u64 offset)
-{
-	bch2_inum_offset_err_msg_trans(trans, out,
-				       (subvol_inum) { op->subvol, op->pos.inode, },
-				       offset << 9);
-	prt_printf(out, "write error%s: ",
-		   op->flags & BCH_WRITE_MOVE ? "(internal move)" : "");
+	bch2_write_op_error(op, op->pos.offset,
+			    "error verifying existing checksum while rewriting existing data (memory corruption?)");
 }

 void bch2_submit_wbio_replicas(struct bch_write_bio *wbio, struct bch_fs *c,
@ -493,7 +512,7 @@ static void bch2_write_done(struct closure *cl)
 	bch2_time_stats_update(&c->times[BCH_TIME_data_write], op->start_time);
 	bch2_disk_reservation_put(c, &op->res);

-	if (!(op->flags & BCH_WRITE_MOVE))
+	if (!(op->flags & BCH_WRITE_move))
 		bch2_write_ref_put(c, BCH_WRITE_REF_write);
 	bch2_keylist_free(&op->insert_keys, op->inline_keys);

@ -516,7 +535,7 @@ static noinline int bch2_write_drop_io_error_ptrs(struct bch_write_op *op)
 					    test_bit(ptr->dev, op->failed.d));

 			if (!bch2_bkey_nr_ptrs(bkey_i_to_s_c(src)))
-				return -EIO;
+				return -BCH_ERR_data_write_io;
 		}

 		if (dst != src)
@ -539,7 +558,7 @@ static void __bch2_write_index(struct bch_write_op *op)
 	unsigned dev;
 	int ret = 0;

-	if (unlikely(op->flags & BCH_WRITE_IO_ERROR)) {
+	if (unlikely(op->flags & BCH_WRITE_io_error)) {
 		ret = bch2_write_drop_io_error_ptrs(op);
 		if (ret)
 			goto err;
@ -548,7 +567,7 @@ static void __bch2_write_index(struct bch_write_op *op)
 	if (!bch2_keylist_empty(keys)) {
 		u64 sectors_start = keylist_sectors(keys);

-		ret = !(op->flags & BCH_WRITE_MOVE)
+		ret = !(op->flags & BCH_WRITE_move)
 			? bch2_write_index_default(op)
 			: bch2_data_update_index_update(op);

@ -560,11 +579,8 @@ static void __bch2_write_index(struct bch_write_op *op)
 		if (unlikely(ret && !bch2_err_matches(ret, EROFS))) {
 			struct bkey_i *insert = bch2_keylist_front(&op->insert_keys);

-			struct printbuf buf = PRINTBUF;
-			__bch2_write_op_error(&buf, op, bkey_start_offset(&insert->k));
-			prt_printf(&buf, "btree update error: %s", bch2_err_str(ret));
-			bch_err_ratelimited(c, "%s", buf.buf);
-			printbuf_exit(&buf);
+			bch2_write_op_error(op, bkey_start_offset(&insert->k),
+					    "btree update error: %s", bch2_err_str(ret));
 		}

 		if (ret)
@ -573,21 +589,29 @@ static void __bch2_write_index(struct bch_write_op *op)
 out:
 	/* If some a bucket wasn't written, we can't erasure code it: */
 	for_each_set_bit(dev, op->failed.d, BCH_SB_MEMBERS_MAX)
-		bch2_open_bucket_write_error(c, &op->open_buckets, dev);
+		bch2_open_bucket_write_error(c, &op->open_buckets, dev, -BCH_ERR_data_write_io);

 	bch2_open_buckets_put(c, &op->open_buckets);
 	return;
 err:
 	keys->top = keys->keys;
 	op->error = ret;
-	op->flags |= BCH_WRITE_SUBMITTED;
+	op->flags |= BCH_WRITE_submitted;
 	goto out;
 }

 static inline void __wp_update_state(struct write_point *wp, enum write_point_state state)
 {
 	if (state != wp->state) {
+		struct task_struct *p = current;
 		u64 now = ktime_get_ns();
+		u64 runtime = p->se.sum_exec_runtime +
+			(now - p->se.exec_start);
+
+		if (state == WRITE_POINT_runnable)
+			wp->last_runtime = runtime;
+		else if (wp->state == WRITE_POINT_runnable)
+			wp->time[WRITE_POINT_running] += runtime - wp->last_runtime;

 		if (wp->last_state_change &&
 		    time_after64(now, wp->last_state_change))
@ -601,7 +625,7 @@ static inline void wp_update_state(struct write_point *wp, bool running)
 {
 	enum write_point_state state;

-	state = running			 ? WRITE_POINT_running :
+	state = running			 ? WRITE_POINT_runnable:
 		!list_empty(&wp->writes) ? WRITE_POINT_waiting_io
 					 : WRITE_POINT_stopped;

@ -615,8 +639,8 @@ static CLOSURE_CALLBACK(bch2_write_index)
 	struct workqueue_struct *wq = index_update_wq(op);
 	unsigned long flags;

-	if ((op->flags & BCH_WRITE_SUBMITTED) &&
-	    (op->flags & BCH_WRITE_MOVE))
+	if ((op->flags & BCH_WRITE_submitted) &&
+	    (op->flags & BCH_WRITE_move))
 		bch2_bio_free_pages_pool(op->c, &op->wbio.bio);

 	spin_lock_irqsave(&wp->writes_lock, flags);
@ -654,11 +678,11 @@ void bch2_write_point_do_index_updates(struct work_struct *work)
 		if (!op)
 			break;

-		op->flags |= BCH_WRITE_IN_WORKER;
+		op->flags |= BCH_WRITE_in_worker;

 		__bch2_write_index(op);

-		if (!(op->flags & BCH_WRITE_SUBMITTED))
+		if (!(op->flags & BCH_WRITE_submitted))
 			__bch2_write(op);
 		else
 			bch2_write_done(&op->cl);
@ -676,13 +700,17 @@ static void bch2_write_endio(struct bio *bio)
 		? bch2_dev_have_ref(c, wbio->dev)
 		: NULL;

-	if (bch2_dev_inum_io_err_on(bio->bi_status, ca, BCH_MEMBER_ERROR_write,
+	bch2_account_io_completion(ca, BCH_MEMBER_ERROR_write,
+				   wbio->submit_time, !bio->bi_status);
+
+	if (bio->bi_status) {
+		bch_err_inum_offset_ratelimited(ca,
 				    op->pos.inode,
 				    wbio->inode_offset << 9,
 				    "data write error: %s",
-				    bch2_blk_status_to_str(bio->bi_status))) {
+				    bch2_blk_status_to_str(bio->bi_status));
 		set_bit(wbio->dev, op->failed.d);
-		op->flags |= BCH_WRITE_IO_ERROR;
+		op->flags |= BCH_WRITE_io_error;
 	}

 	if (wbio->nocow) {
@ -692,10 +720,8 @@ static void bch2_write_endio(struct bio *bio)
 		set_bit(wbio->dev, op->devs_need_flush->d);
 	}

-	if (wbio->have_ioref) {
-		bch2_latency_acct(ca, wbio->submit_time, WRITE);
+	if (wbio->have_ioref)
 		percpu_ref_put(&ca->io_ref);
-	}

 	if (wbio->bounce)
 		bch2_bio_free_pages_pool(c, bio);
@ -729,7 +755,7 @@ static void init_append_extent(struct bch_write_op *op,
 		bch2_extent_crc_append(&e->k_i, crc);

 	bch2_alloc_sectors_append_ptrs_inlined(op->c, wp, &e->k_i, crc.compressed_size,
-				       op->flags & BCH_WRITE_CACHED);
+				       op->flags & BCH_WRITE_cached);

 	bch2_keylist_push(&op->insert_keys);
 }
@ -789,7 +815,6 @@ static int bch2_write_rechecksum(struct bch_fs *c,
 {
 	struct bio *bio = &op->wbio.bio;
 	struct bch_extent_crc_unpacked new_crc;
-	int ret;

 	/* bch2_rechecksum_bio() can't encrypt or decrypt data: */

@ -797,10 +822,10 @@ static int bch2_write_rechecksum(struct bch_fs *c,
 	    bch2_csum_type_is_encryption(new_csum_type))
 		new_csum_type = op->crc.csum_type;

-	ret = bch2_rechecksum_bio(c, bio, op->version, op->crc,
-				  NULL, &new_crc,
-				  op->crc.offset, op->crc.live_size,
-				  new_csum_type);
+	int ret = bch2_rechecksum_bio(c, bio, op->version, op->crc,
+				      NULL, &new_crc,
+				      op->crc.offset, op->crc.live_size,
+				      new_csum_type);
 	if (ret)
 		return ret;

@ -810,44 +835,12 @@ static int bch2_write_rechecksum(struct bch_fs *c,
 	return 0;
 }

-static int bch2_write_decrypt(struct bch_write_op *op)
-{
-	struct bch_fs *c = op->c;
-	struct nonce nonce = extent_nonce(op->version, op->crc);
-	struct bch_csum csum;
-	int ret;
-
-	if (!bch2_csum_type_is_encryption(op->crc.csum_type))
-		return 0;
-
-	/*
-	 * If we need to decrypt data in the write path, we'll no longer be able
-	 * to verify the existing checksum (poly1305 mac, in this case) after
-	 * it's decrypted - this is the last point we'll be able to reverify the
-	 * checksum:
-	 */
-	csum = bch2_checksum_bio(c, op->crc.csum_type, nonce, &op->wbio.bio);
-	if (bch2_crc_cmp(op->crc.csum, csum) && !c->opts.no_data_io)
-		return -EIO;
-
-	ret = bch2_encrypt_bio(c, op->crc.csum_type, nonce, &op->wbio.bio);
-	op->crc.csum_type = 0;
-	op->crc.csum = (struct bch_csum) { 0, 0 };
-	return ret;
-}
-
-static enum prep_encoded_ret {
-	PREP_ENCODED_OK,
-	PREP_ENCODED_ERR,
-	PREP_ENCODED_CHECKSUM_ERR,
-	PREP_ENCODED_DO_WRITE,
-} bch2_write_prep_encoded_data(struct bch_write_op *op, struct write_point *wp)
+static noinline int bch2_write_prep_encoded_data(struct bch_write_op *op, struct write_point *wp)
 {
 	struct bch_fs *c = op->c;
 	struct bio *bio = &op->wbio.bio;
-
-	if (!(op->flags & BCH_WRITE_DATA_ENCODED))
-		return PREP_ENCODED_OK;
+	struct nonce nonce = extent_nonce(op->version, op->crc);
+	int ret = 0;

 	BUG_ON(bio_sectors(bio) != op->crc.compressed_size);

@ -858,12 +851,13 @@ static enum prep_encoded_ret {
 	    (op->crc.compression_type == bch2_compression_opt_to_type(op->compression_opt) ||
 	     op->incompressible)) {
 		if (!crc_is_compressed(op->crc) &&
-		    op->csum_type != op->crc.csum_type &&
-		    bch2_write_rechecksum(c, op, op->csum_type) &&
-		    !c->opts.no_data_io)
-			return PREP_ENCODED_CHECKSUM_ERR;
+		    op->csum_type != op->crc.csum_type) {
+			ret = bch2_write_rechecksum(c, op, op->csum_type);
+			if (ret)
+				return ret;
+		}

-		return PREP_ENCODED_DO_WRITE;
+		return 1;
 	}

 	/*
@ -871,20 +865,23 @@ static enum prep_encoded_ret {
 	 * is, we have to decompress it:
 	 */
 	if (crc_is_compressed(op->crc)) {
-		struct bch_csum csum;
-
-		if (bch2_write_decrypt(op))
-			return PREP_ENCODED_CHECKSUM_ERR;
-
 		/* Last point we can still verify checksum: */
-		csum = bch2_checksum_bio(c, op->crc.csum_type,
-					 extent_nonce(op->version, op->crc),
-					 bio);
+		struct bch_csum csum = bch2_checksum_bio(c, op->crc.csum_type, nonce, bio);
 		if (bch2_crc_cmp(op->crc.csum, csum) && !c->opts.no_data_io)
-			return PREP_ENCODED_CHECKSUM_ERR;
+			goto csum_err;

-		if (bch2_bio_uncompress_inplace(op, bio))
-			return PREP_ENCODED_ERR;
+		if (bch2_csum_type_is_encryption(op->crc.csum_type)) {
+			ret = bch2_encrypt_bio(c, op->crc.csum_type, nonce, bio);
+			if (ret)
+				return ret;
+
+			op->crc.csum_type = 0;
+			op->crc.csum = (struct bch_csum) { 0, 0 };
+		}
+
+		ret = bch2_bio_uncompress_inplace(op, bio);
+		if (ret)
+			return ret;
 	}

 	/*
@ -896,22 +893,34 @@ static enum prep_encoded_ret {
 	 * If the data is checksummed and we're only writing a subset,
 	 * rechecksum and adjust bio to point to currently live data:
 	 */
-	if ((op->crc.live_size != op->crc.uncompressed_size ||
-	     op->crc.csum_type != op->csum_type) &&
-	    bch2_write_rechecksum(c, op, op->csum_type) &&
-	    !c->opts.no_data_io)
-		return PREP_ENCODED_CHECKSUM_ERR;
+	if (op->crc.live_size != op->crc.uncompressed_size ||
+	    op->crc.csum_type != op->csum_type) {
+		ret = bch2_write_rechecksum(c, op, op->csum_type);
+		if (ret)
+			return ret;
+	}

 	/*
 	 * If we want to compress the data, it has to be decrypted:
 	 */
-	if ((op->compression_opt ||
-	     bch2_csum_type_is_encryption(op->crc.csum_type) !=
-	     bch2_csum_type_is_encryption(op->csum_type)) &&
-	    bch2_write_decrypt(op))
-		return PREP_ENCODED_CHECKSUM_ERR;
+	if (bch2_csum_type_is_encryption(op->crc.csum_type) &&
+	    (op->compression_opt || op->crc.csum_type != op->csum_type)) {
+		struct bch_csum csum = bch2_checksum_bio(c, op->crc.csum_type, nonce, bio);
+		if (bch2_crc_cmp(op->crc.csum, csum) && !c->opts.no_data_io)
+			goto csum_err;

-	return PREP_ENCODED_OK;
+		ret = bch2_encrypt_bio(c, op->crc.csum_type, nonce, bio);
+		if (ret)
+			return ret;
+
+		op->crc.csum_type = 0;
+		op->crc.csum = (struct bch_csum) { 0, 0 };
+	}
+
+	return 0;
+csum_err:
+	bch2_write_csum_err_msg(op);
+	return -BCH_ERR_data_write_csum;
 }

 static int bch2_write_extent(struct bch_write_op *op, struct write_point *wp,
@ -930,39 +939,44 @@ static int bch2_write_extent(struct bch_write_op *op, struct write_point *wp,

 	ec_buf = bch2_writepoint_ec_buf(c, wp);

-	switch (bch2_write_prep_encoded_data(op, wp)) {
-	case PREP_ENCODED_OK:
-		break;
-	case PREP_ENCODED_ERR:
-		ret = -EIO;
-		goto err;
-	case PREP_ENCODED_CHECKSUM_ERR:
-		goto csum_err;
-	case PREP_ENCODED_DO_WRITE:
-		/* XXX look for bug here */
-		if (ec_buf) {
-			dst = bch2_write_bio_alloc(c, wp, src,
-						   &page_alloc_failed,
-						   ec_buf);
-			bio_copy_data(dst, src);
-			bounce = true;
+	if (unlikely(op->flags & BCH_WRITE_data_encoded)) {
+		ret = bch2_write_prep_encoded_data(op, wp);
+		if (ret < 0)
+			goto err;
+		if (ret) {
+			if (ec_buf) {
+				dst = bch2_write_bio_alloc(c, wp, src,
+							   &page_alloc_failed,
+							   ec_buf);
+				bio_copy_data(dst, src);
+				bounce = true;
+			}
+			init_append_extent(op, wp, op->version, op->crc);
+			goto do_write;
 		}
-		init_append_extent(op, wp, op->version, op->crc);
-		goto do_write;
 	}

 	if (ec_buf ||
 	    op->compression_opt ||
 	    (op->csum_type &&
-	     !(op->flags & BCH_WRITE_PAGES_STABLE)) ||
+	     !(op->flags & BCH_WRITE_pages_stable)) ||
 	    (bch2_csum_type_is_encryption(op->csum_type) &&
-	     !(op->flags & BCH_WRITE_PAGES_OWNED))) {
+	     !(op->flags & BCH_WRITE_pages_owned))) {
 		dst = bch2_write_bio_alloc(c, wp, src,
 					   &page_alloc_failed,
 					   ec_buf);
 		bounce = true;
 	}

+#ifdef CONFIG_BCACHEFS_DEBUG
+	unsigned write_corrupt_ratio = READ_ONCE(bch2_write_corrupt_ratio);
+	if (!bounce && write_corrupt_ratio) {
+		dst = bch2_write_bio_alloc(c, wp, src,
+					   &page_alloc_failed,
+					   ec_buf);
+		bounce = true;
+	}
+#endif
 	saved_iter = dst->bi_iter;

 	do {
@ -976,7 +990,7 @@ static int bch2_write_extent(struct bch_write_op *op, struct write_point *wp,
 			break;

 		BUG_ON(op->compression_opt &&
-		       (op->flags & BCH_WRITE_DATA_ENCODED) &&
+		       (op->flags & BCH_WRITE_data_encoded) &&
 		       bch2_csum_type_is_encryption(op->crc.csum_type));
 		BUG_ON(op->compression_opt && !bounce);

@ -1014,7 +1028,7 @@ static int bch2_write_extent(struct bch_write_op *op, struct write_point *wp,
 			}
 		}

-		if ((op->flags & BCH_WRITE_DATA_ENCODED) &&
+		if ((op->flags & BCH_WRITE_data_encoded) &&
 		    !crc_is_compressed(crc) &&
 		    bch2_csum_type_is_encryption(op->crc.csum_type) ==
 		    bch2_csum_type_is_encryption(op->csum_type)) {
@ -1046,7 +1060,7 @@ static int bch2_write_extent(struct bch_write_op *op, struct write_point *wp,
 			crc.compression_type = compression_type;
 			crc.nonce = nonce;
 		} else {
-			if ((op->flags & BCH_WRITE_DATA_ENCODED) &&
+			if ((op->flags & BCH_WRITE_data_encoded) &&
 			    bch2_rechecksum_bio(c, src, version, op->crc,
 					NULL, &op->crc,
 					src_len >> 9,
@ -1072,6 +1086,14 @@ static int bch2_write_extent(struct bch_write_op *op, struct write_point *wp,

 		init_append_extent(op, wp, version, crc);

+#ifdef CONFIG_BCACHEFS_DEBUG
+		if (write_corrupt_ratio) {
+			swap(dst->bi_iter.bi_size, dst_len);
+			bch2_maybe_corrupt_bio(dst, write_corrupt_ratio);
+			swap(dst->bi_iter.bi_size, dst_len);
+		}
+#endif
+
 		if (dst != src)
 			bio_advance(dst, dst_len);
 		bio_advance(src, src_len);
@ -1104,15 +1126,8 @@ do_write:
 	*_dst = dst;
 	return more;
 csum_err:
-	{
-		struct printbuf buf = PRINTBUF;
-		bch2_write_op_error(&buf, op);
-		prt_printf(&buf, "error verifying existing checksum while rewriting existing data (memory corruption?)");
-		bch_err_ratelimited(c, "%s", buf.buf);
-		printbuf_exit(&buf);
-	}
-
-	ret = -EIO;
+	bch2_write_csum_err_msg(op);
+	ret = -BCH_ERR_data_write_csum;
 err:
 	if (to_wbio(dst)->bounce)
 		bch2_bio_free_pages_pool(c, dst);
@ -1190,39 +1205,36 @@ static void bch2_nocow_write_convert_unwritten(struct bch_write_op *op)
 {
 	struct bch_fs *c = op->c;
 	struct btree_trans *trans = bch2_trans_get(c);
+	int ret = 0;

 	for_each_keylist_key(&op->insert_keys, orig) {
-		int ret = for_each_btree_key_max_commit(trans, iter, BTREE_ID_extents,
+		ret = for_each_btree_key_max_commit(trans, iter, BTREE_ID_extents,
 				     bkey_start_pos(&orig->k), orig->k.p,
 				     BTREE_ITER_intent, k,
 				     NULL, NULL, BCH_TRANS_COMMIT_no_enospc, ({
 			bch2_nocow_write_convert_one_unwritten(trans, &iter, orig, k, op->new_i_size);
 		}));
-
-		if (ret && !bch2_err_matches(ret, EROFS)) {
-			struct bkey_i *insert = bch2_keylist_front(&op->insert_keys);
-
-			struct printbuf buf = PRINTBUF;
-			bch2_write_op_error_trans(trans, &buf, op, bkey_start_offset(&insert->k));
-			prt_printf(&buf, "btree update error: %s", bch2_err_str(ret));
-			bch_err_ratelimited(c, "%s", buf.buf);
-			printbuf_exit(&buf);
-		}
-
-		if (ret) {
-			op->error = ret;
+		if (ret)
 			break;
-		}
 	}

 	bch2_trans_put(trans);
+
+	if (ret && !bch2_err_matches(ret, EROFS)) {
+		struct bkey_i *insert = bch2_keylist_front(&op->insert_keys);
+		bch2_write_op_error(op, bkey_start_offset(&insert->k),
+				    "btree update error: %s", bch2_err_str(ret));
+	}
+
+	if (ret)
+		op->error = ret;
 }

 static void __bch2_nocow_write_done(struct bch_write_op *op)
 {
-	if (unlikely(op->flags & BCH_WRITE_IO_ERROR)) {
-		op->error = -EIO;
-	} else if (unlikely(op->flags & BCH_WRITE_CONVERT_UNWRITTEN))
+	if (unlikely(op->flags & BCH_WRITE_io_error)) {
+		op->error = -BCH_ERR_data_write_io;
+	} else if (unlikely(op->flags & BCH_WRITE_convert_unwritten))
 		bch2_nocow_write_convert_unwritten(op);
 }

@ -1251,7 +1263,7 @@ static void bch2_nocow_write(struct bch_write_op *op)
 	struct bucket_to_lock *stale_at;
 	int stale, ret;

-	if (op->flags & BCH_WRITE_MOVE)
+	if (op->flags & BCH_WRITE_move)
 		return;

 	darray_init(&buckets);
@ -1309,7 +1321,7 @@ retry:
 						   }), GFP_KERNEL|__GFP_NOFAIL);

 			if (ptr->unwritten)
-				op->flags |= BCH_WRITE_CONVERT_UNWRITTEN;
+				op->flags |= BCH_WRITE_convert_unwritten;
 		}

 		/* Unlock before taking nocow locks, doing IO: */
@ -1317,7 +1329,7 @@ retry:
 		bch2_trans_unlock(trans);

 		bch2_cut_front(op->pos, op->insert_keys.top);
-		if (op->flags & BCH_WRITE_CONVERT_UNWRITTEN)
+		if (op->flags & BCH_WRITE_convert_unwritten)
 			bch2_cut_back(POS(op->pos.inode, op->pos.offset + bio_sectors(bio)), op->insert_keys.top);

 		darray_for_each(buckets, i) {
@ -1342,7 +1354,7 @@ retry:
 			wbio_init(bio)->put_bio = true;
 			bio->bi_opf = op->wbio.bio.bi_opf;
 		} else {
-			op->flags |= BCH_WRITE_SUBMITTED;
+			op->flags |= BCH_WRITE_submitted;
 		}

 		op->pos.offset += bio_sectors(bio);
@ -1352,11 +1364,12 @@ retry:
 		bio->bi_private	= &op->cl;
 		bio->bi_opf |= REQ_OP_WRITE;
 		closure_get(&op->cl);
+
 		bch2_submit_wbio_replicas(to_wbio(bio), c, BCH_DATA_user,
 					  op->insert_keys.top, true);

 		bch2_keylist_push(&op->insert_keys);
-		if (op->flags & BCH_WRITE_SUBMITTED)
+		if (op->flags & BCH_WRITE_submitted)
 			break;
 		bch2_btree_iter_advance(&iter);
 	}
@ -1370,21 +1383,18 @@ err:
 	darray_exit(&buckets);

 	if (ret) {
-		struct printbuf buf = PRINTBUF;
-		bch2_write_op_error(&buf, op);
-		prt_printf(&buf, "%s(): btree lookup error: %s", __func__, bch2_err_str(ret));
-		bch_err_ratelimited(c, "%s", buf.buf);
-		printbuf_exit(&buf);
+		bch2_write_op_error(op, op->pos.offset,
+				    "%s(): btree lookup error: %s", __func__, bch2_err_str(ret));
 		op->error = ret;
-		op->flags |= BCH_WRITE_SUBMITTED;
+		op->flags |= BCH_WRITE_submitted;
 	}

 	/* fallback to cow write path? */
-	if (!(op->flags & BCH_WRITE_SUBMITTED)) {
+	if (!(op->flags & BCH_WRITE_submitted)) {
 		closure_sync(&op->cl);
 		__bch2_nocow_write_done(op);
 		op->insert_keys.top = op->insert_keys.keys;
-	} else if (op->flags & BCH_WRITE_SYNC) {
+	} else if (op->flags & BCH_WRITE_sync) {
 		closure_sync(&op->cl);
 		bch2_nocow_write_done(&op->cl.work);
 	} else {
@ -1414,7 +1424,7 @@ err_bucket_stale:
 				    "pointer to invalid bucket in nocow path on device %llu\n  %s",
 				    stale_at->b.inode,
 				    (bch2_bkey_val_to_text(&buf, c, k), buf.buf))) {
-		ret = -EIO;
+		ret = -BCH_ERR_data_write_invalid_ptr;
 	} else {
 		/* We can retry this: */
 		ret = -BCH_ERR_transaction_restart;
@ -1436,7 +1446,7 @@ static void __bch2_write(struct bch_write_op *op)

 	if (unlikely(op->opts.nocow && c->opts.nocow_enabled)) {
 		bch2_nocow_write(op);
-		if (op->flags & BCH_WRITE_SUBMITTED)
+		if (op->flags & BCH_WRITE_submitted)
 			goto out_nofs_restore;
 	}
 again:
@ -1466,7 +1476,7 @@ again:
 		ret = bch2_trans_run(c, lockrestart_do(trans,
 			bch2_alloc_sectors_start_trans(trans,
 				op->target,
-				op->opts.erasure_code && !(op->flags & BCH_WRITE_CACHED),
+				op->opts.erasure_code && !(op->flags & BCH_WRITE_cached),
 				op->write_point,
 				&op->devs_have,
 				op->nr_replicas,
@ -1489,16 +1499,12 @@ again:
 		bch2_alloc_sectors_done_inlined(c, wp);
 err:
 		if (ret <= 0) {
-			op->flags |= BCH_WRITE_SUBMITTED;
+			op->flags |= BCH_WRITE_submitted;

 			if (unlikely(ret < 0)) {
-				if (!(op->flags & BCH_WRITE_ALLOC_NOWAIT)) {
-					struct printbuf buf = PRINTBUF;
-					bch2_write_op_error(&buf, op);
-					prt_printf(&buf, "%s(): %s", __func__, bch2_err_str(ret));
-					bch_err_ratelimited(c, "%s", buf.buf);
-					printbuf_exit(&buf);
-				}
+				if (!(op->flags & BCH_WRITE_alloc_nowait))
+					bch2_write_op_error(op, op->pos.offset,
+							    "%s(): %s", __func__, bch2_err_str(ret));
 				op->error = ret;
 				break;
 			}
@ -1524,14 +1530,14 @@ err:
 	 * synchronously here if we weren't able to submit all of the IO at
 	 * once, as that signals backpressure to the caller.
 	 */
-	if ((op->flags & BCH_WRITE_SYNC) ||
-	    (!(op->flags & BCH_WRITE_SUBMITTED) &&
-	     !(op->flags & BCH_WRITE_IN_WORKER))) {
+	if ((op->flags & BCH_WRITE_sync) ||
+	    (!(op->flags & BCH_WRITE_submitted) &&
+	     !(op->flags & BCH_WRITE_in_worker))) {
 		bch2_wait_on_allocator(c, &op->cl);

 		__bch2_write_index(op);

-		if (!(op->flags & BCH_WRITE_SUBMITTED))
+		if (!(op->flags & BCH_WRITE_submitted))
 			goto again;
 		bch2_write_done(&op->cl);
 	} else {
@ -1552,8 +1558,8 @@ static void bch2_write_data_inline(struct bch_write_op *op, unsigned data_len)

 	memset(&op->failed, 0, sizeof(op->failed));

-	op->flags |= BCH_WRITE_WROTE_DATA_INLINE;
-	op->flags |= BCH_WRITE_SUBMITTED;
+	op->flags |= BCH_WRITE_wrote_data_inline;
+	op->flags |= BCH_WRITE_submitted;

 	bch2_check_set_feature(op->c, BCH_FEATURE_inline_data);

@ -1616,8 +1622,8 @@ CLOSURE_CALLBACK(bch2_write)
 	BUG_ON(!op->write_point.v);
 	BUG_ON(bkey_eq(op->pos, POS_MAX));

-	if (op->flags & BCH_WRITE_ONLY_SPECIFIED_DEVS)
-		op->flags |= BCH_WRITE_ALLOC_NOWAIT;
+	if (op->flags & BCH_WRITE_only_specified_devs)
+		op->flags |= BCH_WRITE_alloc_nowait;

 	op->nr_replicas_required = min_t(unsigned, op->nr_replicas_required, op->nr_replicas);
 	op->start_time = local_clock();
@ -1625,11 +1631,8 @@ CLOSURE_CALLBACK(bch2_write)
 	wbio_init(bio)->put_bio = false;

 	if (unlikely(bio->bi_iter.bi_size & (c->opts.block_size - 1))) {
-		struct printbuf buf = PRINTBUF;
-		bch2_write_op_error(&buf, op);
-		prt_printf(&buf, "misaligned write");
-		printbuf_exit(&buf);
-		op->error = -EIO;
+		bch2_write_op_error(op, op->pos.offset, "misaligned write");
+		op->error = -BCH_ERR_data_write_misaligned;
 		goto err;
 	}

@ -1638,13 +1641,14 @@ CLOSURE_CALLBACK(bch2_write)
 		goto err;
 	}

-	if (!(op->flags & BCH_WRITE_MOVE) &&
+	if (!(op->flags & BCH_WRITE_move) &&
 	    !bch2_write_ref_tryget(c, BCH_WRITE_REF_write)) {
 		op->error = -BCH_ERR_erofs_no_writes;
 		goto err;
 	}

-	this_cpu_add(c->counters[BCH_COUNTER_io_write], bio_sectors(bio));
+	if (!(op->flags & BCH_WRITE_move))
+		this_cpu_add(c->counters[BCH_COUNTER_io_write], bio_sectors(bio));
 	bch2_increment_clock(c, bio_sectors(bio), WRITE);

 	data_len = min_t(u64, bio->bi_iter.bi_size,
@ -1675,20 +1679,26 @@ static const char * const bch2_write_flags[] = {

 void bch2_write_op_to_text(struct printbuf *out, struct bch_write_op *op)
 {
-	prt_str(out, "pos: ");
+	if (!out->nr_tabstops)
+		printbuf_tabstop_push(out, 32);
+
+	prt_printf(out, "pos:\t");
 	bch2_bpos_to_text(out, op->pos);
 	prt_newline(out);
 	printbuf_indent_add(out, 2);

-	prt_str(out, "started: ");
+	prt_printf(out, "started:\t");
 	bch2_pr_time_units(out, local_clock() - op->start_time);
 	prt_newline(out);

-	prt_str(out, "flags: ");
+	prt_printf(out, "flags:\t");
 	prt_bitflags(out, bch2_write_flags, op->flags);
 	prt_newline(out);

-	prt_printf(out, "ref: %u\n", closure_nr_remaining(&op->cl));
+	prt_printf(out, "nr_replicas:\t%u\n", op->nr_replicas);
+	prt_printf(out, "nr_replicas_required:\t%u\n", op->nr_replicas_required);
+
+	prt_printf(out, "ref:\t%u\n", closure_nr_remaining(&op->cl));

 	printbuf_indent_sub(out, 2);
 }
--- a/fs/bcachefs/io_write.h
+++ b/fs/bcachefs/io_write.h
@ -11,33 +11,27 @@
 void bch2_bio_free_pages_pool(struct bch_fs *, struct bio *);
 void bch2_bio_alloc_pages_pool(struct bch_fs *, struct bio *, size_t);

-#ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT
-void bch2_latency_acct(struct bch_dev *, u64, int);
-#else
-static inline void bch2_latency_acct(struct bch_dev *ca, u64 submit_time, int rw) {}
-#endif
-
 void bch2_submit_wbio_replicas(struct bch_write_bio *, struct bch_fs *,
 			       enum bch_data_type, const struct bkey_i *, bool);

-void bch2_write_op_error(struct printbuf *out, struct bch_write_op *op);
+__printf(3, 4)
+void bch2_write_op_error(struct bch_write_op *op, u64, const char *, ...);

 #define BCH_WRITE_FLAGS()		\
-	x(ALLOC_NOWAIT)			\
-	x(CACHED)			\
-	x(DATA_ENCODED)			\
-	x(PAGES_STABLE)			\
-	x(PAGES_OWNED)			\
-	x(ONLY_SPECIFIED_DEVS)		\
-	x(WROTE_DATA_INLINE)		\
-	x(FROM_INTERNAL)		\
-	x(CHECK_ENOSPC)			\
-	x(SYNC)				\
-	x(MOVE)				\
-	x(IN_WORKER)			\
-	x(SUBMITTED)			\
-	x(IO_ERROR)			\
-	x(CONVERT_UNWRITTEN)
+	x(alloc_nowait)			\
+	x(cached)			\
+	x(data_encoded)			\
+	x(pages_stable)			\
+	x(pages_owned)			\
+	x(only_specified_devs)		\
+	x(wrote_data_inline)		\
+	x(check_enospc)			\
+	x(sync)				\
+	x(move)				\
+	x(in_worker)			\
+	x(submitted)			\
+	x(io_error)			\
+	x(convert_unwritten)

 enum __bch_write_flags {
 #define x(f)	__BCH_WRITE_##f,
--- a/fs/bcachefs/io_write_types.h
+++ b/fs/bcachefs/io_write_types.h
@ -64,7 +64,7 @@ struct bch_write_op {
 	struct bpos		pos;
 	struct bversion		version;

-	/* For BCH_WRITE_DATA_ENCODED: */
+	/* For BCH_WRITE_data_encoded: */
 	struct bch_extent_crc_unpacked crc;

 	struct write_point_specifier write_point;
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@ -20,13 +20,6 @@
 #include "journal_seq_blacklist.h"
 #include "trace.h"

-static const char * const bch2_journal_errors[] = {
-#define x(n)	#n,
-	JOURNAL_ERRORS()
-#undef x
-	NULL
-};
-
 static inline bool journal_seq_unwritten(struct journal *j, u64 seq)
 {
 	return seq > j->seq_ondisk;
@ -56,11 +49,18 @@ static void bch2_journal_buf_to_text(struct printbuf *out, struct journal *j, u6
 	prt_printf(out, "seq:\t%llu\n", seq);
 	printbuf_indent_add(out, 2);

-	prt_printf(out, "refcount:\t%u\n", journal_state_count(s, i));
+	if (!buf->write_started)
+		prt_printf(out, "refcount:\t%u\n", journal_state_count(s, i & JOURNAL_STATE_BUF_MASK));

-	prt_printf(out, "size:\t");
-	prt_human_readable_u64(out, vstruct_bytes(buf->data));
-	prt_newline(out);
+	struct closure *cl = &buf->io;
+	int r = atomic_read(&cl->remaining);
+	prt_printf(out, "io:\t%pS r %i\n", cl->fn, r & CLOSURE_REMAINING_MASK);
+
+	if (buf->data) {
+		prt_printf(out, "size:\t");
+		prt_human_readable_u64(out, vstruct_bytes(buf->data));
+		prt_newline(out);
+	}

 	prt_printf(out, "expires:\t");
 	prt_printf(out, "%li jiffies\n", buf->expires - jiffies);
@ -87,6 +87,9 @@ static void bch2_journal_buf_to_text(struct printbuf *out, struct journal *j, u6

 static void bch2_journal_bufs_to_text(struct printbuf *out, struct journal *j)
 {
+	lockdep_assert_held(&j->lock);
+	out->atomic++;
+
 	if (!out->nr_tabstops)
 		printbuf_tabstop_push(out, 24);

@ -95,6 +98,8 @@ static void bch2_journal_bufs_to_text(struct printbuf *out, struct journal *j)
 	     seq++)
 		bch2_journal_buf_to_text(out, j, seq);
 	prt_printf(out, "last buf %s\n", journal_entry_is_open(j) ? "open" : "closed");
+
+	--out->atomic;
 }

 static inline struct journal_buf *
@ -104,10 +109,8 @@ journal_seq_to_buf(struct journal *j, u64 seq)

 	EBUG_ON(seq > journal_cur_seq(j));

-	if (journal_seq_unwritten(j, seq)) {
+	if (journal_seq_unwritten(j, seq))
 		buf = j->buf + (seq & JOURNAL_BUF_MASK);
-		EBUG_ON(le64_to_cpu(buf->data->seq) != seq);
-	}
 	return buf;
 }

@ -139,8 +142,8 @@ journal_error_check_stuck(struct journal *j, int error, unsigned flags)
 	bool stuck = false;
 	struct printbuf buf = PRINTBUF;

-	if (!(error == JOURNAL_ERR_journal_full ||
-	      error == JOURNAL_ERR_journal_pin_full) ||
+	if (!(error == -BCH_ERR_journal_full ||
+	      error == -BCH_ERR_journal_pin_full) ||
 	    nr_unwritten_journal_entries(j) ||
 	    (flags & BCH_WATERMARK_MASK) != BCH_WATERMARK_reclaim)
 		return stuck;
@ -167,7 +170,7 @@ journal_error_check_stuck(struct journal *j, int error, unsigned flags)
 	spin_unlock(&j->lock);

 	bch_err(c, "Journal stuck! Hava a pre-reservation but journal full (error %s)",
-		bch2_journal_errors[error]);
+		bch2_err_str(error));
 	bch2_journal_debug_to_text(&buf, j);
 	bch_err(c, "%s", buf.buf);

@ -195,7 +198,8 @@ void bch2_journal_do_writes(struct journal *j)
 		if (w->write_started)
 			continue;

-		if (!journal_state_count(j->reservations, idx)) {
+		if (!journal_state_seq_count(j, j->reservations, seq)) {
+			j->seq_write_started = seq;
 			w->write_started = true;
 			closure_call(&w->io, bch2_journal_write, j->wq, NULL);
 		}
@ -306,7 +310,7 @@ static void __journal_entry_close(struct journal *j, unsigned closed_val, bool t

 	bch2_journal_space_available(j);

-	__bch2_journal_buf_put(j, old.idx, le64_to_cpu(buf->data->seq));
+	__bch2_journal_buf_put(j, le64_to_cpu(buf->data->seq));
 }

 void bch2_journal_halt(struct journal *j)
@ -377,29 +381,41 @@ static int journal_entry_open(struct journal *j)
 	BUG_ON(BCH_SB_CLEAN(c->disk_sb.sb));

 	if (j->blocked)
-		return JOURNAL_ERR_blocked;
+		return -BCH_ERR_journal_blocked;

 	if (j->cur_entry_error)
 		return j->cur_entry_error;

-	if (bch2_journal_error(j))
-		return JOURNAL_ERR_insufficient_devices; /* -EROFS */
+	int ret = bch2_journal_error(j);
+	if (unlikely(ret))
+		return ret;

 	if (!fifo_free(&j->pin))
-		return JOURNAL_ERR_journal_pin_full;
+		return -BCH_ERR_journal_pin_full;

 	if (nr_unwritten_journal_entries(j) == ARRAY_SIZE(j->buf))
-		return JOURNAL_ERR_max_in_flight;
+		return -BCH_ERR_journal_max_in_flight;
+
+	if (atomic64_read(&j->seq) - j->seq_write_started == JOURNAL_STATE_BUF_NR)
+		return -BCH_ERR_journal_max_open;

 	if (journal_cur_seq(j) >= JOURNAL_SEQ_MAX) {
 		bch_err(c, "cannot start: journal seq overflow");
 		if (bch2_fs_emergency_read_only_locked(c))
 			bch_err(c, "fatal error - emergency read only");
-		return JOURNAL_ERR_insufficient_devices; /* -EROFS */
+		return -BCH_ERR_journal_shutdown;
 	}

+	if (!j->free_buf && !buf->data)
+		return -BCH_ERR_journal_buf_enomem; /* will retry after write completion frees up a buf */
+
 	BUG_ON(!j->cur_entry_sectors);

+	if (!buf->data) {
+		swap(buf->data,		j->free_buf);
+		swap(buf->buf_size,	j->free_buf_size);
+	}
+
 	buf->expires		=
 		(journal_cur_seq(j) == j->flushed_seq_ondisk
 		 ? jiffies
@ -415,7 +431,7 @@ static int journal_entry_open(struct journal *j)
 	u64s = clamp_t(int, u64s, 0, JOURNAL_ENTRY_CLOSED_VAL - 1);

 	if (u64s <= (ssize_t) j->early_journal_entries.nr)
-		return JOURNAL_ERR_journal_full;
+		return -BCH_ERR_journal_full;

 	if (fifo_empty(&j->pin) && j->reclaim_thread)
 		wake_up_process(j->reclaim_thread);
@ -464,7 +480,7 @@ static int journal_entry_open(struct journal *j)

 		new.idx++;
 		BUG_ON(journal_state_count(new, new.idx));
-		BUG_ON(new.idx != (journal_cur_seq(j) & JOURNAL_BUF_MASK));
+		BUG_ON(new.idx != (journal_cur_seq(j) & JOURNAL_STATE_BUF_MASK));

 		journal_state_inc(&new);

@ -514,6 +530,33 @@ static void journal_write_work(struct work_struct *work)
 	spin_unlock(&j->lock);
 }

+static void journal_buf_prealloc(struct journal *j)
+{
+	if (j->free_buf &&
+	    j->free_buf_size >= j->buf_size_want)
+		return;
+
+	unsigned buf_size = j->buf_size_want;
+
+	spin_unlock(&j->lock);
+	void *buf = kvmalloc(buf_size, GFP_NOFS);
+	spin_lock(&j->lock);
+
+	if (buf &&
+	    (!j->free_buf ||
+	     buf_size > j->free_buf_size)) {
+		swap(buf,	j->free_buf);
+		swap(buf_size,	j->free_buf_size);
+	}
+
+	if (unlikely(buf)) {
+		spin_unlock(&j->lock);
+		/* kvfree can sleep */
+		kvfree(buf);
+		spin_lock(&j->lock);
+	}
+}
+
 static int __journal_res_get(struct journal *j, struct journal_res *res,
 			     unsigned flags)
 {
@ -525,25 +568,28 @@ retry:
 	if (journal_res_get_fast(j, res, flags))
 		return 0;

-	if (bch2_journal_error(j))
-		return -BCH_ERR_erofs_journal_err;
+	ret = bch2_journal_error(j);
+	if (unlikely(ret))
+		return ret;

 	if (j->blocked)
-		return -BCH_ERR_journal_res_get_blocked;
+		return -BCH_ERR_journal_blocked;

 	if ((flags & BCH_WATERMARK_MASK) < j->watermark) {
-		ret = JOURNAL_ERR_journal_full;
+		ret = -BCH_ERR_journal_full;
 		can_discard = j->can_discard;
 		goto out;
 	}

 	if (nr_unwritten_journal_entries(j) == ARRAY_SIZE(j->buf) && !journal_entry_is_open(j)) {
-		ret = JOURNAL_ERR_max_in_flight;
+		ret = -BCH_ERR_journal_max_in_flight;
 		goto out;
 	}

 	spin_lock(&j->lock);

+	journal_buf_prealloc(j);
+
 	/*
 	 * Recheck after taking the lock, so we don't race with another thread
 	 * that just did journal_entry_open() and call bch2_journal_entry_close()
@ -566,25 +612,48 @@ retry:
 		j->buf_size_want = max(j->buf_size_want, buf->buf_size << 1);

 	__journal_entry_close(j, JOURNAL_ENTRY_CLOSED_VAL, false);
-	ret = journal_entry_open(j) ?: JOURNAL_ERR_retry;
+	ret = journal_entry_open(j) ?: -BCH_ERR_journal_retry_open;
 unlock:
 	can_discard = j->can_discard;
 	spin_unlock(&j->lock);
 out:
-	if (ret == JOURNAL_ERR_retry)
-		goto retry;
-	if (!ret)
+	if (likely(!ret))
 		return 0;
+	if (ret == -BCH_ERR_journal_retry_open)
+		goto retry;

 	if (journal_error_check_stuck(j, ret, flags))
-		ret = -BCH_ERR_journal_res_get_blocked;
-
-	if (ret == JOURNAL_ERR_max_in_flight &&
-	    track_event_change(&c->times[BCH_TIME_blocked_journal_max_in_flight], true)) {
+		ret = -BCH_ERR_journal_stuck;

+	if (ret == -BCH_ERR_journal_max_in_flight &&
+	    track_event_change(&c->times[BCH_TIME_blocked_journal_max_in_flight], true) &&
+	    trace_journal_entry_full_enabled()) {
 		struct printbuf buf = PRINTBUF;
+
+		bch2_printbuf_make_room(&buf, 4096);
+
+		spin_lock(&j->lock);
 		prt_printf(&buf, "seq %llu\n", journal_cur_seq(j));
 		bch2_journal_bufs_to_text(&buf, j);
+		spin_unlock(&j->lock);
+
+		trace_journal_entry_full(c, buf.buf);
+		printbuf_exit(&buf);
+		count_event(c, journal_entry_full);
+	}
+
+	if (ret == -BCH_ERR_journal_max_open &&
+	    track_event_change(&c->times[BCH_TIME_blocked_journal_max_open], true) &&
+	    trace_journal_entry_full_enabled()) {
+		struct printbuf buf = PRINTBUF;
+
+		bch2_printbuf_make_room(&buf, 4096);
+
+		spin_lock(&j->lock);
+		prt_printf(&buf, "seq %llu\n", journal_cur_seq(j));
+		bch2_journal_bufs_to_text(&buf, j);
+		spin_unlock(&j->lock);
+
 		trace_journal_entry_full(c, buf.buf);
 		printbuf_exit(&buf);
 		count_event(c, journal_entry_full);
@ -594,8 +663,8 @@ out:
 	 * Journal is full - can't rely on reclaim from work item due to
 	 * freezing:
 	 */
-	if ((ret == JOURNAL_ERR_journal_full ||
-	     ret == JOURNAL_ERR_journal_pin_full) &&
+	if ((ret == -BCH_ERR_journal_full ||
+	     ret == -BCH_ERR_journal_pin_full) &&
 	    !(flags & JOURNAL_RES_GET_NONBLOCK)) {
 		if (can_discard) {
 			bch2_journal_do_discards(j);
@ -608,9 +677,7 @@ out:
 		}
 	}

-	return ret == JOURNAL_ERR_insufficient_devices
-		? -BCH_ERR_erofs_journal_err
-		: -BCH_ERR_journal_res_get_blocked;
+	return ret;
 }

 static unsigned max_dev_latency(struct bch_fs *c)
@ -640,7 +707,7 @@ int bch2_journal_res_get_slowpath(struct journal *j, struct journal_res *res,
 	int ret;

 	if (closure_wait_event_timeout(&j->async_wait,
-		   (ret = __journal_res_get(j, res, flags)) != -BCH_ERR_journal_res_get_blocked ||
+		   !bch2_err_matches(ret = __journal_res_get(j, res, flags), BCH_ERR_operation_blocked) ||
 		   (flags & JOURNAL_RES_GET_NONBLOCK),
 		   HZ))
 		return ret;
@ -654,7 +721,7 @@ int bch2_journal_res_get_slowpath(struct journal *j, struct journal_res *res,
 	remaining_wait = max(0, remaining_wait - HZ);

 	if (closure_wait_event_timeout(&j->async_wait,
-		   (ret = __journal_res_get(j, res, flags)) != -BCH_ERR_journal_res_get_blocked ||
+		   !bch2_err_matches(ret = __journal_res_get(j, res, flags), BCH_ERR_operation_blocked) ||
 		   (flags & JOURNAL_RES_GET_NONBLOCK),
 		   remaining_wait))
 		return ret;
@ -666,7 +733,7 @@ int bch2_journal_res_get_slowpath(struct journal *j, struct journal_res *res,
 	printbuf_exit(&buf);

 	closure_wait_event(&j->async_wait,
-		   (ret = __journal_res_get(j, res, flags)) != -BCH_ERR_journal_res_get_blocked ||
+		   !bch2_err_matches(ret = __journal_res_get(j, res, flags), BCH_ERR_operation_blocked) ||
 		   (flags & JOURNAL_RES_GET_NONBLOCK));
 	return ret;
 }
@ -687,7 +754,6 @@ void bch2_journal_entry_res_resize(struct journal *j,
 		goto out;

 	j->cur_entry_u64s = max_t(int, 0, j->cur_entry_u64s - d);
-	smp_mb();
 	state = READ_ONCE(j->reservations);

 	if (state.cur_entry_offset < JOURNAL_ENTRY_CLOSED_VAL &&
@ -907,7 +973,7 @@ int bch2_journal_meta(struct journal *j)
 	struct bch_fs *c = container_of(j, struct bch_fs, journal);

 	if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_journal))
-		return -EROFS;
+		return -BCH_ERR_erofs_no_writes;

 	int ret = __bch2_journal_meta(j);
 	bch2_write_ref_put(c, BCH_WRITE_REF_journal);
@ -951,7 +1017,8 @@ static void __bch2_journal_block(struct journal *j)
 			new.cur_entry_offset = JOURNAL_ENTRY_BLOCKED_VAL;
 		} while (!atomic64_try_cmpxchg(&j->reservations.counter, &old.v, new.v));

-		journal_cur_buf(j)->data->u64s = cpu_to_le32(old.cur_entry_offset);
+		if (old.cur_entry_offset < JOURNAL_ENTRY_BLOCKED_VAL)
+			journal_cur_buf(j)->data->u64s = cpu_to_le32(old.cur_entry_offset);
 	}
 }

@ -992,7 +1059,7 @@ static struct journal_buf *__bch2_next_write_buffer_flush_journal_buf(struct jou
 				*blocked = true;
 			}

-			ret = journal_state_count(s, idx) > open
+			ret = journal_state_count(s, idx & JOURNAL_STATE_BUF_MASK) > open
 				? ERR_PTR(-EAGAIN)
 				: buf;
 			break;
@ -1349,6 +1416,7 @@ int bch2_fs_journal_start(struct journal *j, u64 cur_seq)
 	j->replay_journal_seq_end = cur_seq;
 	j->last_seq_ondisk	= last_seq;
 	j->flushed_seq_ondisk	= cur_seq - 1;
+	j->seq_write_started	= cur_seq - 1;
 	j->seq_ondisk		= cur_seq - 1;
 	j->pin.front		= last_seq;
 	j->pin.back		= cur_seq;
@ -1389,8 +1457,7 @@ int bch2_fs_journal_start(struct journal *j, u64 cur_seq)
 	set_bit(JOURNAL_running, &j->flags);
 	j->last_flush_write = jiffies;

-	j->reservations.idx = j->reservations.unwritten_idx = journal_cur_seq(j);
-	j->reservations.unwritten_idx++;
+	j->reservations.idx = journal_cur_seq(j);

 	c->last_bucket_seq_cleanup = journal_cur_seq(j);

@ -1443,7 +1510,7 @@ int bch2_dev_journal_init(struct bch_dev *ca, struct bch_sb *sb)
 	unsigned nr_bvecs = DIV_ROUND_UP(JOURNAL_ENTRY_SIZE_MAX, PAGE_SIZE);

 	for (unsigned i = 0; i < ARRAY_SIZE(ja->bio); i++) {
-		ja->bio[i] = kmalloc(struct_size(ja->bio[i], bio.bi_inline_vecs,
+		ja->bio[i] = kzalloc(struct_size(ja->bio[i], bio.bi_inline_vecs,
 				     nr_bvecs), GFP_KERNEL);
 		if (!ja->bio[i])
 			return -BCH_ERR_ENOMEM_dev_journal_init;
@ -1482,6 +1549,7 @@ void bch2_fs_journal_exit(struct journal *j)

 	for (unsigned i = 0; i < ARRAY_SIZE(j->buf); i++)
 		kvfree(j->buf[i].data);
+	kvfree(j->free_buf);
 	free_fifo(&j->pin);
 }

@ -1508,13 +1576,13 @@ int bch2_fs_journal_init(struct journal *j)
 	if (!(init_fifo(&j->pin, JOURNAL_PIN, GFP_KERNEL)))
 		return -BCH_ERR_ENOMEM_journal_pin_fifo;

-	for (unsigned i = 0; i < ARRAY_SIZE(j->buf); i++) {
-		j->buf[i].buf_size = JOURNAL_ENTRY_SIZE_MIN;
-		j->buf[i].data = kvmalloc(j->buf[i].buf_size, GFP_KERNEL);
-		if (!j->buf[i].data)
-			return -BCH_ERR_ENOMEM_journal_buf;
+	j->free_buf_size = j->buf_size_want = JOURNAL_ENTRY_SIZE_MIN;
+	j->free_buf = kvmalloc(j->free_buf_size, GFP_KERNEL);
+	if (!j->free_buf)
+		return -BCH_ERR_ENOMEM_journal_buf;
+
+	for (unsigned i = 0; i < ARRAY_SIZE(j->buf); i++)
 		j->buf[i].idx = i;
-	}

 	j->pin.front = j->pin.back = 1;

@ -1564,6 +1632,7 @@ void __bch2_journal_debug_to_text(struct printbuf *out, struct journal *j)
 	prt_printf(out, "average write size:\t");
 	prt_human_readable_u64(out, nr_writes ? div64_u64(j->entry_bytes_written, nr_writes) : 0);
 	prt_newline(out);
+	prt_printf(out, "free buf:\t%u\n",			j->free_buf ? j->free_buf_size : 0);
 	prt_printf(out, "nr direct reclaim:\t%llu\n",		j->nr_direct_reclaim);
 	prt_printf(out, "nr background reclaim:\t%llu\n",	j->nr_background_reclaim);
 	prt_printf(out, "reclaim kicked:\t%u\n",		j->reclaim_kicked);
@ -1571,7 +1640,7 @@ void __bch2_journal_debug_to_text(struct printbuf *out, struct journal *j)
 	       ? jiffies_to_msecs(j->next_reclaim - jiffies) : 0);
 	prt_printf(out, "blocked:\t%u\n",			j->blocked);
 	prt_printf(out, "current entry sectors:\t%u\n",		j->cur_entry_sectors);
-	prt_printf(out, "current entry error:\t%s\n",		bch2_journal_errors[j->cur_entry_error]);
+	prt_printf(out, "current entry error:\t%s\n",		bch2_err_str(j->cur_entry_error));
 	prt_printf(out, "current entry:\t");

 	switch (s.cur_entry_offset) {
--- a/fs/bcachefs/journal.h
+++ b/fs/bcachefs/journal.h
@ -121,11 +121,6 @@ static inline void journal_wake(struct journal *j)
 	closure_wake_up(&j->async_wait);
 }

-static inline struct journal_buf *journal_cur_buf(struct journal *j)
-{
-	return j->buf + j->reservations.idx;
-}
-
 /* Sequence number of oldest dirty journal entry */

 static inline u64 journal_last_seq(struct journal *j)
@ -143,6 +138,15 @@ static inline u64 journal_last_unwritten_seq(struct journal *j)
 	return j->seq_ondisk + 1;
 }

+static inline struct journal_buf *journal_cur_buf(struct journal *j)
+{
+	unsigned idx = (journal_cur_seq(j) &
+			JOURNAL_BUF_MASK &
+			~JOURNAL_STATE_BUF_MASK) + j->reservations.idx;
+
+	return j->buf + idx;
+}
+
 static inline int journal_state_count(union journal_res_state s, int idx)
 {
 	switch (idx) {
@ -154,6 +158,15 @@ static inline int journal_state_count(union journal_res_state s, int idx)
 	BUG();
 }

+static inline int journal_state_seq_count(struct journal *j,
+					  union journal_res_state s, u64 seq)
+{
+	if (journal_cur_seq(j) - seq < JOURNAL_STATE_BUF_NR)
+		return journal_state_count(s, seq & JOURNAL_STATE_BUF_MASK);
+	else
+		return 0;
+}
+
 static inline void journal_state_inc(union journal_res_state *s)
 {
 	s->buf0_count += s->idx == 0;
@ -193,7 +206,7 @@ bch2_journal_add_entry_noreservation(struct journal_buf *buf, size_t u64s)
 static inline struct jset_entry *
 journal_res_entry(struct journal *j, struct journal_res *res)
 {
-	return vstruct_idx(j->buf[res->idx].data, res->offset);
+	return vstruct_idx(j->buf[res->seq & JOURNAL_BUF_MASK].data, res->offset);
 }

 static inline unsigned journal_entry_init(struct jset_entry *entry, unsigned type,
@ -267,8 +280,9 @@ bool bch2_journal_entry_close(struct journal *);
 void bch2_journal_do_writes(struct journal *);
 void bch2_journal_buf_put_final(struct journal *, u64);

-static inline void __bch2_journal_buf_put(struct journal *j, unsigned idx, u64 seq)
+static inline void __bch2_journal_buf_put(struct journal *j, u64 seq)
 {
+	unsigned idx = seq & JOURNAL_STATE_BUF_MASK;
 	union journal_res_state s;

 	s = journal_state_buf_put(j, idx);
@ -276,8 +290,9 @@ static inline void __bch2_journal_buf_put(struct journal *j, unsigned idx, u64 s
 		bch2_journal_buf_put_final(j, seq);
 }

-static inline void bch2_journal_buf_put(struct journal *j, unsigned idx, u64 seq)
+static inline void bch2_journal_buf_put(struct journal *j, u64 seq)
 {
+	unsigned idx = seq & JOURNAL_STATE_BUF_MASK;
 	union journal_res_state s;

 	s = journal_state_buf_put(j, idx);
@ -306,7 +321,7 @@ static inline void bch2_journal_res_put(struct journal *j,
 				       BCH_JSET_ENTRY_btree_keys,
 				       0, 0, 0);

-	bch2_journal_buf_put(j, res->idx, res->seq);
+	bch2_journal_buf_put(j, res->seq);

 	res->ref = 0;
 }
@ -335,8 +350,10 @@ static inline int journal_res_get_fast(struct journal *j,

 		/*
 		 * Check if there is still room in the current journal
-		 * entry:
+		 * entry, smp_rmb() guarantees that reads from reservations.counter
+		 * occur before accessing cur_entry_u64s:
 		 */
+		smp_rmb();
 		if (new.cur_entry_offset + res->u64s > j->cur_entry_u64s)
 			return 0;

@ -361,9 +378,9 @@ static inline int journal_res_get_fast(struct journal *j,
 				       &old.v, new.v));

 	res->ref	= true;
-	res->idx	= old.idx;
 	res->offset	= old.cur_entry_offset;
-	res->seq	= le64_to_cpu(j->buf[old.idx].data->seq);
+	res->seq	= journal_cur_seq(j);
+	res->seq -= (res->seq - old.idx) & JOURNAL_STATE_BUF_MASK;
 	return 1;
 }

@ -390,6 +407,7 @@ out:
 				    (flags & JOURNAL_RES_GET_NONBLOCK) != 0,
 				    NULL, _THIS_IP_);
 		EBUG_ON(!res->ref);
+		BUG_ON(!res->seq);
 	}
 	return 0;
 }
--- a/fs/bcachefs/journal_io.c
+++ b/fs/bcachefs/journal_io.c
@ -1041,13 +1041,19 @@ reread:
 			bio->bi_iter.bi_sector = offset;
 			bch2_bio_map(bio, buf->data, sectors_read << 9);

+			u64 submit_time = local_clock();
 			ret = submit_bio_wait(bio);
 			kfree(bio);

-			if (bch2_dev_io_err_on(ret, ca, BCH_MEMBER_ERROR_read,
-					       "journal read error: sector %llu",
-					       offset) ||
-			    bch2_meta_read_fault("journal")) {
+			if (!ret && bch2_meta_read_fault("journal"))
+				ret = -BCH_ERR_EIO_fault_injected;
+
+			bch2_account_io_completion(ca, BCH_MEMBER_ERROR_read,
+						   submit_time, !ret);
+
+			if (ret) {
+				bch_err_dev_ratelimited(ca,
+					"journal read error: sector %llu", offset);
 				/*
 				 * We don't error out of the recovery process
 				 * here, since the relevant journal entry may be
@ -1110,13 +1116,16 @@ reread:
 		struct bch_csum csum;
 		csum_good = jset_csum_good(c, j, &csum);

-		if (bch2_dev_io_err_on(!csum_good, ca, BCH_MEMBER_ERROR_checksum,
-				       "%s",
-				       (printbuf_reset(&err),
-					prt_str(&err, "journal "),
-					bch2_csum_err_msg(&err, csum_type, j->csum, csum),
-					err.buf)))
+		bch2_account_io_completion(ca, BCH_MEMBER_ERROR_checksum, 0, csum_good);
+
+		if (!csum_good) {
+			bch_err_dev_ratelimited(ca, "%s",
+				(printbuf_reset(&err),
+				 prt_str(&err, "journal "),
+				 bch2_csum_err_msg(&err, csum_type, j->csum, csum),
+				 err.buf));
 			saw_bad = true;
+		}

 		ret = bch2_encrypt(c, JSET_CSUM_TYPE(j), journal_nonce(j),
 			     j->encrypted_start,
@ -1515,7 +1524,7 @@ static void __journal_write_alloc(struct journal *j,
 * @j:		journal object
 * @w:		journal buf (entry to be written)
 *
- * Returns: 0 on success, or -EROFS on failure
+ * Returns: 0 on success, or -BCH_ERR_insufficient_devices on failure
 */
 static int journal_write_alloc(struct journal *j, struct journal_buf *w)
 {
@ -1600,18 +1609,12 @@ static void journal_buf_realloc(struct journal *j, struct journal_buf *buf)
 	kvfree(new_buf);
 }

-static inline struct journal_buf *journal_last_unwritten_buf(struct journal *j)
-{
-	return j->buf + (journal_last_unwritten_seq(j) & JOURNAL_BUF_MASK);
-}
-
 static CLOSURE_CALLBACK(journal_write_done)
 {
 	closure_type(w, struct journal_buf, io);
 	struct journal *j = container_of(w, struct journal, buf[w->idx]);
 	struct bch_fs *c = container_of(j, struct bch_fs, journal);
 	struct bch_replicas_padded replicas;
-	union journal_res_state old, new;
 	u64 seq = le64_to_cpu(w->data->seq);
 	int err = 0;

@ -1621,12 +1624,11 @@ static CLOSURE_CALLBACK(journal_write_done)

 	if (!w->devs_written.nr) {
 		bch_err(c, "unable to write journal to sufficient devices");
-		err = -EIO;
+		err = -BCH_ERR_journal_write_err;
 	} else {
 		bch2_devlist_to_replicas(&replicas.e, BCH_DATA_journal,
 					 w->devs_written);
-		if (bch2_mark_replicas(c, &replicas.e))
-			err = -EIO;
+		err = bch2_mark_replicas(c, &replicas.e);
 	}

 	if (err)
@ -1641,7 +1643,23 @@ static CLOSURE_CALLBACK(journal_write_done)
 		j->err_seq	= seq;
 	w->write_done = true;

+	if (!j->free_buf || j->free_buf_size < w->buf_size) {
+		swap(j->free_buf,	w->data);
+		swap(j->free_buf_size,	w->buf_size);
+	}
+
+	if (w->data) {
+		void *buf = w->data;
+		w->data = NULL;
+		w->buf_size = 0;
+
+		spin_unlock(&j->lock);
+		kvfree(buf);
+		spin_lock(&j->lock);
+	}
+
 	bool completed = false;
+	bool do_discards = false;

 	for (seq = journal_last_unwritten_seq(j);
 	     seq <= journal_cur_seq(j);
@ -1650,11 +1668,10 @@ static CLOSURE_CALLBACK(journal_write_done)
 		if (!w->write_done)
 			break;

-		if (!j->err_seq && !JSET_NO_FLUSH(w->data)) {
+		if (!j->err_seq && !w->noflush) {
 			j->flushed_seq_ondisk = seq;
 			j->last_seq_ondisk = w->last_seq;

-			bch2_do_discards(c);
 			closure_wake_up(&c->freelist_wait);
 			bch2_reset_alloc_cursors(c);
 		}
@ -1671,16 +1688,6 @@ static CLOSURE_CALLBACK(journal_write_done)
 		if (j->watermark != BCH_WATERMARK_stripe)
 			journal_reclaim_kick(&c->journal);

-		old.v = atomic64_read(&j->reservations.counter);
-		do {
-			new.v = old.v;
-			BUG_ON(journal_state_count(new, new.unwritten_idx));
-			BUG_ON(new.unwritten_idx != (seq & JOURNAL_BUF_MASK));
-
-			new.unwritten_idx++;
-		} while (!atomic64_try_cmpxchg(&j->reservations.counter,
-					       &old.v, new.v));
-
 		closure_wake_up(&w->wait);
 		completed = true;
 	}
@ -1695,7 +1702,7 @@ static CLOSURE_CALLBACK(journal_write_done)
 	}

 	if (journal_last_unwritten_seq(j) == journal_cur_seq(j) &&
-		   new.cur_entry_offset < JOURNAL_ENTRY_CLOSED_VAL) {
+	    j->reservations.cur_entry_offset < JOURNAL_ENTRY_CLOSED_VAL) {
 		struct journal_buf *buf = journal_cur_buf(j);
 		long delta = buf->expires - jiffies;

@ -1715,6 +1722,9 @@ static CLOSURE_CALLBACK(journal_write_done)
 	 */
 	bch2_journal_do_writes(j);
 	spin_unlock(&j->lock);
+
+	if (do_discards)
+		bch2_do_discards(c);
 }

 static void journal_write_endio(struct bio *bio)
@ -1724,13 +1734,16 @@ static void journal_write_endio(struct bio *bio)
 	struct journal *j = &ca->fs->journal;
 	struct journal_buf *w = j->buf + jbio->buf_idx;

-	if (bch2_dev_io_err_on(bio->bi_status, ca, BCH_MEMBER_ERROR_write,
+	bch2_account_io_completion(ca, BCH_MEMBER_ERROR_write,
+				   jbio->submit_time, !bio->bi_status);
+
+	if (bio->bi_status) {
+		bch_err_dev_ratelimited(ca,
 			       "error writing journal entry %llu: %s",
 			       le64_to_cpu(w->data->seq),
-			       bch2_blk_status_to_str(bio->bi_status)) ||
-	    bch2_meta_write_fault("journal")) {
-		unsigned long flags;
+			       bch2_blk_status_to_str(bio->bi_status));

+		unsigned long flags;
 		spin_lock_irqsave(&j->err_lock, flags);
 		bch2_dev_list_drop_dev(&w->devs_written, ca->dev_idx);
 		spin_unlock_irqrestore(&j->err_lock, flags);
@ -1759,7 +1772,11 @@ static CLOSURE_CALLBACK(journal_write_submit)
 			     sectors);

 		struct journal_device *ja = &ca->journal;
-		struct bio *bio = &ja->bio[w->idx]->bio;
+		struct journal_bio *jbio = ja->bio[w->idx];
+		struct bio *bio = &jbio->bio;
+
+		jbio->submit_time	= local_clock();
+
 		bio_reset(bio, ca->disk_sb.bdev, REQ_OP_WRITE|REQ_SYNC|REQ_META);
 		bio->bi_iter.bi_sector	= ptr->offset;
 		bio->bi_end_io		= journal_write_endio;
@ -1791,6 +1808,10 @@ static CLOSURE_CALLBACK(journal_write_preflush)
 	struct journal *j = container_of(w, struct journal, buf[w->idx]);
 	struct bch_fs *c = container_of(j, struct bch_fs, journal);

+	/*
+	 * Wait for previous journal writes to comelete; they won't necessarily
+	 * be flushed if they're still in flight
+	 */
 	if (j->seq_ondisk + 1 != le64_to_cpu(w->data->seq)) {
 		spin_lock(&j->lock);
 		if (j->seq_ondisk + 1 != le64_to_cpu(w->data->seq)) {
@ -1984,7 +2005,7 @@ static int bch2_journal_write_pick_flush(struct journal *j, struct journal_buf *
 	 * write anything at all.
 	 */
 	if (error && test_bit(JOURNAL_need_flush_write, &j->flags))
-		return -EIO;
+		return error;

 	if (error ||
 	    w->noflush ||
--- a/fs/bcachefs/journal_reclaim.c
+++ b/fs/bcachefs/journal_reclaim.c
@ -226,7 +226,7 @@ void bch2_journal_space_available(struct journal *j)

 		bch_err(c, "%s", buf.buf);
 		printbuf_exit(&buf);
-		ret = JOURNAL_ERR_insufficient_devices;
+		ret = -BCH_ERR_insufficient_journal_devices;
 		goto out;
 	}

@ -240,7 +240,7 @@ void bch2_journal_space_available(struct journal *j)
 	total		= j->space[journal_space_total].total;

 	if (!j->space[journal_space_discarded].next_entry)
-		ret = JOURNAL_ERR_journal_full;
+		ret = -BCH_ERR_journal_full;

 	if ((j->space[journal_space_clean_ondisk].next_entry <
 	     j->space[journal_space_clean_ondisk].total) &&
@ -645,7 +645,6 @@ static u64 journal_seq_to_flush(struct journal *j)
 * @j:		journal object
 * @direct:	direct or background reclaim?
 * @kicked:	requested to run since we last ran?
- * Returns:	0 on success, or -EIO if the journal has been shutdown
 *
 * Background journal reclaim writes out btree nodes. It should be run
 * early enough so that we never completely run out of journal buckets.
@ -685,10 +684,9 @@ static int __bch2_journal_reclaim(struct journal *j, bool direct, bool kicked)
 		if (kthread && kthread_should_stop())
 			break;

-		if (bch2_journal_error(j)) {
-			ret = -EIO;
+		ret = bch2_journal_error(j);
+		if (ret)
 			break;
-		}

 		bch2_journal_do_discards(j);

--- a/fs/bcachefs/journal_seq_blacklist.c
+++ b/fs/bcachefs/journal_seq_blacklist.c
@ -231,15 +231,14 @@ bool bch2_blacklist_entries_gc(struct bch_fs *c)
 	struct journal_seq_blacklist_table *t = c->journal_seq_blacklist_table;
 	BUG_ON(nr != t->nr);

-	unsigned i;
-	for (src = bl->start, i = t->nr == 0 ? 0 : eytzinger0_first(t->nr);
-	     src < bl->start + nr;
-	     src++, i = eytzinger0_next(i, nr)) {
+	src = bl->start;
+	eytzinger0_for_each(i, nr) {
 		BUG_ON(t->entries[i].start	!= le64_to_cpu(src->start));
 		BUG_ON(t->entries[i].end	!= le64_to_cpu(src->end));

 		if (t->entries[i].dirty || t->entries[i].end >= c->journal.oldest_seq_found_ondisk)
 			*dst++ = *src;
+		src++;
 	}

 	unsigned new_nr = dst - bl->start;
--- a/fs/bcachefs/journal_types.h
+++ b/fs/bcachefs/journal_types.h
@ -12,7 +12,11 @@
 /* btree write buffer steals 8 bits for its own purposes: */
 #define JOURNAL_SEQ_MAX		((1ULL << 56) - 1)

-#define JOURNAL_BUF_BITS	2
+#define JOURNAL_STATE_BUF_BITS	2
+#define JOURNAL_STATE_BUF_NR	(1U << JOURNAL_STATE_BUF_BITS)
+#define JOURNAL_STATE_BUF_MASK	(JOURNAL_STATE_BUF_NR - 1)
+
+#define JOURNAL_BUF_BITS	4
 #define JOURNAL_BUF_NR		(1U << JOURNAL_BUF_BITS)
 #define JOURNAL_BUF_MASK	(JOURNAL_BUF_NR - 1)

@ -82,7 +86,6 @@ struct journal_entry_pin {

 struct journal_res {
 	bool			ref;
-	u8			idx;
 	u16			u64s;
 	u32			offset;
 	u64			seq;
@ -98,9 +101,8 @@ union journal_res_state {
 	};

 	struct {
-		u64		cur_entry_offset:20,
+		u64		cur_entry_offset:22,
 				idx:2,
-				unwritten_idx:2,
 				buf0_count:10,
 				buf1_count:10,
 				buf2_count:10,
@ -110,13 +112,13 @@ union journal_res_state {

 /* bytes: */
 #define JOURNAL_ENTRY_SIZE_MIN		(64U << 10) /* 64k */
-#define JOURNAL_ENTRY_SIZE_MAX		(4U  << 20) /* 4M */
+#define JOURNAL_ENTRY_SIZE_MAX		(4U  << 22) /* 16M */

 /*
 * We stash some journal state as sentinal values in cur_entry_offset:
 * note - cur_entry_offset is in units of u64s
 */
-#define JOURNAL_ENTRY_OFFSET_MAX	((1U << 20) - 1)
+#define JOURNAL_ENTRY_OFFSET_MAX	((1U << 22) - 1)

 #define JOURNAL_ENTRY_BLOCKED_VAL	(JOURNAL_ENTRY_OFFSET_MAX - 2)
 #define JOURNAL_ENTRY_CLOSED_VAL	(JOURNAL_ENTRY_OFFSET_MAX - 1)
@ -149,28 +151,12 @@ enum journal_flags {
 #undef x
 };

-/* Reasons we may fail to get a journal reservation: */
-#define JOURNAL_ERRORS()		\
-	x(ok)				\
-	x(retry)			\
-	x(blocked)			\
-	x(max_in_flight)		\
-	x(journal_full)			\
-	x(journal_pin_full)		\
-	x(journal_stuck)		\
-	x(insufficient_devices)
-
-enum journal_errors {
-#define x(n)	JOURNAL_ERR_##n,
-	JOURNAL_ERRORS()
-#undef x
-};
-
 typedef DARRAY(u64)		darray_u64;

 struct journal_bio {
 	struct bch_dev		*ca;
 	unsigned		buf_idx;
+	u64			submit_time;

 	struct bio		bio;
 };
@ -199,7 +185,7 @@ struct journal {
 	 * 0, or -ENOSPC if waiting on journal reclaim, or -EROFS if
 	 * insufficient devices:
 	 */
-	enum journal_errors	cur_entry_error;
+	int			cur_entry_error;
 	unsigned		cur_entry_offset_if_blocked;

 	unsigned		buf_size_want;
@ -220,6 +206,8 @@ struct journal {
 	 * other is possibly being written out.
 	 */
 	struct journal_buf	buf[JOURNAL_BUF_NR];
+	void			*free_buf;
+	unsigned		free_buf_size;

 	spinlock_t		lock;

@ -237,6 +225,7 @@ struct journal {
 	/* Sequence number of most recent journal entry (last entry in @pin) */
 	atomic64_t		seq;

+	u64			seq_write_started;
 	/* seq, last_seq from the most recent journal entry successfully written */
 	u64			seq_ondisk;
 	u64			flushed_seq_ondisk;
--- a/fs/bcachefs/lru.c
+++ b/fs/bcachefs/lru.c
@ -6,6 +6,7 @@
 #include "btree_iter.h"
 #include "btree_update.h"
 #include "btree_write_buffer.h"
+#include "ec.h"
 #include "error.h"
 #include "lru.h"
 #include "recovery.h"
@ -59,9 +60,9 @@ int bch2_lru_set(struct btree_trans *trans, u16 lru_id, u64 dev_bucket, u64 time
 	return __bch2_lru_set(trans, lru_id, dev_bucket, time, KEY_TYPE_set);
 }

-int bch2_lru_change(struct btree_trans *trans,
-		    u16 lru_id, u64 dev_bucket,
-		    u64 old_time, u64 new_time)
+int __bch2_lru_change(struct btree_trans *trans,
+		      u16 lru_id, u64 dev_bucket,
+		      u64 old_time, u64 new_time)
 {
 	if (old_time == new_time)
 		return 0;
@ -78,7 +79,9 @@ static const char * const bch2_lru_types[] = {
 };

 int bch2_lru_check_set(struct btree_trans *trans,
-		       u16 lru_id, u64 time,
+		       u16 lru_id,
+		       u64 dev_bucket,
+		       u64 time,
 		       struct bkey_s_c referring_k,
 		       struct bkey_buf *last_flushed)
 {
@ -87,9 +90,7 @@ int bch2_lru_check_set(struct btree_trans *trans,
 	struct btree_iter lru_iter;
 	struct bkey_s_c lru_k =
 		bch2_bkey_get_iter(trans, &lru_iter, BTREE_ID_lru,
-				   lru_pos(lru_id,
-					   bucket_to_u64(referring_k.k->p),
-					   time), 0);
+				   lru_pos(lru_id, dev_bucket, time), 0);
 	int ret = bkey_err(lru_k);
 	if (ret)
 		return ret;
@ -104,7 +105,7 @@ int bch2_lru_check_set(struct btree_trans *trans,
 			     "  %s",
 			     bch2_lru_types[lru_type(lru_k)],
 			     (bch2_bkey_val_to_text(&buf, c, referring_k), buf.buf))) {
-			ret = bch2_lru_set(trans, lru_id, bucket_to_u64(referring_k.k->p), time);
+			ret = bch2_lru_set(trans, lru_id, dev_bucket, time);
 			if (ret)
 				goto err;
 		}
@ -116,49 +117,73 @@ fsck_err:
 	return ret;
 }

+static struct bbpos lru_pos_to_bp(struct bkey_s_c lru_k)
+{
+	enum bch_lru_type type = lru_type(lru_k);
+
+	switch (type) {
+	case BCH_LRU_read:
+	case BCH_LRU_fragmentation:
+		return BBPOS(BTREE_ID_alloc, u64_to_bucket(lru_k.k->p.offset));
+	case BCH_LRU_stripes:
+		return BBPOS(BTREE_ID_stripes, POS(0, lru_k.k->p.offset));
+	default:
+		BUG();
+	}
+}
+
+static u64 bkey_lru_type_idx(struct bch_fs *c,
+			     enum bch_lru_type type,
+			     struct bkey_s_c k)
+{
+	struct bch_alloc_v4 a_convert;
+	const struct bch_alloc_v4 *a;
+
+	switch (type) {
+	case BCH_LRU_read:
+		a = bch2_alloc_to_v4(k, &a_convert);
+		return alloc_lru_idx_read(*a);
+	case BCH_LRU_fragmentation: {
+		a = bch2_alloc_to_v4(k, &a_convert);
+
+		rcu_read_lock();
+		struct bch_dev *ca = bch2_dev_rcu_noerror(c, k.k->p.inode);
+		u64 idx = ca
+			? alloc_lru_idx_fragmentation(*a, ca)
+			: 0;
+		rcu_read_unlock();
+		return idx;
+	}
+	case BCH_LRU_stripes:
+		return k.k->type == KEY_TYPE_stripe
+			? stripe_lru_pos(bkey_s_c_to_stripe(k).v)
+			: 0;
+	default:
+		BUG();
+	}
+}
+
 static int bch2_check_lru_key(struct btree_trans *trans,
 			      struct btree_iter *lru_iter,
 			      struct bkey_s_c lru_k,
 			      struct bkey_buf *last_flushed)
 {
 	struct bch_fs *c = trans->c;
-	struct btree_iter iter;
-	struct bkey_s_c k;
-	struct bch_alloc_v4 a_convert;
-	const struct bch_alloc_v4 *a;
 	struct printbuf buf1 = PRINTBUF;
 	struct printbuf buf2 = PRINTBUF;
-	enum bch_lru_type type = lru_type(lru_k);
-	struct bpos alloc_pos = u64_to_bucket(lru_k.k->p.offset);
-	u64 idx;
-	int ret;

-	struct bch_dev *ca = bch2_dev_bucket_tryget_noerror(c, alloc_pos);
+	struct bbpos bp = lru_pos_to_bp(lru_k);

-	if (fsck_err_on(!ca,
-			trans, lru_entry_to_invalid_bucket,
-			"lru key points to nonexistent device:bucket %llu:%llu",
-			alloc_pos.inode, alloc_pos.offset))
-		return bch2_btree_bit_mod_buffered(trans, BTREE_ID_lru, lru_iter->pos, false);
-
-	k = bch2_bkey_get_iter(trans, &iter, BTREE_ID_alloc, alloc_pos, 0);
-	ret = bkey_err(k);
+	struct btree_iter iter;
+	struct bkey_s_c k = bch2_bkey_get_iter(trans, &iter, bp.btree, bp.pos, 0);
+	int ret = bkey_err(k);
 	if (ret)
 		goto err;

-	a = bch2_alloc_to_v4(k, &a_convert);
+	enum bch_lru_type type = lru_type(lru_k);
+	u64 idx = bkey_lru_type_idx(c, type, k);

-	switch (type) {
-	case BCH_LRU_read:
-		idx = alloc_lru_idx_read(*a);
-		break;
-	case BCH_LRU_fragmentation:
-		idx = alloc_lru_idx_fragmentation(*a, ca);
-		break;
-	}
-
-	if (lru_k.k->type != KEY_TYPE_set ||
-	    lru_pos_time(lru_k.k->p) != idx) {
+	if (lru_pos_time(lru_k.k->p) != idx) {
 		ret = bch2_btree_write_buffer_maybe_flush(trans, lru_k, last_flushed);
 		if (ret)
 			goto err;
@ -176,7 +201,6 @@ static int bch2_check_lru_key(struct btree_trans *trans,
 err:
 fsck_err:
 	bch2_trans_iter_exit(trans, &iter);
-	bch2_dev_put(ca);
 	printbuf_exit(&buf2);
 	printbuf_exit(&buf1);
 	return ret;
--- a/fs/bcachefs/lru.h
+++ b/fs/bcachefs/lru.h
@ -28,9 +28,14 @@ static inline enum bch_lru_type lru_type(struct bkey_s_c l)
 {
 	u16 lru_id = l.k->p.inode >> 48;

-	if (lru_id == BCH_LRU_FRAGMENTATION_START)
+	switch (lru_id) {
+	case BCH_LRU_BUCKET_FRAGMENTATION:
 		return BCH_LRU_fragmentation;
-	return BCH_LRU_read;
+	case BCH_LRU_STRIPE_FRAGMENTATION:
+		return BCH_LRU_stripes;
+	default:
+		return BCH_LRU_read;
+	}
 }

 int bch2_lru_validate(struct bch_fs *, struct bkey_s_c, struct bkey_validate_context);
@ -46,10 +51,19 @@ void bch2_lru_pos_to_text(struct printbuf *, struct bpos);

 int bch2_lru_del(struct btree_trans *, u16, u64, u64);
 int bch2_lru_set(struct btree_trans *, u16, u64, u64);
-int bch2_lru_change(struct btree_trans *, u16, u64, u64, u64);
+int __bch2_lru_change(struct btree_trans *, u16, u64, u64, u64);
+
+static inline int bch2_lru_change(struct btree_trans *trans,
+		      u16 lru_id, u64 dev_bucket,
+		      u64 old_time, u64 new_time)
+{
+	return old_time != new_time
+		? __bch2_lru_change(trans, lru_id, dev_bucket, old_time, new_time)
+		: 0;
+}

 struct bkey_buf;
-int bch2_lru_check_set(struct btree_trans *, u16, u64, struct bkey_s_c, struct bkey_buf *);
+int bch2_lru_check_set(struct btree_trans *, u16, u64, u64, struct bkey_s_c, struct bkey_buf *);

 int bch2_check_lrus(struct bch_fs *);

--- a/fs/bcachefs/lru_format.h
+++ b/fs/bcachefs/lru_format.h
@ -9,7 +9,8 @@ struct bch_lru {

 #define BCH_LRU_TYPES()		\
 	x(read)			\
-	x(fragmentation)
+	x(fragmentation)	\
+	x(stripes)

 enum bch_lru_type {
 #define x(n) BCH_LRU_##n,
@ -17,7 +18,8 @@ enum bch_lru_type {
 #undef x
 };

-#define BCH_LRU_FRAGMENTATION_START	((1U << 16) - 1)
+#define BCH_LRU_BUCKET_FRAGMENTATION	((1U << 16) - 1)
+#define BCH_LRU_STRIPE_FRAGMENTATION	((1U << 16) - 2)

 #define LRU_TIME_BITS			48
 #define LRU_TIME_MAX			((1ULL << LRU_TIME_BITS) - 1)
--- a/fs/bcachefs/migrate.c
+++ b/fs/bcachefs/migrate.c
@ -15,6 +15,7 @@
 #include "keylist.h"
 #include "migrate.h"
 #include "move.h"
+#include "progress.h"
 #include "replicas.h"
 #include "super-io.h"

@ -76,7 +77,9 @@ static int bch2_dev_usrdata_drop_key(struct btree_trans *trans,
 	return 0;
 }

-static int bch2_dev_usrdata_drop(struct bch_fs *c, unsigned dev_idx, int flags)
+static int bch2_dev_usrdata_drop(struct bch_fs *c,
+				 struct progress_indicator_state *progress,
+				 unsigned dev_idx, int flags)
 {
 	struct btree_trans *trans = bch2_trans_get(c);
 	enum btree_id id;
@ -88,8 +91,10 @@ static int bch2_dev_usrdata_drop(struct bch_fs *c, unsigned dev_idx, int flags)

 		ret = for_each_btree_key_commit(trans, iter, id, POS_MIN,
 				BTREE_ITER_prefetch|BTREE_ITER_all_snapshots, k,
-				NULL, NULL, BCH_TRANS_COMMIT_no_enospc,
-			bch2_dev_usrdata_drop_key(trans, &iter, k, dev_idx, flags));
+				NULL, NULL, BCH_TRANS_COMMIT_no_enospc, ({
+			bch2_progress_update_iter(trans, progress, &iter, "dropping user data");
+			bch2_dev_usrdata_drop_key(trans, &iter, k, dev_idx, flags);
+		}));
 		if (ret)
 			break;
 	}
@ -99,7 +104,9 @@ static int bch2_dev_usrdata_drop(struct bch_fs *c, unsigned dev_idx, int flags)
 	return ret;
 }

-static int bch2_dev_metadata_drop(struct bch_fs *c, unsigned dev_idx, int flags)
+static int bch2_dev_metadata_drop(struct bch_fs *c,
+				  struct progress_indicator_state *progress,
+				  unsigned dev_idx, int flags)
 {
 	struct btree_trans *trans;
 	struct btree_iter iter;
@ -125,6 +132,8 @@ retry:
 		while (bch2_trans_begin(trans),
 		       (b = bch2_btree_iter_peek_node(&iter)) &&
 		       !(ret = PTR_ERR_OR_ZERO(b))) {
+			bch2_progress_update_iter(trans, progress, &iter, "dropping metadata");
+
 			if (!bch2_bkey_has_device_c(bkey_i_to_s_c(&b->key), dev_idx))
 				goto next;

@ -169,6 +178,11 @@ err:

 int bch2_dev_data_drop(struct bch_fs *c, unsigned dev_idx, int flags)
 {
-	return bch2_dev_usrdata_drop(c, dev_idx, flags) ?:
-		bch2_dev_metadata_drop(c, dev_idx, flags);
+	struct progress_indicator_state progress;
+	bch2_progress_init(&progress, c,
+			   BIT_ULL(BTREE_ID_extents)|
+			   BIT_ULL(BTREE_ID_reflink));
+
+	return bch2_dev_usrdata_drop(c, &progress, dev_idx, flags) ?:
+		bch2_dev_metadata_drop(c, &progress, dev_idx, flags);
 }
--- a/fs/bcachefs/move.c
+++ b/fs/bcachefs/move.c
@ -38,28 +38,28 @@ const char * const bch2_data_ops_strs[] = {
 	NULL
 };

-static void trace_move_extent2(struct bch_fs *c, struct bkey_s_c k,
+static void trace_io_move2(struct bch_fs *c, struct bkey_s_c k,
 			       struct bch_io_opts *io_opts,
 			       struct data_update_opts *data_opts)
 {
-	if (trace_move_extent_enabled()) {
+	if (trace_io_move_enabled()) {
 		struct printbuf buf = PRINTBUF;

 		bch2_bkey_val_to_text(&buf, c, k);
 		prt_newline(&buf);
 		bch2_data_update_opts_to_text(&buf, c, io_opts, data_opts);
-		trace_move_extent(c, buf.buf);
+		trace_io_move(c, buf.buf);
 		printbuf_exit(&buf);
 	}
 }

-static void trace_move_extent_read2(struct bch_fs *c, struct bkey_s_c k)
+static void trace_io_move_read2(struct bch_fs *c, struct bkey_s_c k)
 {
-	if (trace_move_extent_read_enabled()) {
+	if (trace_io_move_read_enabled()) {
 		struct printbuf buf = PRINTBUF;

 		bch2_bkey_val_to_text(&buf, c, k);
-		trace_move_extent_read(c, buf.buf);
+		trace_io_move_read(c, buf.buf);
 		printbuf_exit(&buf);
 	}
 }
@ -74,11 +74,7 @@ struct moving_io {
 	unsigned			read_sectors;
 	unsigned			write_sectors;

-	struct bch_read_bio		rbio;
-
 	struct data_update		write;
-	/* Must be last since it is variable size */
-	struct bio_vec			bi_inline_vecs[];
 };

 static void move_free(struct moving_io *io)
@ -88,43 +84,72 @@ static void move_free(struct moving_io *io)
 	if (io->b)
 		atomic_dec(&io->b->count);

-	bch2_data_update_exit(&io->write);
-
 	mutex_lock(&ctxt->lock);
 	list_del(&io->io_list);
 	wake_up(&ctxt->wait);
 	mutex_unlock(&ctxt->lock);

+	if (!io->write.data_opts.scrub) {
+		bch2_data_update_exit(&io->write);
+	} else {
+		bch2_bio_free_pages_pool(io->write.op.c, &io->write.op.wbio.bio);
+		kfree(io->write.bvecs);
+	}
 	kfree(io);
 }

 static void move_write_done(struct bch_write_op *op)
 {
 	struct moving_io *io = container_of(op, struct moving_io, write.op);
+	struct bch_fs *c = op->c;
 	struct moving_context *ctxt = io->write.ctxt;

-	if (io->write.op.error)
-		ctxt->write_error = true;
+	if (op->error) {
+		if (trace_io_move_write_fail_enabled()) {
+			struct printbuf buf = PRINTBUF;

-	atomic_sub(io->write_sectors, &io->write.ctxt->write_sectors);
-	atomic_dec(&io->write.ctxt->write_ios);
+			bch2_write_op_to_text(&buf, op);
+			prt_printf(&buf, "ret\t%s\n", bch2_err_str(op->error));
+			trace_io_move_write_fail(c, buf.buf);
+			printbuf_exit(&buf);
+		}
+		this_cpu_inc(c->counters[BCH_COUNTER_io_move_write_fail]);
+
+		ctxt->write_error = true;
+	}
+
+	atomic_sub(io->write_sectors, &ctxt->write_sectors);
+	atomic_dec(&ctxt->write_ios);
 	move_free(io);
 	closure_put(&ctxt->cl);
 }

 static void move_write(struct moving_io *io)
 {
-	if (unlikely(io->rbio.bio.bi_status || io->rbio.hole)) {
+	struct moving_context *ctxt = io->write.ctxt;
+
+	if (ctxt->stats) {
+		if (io->write.rbio.bio.bi_status)
+			atomic64_add(io->write.rbio.bvec_iter.bi_size >> 9,
+				     &ctxt->stats->sectors_error_uncorrected);
+		else if (io->write.rbio.saw_error)
+			atomic64_add(io->write.rbio.bvec_iter.bi_size >> 9,
+				     &ctxt->stats->sectors_error_corrected);
+	}
+
+	if (unlikely(io->write.rbio.ret ||
+		     io->write.rbio.bio.bi_status ||
+		     io->write.data_opts.scrub)) {
 		move_free(io);
 		return;
 	}

-	if (trace_move_extent_write_enabled()) {
+	if (trace_io_move_write_enabled()) {
 		struct bch_fs *c = io->write.op.c;
 		struct printbuf buf = PRINTBUF;

 		bch2_bkey_val_to_text(&buf, c, bkey_i_to_s_c(io->write.k.k));
-		trace_move_extent_write(c, buf.buf);
+		trace_io_move_write(c, buf.buf);
 		printbuf_exit(&buf);
 	}

@ -132,7 +157,7 @@ static void move_write(struct moving_io *io)
 	atomic_add(io->write_sectors, &io->write.ctxt->write_sectors);
 	atomic_inc(&io->write.ctxt->write_ios);

-	bch2_data_update_read_done(&io->write, io->rbio.pick.crc);
+	bch2_data_update_read_done(&io->write);
 }

 struct moving_io *bch2_moving_ctxt_next_pending_write(struct moving_context *ctxt)
@ -145,7 +170,7 @@ struct moving_io *bch2_moving_ctxt_next_pending_write(struct moving_context *ctx

 static void move_read_endio(struct bio *bio)
 {
-	struct moving_io *io = container_of(bio, struct moving_io, rbio.bio);
+	struct moving_io *io = container_of(bio, struct moving_io, write.rbio.bio);
 	struct moving_context *ctxt = io->write.ctxt;

 	atomic_sub(io->read_sectors, &ctxt->read_sectors);
@ -258,14 +283,10 @@ int bch2_move_extent(struct moving_context *ctxt,
 {
 	struct btree_trans *trans = ctxt->trans;
 	struct bch_fs *c = trans->c;
-	struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
-	struct moving_io *io;
-	const union bch_extent_entry *entry;
-	struct extent_ptr_decoded p;
-	unsigned sectors = k.k->size, pages;
 	int ret = -ENOMEM;

-	trace_move_extent2(c, k, &io_opts, &data_opts);
+	trace_io_move2(c, k, &io_opts, &data_opts);
+	this_cpu_add(c->counters[BCH_COUNTER_io_move], k.k->size);

 	if (ctxt->stats)
 		ctxt->stats->pos = BBPOS(iter->btree_id, iter->pos);
@ -273,7 +294,8 @@ int bch2_move_extent(struct moving_context *ctxt,
 	bch2_data_update_opts_normalize(k, &data_opts);

 	if (!data_opts.rewrite_ptrs &&
-	    !data_opts.extra_replicas) {
+	    !data_opts.extra_replicas &&
+	    !data_opts.scrub) {
 		if (data_opts.kill_ptrs)
 			return bch2_extent_drop_ptrs(trans, iter, k, &io_opts, &data_opts);
 		return 0;
@ -285,13 +307,7 @@ int bch2_move_extent(struct moving_context *ctxt,
 	 */
 	bch2_trans_unlock(trans);

-	/* write path might have to decompress data: */
-	bkey_for_each_ptr_decode(k.k, ptrs, p, entry)
-		sectors = max_t(unsigned, sectors, p.crc.uncompressed_size);
-
-	pages = DIV_ROUND_UP(sectors, PAGE_SECTORS);
-	io = kzalloc(sizeof(struct moving_io) +
-		     sizeof(struct bio_vec) * pages, GFP_KERNEL);
+	struct moving_io *io = kzalloc(sizeof(struct moving_io), GFP_KERNEL);
 	if (!io)
 		goto err;

@ -300,31 +316,27 @@ int bch2_move_extent(struct moving_context *ctxt,
 	io->read_sectors	= k.k->size;
 	io->write_sectors	= k.k->size;

-	bio_init(&io->write.op.wbio.bio, NULL, io->bi_inline_vecs, pages, 0);
-	io->write.op.wbio.bio.bi_ioprio =
-		     IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0);
+	if (!data_opts.scrub) {
+		ret = bch2_data_update_init(trans, iter, ctxt, &io->write, ctxt->wp,
+					    &io_opts, data_opts, iter->btree_id, k);
+		if (ret)
+			goto err_free;

-	if (bch2_bio_alloc_pages(&io->write.op.wbio.bio, sectors << 9,
-				 GFP_KERNEL))
-		goto err_free;
+		io->write.op.end_io	= move_write_done;
+	} else {
+		bch2_bkey_buf_init(&io->write.k);
+		bch2_bkey_buf_reassemble(&io->write.k, c, k);

-	io->rbio.c		= c;
-	io->rbio.opts		= io_opts;
-	bio_init(&io->rbio.bio, NULL, io->bi_inline_vecs, pages, 0);
-	io->rbio.bio.bi_vcnt = pages;
-	io->rbio.bio.bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0);
-	io->rbio.bio.bi_iter.bi_size = sectors << 9;
+		io->write.op.c		= c;
+		io->write.data_opts	= data_opts;

-	io->rbio.bio.bi_opf		= REQ_OP_READ;
-	io->rbio.bio.bi_iter.bi_sector	= bkey_start_offset(k.k);
-	io->rbio.bio.bi_end_io		= move_read_endio;
+		ret = bch2_data_update_bios_init(&io->write, c, &io_opts);
+		if (ret)
+			goto err_free;
+	}

-	ret = bch2_data_update_init(trans, iter, ctxt, &io->write, ctxt->wp,
-				    io_opts, data_opts, iter->btree_id, k);
-	if (ret)
-		goto err_free_pages;
-
-	io->write.op.end_io = move_write_done;
+	io->write.rbio.bio.bi_end_io = move_read_endio;
+	io->write.rbio.bio.bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0);

 	if (ctxt->rate)
 		bch2_ratelimit_increment(ctxt->rate, k.k->size);
@ -339,9 +351,7 @@ int bch2_move_extent(struct moving_context *ctxt,
 		atomic_inc(&io->b->count);
 	}

-	this_cpu_add(c->counters[BCH_COUNTER_io_move], k.k->size);
-	this_cpu_add(c->counters[BCH_COUNTER_move_extent_read], k.k->size);
-	trace_move_extent_read2(c, k);
+	trace_io_move_read2(c, k);

 	mutex_lock(&ctxt->lock);
 	atomic_add(io->read_sectors, &ctxt->read_sectors);
@ -356,33 +366,33 @@ int bch2_move_extent(struct moving_context *ctxt,
 	 * ctxt when doing wakeup
 	 */
 	closure_get(&ctxt->cl);
-	bch2_read_extent(trans, &io->rbio,
-			 bkey_start_pos(k.k),
-			 iter->btree_id, k, 0,
-			 BCH_READ_NODECODE|
-			 BCH_READ_LAST_FRAGMENT);
+	__bch2_read_extent(trans, &io->write.rbio,
+			   io->write.rbio.bio.bi_iter,
+			   bkey_start_pos(k.k),
+			   iter->btree_id, k, 0,
+			   NULL,
+			   BCH_READ_last_fragment,
+			   data_opts.scrub ?  data_opts.read_dev : -1);
 	return 0;
-err_free_pages:
-	bio_free_pages(&io->write.op.wbio.bio);
 err_free:
 	kfree(io);
 err:
-	if (ret == -BCH_ERR_data_update_done)
+	if (bch2_err_matches(ret, BCH_ERR_data_update_done))
 		return 0;

 	if (bch2_err_matches(ret, EROFS) ||
 	    bch2_err_matches(ret, BCH_ERR_transaction_restart))
 		return ret;

-	count_event(c, move_extent_start_fail);
+	count_event(c, io_move_start_fail);

-	if (trace_move_extent_start_fail_enabled()) {
+	if (trace_io_move_start_fail_enabled()) {
 		struct printbuf buf = PRINTBUF;

 		bch2_bkey_val_to_text(&buf, c, k);
 		prt_str(&buf, ": ");
 		prt_str(&buf, bch2_err_str(ret));
-		trace_move_extent_start_fail(c, buf.buf);
+		trace_io_move_start_fail(c, buf.buf);
 		printbuf_exit(&buf);
 	}
 	return ret;
@ -551,6 +561,7 @@ static int bch2_move_data_btree(struct moving_context *ctxt,
 	bch2_trans_begin(trans);
 	bch2_trans_iter_init(trans, &iter, btree_id, start,
 			     BTREE_ITER_prefetch|
+			     BTREE_ITER_not_extents|
 			     BTREE_ITER_all_snapshots);

 	if (ctxt->rate)
@ -581,7 +592,7 @@ static int bch2_move_data_btree(struct moving_context *ctxt,
 		    k.k->type == KEY_TYPE_reflink_p &&
 		    REFLINK_P_MAY_UPDATE_OPTIONS(bkey_s_c_to_reflink_p(k).v)) {
 			struct bkey_s_c_reflink_p p = bkey_s_c_to_reflink_p(k);
-			s64 offset_into_extent	= iter.pos.offset - bkey_start_offset(k.k);
+			s64 offset_into_extent	= 0;

 			bch2_trans_iter_exit(trans, &reflink_iter);
 			k = bch2_lookup_indirect_extent(trans, &reflink_iter, &offset_into_extent, p, true, 0);
@ -600,6 +611,7 @@ static int bch2_move_data_btree(struct moving_context *ctxt,
 			 * pointer - need to fixup iter->k
 			 */
 			extent_iter = &reflink_iter;
+			offset_into_extent = 0;
 		}

 		if (!bkey_extent_is_direct_data(k.k))
@ -627,7 +639,7 @@ static int bch2_move_data_btree(struct moving_context *ctxt,
 			if (bch2_err_matches(ret2, BCH_ERR_transaction_restart))
 				continue;

-			if (ret2 == -ENOMEM) {
+			if (bch2_err_matches(ret2, ENOMEM)) {
 				/* memory allocation failure, wait for some IO to finish */
 				bch2_move_ctxt_wait_for_io(ctxt);
 				continue;
@ -689,21 +701,22 @@ int bch2_move_data(struct bch_fs *c,
 		   bool wait_on_copygc,
 		   move_pred_fn pred, void *arg)
 {
-
 	struct moving_context ctxt;
-	int ret;

 	bch2_moving_ctxt_init(&ctxt, c, rate, stats, wp, wait_on_copygc);
-	ret = __bch2_move_data(&ctxt, start, end, pred, arg);
+	int ret = __bch2_move_data(&ctxt, start, end, pred, arg);
 	bch2_moving_ctxt_exit(&ctxt);

 	return ret;
 }

-int bch2_evacuate_bucket(struct moving_context *ctxt,
-			   struct move_bucket_in_flight *bucket_in_flight,
-			   struct bpos bucket, int gen,
-			   struct data_update_opts _data_opts)
+static int __bch2_move_data_phys(struct moving_context *ctxt,
+			struct move_bucket_in_flight *bucket_in_flight,
+			unsigned dev,
+			u64 bucket_start,
+			u64 bucket_end,
+			unsigned data_types,
+			move_pred_fn pred, void *arg)
 {
 	struct btree_trans *trans = ctxt->trans;
 	struct bch_fs *c = trans->c;
@ -712,16 +725,19 @@ int bch2_evacuate_bucket(struct moving_context *ctxt,
 	struct btree_iter iter = {}, bp_iter = {};
 	struct bkey_buf sk;
 	struct bkey_s_c k;
-	struct data_update_opts data_opts;
-	unsigned sectors_moved = 0;
 	struct bkey_buf last_flushed;
 	int ret = 0;

-	struct bch_dev *ca = bch2_dev_tryget(c, bucket.inode);
+	struct bch_dev *ca = bch2_dev_tryget(c, dev);
 	if (!ca)
 		return 0;

-	trace_bucket_evacuate(c, &bucket);
+	bucket_end = min(bucket_end, ca->mi.nbuckets);
+
+	struct bpos bp_start	= bucket_pos_to_bp_start(ca, POS(dev, bucket_start));
+	struct bpos bp_end	= bucket_pos_to_bp_end(ca, POS(dev, bucket_end));
+	bch2_dev_put(ca);
+	ca = NULL;

 	bch2_bkey_buf_init(&last_flushed);
 	bkey_init(&last_flushed.k->k);
@ -732,8 +748,7 @@ int bch2_evacuate_bucket(struct moving_context *ctxt,
 	 */
 	bch2_trans_begin(trans);

-	bch2_trans_iter_init(trans, &bp_iter, BTREE_ID_backpointers,
-			     bucket_pos_to_bp_start(ca, bucket), 0);
+	bch2_trans_iter_init(trans, &bp_iter, BTREE_ID_backpointers, bp_start, 0);

 	bch_err_msg(c, ret, "looking up alloc key");
 	if (ret)
@ -757,7 +772,7 @@ int bch2_evacuate_bucket(struct moving_context *ctxt,
 		if (ret)
 			goto err;

-		if (!k.k || bkey_gt(k.k->p, bucket_pos_to_bp_end(ca, bucket)))
+		if (!k.k || bkey_gt(k.k->p, bp_end))
 			break;

 		if (k.k->type != KEY_TYPE_backpointer)
@ -765,107 +780,148 @@ int bch2_evacuate_bucket(struct moving_context *ctxt,

 		struct bkey_s_c_backpointer bp = bkey_s_c_to_backpointer(k);

+		if (ctxt->stats)
+			ctxt->stats->offset = bp.k->p.offset >> MAX_EXTENT_COMPRESS_RATIO_SHIFT;
+
+		if (!(data_types & BIT(bp.v->data_type)))
+			goto next;
+
+		if (!bp.v->level && bp.v->btree_id == BTREE_ID_stripes)
+			goto next;
+
+		k = bch2_backpointer_get_key(trans, bp, &iter, 0, &last_flushed);
+		ret = bkey_err(k);
+		if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+			continue;
+		if (ret)
+			goto err;
+		if (!k.k)
+			goto next;
+
 		if (!bp.v->level) {
-			k = bch2_backpointer_get_key(trans, bp, &iter, 0, &last_flushed);
-			ret = bkey_err(k);
-			if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
-				continue;
-			if (ret)
-				goto err;
-			if (!k.k)
-				goto next;
-
-			bch2_bkey_buf_reassemble(&sk, c, k);
-			k = bkey_i_to_s_c(sk.k);
-
 			ret = bch2_move_get_io_opts_one(trans, &io_opts, &iter, k);
 			if (ret) {
 				bch2_trans_iter_exit(trans, &iter);
 				continue;
 			}
-
-			data_opts = _data_opts;
-			data_opts.target	= io_opts.background_target;
-			data_opts.rewrite_ptrs = 0;
-
-			unsigned sectors = bp.v->bucket_len; /* move_extent will drop locks */
-			unsigned i = 0;
-			const union bch_extent_entry *entry;
-			struct extent_ptr_decoded p;
-			bkey_for_each_ptr_decode(k.k, bch2_bkey_ptrs_c(k), p, entry) {
-				if (p.ptr.dev == bucket.inode) {
-					if (p.ptr.cached) {
-						bch2_trans_iter_exit(trans, &iter);
-						goto next;
-					}
-					data_opts.rewrite_ptrs |= 1U << i;
-					break;
-				}
-				i++;
-			}
-
-			ret = bch2_move_extent(ctxt, bucket_in_flight,
-					       &iter, k, io_opts, data_opts);
-			bch2_trans_iter_exit(trans, &iter);
-
-			if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
-				continue;
-			if (ret == -ENOMEM) {
-				/* memory allocation failure, wait for some IO to finish */
-				bch2_move_ctxt_wait_for_io(ctxt);
-				continue;
-			}
-			if (ret)
-				goto err;
-
-			if (ctxt->stats)
-				atomic64_add(sectors, &ctxt->stats->sectors_seen);
-			sectors_moved += sectors;
-		} else {
-			struct btree *b;
-
-			b = bch2_backpointer_get_node(trans, bp, &iter, &last_flushed);
-			ret = PTR_ERR_OR_ZERO(b);
-			if (ret == -BCH_ERR_backpointer_to_overwritten_btree_node)
-				goto next;
-			if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
-				continue;
-			if (ret)
-				goto err;
-			if (!b)
-				goto next;
-
-			unsigned sectors = btree_ptr_sectors_written(bkey_i_to_s_c(&b->key));
-
-			ret = bch2_btree_node_rewrite(trans, &iter, b, 0);
-			bch2_trans_iter_exit(trans, &iter);
-
-			if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
-				continue;
-			if (ret)
-				goto err;
-
-			if (ctxt->rate)
-				bch2_ratelimit_increment(ctxt->rate, sectors);
-			if (ctxt->stats) {
-				atomic64_add(sectors, &ctxt->stats->sectors_seen);
-				atomic64_add(sectors, &ctxt->stats->sectors_moved);
-			}
-			sectors_moved += btree_sectors(c);
 		}
+
+		struct data_update_opts data_opts = {};
+		if (!pred(c, arg, k, &io_opts, &data_opts)) {
+			bch2_trans_iter_exit(trans, &iter);
+			goto next;
+		}
+
+		if (data_opts.scrub &&
+		    !bch2_dev_idx_is_online(c, data_opts.read_dev)) {
+			bch2_trans_iter_exit(trans, &iter);
+			ret = -BCH_ERR_device_offline;
+			break;
+		}
+
+		bch2_bkey_buf_reassemble(&sk, c, k);
+		k = bkey_i_to_s_c(sk.k);
+
+		/* move_extent will drop locks */
+		unsigned sectors = bp.v->bucket_len;
+
+		if (!bp.v->level)
+			ret = bch2_move_extent(ctxt, bucket_in_flight, &iter, k, io_opts, data_opts);
+		else if (!data_opts.scrub)
+			ret = bch2_btree_node_rewrite_pos(trans, bp.v->btree_id, bp.v->level, k.k->p, 0);
+		else
+			ret = bch2_btree_node_scrub(trans, bp.v->btree_id, bp.v->level, k, data_opts.read_dev);
+
+		bch2_trans_iter_exit(trans, &iter);
+
+		if (bch2_err_matches(ret, BCH_ERR_transaction_restart))
+			continue;
+		if (ret == -ENOMEM) {
+			/* memory allocation failure, wait for some IO to finish */
+			bch2_move_ctxt_wait_for_io(ctxt);
+			continue;
+		}
+		if (ret)
+			goto err;
+
+		if (ctxt->stats)
+			atomic64_add(sectors, &ctxt->stats->sectors_seen);
 next:
 		bch2_btree_iter_advance(&bp_iter);
 	}
-
-	trace_evacuate_bucket(c, &bucket, sectors_moved, ca->mi.bucket_size, ret);
 err:
 	bch2_trans_iter_exit(trans, &bp_iter);
-	bch2_dev_put(ca);
 	bch2_bkey_buf_exit(&sk, c);
 	bch2_bkey_buf_exit(&last_flushed, c);
 	return ret;
 }

+static int bch2_move_data_phys(struct bch_fs *c,
+			       unsigned dev,
+			       u64 start,
+			       u64 end,
+			       unsigned data_types,
+			       struct bch_ratelimit *rate,
+			       struct bch_move_stats *stats,
+			       struct write_point_specifier wp,
+			       bool wait_on_copygc,
+			       move_pred_fn pred, void *arg)
+{
+	struct moving_context ctxt;
+
+	bch2_trans_run(c, bch2_btree_write_buffer_flush_sync(trans));
+
+	bch2_moving_ctxt_init(&ctxt, c, rate, stats, wp, wait_on_copygc);
+	ctxt.stats->phys = true;
+	ctxt.stats->data_type = (int) DATA_PROGRESS_DATA_TYPE_phys;
+
+	int ret = __bch2_move_data_phys(&ctxt, NULL, dev, start, end, data_types, pred, arg);
+	bch2_moving_ctxt_exit(&ctxt);
+
+	return ret;
+}
+
+struct evacuate_bucket_arg {
+	struct bpos		bucket;
+	int			gen;
+	struct data_update_opts	data_opts;
+};
+
+static bool evacuate_bucket_pred(struct bch_fs *c, void *_arg, struct bkey_s_c k,
+				 struct bch_io_opts *io_opts,
+				 struct data_update_opts *data_opts)
+{
+	struct evacuate_bucket_arg *arg = _arg;
+
+	*data_opts = arg->data_opts;
+
+	unsigned i = 0;
+	bkey_for_each_ptr(bch2_bkey_ptrs_c(k), ptr) {
+		if (ptr->dev == arg->bucket.inode &&
+		    (arg->gen < 0 || arg->gen == ptr->gen) &&
+		    !ptr->cached)
+			data_opts->rewrite_ptrs |= BIT(i);
+		i++;
+	}
+
+	return data_opts->rewrite_ptrs != 0;
+}
+
+int bch2_evacuate_bucket(struct moving_context *ctxt,
+			   struct move_bucket_in_flight *bucket_in_flight,
+			   struct bpos bucket, int gen,
+			   struct data_update_opts data_opts)
+{
+	struct evacuate_bucket_arg arg = { bucket, gen, data_opts, };
+
+	return __bch2_move_data_phys(ctxt, bucket_in_flight,
+				   bucket.inode,
+				   bucket.offset,
+				   bucket.offset + 1,
+				   ~0,
+				   evacuate_bucket_pred, &arg);
+}
+
 typedef bool (*move_btree_pred)(struct bch_fs *, void *,
 				struct btree *, struct bch_io_opts *,
 				struct data_update_opts *);
@ -1007,14 +1063,6 @@ static bool rereplicate_btree_pred(struct bch_fs *c, void *arg,
 	return rereplicate_pred(c, arg, bkey_i_to_s_c(&b->key), io_opts, data_opts);
 }

-static bool migrate_btree_pred(struct bch_fs *c, void *arg,
-			       struct btree *b,
-			       struct bch_io_opts *io_opts,
-			       struct data_update_opts *data_opts)
-{
-	return migrate_pred(c, arg, bkey_i_to_s_c(&b->key), io_opts, data_opts);
-}
-
 /*
 * Ancient versions of bcachefs produced packed formats which could represent
 * keys that the in memory format cannot represent; this checks for those
@ -1104,6 +1152,30 @@ static bool drop_extra_replicas_btree_pred(struct bch_fs *c, void *arg,
 	return drop_extra_replicas_pred(c, arg, bkey_i_to_s_c(&b->key), io_opts, data_opts);
 }

+static bool scrub_pred(struct bch_fs *c, void *_arg,
+		       struct bkey_s_c k,
+		       struct bch_io_opts *io_opts,
+		       struct data_update_opts *data_opts)
+{
+	struct bch_ioctl_data *arg = _arg;
+
+	if (k.k->type != KEY_TYPE_btree_ptr_v2) {
+		struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
+		const union bch_extent_entry *entry;
+		struct extent_ptr_decoded p;
+		bkey_for_each_ptr_decode(k.k, ptrs, p, entry)
+			if (p.ptr.dev == arg->migrate.dev) {
+				if (!p.crc.csum_type)
+					return false;
+				break;
+			}
+	}
+
+	data_opts->scrub	= true;
+	data_opts->read_dev	= arg->migrate.dev;
+	return true;
+}
+
 int bch2_data_job(struct bch_fs *c,
 		  struct bch_move_stats *stats,
 		  struct bch_ioctl_data op)
@ -1118,6 +1190,22 @@ int bch2_data_job(struct bch_fs *c,
 	bch2_move_stats_init(stats, bch2_data_ops_strs[op.op]);

 	switch (op.op) {
+	case BCH_DATA_OP_scrub:
+		/*
+		 * prevent tests from spuriously failing, make sure we see all
+		 * btree nodes that need to be repaired
+		 */
+		bch2_btree_interior_updates_flush(c);
+
+		ret = bch2_move_data_phys(c, op.scrub.dev, 0, U64_MAX,
+					  op.scrub.data_types,
+					  NULL,
+					  stats,
+					  writepoint_hashed((unsigned long) current),
+					  false,
+					  scrub_pred, &op) ?: ret;
+		break;
+
 	case BCH_DATA_OP_rereplicate:
 		stats->data_type = BCH_DATA_journal;
 		ret = bch2_journal_flush_device_pins(&c->journal, -1);
@ -1137,14 +1225,14 @@ int bch2_data_job(struct bch_fs *c,

 		stats->data_type = BCH_DATA_journal;
 		ret = bch2_journal_flush_device_pins(&c->journal, op.migrate.dev);
-		ret = bch2_move_btree(c, start, end,
-				      migrate_btree_pred, &op, stats) ?: ret;
-		ret = bch2_move_data(c, start, end,
-				     NULL,
-				     stats,
-				     writepoint_hashed((unsigned long) current),
-				     true,
-				     migrate_pred, &op) ?: ret;
+		ret = bch2_move_data_phys(c, op.migrate.dev, 0, U64_MAX,
+					  ~0,
+					  NULL,
+					  stats,
+					  writepoint_hashed((unsigned long) current),
+					  true,
+					  migrate_pred, &op) ?: ret;
+		bch2_btree_interior_updates_flush(c);
 		ret = bch2_replicas_gc2(c) ?: ret;
 		break;
 	case BCH_DATA_OP_rewrite_old_nodes:
@ -1176,17 +1264,17 @@ void bch2_move_stats_to_text(struct printbuf *out, struct bch_move_stats *stats)
 	prt_newline(out);
 	printbuf_indent_add(out, 2);

-	prt_printf(out, "keys moved:  %llu\n",	atomic64_read(&stats->keys_moved));
-	prt_printf(out, "keys raced:  %llu\n",	atomic64_read(&stats->keys_raced));
-	prt_printf(out, "bytes seen:  ");
+	prt_printf(out, "keys moved:\t%llu\n",	atomic64_read(&stats->keys_moved));
+	prt_printf(out, "keys raced:\t%llu\n",	atomic64_read(&stats->keys_raced));
+	prt_printf(out, "bytes seen:\t");
 	prt_human_readable_u64(out, atomic64_read(&stats->sectors_seen) << 9);
 	prt_newline(out);

-	prt_printf(out, "bytes moved: ");
+	prt_printf(out, "bytes moved:\t");
 	prt_human_readable_u64(out, atomic64_read(&stats->sectors_moved) << 9);
 	prt_newline(out);

-	prt_printf(out, "bytes raced: ");
+	prt_printf(out, "bytes raced:\t");
 	prt_human_readable_u64(out, atomic64_read(&stats->sectors_raced) << 9);
 	prt_newline(out);

@ -1195,7 +1283,8 @@ void bch2_move_stats_to_text(struct printbuf *out, struct bch_move_stats *stats)

 static void bch2_moving_ctxt_to_text(struct printbuf *out, struct bch_fs *c, struct moving_context *ctxt)
 {
-	struct moving_io *io;
+	if (!out->nr_tabstops)
+		printbuf_tabstop_push(out, 32);

 	bch2_move_stats_to_text(out, ctxt->stats);
 	printbuf_indent_add(out, 2);
@ -1215,8 +1304,9 @@ static void bch2_moving_ctxt_to_text(struct printbuf *out, struct bch_fs *c, str
 	printbuf_indent_add(out, 2);

 	mutex_lock(&ctxt->lock);
+	struct moving_io *io;
 	list_for_each_entry(io, &ctxt->ios, io_list)
-		bch2_write_op_to_text(out, &io->write.op);
+		bch2_data_update_inflight_to_text(out, &io->write);
 	mutex_unlock(&ctxt->lock);

 	printbuf_indent_sub(out, 4);
--- a/fs/bcachefs/move_types.h
+++ b/fs/bcachefs/move_types.h
@ -3,22 +3,36 @@
 #define _BCACHEFS_MOVE_TYPES_H

 #include "bbpos_types.h"
+#include "bcachefs_ioctl.h"

 struct bch_move_stats {
-	enum bch_data_type	data_type;
-	struct bbpos		pos;
 	char			name[32];
+	bool			phys;
+	enum bch_ioctl_data_event_ret	ret;
+
+	union {
+	struct {
+		enum bch_data_type	data_type;
+		struct bbpos		pos;
+	};
+	struct {
+		unsigned		dev;
+		u64			offset;
+	};
+	};

 	atomic64_t		keys_moved;
 	atomic64_t		keys_raced;
 	atomic64_t		sectors_seen;
 	atomic64_t		sectors_moved;
 	atomic64_t		sectors_raced;
+	atomic64_t		sectors_error_corrected;
+	atomic64_t		sectors_error_uncorrected;
 };

 struct move_bucket_key {
 	struct bpos		bucket;
-	u8			gen;
+	unsigned		gen;
 };

 struct move_bucket {
--- a/fs/bcachefs/movinggc.c
+++ b/fs/bcachefs/movinggc.c
@ -167,8 +167,8 @@ static int bch2_copygc_get_buckets(struct moving_context *ctxt,
 	bch2_trans_begin(trans);

 	ret = for_each_btree_key_max(trans, iter, BTREE_ID_lru,
-				  lru_pos(BCH_LRU_FRAGMENTATION_START, 0, 0),
-				  lru_pos(BCH_LRU_FRAGMENTATION_START, U64_MAX, LRU_TIME_MAX),
+				  lru_pos(BCH_LRU_BUCKET_FRAGMENTATION, 0, 0),
+				  lru_pos(BCH_LRU_BUCKET_FRAGMENTATION, U64_MAX, LRU_TIME_MAX),
 				  0, k, ({
 		struct move_bucket b = { .k.bucket = u64_to_bucket(k.k->p.offset) };
 		int ret2 = 0;
@ -317,6 +317,17 @@ void bch2_copygc_wait_to_text(struct printbuf *out, struct bch_fs *c)
 	prt_printf(out, "Currently calculated wait:\t");
 	prt_human_readable_u64(out, bch2_copygc_wait_amount(c));
 	prt_newline(out);
+
+	rcu_read_lock();
+	struct task_struct *t = rcu_dereference(c->copygc_thread);
+	if (t)
+		get_task_struct(t);
+	rcu_read_unlock();
+
+	if (t) {
+		bch2_prt_task_backtrace(out, t, 0, GFP_KERNEL);
+		put_task_struct(t);
+	}
 }

 static int bch2_copygc_thread(void *arg)
--- a/fs/bcachefs/fs-common.c
+++ b/fs/bcachefs/fs-common.c
@ -4,8 +4,8 @@
 #include "acl.h"
 #include "btree_update.h"
 #include "dirent.h"
-#include "fs-common.h"
 #include "inode.h"
+#include "namei.h"
 #include "subvolume.h"
 #include "xattr.h"

@ -47,6 +47,10 @@ int bch2_create_trans(struct btree_trans *trans,
 	if (ret)
 		goto err;

+	/* Inherit casefold state from parent. */
+	if (S_ISDIR(mode))
+		new_inode->bi_flags |= dir_u->bi_flags & BCH_INODE_casefolded;
+
 	if (!(flags & BCH_CREATE_SNAPSHOT)) {
 		/* Normal create path - allocate a new inode: */
 		bch2_inode_init_late(new_inode, now, uid, gid, mode, rdev, dir_u);
@ -153,16 +157,14 @@ int bch2_create_trans(struct btree_trans *trans,
 			dir_u->bi_nlink++;
 		dir_u->bi_mtime = dir_u->bi_ctime = now;

-		ret = bch2_inode_write(trans, &dir_iter, dir_u);
-		if (ret)
-			goto err;
-
-		ret = bch2_dirent_create(trans, dir, &dir_hash,
-					 dir_type,
-					 name,
-					 dir_target,
-					 &dir_offset,
-					 STR_HASH_must_create|BTREE_ITER_with_updates);
+		ret =   bch2_dirent_create(trans, dir, &dir_hash,
+					   dir_type,
+					   name,
+					   dir_target,
+					   &dir_offset,
+					   &dir_u->bi_size,
+					   STR_HASH_must_create|BTREE_ITER_with_updates) ?:
+			bch2_inode_write(trans, &dir_iter, dir_u);
 		if (ret)
 			goto err;

@ -225,7 +227,9 @@ int bch2_link_trans(struct btree_trans *trans,

 	ret = bch2_dirent_create(trans, dir, &dir_hash,
 				 mode_to_type(inode_u->bi_mode),
-				 name, inum.inum, &dir_offset,
+				 name, inum.inum,
+				 &dir_offset,
+				 &dir_u->bi_size,
 				 STR_HASH_must_create);
 	if (ret)
 		goto err;
@ -417,8 +421,8 @@ int bch2_rename_trans(struct btree_trans *trans,
 	}

 	ret = bch2_dirent_rename(trans,
-				 src_dir, &src_hash,
-				 dst_dir, &dst_hash,
+				 src_dir, &src_hash, &src_dir_u->bi_size,
+				 dst_dir, &dst_hash, &dst_dir_u->bi_size,
 				 src_name, &src_inum, &src_offset,
 				 dst_name, &dst_inum, &dst_offset,
 				 mode);
@ -560,6 +564,8 @@ err:
 	return ret;
 }

+/* inum_to_path */
+
 static inline void prt_bytes_reversed(struct printbuf *out, const void *b, unsigned n)
 {
 	bch2_printbuf_make_room(out, n);
@ -650,3 +656,179 @@ disconnected:
 	prt_str_reversed(path, "(disconnected)");
 	goto out;
 }
+
+/* fsck */
+
+static int bch2_check_dirent_inode_dirent(struct btree_trans *trans,
+					  struct bkey_s_c_dirent d,
+					  struct bch_inode_unpacked *target,
+					  bool in_fsck)
+{
+	struct bch_fs *c = trans->c;
+	struct printbuf buf = PRINTBUF;
+	struct btree_iter bp_iter = { NULL };
+	int ret = 0;
+
+	if (inode_points_to_dirent(target, d))
+		return 0;
+
+	if (!target->bi_dir &&
+	    !target->bi_dir_offset) {
+		fsck_err_on(S_ISDIR(target->bi_mode),
+			    trans, inode_dir_missing_backpointer,
+			    "directory with missing backpointer\n%s",
+			    (printbuf_reset(&buf),
+			     bch2_bkey_val_to_text(&buf, c, d.s_c),
+			     prt_printf(&buf, "\n"),
+			     bch2_inode_unpacked_to_text(&buf, target),
+			     buf.buf));
+
+		fsck_err_on(target->bi_flags & BCH_INODE_unlinked,
+			    trans, inode_unlinked_but_has_dirent,
+			    "inode unlinked but has dirent\n%s",
+			    (printbuf_reset(&buf),
+			     bch2_bkey_val_to_text(&buf, c, d.s_c),
+			     prt_printf(&buf, "\n"),
+			     bch2_inode_unpacked_to_text(&buf, target),
+			     buf.buf));
+
+		target->bi_flags &= ~BCH_INODE_unlinked;
+		target->bi_dir		= d.k->p.inode;
+		target->bi_dir_offset	= d.k->p.offset;
+		return __bch2_fsck_write_inode(trans, target);
+	}
+
+	if (bch2_inode_should_have_single_bp(target) &&
+	    !fsck_err(trans, inode_wrong_backpointer,
+		      "dirent points to inode that does not point back:\n  %s",
+		      (bch2_bkey_val_to_text(&buf, c, d.s_c),
+		       prt_printf(&buf, "\n  "),
+		       bch2_inode_unpacked_to_text(&buf, target),
+		       buf.buf)))
+		goto err;
+
+	struct bkey_s_c_dirent bp_dirent =
+		bch2_bkey_get_iter_typed(trans, &bp_iter, BTREE_ID_dirents,
+			      SPOS(target->bi_dir, target->bi_dir_offset, target->bi_snapshot),
+			      0, dirent);
+	ret = bkey_err(bp_dirent);
+	if (ret && !bch2_err_matches(ret, ENOENT))
+		goto err;
+
+	bool backpointer_exists = !ret;
+	ret = 0;
+
+	if (!backpointer_exists) {
+		if (fsck_err(trans, inode_wrong_backpointer,
+			     "inode %llu:%u has wrong backpointer:\n"
+			     "got       %llu:%llu\n"
+			     "should be %llu:%llu",
+			     target->bi_inum, target->bi_snapshot,
+			     target->bi_dir,
+			     target->bi_dir_offset,
+			     d.k->p.inode,
+			     d.k->p.offset)) {
+			target->bi_dir		= d.k->p.inode;
+			target->bi_dir_offset	= d.k->p.offset;
+			ret = __bch2_fsck_write_inode(trans, target);
+		}
+	} else {
+		bch2_bkey_val_to_text(&buf, c, d.s_c);
+		prt_newline(&buf);
+		bch2_bkey_val_to_text(&buf, c, bp_dirent.s_c);
+
+		if (S_ISDIR(target->bi_mode) || target->bi_subvol) {
+			/*
+			 * XXX: verify connectivity of the other dirent
+			 * up to the root before removing this one
+			 *
+			 * Additionally, bch2_lookup would need to cope with the
+			 * dirent it found being removed - or should we remove
+			 * the other one, even though the inode points to it?
+			 */
+			if (in_fsck) {
+				if (fsck_err(trans, inode_dir_multiple_links,
+					     "%s %llu:%u with multiple links\n%s",
+					     S_ISDIR(target->bi_mode) ? "directory" : "subvolume",
+					     target->bi_inum, target->bi_snapshot, buf.buf))
+					ret = bch2_fsck_remove_dirent(trans, d.k->p);
+			} else {
+				bch2_fs_inconsistent(c,
+						"%s %llu:%u with multiple links\n%s",
+						S_ISDIR(target->bi_mode) ? "directory" : "subvolume",
+						target->bi_inum, target->bi_snapshot, buf.buf);
+			}
+
+			goto out;
+		} else {
+			/*
+			 * hardlinked file with nlink 0:
+			 * We're just adjusting nlink here so check_nlinks() will pick
+			 * it up, it ignores inodes with nlink 0
+			 */
+			if (fsck_err_on(!target->bi_nlink,
+					trans, inode_multiple_links_but_nlink_0,
+					"inode %llu:%u type %s has multiple links but i_nlink 0\n%s",
+					target->bi_inum, target->bi_snapshot, bch2_d_types[d.v->d_type], buf.buf)) {
+				target->bi_nlink++;
+				target->bi_flags &= ~BCH_INODE_unlinked;
+				ret = __bch2_fsck_write_inode(trans, target);
+				if (ret)
+					goto err;
+			}
+		}
+	}
+out:
+err:
+fsck_err:
+	bch2_trans_iter_exit(trans, &bp_iter);
+	printbuf_exit(&buf);
+	bch_err_fn(c, ret);
+	return ret;
+}
+
+int __bch2_check_dirent_target(struct btree_trans *trans,
+			       struct btree_iter *dirent_iter,
+			       struct bkey_s_c_dirent d,
+			       struct bch_inode_unpacked *target,
+			       bool in_fsck)
+{
+	struct bch_fs *c = trans->c;
+	struct printbuf buf = PRINTBUF;
+	int ret = 0;
+
+	ret = bch2_check_dirent_inode_dirent(trans, d, target, in_fsck);
+	if (ret)
+		goto err;
+
+	if (fsck_err_on(d.v->d_type != inode_d_type(target),
+			trans, dirent_d_type_wrong,
+			"incorrect d_type: got %s, should be %s:\n%s",
+			bch2_d_type_str(d.v->d_type),
+			bch2_d_type_str(inode_d_type(target)),
+			(printbuf_reset(&buf),
+			 bch2_bkey_val_to_text(&buf, c, d.s_c), buf.buf))) {
+		struct bkey_i_dirent *n = bch2_trans_kmalloc(trans, bkey_bytes(d.k));
+		ret = PTR_ERR_OR_ZERO(n);
+		if (ret)
+			goto err;
+
+		bkey_reassemble(&n->k_i, d.s_c);
+		n->v.d_type = inode_d_type(target);
+		if (n->v.d_type == DT_SUBVOL) {
+			n->v.d_parent_subvol = cpu_to_le32(target->bi_parent_subvol);
+			n->v.d_child_subvol = cpu_to_le32(target->bi_subvol);
+		} else {
+			n->v.d_inum = cpu_to_le64(target->bi_inum);
+		}
+
+		ret = bch2_trans_update(trans, dirent_iter, &n->k_i, 0);
+		if (ret)
+			goto err;
+	}
+err:
+fsck_err:
+	printbuf_exit(&buf);
+	bch_err_fn(c, ret);
+	return ret;
+}
--- a/fs/bcachefs/fs-common.h
+++ b/fs/bcachefs/fs-common.h
@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _BCACHEFS_FS_COMMON_H
-#define _BCACHEFS_FS_COMMON_H
+#ifndef _BCACHEFS_NAMEI_H
+#define _BCACHEFS_NAMEI_H

 #include "dirent.h"

@ -44,4 +44,29 @@ bool bch2_reinherit_attrs(struct bch_inode_unpacked *,

 int bch2_inum_to_path(struct btree_trans *, subvol_inum, struct printbuf *);

-#endif /* _BCACHEFS_FS_COMMON_H */
+int __bch2_check_dirent_target(struct btree_trans *,
+			       struct btree_iter *,
+			       struct bkey_s_c_dirent,
+			       struct bch_inode_unpacked *, bool);
+
+static inline bool inode_points_to_dirent(struct bch_inode_unpacked *inode,
+					  struct bkey_s_c_dirent d)
+{
+	return  inode->bi_dir		== d.k->p.inode &&
+		inode->bi_dir_offset	== d.k->p.offset;
+}
+
+static inline int bch2_check_dirent_target(struct btree_trans *trans,
+					   struct btree_iter *dirent_iter,
+					   struct bkey_s_c_dirent d,
+					   struct bch_inode_unpacked *target,
+					   bool in_fsck)
+{
+	if (likely(inode_points_to_dirent(target, d) &&
+		   d.v->d_type == inode_d_type(target)))
+		return 0;
+
+	return __bch2_check_dirent_target(trans, dirent_iter, d, target, in_fsck);
+}
+
+#endif /* _BCACHEFS_NAMEI_H */
--- a/fs/bcachefs/opts.c
+++ b/fs/bcachefs/opts.c
@ -163,16 +163,6 @@ const char * const bch2_d_types[BCH_DT_MAX] = {
 	[DT_SUBVOL]	= "subvol",
 };

-u64 BCH2_NO_SB_OPT(const struct bch_sb *sb)
-{
-	BUG();
-}
-
-void SET_BCH2_NO_SB_OPT(struct bch_sb *sb, u64 v)
-{
-	BUG();
-}
-
 void bch2_opts_apply(struct bch_opts *dst, struct bch_opts src)
 {
 #define x(_name, ...)						\
@ -223,6 +213,21 @@ void bch2_opt_set_by_id(struct bch_opts *opts, enum bch_opt_id id, u64 v)
 	}
 }

+/* dummy option, for options that aren't stored in the superblock */
+typedef u64 (*sb_opt_get_fn)(const struct bch_sb *);
+typedef void (*sb_opt_set_fn)(struct bch_sb *, u64);
+typedef u64 (*member_opt_get_fn)(const struct bch_member *);
+typedef void (*member_opt_set_fn)(struct bch_member *, u64);
+
+__maybe_unused static const sb_opt_get_fn	BCH2_NO_SB_OPT = NULL;
+__maybe_unused static const sb_opt_set_fn	SET_BCH2_NO_SB_OPT = NULL;
+__maybe_unused static const member_opt_get_fn	BCH2_NO_MEMBER_OPT = NULL;
+__maybe_unused static const member_opt_set_fn	SET_BCH2_NO_MEMBER_OPT = NULL;
+
+#define type_compatible_or_null(_p, _type)				\
+	__builtin_choose_expr(						\
+		__builtin_types_compatible_p(typeof(_p), typeof(_type)), _p, NULL)
+
 const struct bch_option bch2_opt_table[] = {
 #define OPT_BOOL()		.type = BCH_OPT_BOOL, .min = 0, .max = 2
 #define OPT_UINT(_min, _max)	.type = BCH_OPT_UINT,			\
@ -239,15 +244,15 @@ const struct bch_option bch2_opt_table[] = {

 #define x(_name, _bits, _flags, _type, _sb_opt, _default, _hint, _help)	\
 	[Opt_##_name] = {						\
-		.attr	= {						\
-			.name	= #_name,				\
-			.mode = (_flags) & OPT_RUNTIME ? 0644 : 0444,	\
-		},							\
-		.flags	= _flags,					\
-		.hint	= _hint,					\
-		.help	= _help,					\
-		.get_sb = _sb_opt,					\
-		.set_sb	= SET_##_sb_opt,				\
+		.attr.name	= #_name,				\
+		.attr.mode	= (_flags) & OPT_RUNTIME ? 0644 : 0444,	\
+		.flags		= _flags,				\
+		.hint		= _hint,				\
+		.help		= _help,				\
+		.get_sb		= type_compatible_or_null(_sb_opt,	*BCH2_NO_SB_OPT),	\
+		.set_sb		= type_compatible_or_null(SET_##_sb_opt,*SET_BCH2_NO_SB_OPT),	\
+		.get_member	= type_compatible_or_null(_sb_opt,	*BCH2_NO_MEMBER_OPT),	\
+		.set_member	= type_compatible_or_null(SET_##_sb_opt,*SET_BCH2_NO_MEMBER_OPT),\
 		_type							\
 	},

@ -475,11 +480,18 @@ void bch2_opts_to_text(struct printbuf *out,
 	}
 }

-int bch2_opt_check_may_set(struct bch_fs *c, int id, u64 v)
+int bch2_opt_check_may_set(struct bch_fs *c, struct bch_dev *ca, int id, u64 v)
 {
+	lockdep_assert_held(&c->state_lock);
+
 	int ret = 0;

 	switch (id) {
+	case Opt_state:
+		if (ca)
+			return __bch2_dev_set_state(c, ca, v, BCH_FORCE_IF_DEGRADED);
+		break;
+
 	case Opt_compression:
 	case Opt_background_compression:
 		ret = bch2_check_set_has_compressed_data(c, v);
@ -495,12 +507,8 @@ int bch2_opt_check_may_set(struct bch_fs *c, int id, u64 v)

 int bch2_opts_check_may_set(struct bch_fs *c)
 {
-	unsigned i;
-	int ret;
-
-	for (i = 0; i < bch2_opts_nr; i++) {
-		ret = bch2_opt_check_may_set(c, i,
-				bch2_opt_get_by_id(&c->opts, i));
+	for (unsigned i = 0; i < bch2_opts_nr; i++) {
+		int ret = bch2_opt_check_may_set(c, NULL, i, bch2_opt_get_by_id(&c->opts, i));
 		if (ret)
 			return ret;
 	}
@ -619,12 +627,25 @@ out:
 	return ret;
 }

-u64 bch2_opt_from_sb(struct bch_sb *sb, enum bch_opt_id id)
+u64 bch2_opt_from_sb(struct bch_sb *sb, enum bch_opt_id id, int dev_idx)
 {
 	const struct bch_option *opt = bch2_opt_table + id;
 	u64 v;

-	v = opt->get_sb(sb);
+	if (dev_idx < 0) {
+		v = opt->get_sb(sb);
+	} else {
+		if (WARN(!bch2_member_exists(sb, dev_idx),
+			 "tried to set device option %s on nonexistent device %i",
+			 opt->attr.name, dev_idx))
+			return 0;
+
+		struct bch_member m = bch2_sb_member_get(sb, dev_idx);
+		v = opt->get_member(&m);
+	}
+
+	if (opt->flags & OPT_SB_FIELD_ONE_BIAS)
+		--v;

 	if (opt->flags & OPT_SB_FIELD_ILOG2)
 		v = 1ULL << v;
@ -641,35 +662,19 @@ u64 bch2_opt_from_sb(struct bch_sb *sb, enum bch_opt_id id)
 */
 int bch2_opts_from_sb(struct bch_opts *opts, struct bch_sb *sb)
 {
-	unsigned id;
-
-	for (id = 0; id < bch2_opts_nr; id++) {
+	for (unsigned id = 0; id < bch2_opts_nr; id++) {
 		const struct bch_option *opt = bch2_opt_table + id;

-		if (opt->get_sb == BCH2_NO_SB_OPT)
-			continue;
-
-		bch2_opt_set_by_id(opts, id, bch2_opt_from_sb(sb, id));
+		if (opt->get_sb)
+			bch2_opt_set_by_id(opts, id, bch2_opt_from_sb(sb, id, -1));
 	}

 	return 0;
 }

-struct bch_dev_sb_opt_set {
-	void			(*set_sb)(struct bch_member *, u64);
-};
-
-static const struct bch_dev_sb_opt_set bch2_dev_sb_opt_setters [] = {
-#define x(n, set)	[Opt_##n] = { .set_sb = SET_##set },
-	BCH_DEV_OPT_SETTERS()
-#undef x
-};
-
 void __bch2_opt_set_sb(struct bch_sb *sb, int dev_idx,
 		       const struct bch_option *opt, u64 v)
 {
-	enum bch_opt_id id = opt - bch2_opt_table;
-
 	if (opt->flags & OPT_SB_FIELD_SECTORS)
 		v >>= 9;

@ -679,24 +684,16 @@ void __bch2_opt_set_sb(struct bch_sb *sb, int dev_idx,
 	if (opt->flags & OPT_SB_FIELD_ONE_BIAS)
 		v++;

-	if (opt->flags & OPT_FS) {
-		if (opt->set_sb != SET_BCH2_NO_SB_OPT)
-			opt->set_sb(sb, v);
-	}
+	if ((opt->flags & OPT_FS) && opt->set_sb && dev_idx < 0)
+		opt->set_sb(sb, v);

-	if ((opt->flags & OPT_DEVICE) && dev_idx >= 0) {
+	if ((opt->flags & OPT_DEVICE) && opt->set_member && dev_idx >= 0) {
 		if (WARN(!bch2_member_exists(sb, dev_idx),
 			 "tried to set device option %s on nonexistent device %i",
 			 opt->attr.name, dev_idx))
 			return;

-		struct bch_member *m = bch2_members_v2_get_mut(sb, dev_idx);
-
-		const struct bch_dev_sb_opt_set *set = bch2_dev_sb_opt_setters + id;
-		if (set->set_sb)
-			set->set_sb(m, v);
-		else
-			pr_err("option %s cannot be set via opt_set_sb()", opt->attr.name);
+		opt->set_member(bch2_members_v2_get_mut(sb, dev_idx), v);
 	}
 }

--- a/fs/bcachefs/opts.h
+++ b/fs/bcachefs/opts.h
@ -50,10 +50,6 @@ static inline const char *bch2_d_type_str(unsigned d_type)
 * apply the options from that struct that are defined.
 */

-/* dummy option, for options that aren't stored in the superblock */
-u64 BCH2_NO_SB_OPT(const struct bch_sb *);
-void SET_BCH2_NO_SB_OPT(struct bch_sb *, u64);
-
 /* When can be set: */
 enum opt_flags {
 	OPT_FS			= BIT(0),	/* Filesystem option */
@ -132,19 +128,24 @@ enum fsck_err_opts {
 	  OPT_FS|OPT_FORMAT|						\
 	  OPT_HUMAN_READABLE|OPT_MUST_BE_POW_2|OPT_SB_FIELD_SECTORS,	\
 	  OPT_UINT(512, 1U << 16),					\
-	  BCH_SB_BLOCK_SIZE,		8,				\
+	  BCH_SB_BLOCK_SIZE,		4 << 10,			\
 	  "size",	NULL)						\
 	x(btree_node_size,		u32,				\
 	  OPT_FS|OPT_FORMAT|						\
 	  OPT_HUMAN_READABLE|OPT_MUST_BE_POW_2|OPT_SB_FIELD_SECTORS,	\
 	  OPT_UINT(512, 1U << 20),					\
-	  BCH_SB_BTREE_NODE_SIZE,	512,				\
+	  BCH_SB_BTREE_NODE_SIZE,	256 << 10,			\
 	  "size",	"Btree node size, default 256k")		\
 	x(errors,			u8,				\
 	  OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME,			\
 	  OPT_STR(bch2_error_actions),					\
 	  BCH_SB_ERROR_ACTION,		BCH_ON_ERROR_fix_safe,		\
 	  NULL,		"Action to take on filesystem error")		\
+	x(write_error_timeout,		u16,				\
+	  OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME,			\
+	  OPT_UINT(1, 300),						\
+	  BCH_SB_WRITE_ERROR_TIMEOUT,	30,				\
+	  NULL,		"Number of consecutive write errors allowed before kicking out a device")\
 	x(metadata_replicas,		u8,				\
 	  OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME,			\
 	  OPT_UINT(1, BCH_REPLICAS_MAX),				\
@ -181,6 +182,11 @@ enum fsck_err_opts {
 	  OPT_STR(__bch2_csum_opts),					\
 	  BCH_SB_DATA_CSUM_TYPE,	BCH_CSUM_OPT_crc32c,		\
 	  NULL,		NULL)						\
+	x(checksum_err_retry_nr,	u8,				\
+	  OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME,			\
+	  OPT_UINT(0, 32),						\
+	  BCH_SB_CSUM_ERR_RETRY_NR,	3,				\
+	  NULL,		NULL)						\
 	x(compression,			u8,				\
 	  OPT_FS|OPT_INODE|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME,		\
 	  OPT_FN(bch2_opt_compression),					\
@ -197,7 +203,7 @@ enum fsck_err_opts {
 	  BCH_SB_STR_HASH_TYPE,		BCH_STR_HASH_OPT_siphash,	\
 	  NULL,		"Hash function for directory entries and xattrs")\
 	x(metadata_target,		u16,				\
-	  OPT_FS|OPT_INODE|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME,		\
+	  OPT_FS|OPT_FORMAT|OPT_MOUNT|OPT_RUNTIME,			\
 	  OPT_FN(bch2_opt_target),					\
 	  BCH_SB_METADATA_TARGET,	0,				\
 	  "(target)",	"Device or label for metadata writes")		\
@ -308,11 +314,6 @@ enum fsck_err_opts {
 	  OPT_BOOL(),							\
 	  BCH2_NO_SB_OPT,		false,				\
 	  NULL,		"Don't kick drives out when splitbrain detected")\
-	x(discard,			u8,				\
-	  OPT_FS|OPT_MOUNT|OPT_DEVICE,					\
-	  OPT_BOOL(),							\
-	  BCH2_NO_SB_OPT,		true,				\
-	  NULL,		"Enable discard/TRIM support")			\
 	x(verbose,			u8,				\
 	  OPT_FS|OPT_MOUNT|OPT_RUNTIME,					\
 	  OPT_BOOL(),							\
@ -493,27 +494,32 @@ enum fsck_err_opts {
 	  BCH2_NO_SB_OPT,		false,				\
 	  NULL,		"Skip submit_bio() for data reads and writes, "	\
 			"for performance testing purposes")		\
-	x(fs_size,			u64,				\
-	  OPT_DEVICE,							\
+	x(state,			u64,				\
+	  OPT_DEVICE|OPT_RUNTIME,					\
+	  OPT_STR(bch2_member_states),					\
+	  BCH_MEMBER_STATE,		BCH_MEMBER_STATE_rw,		\
+	  "state",	"rw,ro,failed,spare")				\
+	x(bucket_size,			u32,				\
+	  OPT_DEVICE|OPT_HUMAN_READABLE|OPT_SB_FIELD_SECTORS,		\
 	  OPT_UINT(0, S64_MAX),						\
-	  BCH2_NO_SB_OPT,		0,				\
-	  "size",	"Size of filesystem on device")			\
-	x(bucket,			u32,				\
-	  OPT_DEVICE,							\
-	  OPT_UINT(0, S64_MAX),						\
-	  BCH2_NO_SB_OPT,		0,				\
+	  BCH_MEMBER_BUCKET_SIZE,	0,				\
 	  "size",	"Specifies the bucket size; must be greater than the btree node size")\
 	x(durability,			u8,				\
-	  OPT_DEVICE|OPT_SB_FIELD_ONE_BIAS,				\
+	  OPT_DEVICE|OPT_RUNTIME|OPT_SB_FIELD_ONE_BIAS,			\
 	  OPT_UINT(0, BCH_REPLICAS_MAX),				\
-	  BCH2_NO_SB_OPT,		1,				\
+	  BCH_MEMBER_DURABILITY,	1,				\
 	  "n",		"Data written to this device will be considered\n"\
 			"to have already been replicated n times")	\
 	x(data_allowed,			u8,				\
 	  OPT_DEVICE,							\
 	  OPT_BITFIELD(__bch2_data_types),				\
-	  BCH2_NO_SB_OPT,		BIT(BCH_DATA_journal)|BIT(BCH_DATA_btree)|BIT(BCH_DATA_user),\
+	  BCH_MEMBER_DATA_ALLOWED,	BIT(BCH_DATA_journal)|BIT(BCH_DATA_btree)|BIT(BCH_DATA_user),\
 	  "types",	"Allowed data types for this device: journal, btree, and/or user")\
+	x(discard,			u8,				\
+	  OPT_MOUNT|OPT_DEVICE|OPT_RUNTIME,				\
+	  OPT_BOOL(),							\
+	  BCH_MEMBER_DISCARD,		true,				\
+	  NULL,		"Enable discard/TRIM support")			\
 	x(btree_node_prefetch,		u8,				\
 	  OPT_FS|OPT_MOUNT|OPT_RUNTIME,					\
 	  OPT_BOOL(),							\
@ -521,11 +527,6 @@ enum fsck_err_opts {
 	  NULL,		"BTREE_ITER_prefetch casuse btree nodes to be\n"\
 	  " prefetched sequentially")

-#define BCH_DEV_OPT_SETTERS()						\
-	x(discard,		BCH_MEMBER_DISCARD)			\
-	x(durability,		BCH_MEMBER_DURABILITY)			\
-	x(data_allowed,		BCH_MEMBER_DATA_ALLOWED)
-
 struct bch_opts {
 #define x(_name, _bits, ...)	unsigned _name##_defined:1;
 	BCH_OPTS()
@ -582,8 +583,6 @@ struct printbuf;

 struct bch_option {
 	struct attribute	attr;
-	u64			(*get_sb)(const struct bch_sb *);
-	void			(*set_sb)(struct bch_sb *, u64);
 	enum opt_type		type;
 	enum opt_flags		flags;
 	u64			min, max;
@ -595,6 +594,12 @@ struct bch_option {
 	const char		*hint;
 	const char		*help;

+	u64			(*get_sb)(const struct bch_sb *);
+	void			(*set_sb)(struct bch_sb *, u64);
+
+	u64			(*get_member)(const struct bch_member *);
+	void			(*set_member)(struct bch_member *, u64);
+
 };

 extern const struct bch_option bch2_opt_table[];
@ -603,7 +608,7 @@ bool bch2_opt_defined_by_id(const struct bch_opts *, enum bch_opt_id);
 u64 bch2_opt_get_by_id(const struct bch_opts *, enum bch_opt_id);
 void bch2_opt_set_by_id(struct bch_opts *, enum bch_opt_id, u64);

-u64 bch2_opt_from_sb(struct bch_sb *, enum bch_opt_id);
+u64 bch2_opt_from_sb(struct bch_sb *, enum bch_opt_id, int);
 int bch2_opts_from_sb(struct bch_opts *, struct bch_sb *);
 void __bch2_opt_set_sb(struct bch_sb *, int, const struct bch_option *, u64);

@ -625,7 +630,7 @@ void bch2_opts_to_text(struct printbuf *,
 		       struct bch_fs *, struct bch_sb *,
 		       unsigned, unsigned, unsigned);

-int bch2_opt_check_may_set(struct bch_fs *, int, u64);
+int bch2_opt_check_may_set(struct bch_fs *, struct bch_dev *, int, u64);
 int bch2_opts_check_may_set(struct bch_fs *);
 int bch2_parse_one_mount_opt(struct bch_fs *, struct bch_opts *,
 			     struct printbuf *, const char *, const char *);
--- a/fs/bcachefs/progress.c
+++ b/fs/bcachefs/progress.c
@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "bcachefs.h"
+#include "bbpos.h"
+#include "disk_accounting.h"
+#include "progress.h"
+
+void bch2_progress_init(struct progress_indicator_state *s,
+			struct bch_fs *c,
+			u64 btree_id_mask)
+{
+	memset(s, 0, sizeof(*s));
+
+	s->next_print = jiffies + HZ * 10;
+
+	for (unsigned i = 0; i < BTREE_ID_NR; i++) {
+		if (!(btree_id_mask & BIT_ULL(i)))
+			continue;
+
+		struct disk_accounting_pos acc = {
+			.type		= BCH_DISK_ACCOUNTING_btree,
+			.btree.id	= i,
+		};
+
+		u64 v;
+		bch2_accounting_mem_read(c, disk_accounting_pos_to_bpos(&acc), &v, 1);
+		s->nodes_total += div64_ul(v, btree_sectors(c));
+	}
+}
+
+static inline bool progress_update_p(struct progress_indicator_state *s)
+{
+	bool ret = time_after_eq(jiffies, s->next_print);
+
+	if (ret)
+		s->next_print = jiffies + HZ * 10;
+	return ret;
+}
+
+void bch2_progress_update_iter(struct btree_trans *trans,
+			       struct progress_indicator_state *s,
+			       struct btree_iter *iter,
+			       const char *msg)
+{
+	struct bch_fs *c = trans->c;
+	struct btree *b = path_l(btree_iter_path(trans, iter))->b;
+
+	s->nodes_seen += b != s->last_node;
+	s->last_node = b;
+
+	if (progress_update_p(s)) {
+		struct printbuf buf = PRINTBUF;
+		unsigned percent = s->nodes_total
+			? div64_u64(s->nodes_seen * 100, s->nodes_total)
+			: 0;
+
+		prt_printf(&buf, "%s: %d%%, done %llu/%llu nodes, at ",
+			   msg, percent, s->nodes_seen, s->nodes_total);
+		bch2_bbpos_to_text(&buf, BBPOS(iter->btree_id, iter->pos));
+
+		bch_info(c, "%s", buf.buf);
+		printbuf_exit(&buf);
+	}
+}
--- a/fs/bcachefs/progress.h
+++ b/fs/bcachefs/progress.h
@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _BCACHEFS_PROGRESS_H
+#define _BCACHEFS_PROGRESS_H
+
+/*
+ * Lame progress indicators
+ *
+ * We don't like to use these because they print to the dmesg console, which is
+ * spammy - we much prefer to be wired up to a userspace programm (e.g. via
+ * thread_with_file) and have it print the progress indicator.
+ *
+ * But some code is old and doesn't support that, or runs in a context where
+ * that's not yet practical (mount).
+ */
+
+struct progress_indicator_state {
+	unsigned long		next_print;
+	u64			nodes_seen;
+	u64			nodes_total;
+	struct btree		*last_node;
+};
+
+void bch2_progress_init(struct progress_indicator_state *, struct bch_fs *, u64);
+void bch2_progress_update_iter(struct btree_trans *,
+			       struct progress_indicator_state *,
+			       struct btree_iter *,
+			       const char *);
+
+#endif /* _BCACHEFS_PROGRESS_H */
--- a/fs/bcachefs/rebalance.c
+++ b/fs/bcachefs/rebalance.c
@ -26,9 +26,8 @@

 /* bch_extent_rebalance: */

-static const struct bch_extent_rebalance *bch2_bkey_rebalance_opts(struct bkey_s_c k)
+static const struct bch_extent_rebalance *bch2_bkey_ptrs_rebalance_opts(struct bkey_ptrs_c ptrs)
 {
-	struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
 	const union bch_extent_entry *entry;

 	bkey_extent_entry_for_each(ptrs, entry)
@ -38,6 +37,11 @@ static const struct bch_extent_rebalance *bch2_bkey_rebalance_opts(struct bkey_s
 	return NULL;
 }

+static const struct bch_extent_rebalance *bch2_bkey_rebalance_opts(struct bkey_s_c k)
+{
+	return bch2_bkey_ptrs_rebalance_opts(bch2_bkey_ptrs_c(k));
+}
+
 static inline unsigned bch2_bkey_ptrs_need_compress(struct bch_fs *c,
 					   struct bch_io_opts *opts,
 					   struct bkey_s_c k,
@ -97,11 +101,12 @@ static unsigned bch2_bkey_ptrs_need_rebalance(struct bch_fs *c,

 u64 bch2_bkey_sectors_need_rebalance(struct bch_fs *c, struct bkey_s_c k)
 {
-	const struct bch_extent_rebalance *opts = bch2_bkey_rebalance_opts(k);
+	struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
+
+	const struct bch_extent_rebalance *opts = bch2_bkey_ptrs_rebalance_opts(ptrs);
 	if (!opts)
 		return 0;

-	struct bkey_ptrs_c ptrs = bch2_bkey_ptrs_c(k);
 	const union bch_extent_entry *entry;
 	struct extent_ptr_decoded p;
 	u64 sectors = 0;
@ -341,7 +346,7 @@ static struct bkey_s_c next_rebalance_extent(struct btree_trans *trans,
 	memset(data_opts, 0, sizeof(*data_opts));
 	data_opts->rewrite_ptrs		= bch2_bkey_ptrs_need_rebalance(c, io_opts, k);
 	data_opts->target		= io_opts->background_target;
-	data_opts->write_flags		|= BCH_WRITE_ONLY_SPECIFIED_DEVS;
+	data_opts->write_flags		|= BCH_WRITE_only_specified_devs;

 	if (!data_opts->rewrite_ptrs) {
 		/*
@ -449,7 +454,7 @@ static bool rebalance_pred(struct bch_fs *c, void *arg,
 {
 	data_opts->rewrite_ptrs		= bch2_bkey_ptrs_need_rebalance(c, io_opts, k);
 	data_opts->target		= io_opts->background_target;
-	data_opts->write_flags		|= BCH_WRITE_ONLY_SPECIFIED_DEVS;
+	data_opts->write_flags		|= BCH_WRITE_only_specified_devs;
 	return data_opts->rewrite_ptrs != 0;
 }

@ -590,8 +595,19 @@ static int bch2_rebalance_thread(void *arg)

 void bch2_rebalance_status_to_text(struct printbuf *out, struct bch_fs *c)
 {
+	printbuf_tabstop_push(out, 32);
+
 	struct bch_fs_rebalance *r = &c->rebalance;

+	/* print pending work */
+	struct disk_accounting_pos acc = { .type = BCH_DISK_ACCOUNTING_rebalance_work, };
+	u64 v;
+	bch2_accounting_mem_read(c, disk_accounting_pos_to_bpos(&acc), &v, 1);
+
+	prt_printf(out, "pending work:\t");
+	prt_human_readable_u64(out, v);
+	prt_printf(out, "\n\n");
+
 	prt_str(out, bch2_rebalance_state_strs[r->state]);
 	prt_newline(out);
 	printbuf_indent_add(out, 2);
@ -600,15 +616,15 @@ void bch2_rebalance_status_to_text(struct printbuf *out, struct bch_fs *c)
 	case BCH_REBALANCE_waiting: {
 		u64 now = atomic64_read(&c->io_clock[WRITE].now);

-		prt_str(out, "io wait duration:  ");
+		prt_printf(out, "io wait duration:\t");
 		bch2_prt_human_readable_s64(out, (r->wait_iotime_end - r->wait_iotime_start) << 9);
 		prt_newline(out);

-		prt_str(out, "io wait remaining: ");
+		prt_printf(out, "io wait remaining:\t");
 		bch2_prt_human_readable_s64(out, (r->wait_iotime_end - now) << 9);
 		prt_newline(out);

-		prt_str(out, "duration waited:   ");
+		prt_printf(out, "duration waited:\t");
 		bch2_pr_time_units(out, ktime_get_real_ns() - r->wait_wallclock_start);
 		prt_newline(out);
 		break;
@ -621,6 +637,18 @@ void bch2_rebalance_status_to_text(struct printbuf *out, struct bch_fs *c)
 		break;
 	}
 	prt_newline(out);
+
+	rcu_read_lock();
+	struct task_struct *t = rcu_dereference(c->rebalance.thread);
+	if (t)
+		get_task_struct(t);
+	rcu_read_unlock();
+
+	if (t) {
+		bch2_prt_task_backtrace(out, t, 0, GFP_KERNEL);
+		put_task_struct(t);
+	}
+
 	printbuf_indent_sub(out, 2);
 }

--- a/fs/bcachefs/recovery.c
+++ b/fs/bcachefs/recovery.c
@ -13,12 +13,12 @@
 #include "disk_accounting.h"
 #include "errcode.h"
 #include "error.h"
-#include "fs-common.h"
 #include "journal_io.h"
 #include "journal_reclaim.h"
 #include "journal_seq_blacklist.h"
 #include "logged_ops.h"
 #include "move.h"
+#include "namei.h"
 #include "quota.h"
 #include "rebalance.h"
 #include "recovery.h"
@ -899,7 +899,7 @@ use_clean:
 	 * journal sequence numbers:
 	 */
 	if (!c->sb.clean)
-		journal_seq += 8;
+		journal_seq += JOURNAL_BUF_NR * 4;

 	if (blacklist_seq != journal_seq) {
 		ret =   bch2_journal_log_msg(c, "blacklisting entries %llu-%llu",
--- a/fs/bcachefs/recovery_passes_types.h
+++ b/fs/bcachefs/recovery_passes_types.h
@ -24,7 +24,7 @@
 	x(check_topology,			 4, 0)					\
 	x(accounting_read,			39, PASS_ALWAYS)			\
 	x(alloc_read,				 0, PASS_ALWAYS)			\
-	x(stripes_read,				 1, PASS_ALWAYS)			\
+	x(stripes_read,				 1, 0)					\
 	x(initialize_subvolumes,		 2, 0)					\
 	x(snapshots_read,			 3, PASS_ALWAYS)			\
 	x(check_allocations,			 5, PASS_FSCK)				\
--- a/fs/bcachefs/reflink.c
+++ b/fs/bcachefs/reflink.c
@ -185,12 +185,21 @@ static int bch2_indirect_extent_missing_error(struct btree_trans *trans,
 	BUG_ON(missing_start	< refd_start);
 	BUG_ON(missing_end	> refd_end);

-	if (fsck_err(trans, reflink_p_to_missing_reflink_v,
-		     "pointer to missing indirect extent\n"
-		     "  %s\n"
-		     "  missing range %llu-%llu",
-		     (bch2_bkey_val_to_text(&buf, c, p.s_c), buf.buf),
-		     missing_start, missing_end)) {
+	struct bpos missing_pos = bkey_start_pos(p.k);
+	missing_pos.offset += missing_start - live_start;
+
+	prt_printf(&buf, "pointer to missing indirect extent in ");
+	ret = bch2_inum_snap_offset_err_msg_trans(trans, &buf, missing_pos);
+	if (ret)
+		goto err;
+
+	prt_printf(&buf, "-%llu\n  ", (missing_pos.offset + (missing_end - missing_start)) << 9);
+	bch2_bkey_val_to_text(&buf, c, p.s_c);
+
+	prt_printf(&buf, "\n  missing reflink btree range %llu-%llu",
+		   missing_start, missing_end);
+
+	if (fsck_err(trans, reflink_p_to_missing_reflink_v, "%s", buf.buf)) {
 		struct bkey_i_reflink_p *new = bch2_bkey_make_mut_noupdate_typed(trans, p.s_c, reflink_p);
 		ret = PTR_ERR_OR_ZERO(new);
 		if (ret)
@ -597,7 +606,7 @@ s64 bch2_remap_range(struct bch_fs *c,
 	u64 dst_done = 0;
 	u32 dst_snapshot, src_snapshot;
 	bool reflink_p_may_update_opts_field =
-		bch2_request_incompat_feature(c, bcachefs_metadata_version_reflink_p_may_update_opts);
+		!bch2_request_incompat_feature(c, bcachefs_metadata_version_reflink_p_may_update_opts);
 	int ret = 0, ret2 = 0;

 	if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_reflink))
--- a/fs/bcachefs/sb-counters.c
+++ b/fs/bcachefs/sb-counters.c
@ -5,7 +5,13 @@

 /* BCH_SB_FIELD_counters */

-static const char * const bch2_counter_names[] = {
+static const u8 counters_to_stable_map[] = {
+#define x(n, id, ...)	[BCH_COUNTER_##n] = BCH_COUNTER_STABLE_##n,
+	BCH_PERSISTENT_COUNTERS()
+#undef x
+};
+
+const char * const bch2_counter_names[] = {
 #define x(t, n, ...) (#t),
 	BCH_PERSISTENT_COUNTERS()
 #undef x
@ -18,13 +24,13 @@ static size_t bch2_sb_counter_nr_entries(struct bch_sb_field_counters *ctrs)
 		return 0;

 	return (__le64 *) vstruct_end(&ctrs->field) - &ctrs->d[0];
-};
+}

 static int bch2_sb_counters_validate(struct bch_sb *sb, struct bch_sb_field *f,
 				enum bch_validate_flags flags, struct printbuf *err)
 {
 	return 0;
-};
+}

 static void bch2_sb_counters_to_text(struct printbuf *out, struct bch_sb *sb,
 			      struct bch_sb_field *f)
@ -32,50 +38,56 @@ static void bch2_sb_counters_to_text(struct printbuf *out, struct bch_sb *sb,
 	struct bch_sb_field_counters *ctrs = field_to_type(f, counters);
 	unsigned int nr = bch2_sb_counter_nr_entries(ctrs);

-	for (unsigned i = 0; i < nr; i++)
-		prt_printf(out, "%s \t%llu\n",
-			   i < BCH_COUNTER_NR ? bch2_counter_names[i] : "(unknown)",
-			   le64_to_cpu(ctrs->d[i]));
-};
+	for (unsigned i = 0; i < BCH_COUNTER_NR; i++) {
+		unsigned stable = counters_to_stable_map[i];
+		if (stable < nr)
+			prt_printf(out, "%s \t%llu\n",
+				   bch2_counter_names[i],
+				   le64_to_cpu(ctrs->d[stable]));
+	}
+}

 int bch2_sb_counters_to_cpu(struct bch_fs *c)
 {
 	struct bch_sb_field_counters *ctrs = bch2_sb_field_get(c->disk_sb.sb, counters);
-	unsigned int i;
 	unsigned int nr = bch2_sb_counter_nr_entries(ctrs);
-	u64 val = 0;

-	for (i = 0; i < BCH_COUNTER_NR; i++)
+	for (unsigned i = 0; i < BCH_COUNTER_NR; i++)
 		c->counters_on_mount[i] = 0;

-	for (i = 0; i < min_t(unsigned int, nr, BCH_COUNTER_NR); i++) {
-		val = le64_to_cpu(ctrs->d[i]);
-		percpu_u64_set(&c->counters[i], val);
-		c->counters_on_mount[i] = val;
+	for (unsigned i = 0; i < BCH_COUNTER_NR; i++) {
+		unsigned stable = counters_to_stable_map[i];
+		if (stable < nr) {
+			u64 v = le64_to_cpu(ctrs->d[stable]);
+			percpu_u64_set(&c->counters[i], v);
+			c->counters_on_mount[i] = v;
+		}
 	}
+
 	return 0;
-};
+}

 int bch2_sb_counters_from_cpu(struct bch_fs *c)
 {
 	struct bch_sb_field_counters *ctrs = bch2_sb_field_get(c->disk_sb.sb, counters);
 	struct bch_sb_field_counters *ret;
-	unsigned int i;
 	unsigned int nr = bch2_sb_counter_nr_entries(ctrs);

 	if (nr < BCH_COUNTER_NR) {
 		ret = bch2_sb_field_resize(&c->disk_sb, counters,
-					       sizeof(*ctrs) / sizeof(u64) + BCH_COUNTER_NR);
-
+					   sizeof(*ctrs) / sizeof(u64) + BCH_COUNTER_NR);
 		if (ret) {
 			ctrs = ret;
 			nr = bch2_sb_counter_nr_entries(ctrs);
 		}
 	}

+	for (unsigned i = 0; i < BCH_COUNTER_NR; i++) {
+		unsigned stable = counters_to_stable_map[i];
+		if (stable < nr)
+			ctrs->d[stable] = cpu_to_le64(percpu_u64_get(&c->counters[i]));
+	}

-	for (i = 0; i < min_t(unsigned int, nr, BCH_COUNTER_NR); i++)
-		ctrs->d[i] = cpu_to_le64(percpu_u64_get(&c->counters[i]));
 	return 0;
 }

@ -97,3 +109,39 @@ const struct bch_sb_field_ops bch_sb_field_ops_counters = {
 	.validate	= bch2_sb_counters_validate,
 	.to_text	= bch2_sb_counters_to_text,
 };
+
+#ifndef NO_BCACHEFS_CHARDEV
+long bch2_ioctl_query_counters(struct bch_fs *c,
+			struct bch_ioctl_query_counters __user *user_arg)
+{
+	struct bch_ioctl_query_counters arg;
+	int ret = copy_from_user_errcode(&arg, user_arg, sizeof(arg));
+	if (ret)
+		return ret;
+
+	if ((arg.flags & ~BCH_IOCTL_QUERY_COUNTERS_MOUNT) ||
+	    arg.pad)
+		return -EINVAL;
+
+	arg.nr = min(arg.nr, BCH_COUNTER_NR);
+	ret = put_user(arg.nr, &user_arg->nr);
+	if (ret)
+		return ret;
+
+	for (unsigned i = 0; i < BCH_COUNTER_NR; i++) {
+		unsigned stable = counters_to_stable_map[i];
+
+		if (stable < arg.nr) {
+			u64 v = !(arg.flags & BCH_IOCTL_QUERY_COUNTERS_MOUNT)
+				? percpu_u64_get(&c->counters[i])
+				: c->counters_on_mount[i];
+
+			ret = put_user(v, &user_arg->d[stable]);
+			if (ret)
+				return ret;
+		}
+	}
+
+	return 0;
+}
+#endif
--- a/fs/bcachefs/sb-counters.h
+++ b/fs/bcachefs/sb-counters.h
@ -11,6 +11,10 @@ int bch2_sb_counters_from_cpu(struct bch_fs *);
 void bch2_fs_counters_exit(struct bch_fs *);
 int bch2_fs_counters_init(struct bch_fs *);

+extern const char * const bch2_counter_names[];
 extern const struct bch_sb_field_ops bch_sb_field_ops_counters;

+long bch2_ioctl_query_counters(struct bch_fs *,
+			struct bch_ioctl_query_counters __user *);
+
 #endif // _BCACHEFS_SB_COUNTERS_H
--- a/fs/bcachefs/sb-counters_format.h
+++ b/fs/bcachefs/sb-counters_format.h
@ -9,10 +9,24 @@ enum counters_flags {

 #define BCH_PERSISTENT_COUNTERS()					\
 	x(io_read,					0,	TYPE_SECTORS)	\
+	x(io_read_inline,				80,	TYPE_SECTORS)	\
+	x(io_read_hole,					81,	TYPE_SECTORS)	\
+	x(io_read_promote,				30,	TYPE_COUNTER)	\
+	x(io_read_bounce,				31,	TYPE_COUNTER)	\
+	x(io_read_split,				33,	TYPE_COUNTER)	\
+	x(io_read_reuse_race,				34,	TYPE_COUNTER)	\
+	x(io_read_retry,				32,	TYPE_COUNTER)	\
 	x(io_write,					1,	TYPE_SECTORS)	\
 	x(io_move,					2,	TYPE_SECTORS)	\
+	x(io_move_read,					35,	TYPE_SECTORS)	\
+	x(io_move_write,				36,	TYPE_SECTORS)	\
+	x(io_move_finish,				37,	TYPE_SECTORS)	\
+	x(io_move_fail,					38,	TYPE_COUNTER)	\
+	x(io_move_write_fail,				82,	TYPE_COUNTER)	\
+	x(io_move_start_fail,				39,	TYPE_COUNTER)	\
 	x(bucket_invalidate,				3,	TYPE_COUNTER)	\
 	x(bucket_discard,				4,	TYPE_COUNTER)	\
+	x(bucket_discard_fast,				79,	TYPE_COUNTER)	\
 	x(bucket_alloc,					5,	TYPE_COUNTER)	\
 	x(bucket_alloc_fail,				6,	TYPE_COUNTER)	\
 	x(btree_cache_scan,				7,	TYPE_COUNTER)	\
@ -38,16 +52,6 @@ enum counters_flags {
 	x(journal_reclaim_finish,			27,	TYPE_COUNTER)	\
 	x(journal_reclaim_start,			28,	TYPE_COUNTER)	\
 	x(journal_write,				29,	TYPE_COUNTER)	\
-	x(read_promote,					30,	TYPE_COUNTER)	\
-	x(read_bounce,					31,	TYPE_COUNTER)	\
-	x(read_split,					33,	TYPE_COUNTER)	\
-	x(read_retry,					32,	TYPE_COUNTER)	\
-	x(read_reuse_race,				34,	TYPE_COUNTER)	\
-	x(move_extent_read,				35,	TYPE_SECTORS)	\
-	x(move_extent_write,				36,	TYPE_SECTORS)	\
-	x(move_extent_finish,				37,	TYPE_SECTORS)	\
-	x(move_extent_fail,				38,	TYPE_COUNTER)	\
-	x(move_extent_start_fail,			39,	TYPE_COUNTER)	\
 	x(copygc,					40,	TYPE_COUNTER)	\
 	x(copygc_wait,					41,	TYPE_COUNTER)	\
 	x(gc_gens_end,					42,	TYPE_COUNTER)	\
@ -95,6 +99,13 @@ enum bch_persistent_counters {
 	BCH_COUNTER_NR
 };

+enum bch_persistent_counters_stable {
+#define x(t, n, ...) BCH_COUNTER_STABLE_##t = n,
+	BCH_PERSISTENT_COUNTERS()
+#undef x
+	BCH_COUNTER_STABLE_NR
+};
+
 struct bch_sb_field_counters {
 	struct bch_sb_field	field;
 	__le64			d[];
--- a/fs/bcachefs/sb-downgrade.c
+++ b/fs/bcachefs/sb-downgrade.c
@ -90,7 +90,13 @@
 	  BIT_ULL(BCH_RECOVERY_PASS_check_allocations),		\
 	  BCH_FSCK_ERR_accounting_mismatch,			\
 	  BCH_FSCK_ERR_accounting_key_replicas_nr_devs_0,	\
-	  BCH_FSCK_ERR_accounting_key_junk_at_end)
+	  BCH_FSCK_ERR_accounting_key_junk_at_end)		\
+	x(cached_backpointers,					\
+	  BIT_ULL(BCH_RECOVERY_PASS_check_extents_to_backpointers),\
+	  BCH_FSCK_ERR_ptr_to_missing_backpointer)		\
+	x(stripe_backpointers,					\
+	  BIT_ULL(BCH_RECOVERY_PASS_check_extents_to_backpointers),\
+	  BCH_FSCK_ERR_ptr_to_missing_backpointer)

 #define DOWNGRADE_TABLE()					\
 	x(bucket_stripe_sectors,				\
--- a/fs/bcachefs/sb-errors_format.h
+++ b/fs/bcachefs/sb-errors_format.h
@ -179,6 +179,7 @@ enum bch_fsck_flags {
 	x(ptr_crc_redundant,					160,	0)		\
 	x(ptr_crc_nonce_mismatch,				162,	0)		\
 	x(ptr_stripe_redundant,					163,	0)		\
+	x(extent_flags_not_at_start,				306,	0)		\
 	x(reservation_key_nr_replicas_invalid,			164,	0)		\
 	x(reflink_v_refcount_wrong,				165,	FSCK_AUTOFIX)	\
 	x(reflink_v_pos_bad,					292,	0)		\
@ -314,7 +315,9 @@ enum bch_fsck_flags {
 	x(compression_opt_not_marked_in_sb,			295,	FSCK_AUTOFIX)	\
 	x(compression_type_not_marked_in_sb,			296,	FSCK_AUTOFIX)	\
 	x(directory_size_mismatch,				303,	FSCK_AUTOFIX)	\
-	x(MAX,							304,	0)
+	x(dirent_cf_name_too_big,				304,	0)		\
+	x(dirent_stray_data_after_cf_name,			305,	0)		\
+	x(MAX,							307,	0)

 enum bch_sb_error_id {
 #define x(t, n, ...) BCH_FSCK_ERR_##t = n,
--- a/fs/bcachefs/sb-members.h
+++ b/fs/bcachefs/sb-members.h
@ -23,7 +23,19 @@ static inline bool bch2_dev_is_online(struct bch_dev *ca)
 	return !percpu_ref_is_zero(&ca->io_ref);
 }

-static inline bool bch2_dev_is_readable(struct bch_dev *ca)
+static inline struct bch_dev *bch2_dev_rcu(struct bch_fs *, unsigned);
+
+static inline bool bch2_dev_idx_is_online(struct bch_fs *c, unsigned dev)
+{
+	rcu_read_lock();
+	struct bch_dev *ca = bch2_dev_rcu(c, dev);
+	bool ret = ca && bch2_dev_is_online(ca);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool bch2_dev_is_healthy(struct bch_dev *ca)
 {
 	return bch2_dev_is_online(ca) &&
 		ca->mi.state != BCH_MEMBER_STATE_failed;
@ -271,6 +283,8 @@ static inline struct bch_dev *bch2_dev_iterate(struct bch_fs *c, struct bch_dev

 static inline struct bch_dev *bch2_dev_get_ioref(struct bch_fs *c, unsigned dev, int rw)
 {
+	might_sleep();
+
 	rcu_read_lock();
 	struct bch_dev *ca = bch2_dev_rcu(c, dev);
 	if (ca && !percpu_ref_tryget(&ca->io_ref))
--- a/Show More
+++ b/Show More