linux-stable-mirror

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ synced 2025-04-19 20:58:31 +09:00

Author	SHA1	Message	Date
Anuj Gupta	677e332e48	block: ensure correct integrity capability propagation in stacked devices queue_limits_stack_integrity() incorrectly sets BLK_INTEGRITY_DEVICE_CAPABLE for a DM device even when none of its underlying devices support integrity. This happens because the flag is inherited unconditionally. Ensure that integrity capabilities are correctly propagated only when the underlying devices actually support integrity. Reported-by: M Nikhil <nikh1092@linux.ibm.com> Link: https://lore.kernel.org/linux-block/f6130475-3ccd-45d2-abde-3ccceada0f0a@linux.ibm.com/ Fixes: c6e56cf6b2e7 ("block: move integrity information into queue_limits") Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250305063033.1813-2-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-06 08:01:37 -07:00
Ming Lei	6cc477c368	blk-throttle: carry over directly Now ->carryover_bytes[] and ->carryover_ios[] only covers limit/config update. Actually the carryover bytes/ios can be carried to ->bytes_disp[] and ->io_disp[] directly, since the carryover is one-shot thing and only valid in current slice. Then we can remove the two fields and simplify code much. Type of ->bytes_disp[] and ->io_disp[] has to change as signed because the two fields may become negative when updating limits or config, but both are big enough for holding bytes/ios dispatched in single slice Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-05 16:24:40 -07:00
Ming Lei	a9fc8868b3	blk-throttle: don't take carryover for prioritized processing of metadata Commit 29390bb5661d ("blk-throttle: support prioritized processing of metadata") takes bytes/ios carryover for prioritized processing of metadata. Turns out we can support it by charging it directly without trimming slice, and the result is same with carryover. Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-05 16:24:40 -07:00
Ming Lei	483a393e7e	blk-throttle: remove last_bytes_disp and last_ios_disp The two fields are not used any more, so remove them. Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250305043123.3938491-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-05 16:24:40 -07:00
Yu Kuai	29cb955934	blk-throttle: fix lower bps rate by throtl_trim_slice() The bio submission time may be a few jiffies more than the expected waiting time, due to 'extra_bytes' can't be divided in tg_within_bps_limit(), and also due to timer wakeup delay. In this case, adjust slice_start to jiffies will discard the extra wait time, causing lower rate than expected. Current in-tree code already covers deviation by rounddown(), but turns out it is not enough, because jiffies - slice_start can be a multiple of throtl_slice. For example, assume bps_limit is 1000bytes, 1 jiffes is 10ms, and slice is 20ms(2 jiffies), expected rate is 1000 / 1000 * 20 = 20 bytes per slice. If user issues two 21 bytes IO, then wait time will be 30ms for the first IO: bytes_allowed = 20, extra_bytes = 1; jiffy_wait = 1 + 2 = 3 jiffies and consider extra 1 jiffies by timer, throtl_trim_slice() will be called at: jiffies = 40ms slice_start = 0ms, slice_end= 40ms bytes_disp = 21 In this case, before the patch, real rate in the first two slices is 10.5 bytes per slice, and slice will be updated to: jiffies = 40ms slice_start = 40ms, slice_end = 60ms, bytes_disp = 0; Hence the second IO will have to wait another 30ms; With the patch, the real rate in the first slice is 20 bytes per slice, which is the same as expected, and slice will be updated: jiffies=40ms, slice_start = 20ms, slice_end = 60ms, bytes_disp = 1; And now, there is still 19 bytes allowed in the second slice, and the second IO will only have to wait 10ms; This problem will cause blktests throtl/001 failure in case of CONFIG_HZ_100=y, fix it by preserving one extra finished slice in throtl_trim_slice(). Fixes: e43473b7f223 ("blkio: Core implementation of throttle policy") Reported-by: Ming Lei <ming.lei@redhat.com> Closes: https://lore.kernel.org/linux-block/20250222092823.210318-3-yukuai1@huaweicloud.com/ Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250227120645.812815-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-05 16:24:30 -07:00
Olivier Gayot	e06472bab2	block: fix conversion of GPT partition name to 7-bit The utf16_le_to_7bit function claims to, naively, convert a UTF-16 string to a 7-bit ASCII string. By naively, we mean that it: * drops the first byte of every character in the original UTF-16 string * checks if all characters are printable, and otherwise replaces them by exclamation mark "!". This means that theoretically, all characters outside the 7-bit ASCII range should be replaced by another character. Examples: * lower-case alpha (ɒ) 0x0252 becomes 0x52 (R) * ligature OE (œ) 0x0153 becomes 0x53 (S) * hangul letter pieup (ㅂ) 0x3142 becomes 0x42 (B) * upper-case gamma (Ɣ) 0x0194 becomes 0x94 (not printable) so gets replaced by "!" The result of this conversion for the GPT partition name is passed to user-space as PARTNAME via udev, which is confusing and feels questionable. However, there is a flaw in the conversion function itself. By dropping one byte of each character and using isprint() to check if the remaining byte corresponds to a printable character, we do not actually guarantee that the resulting character is 7-bit ASCII. This happens because we pass 8-bit characters to isprint(), which in the kernel returns 1 for many values > 0x7f - as defined in ctype.c. This results in many values which should be replaced by "!" to be kept as-is, despite not being valid 7-bit ASCII. Examples: * e with acute accent (é) 0x00E9 becomes 0xE9 - kept as-is because isprint(0xE9) returns 1. * euro sign (€) 0x20AC becomes 0xAC - kept as-is because isprint(0xAC) returns 1. This way has broken pyudev utility[1], fixes it by using a mask of 7 bits instead of 8 bits before calling isprint. Link: https://github.com/pyudev/pyudev/issues/490#issuecomment-2685794648 [1] Link: https://lore.kernel.org/linux-block/4cac90c2-e414-4ebb-ae62-2a4589d9dc6e@canonical.com/ Cc: Mulhern <amulhern@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: stable@vger.kernel.org Signed-off-by: Olivier Gayot <olivier.gayot@canonical.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250305022154.3903128-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-05 07:40:24 -07:00
Christoph Hellwig	105ca2a2c2	block: split struct bio_integrity_payload Many of the fields in struct bio_integrity_payload are only needed for the default integrity buffer in the block layer, and the variable sized array at the end of the structure makes it very hard to embed into caller allocated structures. Reduce struct bio_integrity_payload to the minimal structure needed in common code and create two separate containing structures for the automatically generated payload and the caller allocated payload. The latter is a simple wrapper for struct bio_integrity_payload and the bvecs, while the former contains the additional fields moved out of struct bio_integrity_payload. Always use a dedicated mempool for automatic integrity metadata instead of depending on bio_set that is submitter controlled and thus often doesn't have the mempool initialized and stop using mempools for the submitter buffers as they aren't in the NOIO I/O submission path where we need to guarantee forward progress. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250225154449.422989-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-03 11:17:52 -07:00
Christoph Hellwig	e51679112c	block: move the block layer auto-integrity code into a new file The code that automatically creates a integrity payload and generates and verifies the checksums for bios that don't have submitter-provided integrity payload currently sits right in the middle of the block integrity metadata infrastructure. Split it into a separate file to make the different layers clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250225154449.422989-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-03 11:17:52 -07:00
Christoph Hellwig	5fd0268a88	block: mark bounce buffering as incompatible with integrity None of the few drivers still using the legacy block layer bounce buffering support integrity metadata. Explicitly mark the features as incompatible and stop creating the slab and mempool for integrity buffers for the bounce bio_set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250225154449.422989-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-03-03 11:17:52 -07:00
Linus Torvalds	276f98efb6	block-6.14-20250228 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfBwygQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuyVD/9kem557zNDkps/+2k8Q86FGZ/XmD+GPu1H l30qlar1XubeC/AE/bxgyI8G6rWY9li3PPn0tu/LeTgTVW5noIZCyvrtxl8g6yKV Gptm3H5AJypMU9cDz1/KTYTgrEypDJ22092/V1cuoeJxUS3srIEx6rlBp1wXzoG6 WdEIBhk9hM3hwXghyEarJeacHFe6xzd9lJM9ZODXBMkKtee85zXDLSAEPJsnjCcH t2tU/EAa6O0MLuYorG4Lkfs0ggDP+UDRdwh2MbANZXZdUCG2SwBS3pKDYtn684A1 gSsPnJGVZjLTog9jzaGkw64ebZ8tdLU4szjzroAJYkIbz9kO3QxT+H4TfW5UMoip TVPdNDqvypqs8ENKUvv3XuGsKuOfYjpBEiU2oGUUuioHJnWlh6CPnt8V8t3YKnbP xreqnIOjRJni1/OOZOMcWfRLlIRMG2dGFwhskWBWY8dmt4eHoge3RQzPZtAFelcG eM+Gkczz+GAXAnFHt5JQIPnfmcVmXqkbX12uoxUyuoa4AFaDLT+7nVtu3Gj5/beJ bcvk8q6ww8oXGVvJ0sYwic9tOX4XoxHsdr8u80Wd0uvHUB6uU/HTAxQxUO3uMSD5 0pk9l/zGDjDcEcuOUiIAUldl2M1eIyoBIOK3svMq6TKiC13j7+xkGI1uSA9cKws6 /+OsNMd9JQ== =tcA2 -----END PGP SIGNATURE----- Merge tag 'block-6.14-20250228' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fix plugging for native zone writes - Fix segment limit settings for != 4K page size archs - Fix for slab names overflowing * tag 'block-6.14-20250228' of git://git.kernel.dk/linux: block: fix 'kmem_cache of name 'bio-108' already exists' block: Remove zone write plugs when handling native zone append writes block: make segment size limit workable for > 4K PAGE_SIZE	2025-02-28 09:43:46 -08:00
Ming Lei	b654f7a51f	block: fix 'kmem_cache of name 'bio-108' already exists' Device mapper bioset often has big bio_slab size, which can be more than 1000, then 8byte can't hold the slab name any more, cause the kmem_cache allocation warning of 'kmem_cache of name 'bio-108' already exists'. Fix the warning by extending bio_slab->name to 12 bytes, but fix output of /proc/slabinfo Reported-by: Guangwu Zhang <guazhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250228132656.2838008-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-28 07:06:42 -07:00
Damien Le Moal	a6aa36e957	block: Remove zone write plugs when handling native zone append writes For devices that natively support zone append operations, REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging and are immediately issued to the zoned device. This means that there is no write pointer offset tracking done for these operations and that a zone write plug is not necessary. However, when receiving a zone append BIO, we may already have a zone write plug for the target zone if that zone was previously partially written using regular write operations. In such case, since the write pointer offset of the zone write plug is not incremented by the amount of sectors appended to the zone, 2 issues arise: 1) we risk leaving the plug in the disk hash table if the zone is fully written using zone append or regular write operations, because the write pointer offset will never reach the "zone full" state. 2) Regular write operations that are issued after zone append operations will always be failed by blk_zone_wplug_prepare_bio() as the write pointer alignment check will fail, even if the user correctly accounted for the zone append operations and issued the regular writes with a correct sector. Avoid these issues by immediately removing the zone write plug of zones that are the target of zone append operations when blk_zone_plug_bio() is called. The new function blk_zone_wplug_handle_native_zone_append() implements this for devices that natively support zone append. The removal of the zone write plug using disk_remove_zone_wplug() requires aborting all plugged regular write using disk_zone_wplug_abort() as otherwise the plugged write BIOs would never be executed (with the plug removed, the completion path will never see again the zone write plug as disk_get_zone_wplug() will return NULL). Rate-limited warnings are added to blk_zone_wplug_handle_native_zone_append() and to disk_zone_wplug_abort() to signal this. Since blk_zone_wplug_handle_native_zone_append() is called in the hot path for operations that will not be plugged, disk_get_zone_wplug() is optimized under the assumption that a user issuing zone append operations is not at the same time issuing regular writes and that there are no hashed zone write plugs. The struct gendisk atomic counter nr_zone_wplugs is added to check this, with this counter incremented in disk_insert_zone_wplug() and decremented in disk_remove_zone_wplug(). To be consistent with this fix, we do not need to fill the zone write plug hash table with zone write plugs for zones that are partially written for a device that supports native zone append operations. So modify blk_revalidate_seq_zone() to return early to avoid allocating and inserting a zone write plug for partially written sequential zones if the device natively supports zone append. Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com> Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Jorgen Hansen <Jorgen.Hansen@wdc.com> Link: https://lore.kernel.org/r/20250214041434.82564-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-25 19:45:21 -07:00
Tang Yizhou	8ac17e6ae1	blk-wbt: Cleanup a comment in wb_timer_fn The original comment contains a grammatical error. Rewrite it into a more easily understandable sentence. Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Link: https://lore.kernel.org/r/20250213100611.209997-3-yizhou.tang@shopee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-25 08:43:52 -07:00
Tang Yizhou	5d01d2df85	blk-wbt: Fix some comments wbt_wait() no longer uses a spinlock as a parameter. Update the function comments accordingly. RWB_UNKNOWN_BUMP is used when we gradually adjust scale_steps toward the center state, which is a value of 0. Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250213100611.209997-2-yizhou.tang@shopee.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-25 08:43:52 -07:00
Ming Lei	889c57066c	block: make segment size limit workable for > 4K PAGE_SIZE Using PAGE_SIZE as a minimum expected DMA segment size in consideration of devices which have a max DMA segment size of < 64k when used on 64k PAGE_SIZE systems leads to devices not being able to probe such as eMMC and Exynos UFS controller [0] [1] you can end up with a probe failure as follows: WARNING: CPU: 2 PID: 397 at block/blk-settings.c:339 blk_validate_limits+0x364/0x3c0 Ensure we use min(max_seg_size, seg_boundary_mask + 1) as the new min segment size when max segment size is < PAGE_SIZE for 16k and 64k base page size systems. If anyone need to backport this patch, the following commits are depended: commit 6aeb4f836480 ("block: remove bio_add_pc_page") commit 02ee5d69e3ba ("block: remove blk_rq_bio_prep") commit b7175e24d6ac ("block: add a dma mapping iterator") Link: https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/ # [0] Link: https://lore.kernel.org/linux-block/1d55e942-5150-de4c-3a02-c3d066f87028@acm.org/ # [1] Cc: Yi Zhang <yi.zhang@redhat.com> Cc: John Garry <john.g.garry@oracle.com> Cc: Keith Busch <kbusch@kernel.org> Tested-by: Paul Bunyan <pbunyan@redhat.com> Reviewed-by: Daniel Gomez <da.gomez@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250225022141.2154581-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-25 08:41:32 -07:00
Luis Chamberlain	425fbcd62d	bdev: use bdev_io_min() for statx block size You can use lsblk to query for a block device block device block size: lsblk -o MIN-IO /dev/nvme0n1 MIN-IO 4096 The min-io is the minimum IO the block device prefers for optimal performance. In turn we map this to the block device block size. The current block size exposed even for block devices with an LBA format of 16k is 4k. Likewise devices which support 4k LBA format but have a larger Indirection Unit of 16k have an exposed block size of 4k. This incurs read-modify-writes on direct IO against devices with a min-io larger than the page size. To fix this, use the block device min io, which is the minimal optimal IO the device prefers. With this we now get: lsblk -o MIN-IO /dev/nvme0n1 MIN-IO 16384 And so userspace gets the appropriate information it needs for optimal performance. This is verified with blkalgn against mkfs against a device with LBA format of 4k but an NPWG of 16k (min io size) mkfs.xfs -f -b size=16k /dev/nvme3n1 blkalgn -d nvme3n1 --ops Write Block size : count distribution 0 -> 1 : 0 \| \| 2 -> 3 : 0 \| \| 4 -> 7 : 0 \| \| 8 -> 15 : 0 \| \| 16 -> 31 : 0 \| \| 32 -> 63 : 0 \| \| 64 -> 127 : 0 \| \| 128 -> 255 : 0 \| \| 256 -> 511 : 0 \| \| 512 -> 1023 : 0 \| \| 1024 -> 2047 : 0 \| \| 2048 -> 4095 : 0 \| \| 4096 -> 8191 : 0 \| \| 8192 -> 16383 : 0 \| \| 16384 -> 32767 : 66 \|***************************************\| 32768 -> 65535 : 0 \| \| 65536 -> 131071 : 0 \| \| 131072 -> 262143 : 2 \| \| Block size: 14 - 66 Block size: 17 - 2 Algn size : count distribution 0 -> 1 : 0 \| \| 2 -> 3 : 0 \| \| 4 -> 7 : 0 \| \| 8 -> 15 : 0 \| \| 16 -> 31 : 0 \| \| 32 -> 63 : 0 \| \| 64 -> 127 : 0 \| \| 128 -> 255 : 0 \| \| 256 -> 511 : 0 \| \| 512 -> 1023 : 0 \| \| 1024 -> 2047 : 0 \| \| 2048 -> 4095 : 0 \| \| 4096 -> 8191 : 0 \| \| 8192 -> 16383 : 0 \| \| 16384 -> 32767 : 66 \|***************************************\| 32768 -> 65535 : 0 \| \| 65536 -> 131071 : 0 \| \| 131072 -> 262143 : 2 \| \| Algn size: 14 - 66 Algn size: 17 - 2 Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-9-mcgrof@kernel.org Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-24 11:44:44 +01:00
Luis Chamberlain	47dd675323	block/bdev: lift block size restrictions to 64k We now can support blocksizes larger than PAGE_SIZE, so in theory we should be able to lift the restriction up to the max supported page cache order. However bound ourselves to what we can currently validate and test. Through blktests and fstest we can validate up to 64k today. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20250221223823.1680616-8-mcgrof@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-24 11:44:44 +01:00
Hannes Reinecke	3c20917120	block/bdev: enable large folio support for large logical block sizes Call mapping_set_folio_min_order() when modifying the logical block size to ensure folios are allocated with the correct size. Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250221223823.1680616-7-mcgrof@kernel.org Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-24 11:44:44 +01:00
Thorsten Blum	8985c42987	block: Remove commented out code Remove commented out code. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250219205328.28462-2-thorsten.blum@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-21 17:12:21 -07:00
Linus Torvalds	8a61cb6e15	block-6.14-20250221 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAme4rUUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpj9+EADLPOFPa9hT1PGBbpnj74vBayoTO/M+w2Gp +k2b8if3eGlY43WO2k+ytceWbA901iyvLPRqt1M1Ez8+BrNBg4NKcLv7q4O9NA3i nDPLggugSc5sdRbLRimxiwHkkpSOBenkdb7R9XGmMXTCfSbRKl0kK01ivpgkbiG4 pbyPWYcoMyHaECBfPhazrJig4+rugXOYbkYoOM4NHsLqlTNfmowcMRPu+6czXt7q ITHW2RTWK3ue8q+c3nwGPDk2ZKM8X/49rA/6bvD3voLNs+jQ8KFg2KULENf0Xaq6 1ZGrhLcr45iEHP0/+RORMzx27PqbTCSGIOTMZtwZNqh5+V+ybrGJq/F/T5rkrA3F QqHld/WSSKWJ10RVAyjDP7NQ5vNZTwwGAEVagjyIFEfk7G7RTY2kIpSZiUgrZ9oD 4CkOKUGmVkUsKQW6gb0JQObtYyXXoNtmg8wQU2WwhISjFDkoYWw53LHwH/LnxLyi Vg182amVBmERk4I5nTUiIML/7TzS69srb0Q7yaQS3eTwzLorDaB+3tPAxQmCTkGq KeBfuBtbP3LTOy2Oek4YbKl8CA2KDYtK7FbCE6PECUbdTjpNrcAAgA/ZcgoKV8s7 EHWZFx7dZyFS6LFNWzT9VhTtgSZS92JIsZwgnjSJPV2UazyxmqHFChzVVGWJbaB3 agkMor3nVg== =L+Lr -----END PGP SIGNATURE----- Merge tag 'block-6.14-20250221' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - FC controller state check fixes (Daniel) - PCI Endpoint fixes (Damien) - TCP connection failure fixe (Caleb) - TCP handling C2HTermReq PDU (Maurizio) - RDMA queue state check (Ruozhu) - Apple controller fixes (Hector) - Target crash on disbaled namespace (Hannes) - MD pull request via Yu: - Fix queue limits error handling for raid0, raid1 and raid10 - Fix for a NULL pointer deref in request data mapping - Code cleanup for request merging * tag 'block-6.14-20250221' of git://git.kernel.dk/linux: nvme: only allow entering LIVE from CONNECTING state nvme-fc: rely on state transitions to handle connectivity loss apple-nvme: Support coprocessors left idle apple-nvme: Release power domains when probe fails nvmet: Use enum definitions instead of hardcoded values nvme: Cleanup the definition of the controller config register fields nvme/ioctl: add missing space in err message nvme-tcp: fix connect failure on receiving partial ICResp PDU nvme: tcp: Fix compilation warning with W=1 nvmet: pci-epf: Avoid RCU stalls under heavy workload nvmet: pci-epf: Do not uselessly write the CSTS register nvmet: pci-epf: Correctly initialize CSTS when enabling the controller nvmet-rdma: recheck queue state is LIVE in state lock in recv done nvmet: Fix crash when a namespace is disabled nvme-tcp: add basic support for the C2HTermReq PDU nvme-pci: quirk Acer FA100 for non-uniqueue identifiers block: fix NULL pointer dereferenced within __blk_rq_map_sg block/merge: remove unnecessary min() with UINT_MAX md/raid*: Fix the set_queue_limits implementations	2025-02-21 09:36:28 -08:00
Nam Cao	cab0e0a056	blk_iocost: Switch to use hrtimer_setup() hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/196d487c925411923a2d59d4bf5e366b9dac2747.1738746821.git.namcao@linutronix.de	2025-02-18 10:32:34 +01:00
Nam Cao	2414f15910	block, bfq: Switch to use hrtimer_setup() hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/d0d57e1dab46b617856dfb93c721d221cc31ab0b.1738746821.git.namcao@linutronix.de	2025-02-18 10:32:33 +01:00
Ming Lei	dd8b0582e2	block: fix NULL pointer dereferenced within __blk_rq_map_sg The block layer internal flush request may not have bio attached, so the request iterator has to be initialized from valid req->bio, otherwise NULL pointer dereferenced is triggered. Cc: Christoph Hellwig <hch@lst.de> Reported-and-tested-by: Cheyenne Wills <cheyenne.wills@gmail.com> Fixes: b7175e24d6ac ("block: add a dma mapping iterator") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250217031626.461977-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-17 09:04:07 -07:00
Caleb Sander Mateos	43c70b1040	block/merge: remove unnecessary min() with UINT_MAX In bvec_split_segs(), max_bytes is an unsigned, so it must be less than or equal to UINT_MAX. Remove the unnecessary min(). Prior to commit 67927d220150 ("block/merge: count bytes instead of sectors"), the min() was with UINT_MAX >> 9, so it did have an effect. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250214193637.234702-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-14 15:40:17 -07:00
Linus Torvalds	1b8c8cdad1	block-6.14-20250214 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmevfDEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgptaHEACEqo12wWNcYklms/oy9DxsVEFM7d5waYRR NZy1+i3wbUAGfYl0marBh484kDr7Uyko4YJa0O0LyMKdW7wZOEk36MRUU2+7FeSp 4bFiFlSGyds9kIqjem4dR0ACCL/NW+PS5T79Xh1PnWDEByMH7wbRtPWVT6JJl4r6 PVdi4FB1aV6+C2DayjKbFqR0kDbFnl8INaGw8mg5PpI32A9mCQtl6XU2G/Pw8WVZ 3UJR+DWzfK/lSeVvPiZgOvLHWzi1UB0rKKuWjzbIq7dTtMy241Tox0YRnLsPiNxR ncRHftgEIjgkHjpCT4qQZ/joQfLop6MSkRixWUaORjTRqHHTqhLpj5SzjNlfn0Cb qhb/jf4VoBYD/04NEwvBzNmwyX6xohD07boM2SlnpiPNzBo0pcHzD4YuYzmsUCO4 gE2DeI9NAtDLMB5987Heb2zbvNtWgSM4g9t5zZuKtBEfNPnQwzYKFWeOIbSxmcbN Y5FW+sLXmXLT+li17BeJFzOXp882Lp4oZtSdX1ibTkmdj4P/IcNYuB3Z/VYvF1NO ZY2mBFRdUrii5oBh7iVSkwGIJM/TUwBgjoPlG84F7CoaxK6wQDHovFhkLHUVd7mx JfzTDfbsC/7R934IgLcLDR8uCaLmbMnJNYJqdvGQdR2NVy4azM52zopkHX6ereby DqicWc+Ekg== =zguJ -----END PGP SIGNATURE----- Merge tag 'block-6.14-20250214' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fix for request rejection for batch addition - Fix a few issues for bogus mac partition tables * tag 'block-6.14-20250214' of git://git.kernel.dk/linux: partitions: mac: fix handling of bogus partition table block: cleanup and fix batch completion adding conditions	2025-02-14 11:40:59 -08:00
Jann Horn	80e648042e	partitions: mac: fix handling of bogus partition table Fix several issues in partition probing: - The bailout for a bad partoffset must use put_dev_sector(), since the preceding read_part_sector() succeeded. - If the partition table claims a silly sector size like 0xfff bytes (which results in partition table entries straddling sector boundaries), bail out instead of accessing out-of-bounds memory. - We must not assume that the partition table contains proper NUL termination - use strnlen() and strncmp() instead of strlen() and strcmp(). Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20250214-partition-mac-v1-1-c1c626dffbd5@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-14 08:38:28 -07:00
Muchun Song	a052bfa636	block: refactor rq_qos_wait() When rq_qos_wait() is first introduced, it is easy to understand. But with some bug fixes applied, it is not easy for newcomers to understand the whole logic under those fixes. In this patch, rq_qos_wait() is refactored and more comments are added for better understanding. There are 3 points for the improvement: 1) Use waitqueue_active() instead of wq_has_sleeper() to eliminate unnecessary memory barrier in wq_has_sleeper() which is supposed to be used in waker side. In this case, we do need the barrier. So use the cheaper one to locklessly test for waiters on the queue. 2) Remove acquire_inflight_cb() logic for the first waiter out of the while loop to make the code clear. 3) Add more comments to explain how to sync with different waiters and the waker. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250208090416.38642-2-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-11 13:04:11 -07:00
Muchun Song	36d03cb327	block: introduce init_wait_func() There is already a macro DEFINE_WAIT_FUNC() to declare a wait_queue_entry with a specified waking function. But there is not a counterpart for initializing one wait_queue_entry with a specified waking function. So introducing init_wait_func() for this, which also could be used in iocost and rq-qos. Using default_wake_function() in rq_qos_wait() to wake up waiters, which could remove ->task field from rq_qos_wait_data. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250208090416.38642-1-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-11 13:04:11 -07:00
Eric Biggers	1ebd4a3c09	blk-crypto: add ioctls to create and prepare hardware-wrapped keys Until this point, the kernel can use hardware-wrapped keys to do encryption if userspace provides one -- specifically a key in ephemerally-wrapped form. However, no generic way has been provided for userspace to get such a key in the first place. Getting such a key is a two-step process. First, the key needs to be imported from a raw key or generated by the hardware, producing a key in long-term wrapped form. This happens once in the whole lifetime of the key. Second, the long-term wrapped key needs to be converted into ephemerally-wrapped form. This happens each time the key is "unlocked". In Android, these operations are supported in a generic way through KeyMint, a userspace abstraction layer. However, that method is Android-specific and can't be used on other Linux systems, may rely on proprietary libraries, and also misleads people into supporting KeyMint features like rollback resistance that make sense for other KeyMint keys but don't make sense for hardware-wrapped inline encryption keys. Therefore, this patch provides a generic kernel interface for these operations by introducing new block device ioctls: - BLKCRYPTOIMPORTKEY: convert a raw key to long-term wrapped form. - BLKCRYPTOGENERATEKEY: have the hardware generate a new key, then return it in long-term wrapped form. - BLKCRYPTOPREPAREKEY: convert a key from long-term wrapped form to ephemerally-wrapped form. These ioctls are implemented using new operations in blk_crypto_ll_ops. Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650 Link: https://lore.kernel.org/r/20250204060041.409950-4-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-10 09:54:19 -07:00
Eric Biggers	e35fde43e2	blk-crypto: show supported key types in sysfs Add sysfs files that indicate which type(s) of keys are supported by the inline encryption hardware associated with a particular request queue: /sys/block/$disk/queue/crypto/hw_wrapped_keys /sys/block/$disk/queue/crypto/raw_keys Userspace can use the presence or absence of these files to decide what encyption settings to use. Don't use a single key_type file, as devices might support both key types at the same time. Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650 Link: https://lore.kernel.org/r/20250204060041.409950-3-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-10 09:54:19 -07:00
Eric Biggers	ebc4176551	blk-crypto: add basic hardware-wrapped key support To prevent keys from being compromised if an attacker acquires read access to kernel memory, some inline encryption hardware can accept keys which are wrapped by a per-boot hardware-internal key. This avoids needing to keep the raw keys in kernel memory, without limiting the number of keys that can be used. Such hardware also supports deriving a "software secret" for cryptographic tasks that can't be handled by inline encryption; this is needed for fscrypt to work properly. To support this hardware, allow struct blk_crypto_key to represent a hardware-wrapped key as an alternative to a raw key, and make drivers set flags in struct blk_crypto_profile to indicate which types of keys they support. Also add the ->derive_sw_secret() low-level operation, which drivers supporting wrapped keys must implement. For more information, see the detailed documentation which this patch adds to Documentation/block/inline-encryption.rst. Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650 Link: https://lore.kernel.org/r/20250204060041.409950-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-02-10 09:54:19 -07:00
Eric Biggers	f6c3f6fb32	lib/crc64: rename CRC64-Rocksoft to CRC64-NVME This CRC64 variant comes from the NVME NVM Command Set Specification (https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-1.0e-2024.07.29-Ratified.pdf). The "Rocksoft Model CRC Algorithm", published in 1993 and available at https://www.zlib.net/crc_v3.txt, is a generalized CRC algorithm that can calculate any variant of CRC, given a list of parameters such as polynomial, bit order, etc. It is not a CRC variant. The NVME NVM Command Set Specification has a table that gives the "Rocksoft Model Parameters" for the CRC variant it uses. When support for this CRC variant was added to Linux, this table seems to have been misinterpreted as naming the CRC variant the "Rocksoft" CRC. In fact, the table names the CRC variant as the "NVM Express 64b CRC". Most implementations of this CRC variant outside Linux have been calling it CRC64-NVME. Therefore, update Linux to match. While at it, remove the superfluous "update" from the function name, so crc64_rocksoft_update() is now just crc64_nvme(), matching most of the other CRC library functions. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250130035130.180676-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2025-02-08 20:06:24 -08:00
Eric Biggers	feb541bfac	lib/crc64-rocksoft: stop wrapping the crypto API Following what was done for the CRC32 and CRC-T10DIF library functions, get rid of the pointless use of the crypto API and make crc64_rocksoft_update() call into the library directly. This is faster and simpler. Remove crc64_rocksoft() (the version of the function that did not take a 'crc' argument) since it is unused. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250130035130.180676-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>	2025-02-08 20:06:19 -08:00
Linus Torvalds	9755ffd989	block-6.14-20250131 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmec72kQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpm0pD/sENbY3LKqN8JipH8fWjLmdcimB7STB8MpB LcuWiiuhDQuU1Yjr7c1h0Nkb6qwrb+gP+NQJrhlyUO7JVlvzFivJqsRscibNj+MF hX/vUM2zKo0La2YxRv7aygTTCs8JfRmLKkTtfo0H4geljbX8/yMKSEJA+7x4j60X TeJjmAHEKBGm2cNA7QMGqhzoUfjW1WQyn9x6/YOlLsYLYB8Cd7LFJqxM1tiDzUay ys/QJ833qctdqBK5iLIh5voqZA58N3koidw536NLNgyvJn77CSAmC5jF1Vr0kFS6 Zb+dE9qRXDsND83dq3qvpSG5gtE42DygEG0g7s49zztd1rPOWE45RCxQtcZqHSvu N9hN+/4SNX+02AfrgZFEk8+yGbtIlsb0mHRxqIKh8261tb1RTYVmtI4tqC3GVMjj Qo41rDLDhaPOoKayo/KZO55n9U9M8BIkoF73hIIr5l9wHf5iOwi+EjzULlYhGigc xypMvwWFLu3Of9YCJo38fTxGO6fL4eOJihoqqxNfeRACyQqbVeLWAT4wgUr2G8la MIp2l5Vb7MmFXsES39fARLoN/8jjbgaXdxnSWhZlo4SPQaAlVA5ScXC4tKI3Gte3 O7fPVAX76PrZ+y6x4enfooI3mrpkeK676BxsokOX0m8WrVyVmIc4cfSaayc1gC5m Uy298X93IA== =molg -----END PGP SIGNATURE----- Merge tag 'block-6.14-20250131' of git://git.kernel.dk/linux Pull more block updates from Jens Axboe: - MD pull request via Song: - Fix a md-cluster regression introduced - More sysfs race fixes - Mark anything inside queue freezing as not being able to do IO for memory allocations - Fix for a regression introduced in loop in this merge window - Fix for a regression in queue mapping setups introduced in this merge window - Fix for the block dio fops attempting an iov_iter revert upton getting -EIOCBQUEUED on the read side. This one is going to stable as well * tag 'block-6.14-20250131' of git://git.kernel.dk/linux: block: force noio scope in blk_mq_freeze_queue block: fix nr_hw_queue update racing with disk addition/removal block: get rid of request queue ->sysfs_dir_lock loop: don't clear LO_FLAGS_PARTSCAN on LOOP_SET_STATUS{,64} md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime blk-mq: create correct map for fallback case block: don't revert iter for -EIOCBQUEUED	2025-01-31 11:49:30 -08:00
Christoph Hellwig	1e1a9cecfa	block: force noio scope in blk_mq_freeze_queue When block drivers or the core block code perform allocations with a frozen queue, this could try to recurse into the block device to reclaim memory and deadlock. Thus all allocations done by a process that froze a queue need to be done without __GFP_IO and __GFP_FS. Instead of tying to track all of them down, force a noio scope as part of freezing the queue. Note that nvme is a bit of a mess here due to the non-owner freezes, and they will be addressed separately. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250131120352.1315351-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-01-31 07:20:08 -07:00
Nilay Shroff	14ef49657f	block: fix nr_hw_queue update racing with disk addition/removal The nr_hw_queue update could potentially race with disk addtion/removal while registering/unregistering hctx sysfs files. The __blk_mq_update_ nr_hw_queues() runs with q->tag_list_lock held and so to avoid it racing with disk addition/removal we should acquire q->tag_list_lock while registering/unregistering hctx sysfs files. With this patch, blk_mq_sysfs_register() (called during disk addition) and blk_mq_sysfs_unregister() (called during disk removal) now runs with q->tag_list_lock held so that it avoids racing with __blk_mq_update _nr_hw_queues(). Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250128143436.874357-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-01-29 07:16:47 -07:00
Nilay Shroff	fe66286086	block: get rid of request queue ->sysfs_dir_lock The request queue uses ->sysfs_dir_lock for protecting the addition/ deletion of kobject entries under sysfs while we register/unregister blk-mq. However kobject addition/deletion is already protected with kernfs/sysfs internal synchronization primitives. So use of q->sysfs_ dir_lock seems redundant. Moreover, q->sysfs_dir_lock is also used at few other callsites along with q->sysfs_lock for protecting the addition/deletion of kojects. One such example is when we register with sysfs a set of independent access ranges for a disk. Here as well we could get rid off q->sysfs_ dir_lock and only use q->sysfs_lock. The only variable which q->sysfs_dir_lock appears to protect is q-> mq_sysfs_init_done which is set/unset while registering/unregistering blk-mq with sysfs. But use of q->mq_sysfs_init_done could be easily replaced using queue registered bit QUEUE_FLAG_REGISTERED. So with this patch we remove q->sysfs_dir_lock from each callsite and replace q->mq_sysfs_init_done using QUEUE_FLAG_REGISTERED. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250128143436.874357-2-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-01-29 07:16:47 -07:00
Linus Torvalds	2ab002c755	Driver core and debugfs updates Here is the big set of driver core and debugfs updates for 6.14-rc1. It's coming late in the merge cycle as there are a number of merge conflicts with your tree now, and I wanted to make sure they were working properly. To resolve them, look in linux-next, and I will send the "fixup" patch as a response to the pull request. Included in here is a bunch of driver core, PCI, OF, and platform rust bindings (all acked by the different subsystem maintainers), hence the merge conflict with the rust tree, and some driver core api updates to mark things as const, which will also require some fixups due to new stuff coming in through other trees in this merge window. There are also a bunch of debugfs updates from Al, and there is at least one user that does have a regression with these, but Al is working on tracking down the fix for it. In my use (and everyone else's linux-next use), it does not seem like a big issue at the moment. Here's a short list of the things in here: - driver core bindings for PCI, platform, OF, and some i/o functions. We are almost at the "write a real driver in rust" stage now, depending on what you want to do. - misc device rust bindings and a sample driver to show how to use them - debugfs cleanups in the fs as well as the users of the fs api for places where drivers got it wrong or were unnecessarily doing things in complex ways. - driver core const work, making more of the api take const * for different parameters to make the rust bindings easier overall. - other small fixes and updates All of these have been in linux-next with all of the aforementioned merge conflicts, and the one debugfs issue, which looks to be resolved "soon". Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZ5koPA8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ymFHACfT5acDKf2Bov2Lc/5u3vBW/R6ChsAnj+LmgVI hcDSPodj4szR40RRnzBd =u5Ey -----END PGP SIGNATURE----- Merge tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core and debugfs updates from Greg KH: "Here is the big set of driver core and debugfs updates for 6.14-rc1. Included in here is a bunch of driver core, PCI, OF, and platform rust bindings (all acked by the different subsystem maintainers), hence the merge conflict with the rust tree, and some driver core api updates to mark things as const, which will also require some fixups due to new stuff coming in through other trees in this merge window. There are also a bunch of debugfs updates from Al, and there is at least one user that does have a regression with these, but Al is working on tracking down the fix for it. In my use (and everyone else's linux-next use), it does not seem like a big issue at the moment. Here's a short list of the things in here: - driver core rust bindings for PCI, platform, OF, and some i/o functions. We are almost at the "write a real driver in rust" stage now, depending on what you want to do. - misc device rust bindings and a sample driver to show how to use them - debugfs cleanups in the fs as well as the users of the fs api for places where drivers got it wrong or were unnecessarily doing things in complex ways. - driver core const work, making more of the api take const * for different parameters to make the rust bindings easier overall. - other small fixes and updates All of these have been in linux-next with all of the aforementioned merge conflicts, and the one debugfs issue, which looks to be resolved "soon"" * tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits) rust: device: Use as_char_ptr() to avoid explicit cast rust: device: Replace CString with CStr in property_present() devcoredump: Constify 'struct bin_attribute' devcoredump: Define 'struct bin_attribute' through macro rust: device: Add property_present() saner replacement for debugfs_rename() orangefs-debugfs: don't mess with ->d_name octeontx2: don't mess with ->d_parent or ->d_parent->d_name arm_scmi: don't mess with ->d_parent->d_name slub: don't mess with ->d_name sof-client-ipc-flood-test: don't mess with ->d_name qat: don't mess with ->d_name xhci: don't mess with ->d_iname mtu3: don't mess wiht ->d_iname greybus/camera - stop messing with ->d_iname mediatek: stop messing with ->d_iname netdevsim: don't embed file_operations into your structs b43legacy: make use of debugfs_get_aux() b43: stop embedding struct file_operations into their objects carl9170: stop embedding file_operations into their objects ...	2025-01-28 12:25:12 -08:00
Daniel Wagner	a9ae6fe1c3	blk-mq: create correct map for fallback case The fallback code in blk_mq_map_hw_queues is original from blk_mq_pci_map_queues and was added to handle the case where pci_irq_get_affinity will return NULL for !SMP configuration. blk_mq_map_hw_queues replaces besides blk_mq_pci_map_queues also blk_mq_virtio_map_queues which used to use blk_mq_map_queues for the fallback. It's possible to use blk_mq_map_queues for both cases though. blk_mq_map_queues creates the same map as blk_mq_clear_mq_map for !SMP that is CPU 0 will be mapped to hctx 0. The WARN_ON_ONCE has to be dropped for virtio as the fallback is also taken for certain configuration on default. Though there is still a WARN_ON_ONCE check in lib/group_cpus.c: WARN_ON(nr_present + nr_others < numgrps); which will trigger if the caller tries to create more hardware queues than CPUs. It tests the same as the WARN_ON_ONCE in blk_mq_pci_map_queues did. Fixes: a5665c3d150c ("virtio: blk/scsi: replace blk_mq_virtio_map_queues with blk_mq_map_hw_queues") Reported-by: Steven Rostedt <rostedt@goodmis.org> Closes: https://lore.kernel.org/all/20250122093020.6e8a4e5b@gandalf.local.home/ Signed-off-by: Daniel Wagner <wagi@kernel.org> Link: https://lore.kernel.org/r/20250123-fix-blk_mq_map_hw_queues-v1-1-08dbd01f2c39@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-01-23 06:34:32 -07:00
Jens Axboe	b13ee668e8	block: don't revert iter for -EIOCBQUEUED blkdev_read_iter() has a few odd checks, like gating the position and count adjustment on whether or not the result is bigger-than-or-equal to zero (where bigger than makes more sense), and not checking the return value of blkdev_direct_IO() before doing an iov_iter_revert(). The latter can lead to attempting to revert with a negative value, which when passed to iov_iter_revert() as an unsigned value will lead to throwing a WARN_ON() because unroll is bigger than MAX_RW_COUNT. Be sane and don't revert for -EIOCBQUEUED, like what is done in other spots. Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-01-23 06:18:41 -07:00
Linus Torvalds	a312e1706c	for-6.14/io_uring-20250119 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeNDEUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpl5hD/4t7kWWNQDeQG9CiA3QStMJ5Yow2AgYtK8f sJBr5/6PGEsbTreX//Kh8DtPZPRGcjG9elCo58QxWaPZ2mg3fTOR3/QYLMlaGXU2 hSht58lj32utpuzMjMo9bG3aesi03bLf+buaq7V1FaMlcTV8rXqK1s/HGtphDBRo 8tNLEk3JDJDs3vlWbNp/5Hqh9+Ro6DU8df1zWWH4Vbu8RXaGIPyJyjKvvcbfuuCf k7Ay45XNAmTZg+rSNGv1H3Yn1LNzPMVFLWBfzRahPCzlKy2+mJMWz1PWu9naaUK+ WTM+kgiBLF24k59G/9xuxC5bYtsTjTbr4GsEE5ZvFBnhKPzLzzaJj7iQHRj83vtv tqxNmAbA3wJoNk48Zr8+cYbfDX9Q9Pl32wIaS/LxRgF9MT4lem6pyKY7Skd12oK3 rnQ8moGtnOBxp3QUU6BZ7IX3ipb+Bgw7FhZbtVYJdlqKeKyi1QO0MuITwGXpMwk/ EWDDTsspIf+QaTu+fmO8byJavugKljW8t7hM1JpvlfOLl+rsh6/+AYz42fCvcaA0 Tu4bpUk8SuwALvZfU2R6bLkorGG6MFuGI8g3eixOcGir3YAcHBMfdg6ItpZi5qVt ToM87BMaezOZZvSwX1JBaQ0AR5HBQYmHaiLWgPsORf3PjJ0kz+u21SK9D+yJkUtU rT6+HvoVXA== =ufpE -----END PGP SIGNATURE----- Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux Pull io_uring updates from Jens Axboe: "Not a lot in terms of features this time around, mostly just cleanups and code consolidation: - Support for PI meta data read/write via io_uring, with NVMe and SCSI covered - Cleanup the per-op structure caching, making it consistent across various command types - Consolidate the various user mapped features into a concept called regions, making the various users of that consistent - Various cleanups and fixes" * tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits) io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname io_uring: reuse io_should_terminate_tw() for cmds io_uring: Factor out a function to parse restrictions io_uring/rsrc: require cloned buffers to share accounting contexts io_uring: simplify the SQPOLL thread check when cancelling requests io_uring: expose read/write attribute capability io_uring/rw: don't gate retry on completion context io_uring/rw: handle -EAGAIN retry at IO completion time io_uring/rw: use io_rw_recycle() from cleanup path io_uring/rsrc: simplify the bvec iter count calculation io_uring: ensure io_queue_deferred() is out-of-line io_uring/rw: always clear ->bytes_done on io_async_rw setup io_uring/rw: use NULL for rw->free_iovec assigment io_uring/rw: don't mask in f_iocb_flags io_uring/msg_ring: Drop custom destructor io_uring: Move old async data allocation helper to header io_uring/rw: Allocate async data through helper io_uring/net: Allocate msghdr async data through helper io_uring/uring_cmd: Allocate async data through generic helper io_uring/poll: Allocate apoll with generic alloc_cache helper ...	2025-01-20 20:27:33 -08:00
Linus Torvalds	1cbfb828e0	for-6.14/block-20250118 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27 hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/ CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA h3Rtbh+Pgg== =EXYs -----END PGP SIGNATURE----- Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe pull requests via Keith: - Target support for PCI-Endpoint transport (Damien) - TCP IO queue spreading fixes (Sagi, Chaitanya) - Target handling for "limited retry" flags (Guixen) - Poll type fix (Yongsoo) - Xarray storage error handling (Keisuke) - Host memory buffer free size fix on error (Francis) - MD pull requests via Song: - Reintroduce md-linear (Yu Kuai) - md-bitmap refactor and fix (Yu Kuai) - Replace kmap_atomic with kmap_local_page (David Reaver) - Quite a few queue freeze and debugfs deadlock fixes Ming introduced lockdep support for this in the 6.13 kernel, and it has (unsurprisingly) uncovered quite a few issues - Use const attributes for IO schedulers - Remove bio ioprio wrappers - Fixes for stacked device atomic write support - Refactor queue affinity helpers, in preparation for better supporting isolated CPUs - Cleanups of loop O_DIRECT handling - Cleanup of BLK_MQ_F_* flags - Add rotational support for null_blk - Various fixes and cleanups * tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits) block: Don't trim an atomic write block: Add common atomic writes enable flag md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add() block: limit disk max sectors to (LLONG_MAX >> 9) block: Change blk_stack_atomic_writes_limits() unit_min check block: Ensure start sector is aligned for stacking atomic writes blk-mq: Move more error handling into blk_mq_submit_bio() block: Reorder the request allocation code in blk_mq_submit_bio() nvme: fix bogus kzalloc() return check in nvme_init_effects_log() md/md-bitmap: move bitmap_{start, end}write to md upper layer md/raid5: implement pers->bitmap_sector() md: add a new callback pers->bitmap_sector() md/md-bitmap: remove the last parameter for bimtap_ops->endwrite() md/md-bitmap: factor behind write counters out from bitmap_{start/end}write() md: Replace deprecated kmap_atomic() with kmap_local_page() md: reintroduce md-linear partitions: ldm: remove the initial kernel-doc notation blk-cgroup: rwstat: fix kernel-doc warnings in header file blk-cgroup: fix kernel-doc warnings in header file nbd: fix partial sending ...	2025-01-20 19:38:46 -08:00
John Garry	554b22864c	block: Don't trim an atomic write This is disallowed. This check will now be relevant since the device mapper personalities will start to support atomic writes, and they use this function. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20250116170301.474130-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-01-17 13:13:55 -07:00
John Garry	6a7e17b220	block: Add common atomic writes enable flag Currently only stacked devices need to explicitly enable atomic writes by setting BLK_FEAT_ATOMIC_WRITES_STACKED flag. This does not work well for device mapper stacking devices, as there many sets of limits are stacked and what is the 'bottom' and 'top' device can swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set for many queue limits, which is messy. Generalize enabling atomic writes enabling by ensuring that all devices must explicitly set a flag - that includes NVMe, SCSI sd, and md raid. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20250116170301.474130-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-01-17 13:13:54 -07:00
Ming Lei	3d9a9e9a77	block: limit disk max sectors to (LLONG_MAX >> 9) Kernel `loff_t` is defined as `long long int`, so we can't support disk which size is > LLONG_MAX. There are many virtual block drivers, and hardware may report bad capacity too, so limit max sectors to (LLONG_MAX >> 9) for avoiding potential trouble. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250115092648.1104452-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-01-15 15:46:56 -07:00
John Garry	5d1f7ee7f0	block: Change blk_stack_atomic_writes_limits() unit_min check The current check in blk_stack_atomic_writes_limits() for a bottom device supporting atomic writes is to verify that limit atomic_write_unit_min is non-zero. This would cause a problem for device mapper queue limits calculation. This is because it uses a temporary queue_limits structure to stack the limits, before finally commiting the limits update. The value of atomic_write_unit_min for the temporary queue_limits structure is never evaluated and so cannot be used, so use limit atomic_write_hw_unit_min. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250109114000.2299896-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-01-15 09:47:43 -07:00
John Garry	6564862d64	block: Ensure start sector is aligned for stacking atomic writes For stacking atomic writes, ensure that the start sector is aligned with the device atomic write unit min and any boundary. Otherwise, we may permit misaligned atomic writes. Rework bdev_can_atomic_write() into a common helper to resuse the alignment check. There also use atomic_write_hw_unit_min, which is more proper (than atomic_write_unit_min). Fixes: d7f36dc446e89 ("block: Support atomic writes limits for stacked devices") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250109114000.2299896-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-01-15 09:47:43 -07:00
Bart Van Assche	659381520a	blk-mq: Move more error handling into blk_mq_submit_bio() The error handling code in blk_mq_get_new_requests() cannot be understood without knowing that this function is only called by blk_mq_submit_bio(). Hence move the code for handling blk_mq_get_new_requests() failures into blk_mq_submit_bio(). Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241218212246.1073149-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-01-14 10:13:25 -07:00
Bart Van Assche	44e4138159	block: Reorder the request allocation code in blk_mq_submit_bio() Help the CPU branch predictor in case of a cache hit by handling the cache hit scenario first. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241218212246.1073149-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-01-14 10:13:25 -07:00
Randy Dunlap	e494e45161	partitions: ldm: remove the initial kernel-doc notation Remove the file's first comment describing what the file is. This comment is not in kernel-doc format so it causes a kernel-doc warning. ldm.h:13: warning: expecting prototype for ldm(). Prototype was for _FS_PT_LDM_H_() instead Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Richard Russon (FlatCap) <ldm@flatcap.org> Cc: linux-ntfs-dev@lists.sourceforge.net Cc: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250111062758.910458-1-rdunlap@infradead.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-01-13 07:47:19 -07:00

1 2 3 4 5 ...

7730 Commits