mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/
synced 2025-04-19 20:58:31 +09:00
73 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
e3924279e5 |
futex: Use a hashmask instead of hashsize
The global hash uses futex_hashsize to save the amount of the hash buckets that have been allocated during system boot. On each futex_hash() invocation this number is substracted by one to get the mask. This can be optimized by saving directly the mask avoiding the substraction on each futex_hash() invocation. Rename futex_hashsize to futex_hashmask and save the mask of the allocated hash map. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/all/20250226091057.bX8vObR4@linutronix.de |
||
![]() |
fa76887bb7 |
Fix a dangling pointer bug in the futex code used by the
uring code, which isn't causing problems at the moment due to uring ABI limitations leaving it essentially unused in current usages, but is a good idea to fix nevertheless. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmenHrkRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1jnLQ//T+vNYeyQ5Nc3CuqsZfv5h77ijCLzazSh qu5LXyGHHIlLLPEzh53wRQQbGBQ6A2HdbVVphn8k/0v4eT1Ez5yN7AiTYuPkEP73 m6MWQAWcGQ7M7vR7cvWIsIB1wS5PD2g3UdvS8x+OECZk4lnSx4Xh/TfbRIURwhe2 SS6jgRGhaodsp8N2o8c/BgrvvHY9aedJQhx4iAh3PiuPomygr9kfIAaQstQNKx61 w4NQBQhK93LD9duESc+ONDlRhzSvbdJfRby1hbHzvcnCGe5S2aZzOfY31CPJbOt6 UvbfeStEGEHkfqbZOXEtwVPZ80+U2hWvD67wSXFB0pTc68zkuGN3/Ko88GCyZx5+ mxDRYWLoExknEUuk/Mc+hOzu1uaCjpXxA8qRr7SW3ewH1QOGr+ZISQgSffRdujbH 2E2cBh9/HOeVZ/7nAvfkSU+yyfvBwZBP/Q0PN5ODpk3S7ZfCC7h57oClWx4WUuTX 0H9N2IvPG0hqmqljKkt/5Xc4Qgvh6RA+pmxK0uUngViuw+v81Ea7/m+kbetQRO07 OPOH/UT4nlmwoCwch+nKr/MRmZADpXEZyeRKS0kBJQLRMkN9VT1+e2Zf3Yir2Ji4 hveqiJKiIgCPPxz3w+N/XcSgOTQUN1PmOLjEXB+gRNRctsvGZOtuY2HZIydQAMbT EjJBwkEWIQo= =qhN4 -----END PGP SIGNATURE----- Merge tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Ingo Molnar: "Fix a dangling pointer bug in the futex code used by the uring code. It isn't causing problems at the moment due to uring ABI limitations leaving it essentially unused in current usages, but is a good idea to fix nevertheless" * tag 'locking-urgent-2025-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Pass in task to futex_queue() |
||
![]() |
c159dfbdd4 |
Mainly individually changelogged singleton patches. The patch series in
this pull are: - "lib min_heap: Improve min_heap safety, testing, and documentation" from Kuan-Wei Chiu provides various tightenings to the min_heap library code. - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some cleanup and Rust preparation in the xarray library code. - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes pathnames in some code comments. - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the new secs_to_jiffies() in various places where that is appropriate. - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen switches two filesystems to the new mount API. - "Convert ocfs2 to use folios" from Matthew Wilcox does that. - "Remove get_task_comm() and print task comm directly" from Yafang Shao removes now-unneeded calls to get_task_comm() in various places. - "squashfs: reduce memory usage and update docs" from Phillip Lougher implements some memory savings in squashfs and performs some maintainability work. - "lib: clarify comparison function requirements" from Kuan-Wei Chiu tightens the sort code's behaviour and adds some maintenance work. - "nilfs2: protect busy buffer heads from being force-cleared" from Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a corrupted image. - "nilfs2: fix kernel-doc comments for function return values" from Ryusuke Konishi fixes some nilfs kerneldoc. - "nilfs2: fix issues with rename operations" from Ryusuke Konishi addresses some nilfs BUG_ONs which syzbot was able to trigger. - "minmax.h: Cleanups and minor optimisations" from David Laight does some maintenance work on the min/max library code. - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work on the xarray library code. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5SP5QAKCRDdBJ7gKXxA jqN7AQChvwXGG43n4d5SDiA/rH7ddvowQcDqhC9cAMJ1ReR7qwEA8/LIWDE4PdMX mJnaZ1/ibpEpearrChCViApQtcyEGQI= =ti4E -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Mainly individually changelogged singleton patches. The patch series in this pull are: - "lib min_heap: Improve min_heap safety, testing, and documentation" from Kuan-Wei Chiu provides various tightenings to the min_heap library code - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some cleanup and Rust preparation in the xarray library code - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes pathnames in some code comments - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the new secs_to_jiffies() in various places where that is appropriate - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen switches two filesystems to the new mount API - "Convert ocfs2 to use folios" from Matthew Wilcox does that - "Remove get_task_comm() and print task comm directly" from Yafang Shao removes now-unneeded calls to get_task_comm() in various places - "squashfs: reduce memory usage and update docs" from Phillip Lougher implements some memory savings in squashfs and performs some maintainability work - "lib: clarify comparison function requirements" from Kuan-Wei Chiu tightens the sort code's behaviour and adds some maintenance work - "nilfs2: protect busy buffer heads from being force-cleared" from Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a corrupted image - "nilfs2: fix kernel-doc comments for function return values" from Ryusuke Konishi fixes some nilfs kerneldoc - "nilfs2: fix issues with rename operations" from Ryusuke Konishi addresses some nilfs BUG_ONs which syzbot was able to trigger - "minmax.h: Cleanups and minor optimisations" from David Laight does some maintenance work on the min/max library code - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work on the xarray library code" * tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits) ocfs2: use str_yes_no() and str_no_yes() helper functions include/linux/lz4.h: add some missing macros Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent Xarray: remove repeat check in xas_squash_marks() Xarray: distinguish large entries correctly in xas_split_alloc() Xarray: move forward index correctly in xas_pause() Xarray: do not return sibling entries from xas_find_marked() ipc/util.c: complete the kernel-doc function descriptions gcov: clang: use correct function param names latencytop: use correct kernel-doc format for func params minmax.h: remove some #defines that are only expanded once minmax.h: simplify the variants of clamp() minmax.h: move all the clamp() definitions after the min/max() ones minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp() minmax.h: reduce the #define expansion of min(), max() and clamp() minmax.h: update some comments minmax.h: add whitespace around operators and after commas nilfs2: do not update mtime of renamed directory that is not moved nilfs2: handle errors that nilfs_prepare_chunk() may return CREDITS: fix spelling mistake ... |
||
![]() |
5e0e02f0d7 |
futex: Pass in task to futex_queue()
futex_queue() -> __futex_queue() uses 'current' as the task to store in the struct futex_q->task field. This is fine for synchronous usage of the futex infrastructure, but it's not always correct when used by io_uring where the task doing the initial futex_queue() might not be available later on. This doesn't lead to any issues currently, as the io_uring side doesn't support PI futexes, but it does leave a potentially dangling pointer which is never a good idea. Have futex_queue() take a task_struct argument, and have the regular callers pass in 'current' for that. Meanwhile io_uring can just pass in NULL, as the task should never be used off that path. In theory req->tctx->task could be used here, but there's no point populating it with a task field that will never be used anyway. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/22484a23-542c-4003-b721-400688a0d055@kernel.dk |
||
![]() |
b6dcdb06c0 |
kernel: remove get_task_comm() and print task comm directly
Patch series "Remove get_task_comm() and print task comm directly", v2. Since task->comm is guaranteed to be NUL-terminated, we can print it directly without the need to copy it into a separate buffer. This simplifies the code and avoids unnecessary operations. This patch (of 5): Since task->comm is guaranteed to be NUL-terminated, we can print it directly without the need to copy it into a separate buffer. This simplifies the code and avoids unnecessary operations. Link: https://lkml.kernel.org/r/20241219023452.69907-1-laoar.shao@gmail.com Link: https://lkml.kernel.org/r/20241219023452.69907-2-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <dvhart@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "André Almeida" <andrealmeid@igalia.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Kalle Valo <kvalo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Danilo Krummrich <dakr@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: James Morris <jmorris@namei.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Lyude Paul <lyude@redhat.com> Cc: Oded Gabbay <ogabbay@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: Tvrtko Ursulin <tursulin@ursulin.net> Cc: Vineet Gupta <vgupta@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
abfdccd6af |
sched/wake_q: Add helper to call wake_up_q after unlock with preemption disabled
A common pattern seen when wake_qs are used to defer a wakeup until after a lock is released is something like: preempt_disable(); raw_spin_unlock(lock); wake_up_q(wake_q); preempt_enable(); So create some raw_spin_unlock*_wake() helper functions to clean this up. Applies on top of the fix I submitted here: https://lore.kernel.org/lkml/20241212222138.2400498-1-jstultz@google.com/ NOTE: I recognise the unlock()/unlock_irq()/unlock_irqrestore() variants creates its own duplication, which we could use a macro to generate the similar functions, but I often dislike how those generation macros making finding the actual implementation harder, so I left the three functions as is. If folks would prefer otherwise, let me know and I'll switch it. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241217040803.243420-1-jstultz@google.com |
||
![]() |
32913f3482 |
futex: fix user access on powerpc
The powerpc user access code is special, and unlike other architectures distinguishes between user access for reading and writing. And commit 43a43faf5376 ("futex: improve user space accesses") messed that up. It went undetected elsewhere, but caused ppc32 to fail early during boot, because the user access had been started with user_read_access_begin(), but then finished off with just a plain "user_access_end()". Note that the address-masking user access helpers don't even have that read-vs-write distinction, so if powerpc ever wants to do address masking tricks, we'll have to do some extra work for it. [ Make sure to also do it for the EFAULT case, as pointed out by Christophe Leroy ] Reported-by: Andreas Schwab <schwab@linux-m68k.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/all/87bjxl6b0i.fsf@igel.home/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
43a43faf53 |
futex: improve user space accesses
Josh Poimboeuf reports that he got a "will-it-scale.per_process_ops 1.9% improvement" report for his patch that changed __get_user() to use pointer masking instead of the explicit speculation barrier. However, that patch doesn't actually work in the general case, because some (very bad) architecture-specific code actually depends on __get_user() also working on kernel addresses. A profile showed that the offending __get_user() was the futex code, which really should be fixed up to not use that horrid legacy case. Rewrite futex_get_value_locked() to use the modern user acccess helpers, and inline it so that the compiler not only avoids the function call for a few instructions, but can do CSE on the address masking. It also turns out the x86 futex functions have unnecessary barriers in other places, so let's fix those up too. Link: https://lore.kernel.org/all/20241115230653.hfvzyf3aqqntgp63@jpoimboe/ Reported-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
![]() |
5c00ff742b |
- The series "zram: optimal post-processing target selection" from
Sergey Senozhatsky improves zram's post-processing selection algorithm. This leads to improved memory savings. - Wei Yang has gone to town on the mapletree code, contributing several series which clean up the implementation: - "refine mas_mab_cp()" - "Reduce the space to be cleared for maple_big_node" - "maple_tree: simplify mas_push_node()" - "Following cleanup after introduce mas_wr_store_type()" - "refine storing null" - The series "selftests/mm: hugetlb_fault_after_madv improvements" from David Hildenbrand fixes this selftest for s390. - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng implements some rationaizations and cleanups in the page mapping code. - The series "mm: optimize shadow entries removal" from Shakeel Butt optimizes the file truncation code by speeding up the handling of shadow entries. - The series "Remove PageKsm()" from Matthew Wilcox completes the migration of this flag over to being a folio-based flag. - The series "Unify hugetlb into arch_get_unmapped_area functions" from Oscar Salvador implements a bunch of consolidations and cleanups in the hugetlb code. - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain takes away the wp-fault time practice of turning a huge zero page into small pages. Instead we replace the whole thing with a THP. More consistent cleaner and potentiall saves a large number of pagefaults. - The series "percpu: Add a test case and fix for clang" from Andy Shevchenko enhances and fixes the kernel's built in percpu test code. - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett optimizes mremap() by avoiding doing things which we didn't need to do. - The series "Improve the tmpfs large folio read performance" from Baolin Wang teaches tmpfs to copy data into userspace at the folio size rather than as individual pages. A 20% speedup was observed. - The series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting. - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt removes the long-deprecated memcgv2 charge moving feature. - The series "fix error handling in mmap_region() and refactor" from Lorenzo Stoakes cleanup up some of the mmap() error handling and addresses some potential performance issues. - The series "x86/module: use large ROX pages for text allocations" from Mike Rapoport teaches x86 to use large pages for read-only-execute module text. - The series "page allocation tag compression" from Suren Baghdasaryan is followon maintenance work for the new page allocation profiling feature. - The series "page->index removals in mm" from Matthew Wilcox remove most references to page->index in mm/. A slow march towards shrinking struct page. - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests" from Andrew Paniakin performs maintenance work for DAMON's self testing code. - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar improves zswap's batching of compression and decompression. It is a step along the way towards using Intel IAA hardware acceleration for this zswap operation. - The series "kasan: migrate the last module test to kunit" from Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests over to the KUnit framework. - The series "implement lightweight guard pages" from Lorenzo Stoakes permits userapace to place fault-generating guard pages within a single VMA, rather than requiring that multiple VMAs be created for this. Improved efficiencies for userspace memory allocators are expected. - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses tracepoints to provide increased visibility into memcg stats flushing activity. - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky fixes a zram buglet which potentially affected performance. - The series "mm: add more kernel parameters to control mTHP" from Maíra Canal enhances our ability to control/configuremultisize THP from the kernel boot command line. - The series "kasan: few improvements on kunit tests" from Sabyrzhan Tasbolatov has a couple of fixups for the KASAN KUnit tests. - The series "mm/list_lru: Split list_lru lock into per-cgroup scope" from Kairui Song optimizes list_lru memory utilization when lockdep is enabled. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzwFqgAKCRDdBJ7gKXxA jkeuAQCkl+BmeYHE6uG0hi3pRxkupseR6DEOAYIiTv0/l8/GggD/Z3jmEeqnZaNq xyyenpibWgUoShU2wZ/Ha8FE5WDINwg= =JfWR -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "zram: optimal post-processing target selection" from Sergey Senozhatsky improves zram's post-processing selection algorithm. This leads to improved memory savings. - Wei Yang has gone to town on the mapletree code, contributing several series which clean up the implementation: - "refine mas_mab_cp()" - "Reduce the space to be cleared for maple_big_node" - "maple_tree: simplify mas_push_node()" - "Following cleanup after introduce mas_wr_store_type()" - "refine storing null" - The series "selftests/mm: hugetlb_fault_after_madv improvements" from David Hildenbrand fixes this selftest for s390. - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng implements some rationaizations and cleanups in the page mapping code. - The series "mm: optimize shadow entries removal" from Shakeel Butt optimizes the file truncation code by speeding up the handling of shadow entries. - The series "Remove PageKsm()" from Matthew Wilcox completes the migration of this flag over to being a folio-based flag. - The series "Unify hugetlb into arch_get_unmapped_area functions" from Oscar Salvador implements a bunch of consolidations and cleanups in the hugetlb code. - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain takes away the wp-fault time practice of turning a huge zero page into small pages. Instead we replace the whole thing with a THP. More consistent cleaner and potentiall saves a large number of pagefaults. - The series "percpu: Add a test case and fix for clang" from Andy Shevchenko enhances and fixes the kernel's built in percpu test code. - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett optimizes mremap() by avoiding doing things which we didn't need to do. - The series "Improve the tmpfs large folio read performance" from Baolin Wang teaches tmpfs to copy data into userspace at the folio size rather than as individual pages. A 20% speedup was observed. - The series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting. - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt removes the long-deprecated memcgv2 charge moving feature. - The series "fix error handling in mmap_region() and refactor" from Lorenzo Stoakes cleanup up some of the mmap() error handling and addresses some potential performance issues. - The series "x86/module: use large ROX pages for text allocations" from Mike Rapoport teaches x86 to use large pages for read-only-execute module text. - The series "page allocation tag compression" from Suren Baghdasaryan is followon maintenance work for the new page allocation profiling feature. - The series "page->index removals in mm" from Matthew Wilcox remove most references to page->index in mm/. A slow march towards shrinking struct page. - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests" from Andrew Paniakin performs maintenance work for DAMON's self testing code. - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar improves zswap's batching of compression and decompression. It is a step along the way towards using Intel IAA hardware acceleration for this zswap operation. - The series "kasan: migrate the last module test to kunit" from Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests over to the KUnit framework. - The series "implement lightweight guard pages" from Lorenzo Stoakes permits userapace to place fault-generating guard pages within a single VMA, rather than requiring that multiple VMAs be created for this. Improved efficiencies for userspace memory allocators are expected. - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses tracepoints to provide increased visibility into memcg stats flushing activity. - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky fixes a zram buglet which potentially affected performance. - The series "mm: add more kernel parameters to control mTHP" from Maíra Canal enhances our ability to control/configuremultisize THP from the kernel boot command line. - The series "kasan: few improvements on kunit tests" from Sabyrzhan Tasbolatov has a couple of fixups for the KASAN KUnit tests. - The series "mm/list_lru: Split list_lru lock into per-cgroup scope" from Kairui Song optimizes list_lru memory utilization when lockdep is enabled. * tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits) cma: enforce non-zero pageblock_order during cma_init_reserved_mem() mm/kfence: add a new kunit test test_use_after_free_read_nofault() zram: fix NULL pointer in comp_algorithm_show() memcg/hugetlb: add hugeTLB counters to memcg vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount zram: ZRAM_DEF_COMP should depend on ZRAM MAINTAINERS/MEMORY MANAGEMENT: add document files for mm Docs/mm/damon: recommend academic papers to read and/or cite mm: define general function pXd_init() kmemleak: iommu/iova: fix transient kmemleak false positive mm/list_lru: simplify the list_lru walk callback function mm/list_lru: split the lock to per-cgroup scope mm/list_lru: simplify reparenting and initial allocation mm/list_lru: code clean up for reparenting mm/list_lru: don't export list_lru_add mm/list_lru: don't pass unnecessary key parameters kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols ... |
||
![]() |
bf9aa14fc5 |
A rather large update for timekeeping and timers:
- The final step to get rid of auto-rearming posix-timers posix-timers are currently auto-rearmed by the kernel when the signal of the timer is ignored so that the timer signal can be delivered once the corresponding signal is unignored. This requires to throttle the timer to prevent a DoS by small intervals and keeps the system pointlessly out of low power states for no value. This is a long standing non-trivial problem due to the lock order of posix-timer lock and the sighand lock along with life time issues as the timer and the sigqueue have different life time rules. Cure this by: * Embedding the sigqueue into the timer struct to have the same life time rules. Aside of that this also avoids the lookup of the timer in the signal delivery and rearm path as it's just a always valid container_of() now. * Queuing ignored timer signals onto a seperate ignored list. * Moving queued timer signals onto the ignored list when the signal is switched to SIG_IGN before it could be delivered. * Walking the ignored list when SIG_IGN is lifted and requeue the signals to the actual signal lists. This allows the signal delivery code to rearm the timer. This also required to consolidate the signal delivery rules so they are consistent across all situations. With that all self test scenarios finally succeed. - Core infrastructure for VFS multigrain timestamping This is required to allow the kernel to use coarse grained time stamps by default and switch to fine grained time stamps when inode attributes are actively observed via getattr(). These changes have been provided to the VFS tree as well, so that the VFS specific infrastructure could be built on top. - Cleanup and consolidation of the sleep() infrastructure * Move all sleep and timeout functions into one file * Rework udelay() and ndelay() into proper documented inline functions and replace the hardcoded magic numbers by proper defines. * Rework the fsleep() implementation to take the reality of the timer wheel granularity on different HZ values into account. Right now the boundaries are hard coded time ranges which fail to provide the requested accuracy on different HZ settings. * Update documentation for all sleep/timeout related functions and fix up stale documentation links all over the place * Fixup a few usage sites - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks A system can have multiple PTP clocks which are participating in seperate and independent PTP clock domains. So far the kernel only considers the PTP clock which is based on CLOCK TAI relevant as that's the clock which drives the timekeeping adjustments via the various user space daemons through adjtimex(2). The non TAI based clock domains are accessible via the file descriptor based posix clocks, but their usability is very limited. They can't be accessed fast as they always go all the way out to the hardware and they cannot be utilized in the kernel itself. As Time Sensitive Networking (TSN) gains traction it is required to provide fast user and kernel space access to these clocks. The approach taken is to utilize the timekeeping and adjtimex(2) infrastructure to provide this access in a similar way how the kernel provides access to clock MONOTONIC, REALTIME etc. Instead of creating a duplicated infrastructure this rework converts timekeeping and adjtimex(2) into generic functionality which operates on pointers to data structures instead of using static variables. This allows to provide time accessors and adjtimex(2) functionality for the independent PTP clocks in a subsequent step. - Consolidate hrtimer initialization hrtimers are set up by initializing the data structure and then seperately setting the callback function for historical reasons. That's an extra unnecessary step and makes Rust support less straight forward than it should be. Provide a new set of hrtimer_setup*() functions and convert the core code and a few usage sites of the less frequently used interfaces over. The bulk of the htimer_init() to hrtimer_setup() conversion is already prepared and scheduled for the next merge window. - Drivers: * Ensure that the global timekeeping clocksource is utilizing the cluster 0 timer on MIPS multi-cluster systems. Otherwise CPUs on different clusters use their cluster specific clocksource which is not guaranteed to be synchronized with other clusters. * Mostly boring cleanups, fixes, improvements and code movement -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmc7kPITHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoZKkD/9OUL6fOJrDUmOYBa4QVeMyfTef4EaL tvwIMM/29XQFeiq3xxCIn+EMnHjXn2lvIhYGQ7GKsbKYwvJ7ZBDpQb+UMhZ2nKI9 6D6BP6WomZohKeH2fZbJQAdqOi3KRYdvQdIsVZUexkqiaVPphRvOH9wOr45gHtZM EyMRSotPlQTDqcrbUejDMEO94GyjDCYXRsyATLxjmTzL/N4xD4NRIiotjM2vL/a9 8MuCgIhrKUEyYlFoOxxeokBsF3kk3/ez2jlG9b/N8VLH3SYIc2zgL58FBgWxlmgG bY71nVG3nUgEjxBd2dcXAVVqvb+5widk8p6O7xxOAQKTLMcJ4H0tQDkMnzBtUzvB DGAJDHAmAr0g+ja9O35Pkhunkh4HYFIbq0Il4d1HMKObhJV0JumcKuQVxrXycdm3 UZfq3seqHsZJQbPgCAhlFU0/2WWScocbee9bNebGT33KVwSp5FoVv89C/6Vjb+vV Gusc3thqrQuMAZW5zV8g4UcBAA/xH4PB0I+vHib+9XPZ4UQ7/6xKl2jE0kd5hX7n AAUeZvFNFqIsY+B6vz+Jx/yzyM7u5cuXq87pof5EHVFzv56lyTp4ToGcOGYRgKH5 JXeYV1OxGziSDrd5vbf9CzdWMzqMvTefXrHbWrjkjhNOe8E1A8O88RZ5uRKZhmSw hZZ4hdM9+3T7cg== =2VC6 -----END PGP SIGNATURE----- Merge tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "A rather large update for timekeeping and timers: - The final step to get rid of auto-rearming posix-timers posix-timers are currently auto-rearmed by the kernel when the signal of the timer is ignored so that the timer signal can be delivered once the corresponding signal is unignored. This requires to throttle the timer to prevent a DoS by small intervals and keeps the system pointlessly out of low power states for no value. This is a long standing non-trivial problem due to the lock order of posix-timer lock and the sighand lock along with life time issues as the timer and the sigqueue have different life time rules. Cure this by: - Embedding the sigqueue into the timer struct to have the same life time rules. Aside of that this also avoids the lookup of the timer in the signal delivery and rearm path as it's just a always valid container_of() now. - Queuing ignored timer signals onto a seperate ignored list. - Moving queued timer signals onto the ignored list when the signal is switched to SIG_IGN before it could be delivered. - Walking the ignored list when SIG_IGN is lifted and requeue the signals to the actual signal lists. This allows the signal delivery code to rearm the timer. This also required to consolidate the signal delivery rules so they are consistent across all situations. With that all self test scenarios finally succeed. - Core infrastructure for VFS multigrain timestamping This is required to allow the kernel to use coarse grained time stamps by default and switch to fine grained time stamps when inode attributes are actively observed via getattr(). These changes have been provided to the VFS tree as well, so that the VFS specific infrastructure could be built on top. - Cleanup and consolidation of the sleep() infrastructure - Move all sleep and timeout functions into one file - Rework udelay() and ndelay() into proper documented inline functions and replace the hardcoded magic numbers by proper defines. - Rework the fsleep() implementation to take the reality of the timer wheel granularity on different HZ values into account. Right now the boundaries are hard coded time ranges which fail to provide the requested accuracy on different HZ settings. - Update documentation for all sleep/timeout related functions and fix up stale documentation links all over the place - Fixup a few usage sites - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks A system can have multiple PTP clocks which are participating in seperate and independent PTP clock domains. So far the kernel only considers the PTP clock which is based on CLOCK TAI relevant as that's the clock which drives the timekeeping adjustments via the various user space daemons through adjtimex(2). The non TAI based clock domains are accessible via the file descriptor based posix clocks, but their usability is very limited. They can't be accessed fast as they always go all the way out to the hardware and they cannot be utilized in the kernel itself. As Time Sensitive Networking (TSN) gains traction it is required to provide fast user and kernel space access to these clocks. The approach taken is to utilize the timekeeping and adjtimex(2) infrastructure to provide this access in a similar way how the kernel provides access to clock MONOTONIC, REALTIME etc. Instead of creating a duplicated infrastructure this rework converts timekeeping and adjtimex(2) into generic functionality which operates on pointers to data structures instead of using static variables. This allows to provide time accessors and adjtimex(2) functionality for the independent PTP clocks in a subsequent step. - Consolidate hrtimer initialization hrtimers are set up by initializing the data structure and then seperately setting the callback function for historical reasons. That's an extra unnecessary step and makes Rust support less straight forward than it should be. Provide a new set of hrtimer_setup*() functions and convert the core code and a few usage sites of the less frequently used interfaces over. The bulk of the htimer_init() to hrtimer_setup() conversion is already prepared and scheduled for the next merge window. - Drivers: - Ensure that the global timekeeping clocksource is utilizing the cluster 0 timer on MIPS multi-cluster systems. Otherwise CPUs on different clusters use their cluster specific clocksource which is not guaranteed to be synchronized with other clusters. - Mostly boring cleanups, fixes, improvements and code movement" * tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits) posix-timers: Fix spurious warning on double enqueue versus do_exit() clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties clocksource/drivers/gpx: Remove redundant casts clocksource/drivers/timer-ti-dm: Fix child node refcount handling dt-bindings: timer: actions,owl-timer: convert to YAML clocksource/drivers/ralink: Add Ralink System Tick Counter driver clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource clocksource/drivers/timer-ti-dm: Don't fail probe if int not found clocksource/drivers:sp804: Make user selectable clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions hrtimers: Delete hrtimer_init_on_stack() alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack() io_uring: Switch to use hrtimer_setup_on_stack() sched/idle: Switch to use hrtimer_setup_on_stack() hrtimers: Delete hrtimer_init_sleeper_on_stack() wait: Switch to use hrtimer_setup_sleeper_on_stack() timers: Switch to use hrtimer_setup_sleeper_on_stack() net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack() futex: Switch to use hrtimer_setup_sleeper_on_stack() fs/aio: Switch to use hrtimer_setup_sleeper_on_stack() ... |
||
![]() |
3f020399e4 |
Scheduler changes for v6.13:
- Core facilities: - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which optimizes fair-class preemption by delaying preemption requests to the tick boundary, while working as full preemption for RR/FIFO/DEADLINE classes. (Peter Zijlstra) - x86: Enable Lazy preemption (Peter Zijlstra) - riscv: Enable Lazy preemption (Jisheng Zhang) - Initialize idle tasks only once (Thomas Gleixner) - sched/ext: Remove sched_fork() hack (Thomas Gleixner) - Fair scheduler: - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie) - Idle loop: Optimize the generic idle loop by removing unnecessary memory barrier (Zhongqiu Han) - RSEQ: - Improve cache locality of RSEQ concurrency IDs for intermittent workloads (Mathieu Desnoyers) - Waitqueues: - Make wake_up_{bit,var} less fragile (Neil Brown) - PSI: - Pass enqueue/dequeue flags to psi callbacks directly (Johannes Weiner) - Preparatory patches for proxy execution: - core: Add move_queued_task_locked helper (Connor O'Brien) - core: Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien) - core: Split out __schedule() deactivate task logic into a helper (John Stultz) - core: Split scheduler and execution contexts (Peter Zijlstra) - locking/mutex: Make mutex::wait_lock irq safe (Juri Lelli) - locking/mutex: Expose __mutex_owner() (Juri Lelli) - locking/mutex: Remove wakeups from under mutex::wait_lock (Peter Zijlstra) - Misc fixes and cleanups: - core: Remove unused __HAVE_THREAD_FUNCTIONS hook support (David Disseldorp) - core: Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej Siewior) - wait: Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert) - fair: remove the DOUBLE_TICK feature (Huang Shijie) - fair: fix the comment for PREEMPT_SHORT (Huang Shijie) - uclamp: Fix unnused variable warning (Christian Loehle) - rt: No PREEMPT_RT=y for all{yes,mod}config Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7fnQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hZTBAAozVdWA2m51aNa67HvAZta/olmrIagVbW inwbTgqa8b+UfeWEuKOfrZr5khjEh6pLgR3dBTib1uH6xxYj/Okds+qbPWSBPVLh yzavlm/zJZM1U1XtxE3eyVfqWik4GrY7DoIMDQQr+YH7rNXonJeJkll38OI2E5MC q3Q01qyMo8RJJX8qkf3f8ObOoP/51NsVniTw0Zb2fzEhXz8FjezLlxk6cMfgSkJG lg9gfIwUZ7Xg5neRo4kJcc3Ht31KYOhWSiupBJzRD1hss/N/AybvMcTX/Cm8d07w HIAdDDAn84o46miFo/a0V/hsJZ72idWbqxVJUCtaezrpOUiFkG+uInRvG/ynr0lF 5dEI9f+6PUw8Nc7L72IyHkobjPqS2IefSaxYYCBKmxMX2qrenfTor/pKiWzzhBIl rX3MZSuUJ8NjV4rNGD/qXRM1IsMJrsDwxDyv+sRec3XdH33x286ds6aAUEPDQ6N7 96VS0sOKcNUJN8776ErNjlIxRl8HTlpkaO3nZlQIfXgTlXUpRvOuKbEWqP+606lo oANgJTKgUhgJPWZnvmdRxDjSiOp93QcImjus9i1tN81FGiEDleONsJUxu2Di1E5+ s1nCiytjq+cdvzCqFyiOZUh+g6kSZ4yXxNgLg2UvbXzX1zOeUQT3WtyKUhMPXhU8 esh1TgbUbpE= =Zcqj -----END PGP SIGNATURE----- Merge tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core facilities: - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which optimizes fair-class preemption by delaying preemption requests to the tick boundary, while working as full preemption for RR/FIFO/DEADLINE classes. (Peter Zijlstra) - x86: Enable Lazy preemption (Peter Zijlstra) - riscv: Enable Lazy preemption (Jisheng Zhang) - Initialize idle tasks only once (Thomas Gleixner) - sched/ext: Remove sched_fork() hack (Thomas Gleixner) Fair scheduler: - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie) Idle loop: - Optimize the generic idle loop by removing unnecessary memory barrier (Zhongqiu Han) RSEQ: - Improve cache locality of RSEQ concurrency IDs for intermittent workloads (Mathieu Desnoyers) Waitqueues: - Make wake_up_{bit,var} less fragile (Neil Brown) PSI: - Pass enqueue/dequeue flags to psi callbacks directly (Johannes Weiner) Preparatory patches for proxy execution: - Add move_queued_task_locked helper (Connor O'Brien) - Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien) - Split out __schedule() deactivate task logic into a helper (John Stultz) - Split scheduler and execution contexts (Peter Zijlstra) - Make mutex::wait_lock irq safe (Juri Lelli) - Expose __mutex_owner() (Juri Lelli) - Remove wakeups from under mutex::wait_lock (Peter Zijlstra) Misc fixes and cleanups: - Remove unused __HAVE_THREAD_FUNCTIONS hook support (David Disseldorp) - Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej Siewior) - Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert) - remove the DOUBLE_TICK feature (Huang Shijie) - fix the comment for PREEMPT_SHORT (Huang Shijie) - Fix unnused variable warning (Christian Loehle) - No PREEMPT_RT=y for all{yes,mod}config" * tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY. sched: No PREEMPT_RT=y for all{yes,mod}config riscv: add PREEMPT_LAZY support sched, x86: Enable Lazy preemption sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT sched: Add Lazy preemption model sched: Add TIF_NEED_RESCHED_LAZY infrastructure sched/ext: Remove sched_fork() hack sched: Initialize idle tasks only once sched: psi: pass enqueue/dequeue flags to psi callbacks directly sched/uclamp: Fix unnused variable warning sched: Split scheduler and execution contexts sched: Split out __schedule() deactivate task logic into a helper sched: Consolidate pick_*_task to task_is_pushable helper sched: Add move_queued_task_locked helper locking/mutex: Expose __mutex_owner() locking/mutex: Make mutex::wait_lock irq safe locking/mutex: Remove wakeups from under mutex::wait_lock sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads sched: idle: Optimize the generic idle loop by removing needless memory barrier ... |
||
![]() |
7d3e93eca3 |
mm: use page_pgoff() in more places
There are several places which currently open-code page_pgoff(), convert them to call it. Link: https://lkml.kernel.org/r/20241005200121.3231142-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
9788c1f0ff |
futex: Switch to use hrtimer_setup_sleeper_on_stack()
hrtimer_setup_sleeper_on_stack() replaces hrtimer_init_sleeper_on_stack() to keep the naming convention consistent. Convert the usage site over to it. The conversion was done with Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/d92116a17313dee283ebc959869bea80fbf94cdb.1730386209.git.namcao@linutronix.de |
||
![]() |
87347f1480 |
futex: Use atomic64_try_cmpxchg_relaxed() in get_inode_sequence_number()
Optimize get_inode_sequence_number() to use simpler and faster: !atomic64_try_cmpxchg_relaxed(*ptr, &old, new) instead of: atomic64_cmpxchg relaxed(*ptr, old, new) != old The x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg. The generated code improves from: 3da: 31 c0 xor %eax,%eax 3dc: f0 48 0f b1 8a 38 01 lock cmpxchg %rcx,0x138(%rdx) 3e3: 00 00 3e5: 48 85 c0 test %rax,%rax 3e8: 48 0f 44 c1 cmove %rcx,%rax to: 3da: 31 c0 xor %eax,%eax 3dc: f0 48 0f b1 8a 38 01 lock cmpxchg %rcx,0x138(%rdx) 3e3: 00 00 3e5: 48 0f 44 c1 cmove %rcx,%rax Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/all/20241010071023.21913-2-ubizjak@gmail.com |
||
![]() |
19298f4869 |
futex: Use atomic64_inc_return() in get_inode_sequence_number()
Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref) to use optimized implementation and ease register pressure around the primitive for targets that implement optimized variant. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/all/20241010071023.21913-1-ubizjak@gmail.com |
||
![]() |
894d1b3db4 |
locking/mutex: Remove wakeups from under mutex::wait_lock
In preparation to nest mutex::wait_lock under rq::lock we need to remove wakeups from under it. Do this by utilizing wake_qs to defer the wakeup until after the lock is dropped. [Heavily changed after 55f036ca7e74 ("locking: WW mutex cleanup") and 08295b3b5bee ("locking: Implement an algorithm choice for Wound-Wait mutexes")] [jstultz: rebased to mainline, added extra wake_up_q & init to avoid hangs, similar to Connor's rework of this patch] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Metin Kaya <metin.kaya@arm.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Metin Kaya <metin.kaya@arm.com> Link: https://lore.kernel.org/r/20241009235352.1614323-2-jstultz@google.com |
||
![]() |
6ce2082fd3 |
fault-inject: improve build for CONFIG_FAULT_INJECTION=n
The fault-inject.h users across the kernel need to add a lot of #ifdef CONFIG_FAULT_INJECTION to cater for shortcomings in the header. Make fault-inject.h self-contained for CONFIG_FAULT_INJECTION=n, and add stubs for DECLARE_FAULT_ATTR(), setup_fault_attr(), should_fail_ex(), and should_fail() to allow removal of conditional compilation. [akpm@linux-foundation.org: repair fallout from no longer including debugfs.h into fault-inject.h] [akpm@linux-foundation.org: fix drivers/misc/xilinx_tmr_inject.c] [akpm@linux-foundation.org: Add debugfs.h inclusion to more files, per Stephen] Link: https://lkml.kernel.org/r/20240813121237.2382534-1-jani.nikula@intel.com Fixes: 6ff1cb355e62 ("[PATCH] fault-injection capabilities infrastructure") Signed-off-by: Jani Nikula <jani.nikula@intel.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Rob Clark <robdclark@gmail.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
b3e90f375b |
printk: Change type of CONFIG_BASE_SMALL to bool
CONFIG_BASE_SMALL is currently a type int but is only used as a boolean. So, change its type to bool and adapt all usages: CONFIG_BASE_SMALL == 0 becomes !IS_ENABLED(CONFIG_BASE_SMALL) and CONFIG_BASE_SMALL != 0 becomes IS_ENABLED(CONFIG_BASE_SMALL). Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Yoann Congal <yoann.congal@smile.fr> Link: https://lore.kernel.org/r/20240505080343.1471198-3-yoann.congal@smile.fr Signed-off-by: Petr Mladek <pmladek@suse.com> |
||
![]() |
e626cb02ee |
futex: Prevent the reuse of stale pi_state
Jiri Slaby reported a futex state inconsistency resulting in -EINVAL during a lock operation for a PI futex. It requires that the a lock process is interrupted by a timeout or signal: T1 Owns the futex in user space. T2 Tries to acquire the futex in kernel (futex_lock_pi()). Allocates a pi_state and attaches itself to it. T2 Times out and removes its rt_waiter from the rt_mutex. Drops the rtmutex lock and tries to acquire the hash bucket lock to remove the futex_q. The lock is contended and T2 schedules out. T1 Unlocks the futex (futex_unlock_pi()). Finds a futex_q but no rt_waiter. Unlocks the futex (do_uncontended) and makes it available to user space. T3 Acquires the futex in user space. T4 Tries to acquire the futex in kernel (futex_lock_pi()). Finds the existing futex_q of T2 and tries to attach itself to the existing pi_state. This (attach_to_pi_state()) fails with -EINVAL because uval contains the TID of T3 but pi_state points to T1. It's incorrect to unlock the futex and make it available for user space to acquire as long as there is still an existing state attached to it in the kernel. T1 cannot hand over the futex to T2 because T2 already gave up and started to clean up and is blocked on the hash bucket lock, so T2's futex_q with the pi_state pointing to T1 is still queued. T2 observes the futex_q, but ignores it as there is no waiter on the corresponding rt_mutex and takes the uncontended path which allows the subsequent caller of futex_lock_pi() (T4) to observe that stale state. To prevent this the unlock path must dequeue all futex_q entries which point to the same pi_state when there is no waiter on the rt mutex. This requires obviously to make the dequeue conditional in the locking path to prevent a double dequeue. With that it's guaranteed that user space cannot observe an uncontended futex which has kernel state attached. Fixes: fbeb558b0dd0d ("futex/pi: Fix recursive rt_mutex waiter state") Reported-by: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Slaby <jirislaby@kernel.org> Link: https://lore.kernel.org/r/20240118115451.0TkD_ZhB@linutronix.de Closes: https://lore.kernel.org/all/4611bcf2-44d0-4c34-9b84-17406f881003@kernel.org |
||
![]() |
8b7787a543 |
plist: Split out plist_types.h
Trimming down sched.h dependencies: we don't want to include more than the base types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> |
||
![]() |
c9bd1568d5 |
futex: Fix hardcoded flags
Xi reported that commit 5694289ce183 ("futex: Flag conversion") broke glibc's robust futex tests. This was narrowed down to the change of FLAGS_SHARED from 0x01 to 0x10, at which point Florian noted that handle_futex_death() has a hardcoded flags argument of 1. Change this to: FLAGS_SIZE_32 | FLAGS_SHARED, matching how futex_to_flags() unconditionally sets FLAGS_SIZE_32 for all legacy futex ops. Reported-by: Xi Ruoyao <xry111@xry111.site> Reported-by: Florian Weimer <fweimer@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20231114201402.GA25315@noisy.programming.kicks-ass.net Fixes: 5694289ce183 ("futex: Flag conversion") Cc: <stable@vger.kernel.org> |
||
![]() |
4de520f1fc |
io_uring-futex-2023-10-30
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmVAUXUQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuGsEADEs0/4uXb8kLUF/y0B0bY9jmwiw5id14g5 TkAH9lbceV0Yv0E1tPeWYIz7Y7s83UOduFVZo4hRH8EysH3IYFZCI/ny3v2nJ1av lN7F7YegVOu6qx77e/CwLo7on14awHkSo8pUdCOm6tYLunLg42miRf+xTpSAL0Mg ONnt0WxWDOgdNvTaGwBPaVE78FAWK8nc2ACzonQGfzCl2VXOsSy9JaJJMv8eyXOf VVZCNcSvHh/zVznlC1YPoZh/bgS2UUJmIGL/XMQnM5qzbK1IPpzlN0cu8rje3s9b TUKBKqr6xhC9nyAS1qAjgZ98RfjVnzcbMX+aWEb/Z0y9XFJVSSQQdW+f9A/0KLZm jAejHJpNuqwEdB9MplHTXdeSDTkJH3YNbXvtwA6cc/KpZ1FVQXlhSJPp/mbOa7qe IIeg6SYt84uZ2HxflTtm+I1uVE9QMcsesy3FIK4kxhA8jSximQw+hPZ3xrv4AHLd cTkRAzfXPUFsJJQCgpv289QXobV/vsFhCFTHFxv63H+EGpJ7e1EaW6Eq0pAHG0Ai 8kk5Ns29jzTVer1W3sMMeDaZ7S8hGRAyRC+Zb/0QxtGsmvxikB0qY1GpdRGPFueQ gOawhLZdhkigIsq0U1UGMpHKY0G1Sl9wvHuH2qzUKeWk+vFRv5RwR6zQuVJr2Jo/ j3HgyYDs7Q== =Z0L0 -----END PGP SIGNATURE----- Merge tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linux Pull io_uring futex support from Jens Axboe: "This adds support for using futexes through io_uring - first futex wake and wait, and then the vectored variant of waiting, futex waitv. For both wait/wake/waitv, we support the bitset variant, as the 'normal' variants can be easily implemented on top of that. PI and requeue are not supported through io_uring, just the above mentioned parts. This may change in the future, but in the spirit of keeping this small (and based on what people have been asking for), this is what we currently have. Wake support is pretty straight forward, most of the thought has gone into the wait side to avoid needing to offload wait operations to a blocking context. Instead, we rely on the usual callbacks to retry and post a completion event, when appropriate. As far as I can recall, the first request for futex support with io_uring came from Andres Freund, working on postgres. His aio rework of postgres was one of the early adopters of io_uring, and futex support was a natural extension for that. This is relevant from both a usability point of view, as well as for effiency and performance. In Andres's words, for the former: Futex wait support in io_uring makes it a lot easier to avoid deadlocks in concurrent programs that have their own buffer pool: Obviously pages in the application buffer pool have to be locked during IO. If the initiator of IO A needs to wait for a held lock B, the holder of lock B might wait for the IO A to complete. The ability to wait for a lock and IO completions at the same time provides an efficient way to avoid such deadlocks and in terms of effiency, even without unlocking the full potential yet, Andres says: Futex wake support in io_uring is useful because it allows for more efficient directed wakeups. For some "locks" postgres has queues implemented in userspace, with wakeup logic that cannot easily be implemented with FUTEX_WAKE_BITSET on a single "futex word" (imagine waiting for journal flushes to have completed up to a certain point). Thus a "lock release" sometimes need to wake up many processes in a row. A quick-and-dirty conversion to doing these wakeups via io_uring lead to a 3% throughput increase, with 12% fewer context switches, albeit in a fairly extreme workload" * tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linux: io_uring: add support for vectored futex waits futex: make the vectored futex operations available futex: make futex_parse_waitv() available as a helper futex: add wake_data to struct futex_q io_uring: add support for futex wake and wait futex: abstract out a __futex_wake_mark() helper futex: factor out the futex wake handling futex: move FUTEX2_VALID_MASK to futex.h |
||
![]() |
c73801ae4f |
futex: Don't include process MM in futex key on no-MMU
On no-MMU, all futexes are treated as private because there is no need to map a virtual address to physical to match the futex across processes. This doesn't quite work though, because private futexes include the current process's mm_struct as part of their key. This makes it impossible for one process to wake up a shared futex being waited on in another process. Fix this bug by excluding the mm_struct from the key. With a single address space, the futex address is already a unique key. Fixes: 784bdf3bb694 ("futex: Assume all mappings are private on !MMU systems") Signed-off-by: Ben Wolsieffer <ben.wolsieffer@hefring.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Darren Hart <dvhart@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20231019204548.1236437-2-ben.wolsieffer@hefring.com |
||
![]() |
01a99a750a |
futex/requeue: Remove unnecessary ‘NULL’ initialization from futex_proxy_trylock_atomic()
'top_waiter' is assigned unconditionally before first use, so it does not need an initialization. [ mingo: Created legible changelog. ] Signed-off-by: Li zeming <zeming@nfschina.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230725195047.3106-1-zeming@nfschina.com |
||
![]() |
e9a56c9325 |
futex: make the vectored futex operations available
Rename unqueue_multiple() as futex_unqueue_multiple(), and make both that and futex_wait_multiple_setup() available for external users. This is in preparation for wiring up vectored waits in io_uring. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
![]() |
5177c0cb30 |
futex: make futex_parse_waitv() available as a helper
To make it more generically useful, augment it with allowing the caller to pass in the wake handler and wake data. Convert the futex_waitv() syscall, passing in the default handlers. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
![]() |
8af1692616 |
futex: add wake_data to struct futex_q
With handling multiple futex_q for waitv, we cannot easily go from the futex_q to data related to that request or queue. Add a wake_data argument that belongs to the wake handler assigned. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
![]() |
e52c43403c |
futex: abstract out a __futex_wake_mark() helper
Move the unqueue and lock_ptr clear into a helper that futex_wake_mark() calls. Add it to the public functions as well, in preparation for using it outside the core futex code. Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
![]() |
12a4be50af |
futex: factor out the futex wake handling
In preparation for having another waker that isn't futex_wake_mark(), add a wake handler in futex_q. No extra data is associated with the handler outside of struct futex_q itself. futex_wake_mark() is defined as the standard wakeup helper, now set through futex_q_init like other defaults. Normal sync futex waiting relies on wake_q holding tasks that should be woken up. This is what futex_wake_mark() does, it'll unqueue the futex and add the associated task to the wake queue. For async usage of futex waiting, rather than having tasks sleeping on the futex, we'll need to deal with a futex wake differently. For the planned io_uring case, that means posting a completion event for the task in question. Having a definable wake handler can help support that use case. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
![]() |
3b07815954 |
futex: move FUTEX2_VALID_MASK to futex.h
We need this for validating the futex2 flags outside of the normal futex syscalls. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
![]() |
0f4b5f9722 |
futex: Add sys_futex_requeue()
Finish off the 'simple' futex2 syscall group by adding sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are too numerous to fit into a regular syscall. As such, use struct futex_waitv to pass the 'source' and 'destination' futexes to the syscall. This syscall implements what was previously known as FUTEX_CMP_REQUEUE and uses {val, uaddr, flags} for source and {uaddr, flags} for destination. This design explicitly allows requeueing between different types of futex by having a different flags word per uaddr. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20230921105248.511860556@noisy.programming.kicks-ass.net |
||
![]() |
27b88f3519 |
futex: Add flags2 argument to futex_requeue()
In order to support mixed size requeue, add a second flags argument to the internal futex_requeue() function. No functional change intended. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230921105248.396780136@noisy.programming.kicks-ass.net |
||
![]() |
3b63a55f49 |
futex: Propagate flags into get_futex_key()
Instead of only passing FLAGS_SHARED as a boolean, pass down flags as a whole. No functional change intended. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230921105248.282857501@noisy.programming.kicks-ass.net |
||
![]() |
cb8c4312af |
futex: Add sys_futex_wait()
To complement sys_futex_waitv()/wake(), add sys_futex_wait(). This syscall implements what was previously known as FUTEX_WAIT_BITSET except it uses 'unsigned long' for the value and bitmask arguments, takes timespec and clockid_t arguments for the absolute timeout and uses FUTEX2 flags. The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20230921105248.164324363@noisy.programming.kicks-ass.net |
||
![]() |
43adf84495 |
futex: FLAGS_STRICT
The current semantics for futex_wake() are a bit loose, specifically asking for 0 futexes to be woken actually gets you 1. Adding a !nr check to sys_futex_wake() makes that it would return 0 for unaligned futex words, because that check comes in the shared futex_wake() function. Adding the !nr check there, would affect the legacy sys_futex() semantics. Hence frob a flag :-( Suggested-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230921105248.048643656@noisy.programming.kicks-ass.net |
||
![]() |
9f6c532f59 |
futex: Add sys_futex_wake()
To complement sys_futex_waitv() add sys_futex_wake(). This syscall implements what was previously known as FUTEX_WAKE_BITSET except it uses 'unsigned long' for the bitmask and takes FUTEX2 flags. The 'unsigned long' allows FUTEX2_SIZE_U64 on 64bit platforms. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20230921105247.936205525@noisy.programming.kicks-ass.net |
||
![]() |
698eb82638 |
futex: Validate futex value against futex size
Ensure the futex value fits in the given futex size. Since this adds a constraint to an existing syscall, it might possibly change behaviour. Currently the value would be truncated to a u32 and any high bits would get silently lost. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230921105247.828934099@noisy.programming.kicks-ass.net |
||
![]() |
5694289ce1 |
futex: Flag conversion
Futex has 3 sets of flags: - legacy futex op bits - futex2 flags - internal flags Add a few helpers to convert from the API flags into the internal flags. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20230921105247.722140574@noisy.programming.kicks-ass.net |
||
![]() |
d6d08d2479 |
futex: Extend the FUTEX2 flags
Add the definition for the missing but always intended extra sizes, and add a NUMA flag for the planned numa extention. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20230921105247.617057368@noisy.programming.kicks-ass.net |
||
![]() |
4923954bbc |
futex: Clarify FUTEX2 flags
sys_futex_waitv() is part of the futex2 series (the first and only so far) of syscalls and has a flags field per futex (as opposed to flags being encoded in the futex op). This new flags field has a new namespace, which unfortunately isn't super explicit. Notably it currently takes FUTEX_32 and FUTEX_PRIVATE_FLAG. Introduce the FUTEX2 namespace to clarify this Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20230921105247.507327749@noisy.programming.kicks-ass.net |
||
![]() |
fbeb558b0d |
futex/pi: Fix recursive rt_mutex waiter state
Some new assertions pointed out that the existing code has nested rt_mutex wait state in the futex code. Specifically, the futex_lock_pi() cancel case uses spin_lock() while there still is a rt_waiter enqueued for this task, resulting in a state where there are two waiters for the same task (and task_struct::pi_blocked_on gets scrambled). The reason to take hb->lock at this point is to avoid the wake_futex_pi() EAGAIN case. This happens when futex_top_waiter() and rt_mutex_top_waiter() state becomes inconsistent. The current rules are such that this inconsistency will not be observed. Notably the case that needs to be avoided is where futex_lock_pi() and futex_unlock_pi() interleave such that unlock will fail to observe a new waiter. *However* the case at hand is where a waiter is leaving, in this case the race means a waiter that is going away is not observed -- which is harmless, provided this race is explicitly handled. This is a somewhat dangerous proposition because the converse race is not observing a new waiter, which must absolutely not happen. But since the race is valid this cannot be asserted. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lkml.kernel.org/r/20230915151943.GD6743@noisy.programming.kicks-ass.net |
||
![]() |
d14f9e930b |
locking/rtmutex: Use rt_mutex specific scheduler helpers
Have rt_mutex use the rt_mutex specific scheduler helpers to avoid recursion vs rtlock on the PI state. [[ peterz: adapted to new names ]] Reported-by: Crystal Wood <swood@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230908162254.999499-6-bigeasy@linutronix.de |
||
![]() |
6f23fc47c1 |
Linux 6.6-rc2
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmUHckgeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGAx4IAKFIY9buHYswh0xs me3S0aZziLs5IHXkV//bCHiFGjZhglEpzQMpRdNfIoojVgE3CFHvG41e5a6FQPol os0wyZkAvzdGnv997MMgP6/q7aRV/Eom0djKLdtvuKztVoOu5ynAmOKhq//Iy3Nf HEr83Xac5C37HyluUsd6uU7lm8cJW2jwdDPvJRd8i/1CU3Aq6cvYVzAN9ZICOpv0 pBhJEaBz7E2jx+ej9cZ1FT+WEQtY2Kw27QkZ27Ka/EAjKPtSxF0477cl26kmACMp bWNnf0qRDGNqDhP0/6/hmI1Ni4s/1t+eIovJl8AvwamPmfEv4kmeMl1M06aE4Z0V NSVQZ4Y= =Gffk -----END PGP SIGNATURE----- Merge tag 'v6.6-rc2' into locking/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
![]() |
e35a6cf1cc |
futex: Use a folio instead of a page
The futex code already handles compound pages correctly, but using a folio tells the compiler that there is already a reference to the head page and it doesn't need to call compound_head() again. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230821142207.2537124-1-willy@infradead.org |
||
![]() |
3fade62b62 |
mm/mm_init.c: remove obsolete macro HASH_SMALL
HASH_SMALL only works when parameter numentries is 0. But the sole caller futex_init() never calls alloc_large_system_hash() with numentries set to 0. So HASH_SMALL is obsolete and remove it. Link: https://lkml.kernel.org/r/20230625021323.849147-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: André Almeida <andrealmeid@igalia.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
![]() |
95d248d16f |
- Prevent the leaking of a debug timer in futex_waitv()
- A preempt-RT mutex locking fix, adding the proper acquire semantics -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmOxn1cACgkQEsHwGGHe VUp+AQ/9GilVlaWqloC+sDK/7qTLfmQjR/XwQPii2pMBUO1oRFT7mG+dG7lKNpRL WCLTVIrzG528ORJytvMGuqx+CqenuTGPFI9rtizxdtRlNZyiS33hvlaGBPY7XYY7 G2qugGKuZ7WVKZpq6szPwnrMuWjdsd+XOO+NE3h/5kI2T7ZukmlXl9/EtVii7+y3 LGVjMhN+R0XCMCPSSPtFx5eUGalemPnGSBcX/Qb4zG1MphSPURFKHEQmWHZzgrkg 6fmqx773u2mquxTYeEy0FDLjOYFEHlzDFgRtIDBaziHIxjzO/5r09m8RDF2mXetW W/pf/Iduh2TZWwJDngmUi0JxqOzz0l8T/6vSCuArYWkjb+0zDOPHuCYwif2iHAQo pMn+O5nAtLMZhdk+yjTs4abQ11q9owLh0nGps4A2SKgJIH0mL2BICwRotzuV1sxf CGnoOxIxZuZGq8xMoHp6sJ8e4RnJHBAt7PKrCsLTkPAjKjgmBpqshVav//kyCFeY 2R4rgRY/a42Z86uNSysVnxxkop4vRgTZ3l7TijzyeIbycI7v9WMADk0TKTDgsMbL nEnlZPWXJGYNXH71ZPV5TPgidHfiO68sVKerXWRrWRLyGHhY2TC67Lz29vTAlbMK NdOM6pO0DkaKSLxEJ6qso819ePnRA/2O5B7xU/Ll7vONjCpHlFw= =9LHF -----END PGP SIGNATURE----- Merge tag 'locking_urgent_for_v6.2_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Borislav Petkov: - Prevent the leaking of a debug timer in futex_waitv() - A preempt-RT mutex locking fix, adding the proper acquire semantics * tag 'locking_urgent_for_v6.2_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Fix futex_waitv() hrtimer debug object leak on kcalloc error rtmutex: Add acquire semantics for rtmutex lock acquisition slow path |
||
![]() |
94cd8fa09f |
futex: Fix futex_waitv() hrtimer debug object leak on kcalloc error
In a scenario where kcalloc() fails to allocate memory, the futex_waitv system call immediately returns -ENOMEM without invoking destroy_hrtimer_on_stack(). When CONFIG_DEBUG_OBJECTS_TIMERS=y, this results in leaking a timer debug object. Fixes: bf69bad38cf6 ("futex: Implement sys_futex_waitv()") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: stable@vger.kernel.org Cc: stable@vger.kernel.org # v5.16+ Link: https://lore.kernel.org/r/20221214222008.200393-1-mathieu.desnoyers@efficios.com |
||
![]() |
90d7588967 |
futex: Resend potentially swallowed owner death notification
Commit ca16d5bee598 ("futex: Prevent robust futex exit race") addressed two cases when tasks waiting on a robust non-PI futex remained blocked despite the futex not being owned anymore: * if the owner died after writing zero to the futex word, but before waking up a waiter * if a task waiting on the futex was woken up, but died before updating the futex word (effectively swallowing the notification without acting on it) In the second case, the task could be woken up either by the previous owner (after the futex word was reset to zero) or by the kernel (after the OWNER_DIED bit was set and the TID part of the futex word was reset to zero) if the previous owner died without the resetting the futex. Because the referenced commit wakes up a potential waiter only if the whole futex word is zero, the latter subcase remains unaddressed. Fix this by looking only at the TID part of the futex when deciding whether a wake up is needed. Fixes: ca16d5bee598 ("futex: Prevent robust futex exit race") Signed-off-by: Alexey Izbyshev <izbyshev@ispras.ru> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221111215439.248185-1-izbyshev@ispras.ru |
||
![]() |
f5d39b0208 |
freezer,sched: Rewrite core freezer logic
Rewrite the core freezer to behave better wrt thawing and be simpler in general. By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible. As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!). Specifically; the current scheme works a little like: freezer_do_not_count(); schedule(); freezer_count(); And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues. However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time. That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back. This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run. As said; replace this with a special task state TASK_FROZEN and add the following state transitions: TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost). The special __TASK_{STOPPED,TRACED} states *can* be restored since their canonical state is in ->jobctl. With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org |
||
![]() |
2518f226c6 |
drm for 5.19-rc1
dma-buf: - add dma_resv_replace_fences - add dma_resv_get_singleton - make dma_excl_fence private core: - EDID parser refactorings - switch drivers to drm_mode_copy/duplicate - DRM managed mutex initialization display-helper: - put HDMI, SCDC, HDCP, DSC and DP into new module gem: - rework fence handling ttm: - rework bulk move handling - add common debugfs for resource managers - convert to kvcalloc format helpers: - support monochrome formats - RGB888, RGB565 to XRGB8888 conversions fbdev: - cfb/sys_imageblit fixes - pagelist corruption fix - create offb platform device - deferred io improvements sysfb: - Kconfig rework - support for VESA mode selection bridge: - conversions to devm_drm_of_get_bridge - conversions to panel_bridge - analogix_dp - autosuspend support - it66121 - audio support - tc358767 - DSI to DPI support - icn6211 - PLL/I2C fixes, DT property - adv7611 - enable DRM_BRIDGE_OP_HPD - anx7625 - fill ELD if no monitor - dw_hdmi - add audio support - lontium LT9211 support, i.MXMP LDB - it6505: Kconfig fix, DPCD set power fix - adv7511 - CEC support for ADV7535 panel: - ltk035c5444t, B133UAN01, NV3052C panel support - DataImage FG040346DSSWBG04 support - st7735r - DT bindings fix - ssd130x - fixes i915: - DG2 laptop PCI-IDs ("motherboard down") - Initial RPL-P PCI IDs - compute engine ABI - DG2 Tile4 support - DG2 CCS clear color compression support - DG2 render/media compression formats support - ATS-M platform info - RPL-S PCI IDs added - Bump ADL-P DMC version to v2.16 - Support static DRRS - Support multiple eDP/LVDS native mode refresh rates - DP HDR support for HSW+ - Lots of display refactoring + fixes - GuC hwconfig support and query - sysfs support for multi-tile - fdinfo per-client gpu utilisation - add geometry subslices query - fix prime mmap with LMEM - fix vm open count and remove vma refcounts - contiguous allocation fixes - steered register write support - small PCI BAR enablement - GuC error capture support - sunset igpu legacy mmap support for newer devices - GuC version 70.1.1 support amdgpu: - Initial SoC21 support - SMU 13.x enablement - SMU 13.0.4 support - ttm_eu cleanups - USB-C, GPUVM updates - TMZ fixes for RV - RAS support for VCN - PM sysfs code cleanup - DC FP rework - extend CG/PG flags to 64-bit - SI dpm lockdep fix - runtime PM fixes amdkfd: - RAS/SVM fixes - TLB flush fixes - CRIU GWS support - ignore bogus MEC signals more efficiently msm: - Fourcc modifier for tiled but not compressed layouts - Support for userspace allocated IOVA (GPU virtual address) - DPU: DSC (Display Stream Compression) support - DP: eDP support - DP: conversion to use drm_bridge and drm_bridge_connector - Merge DPU1 and MDP5 MDSS driver - DPU: writeback support nouveau: - make some structures static - make some variables static - switch to drm_gem_plane_helper_prepare_fb radeon: - misc fixes/cleanups mxsfb: - rework crtc mode setting - LCDIF CRC support etnaviv: - fencing improvements - fix address space collisions - cleanup MMU reference handling gma500: - GEM/GTT improvements - connector handling fixes komeda: - switch to plane reset helper mediatek: - MIPI DSI improvements omapdrm: - GEM improvements qxl: - aarch64 support vc4: - add a CL submission tracepoint - HDMI YUV support - HDMI/clock improvements - drop is_hdmi caching virtio: - remove restriction of non-zero blob types vmwgfx: - support for cursormob and cursorbypass 4 - fence improvements tidss: - reset DISPC on startup solomon: - SPI support - DT improvements sun4i: - allwinner D1 support - drop is_hdmi caching imx: - use swap() instead of open-coding - use devm_platform_ioremap_resource - remove redunant initializations ast: - Displayport support rockchip: - Refactor IOMMU initialisation - make some structures static - replace drm_detect_hdmi_monitor with drm_display_info.is_hdmi - support swapped YUV formats, - clock improvements - rk3568 support - VOP2 support mediatek: - MT8186 support tegra: - debugabillity improvements -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmKNxkAACgkQDHTzWXnE hr4hghAAqSXeMEw1w34miyM28hcOpXqkDfT1VVooqFxBT8MBqamzpZvCH94qsZwm 3DRXlhQ4pk8wzUcWJpGprdNakxNQPpFVs2UuxYxOyrxYpdkbOwqsEcM3d8VXD9Cy E36z+dr85A8Te/J0Yg/FLoZMHulTlidqEZeOz6SMaNUohtrmH/oPWR+cPIy4/Zpp yysfbBSKTwblJFDf4+nIpks/VvJYAUO3i6KClT/Rh79Yg6de582AU0YaNcEArTi6 JqdiYIoYLx609Ecy5NVme6wR/ai46afFLMYt3ZIP4OfHRINk+YL01BYMo2JE2M8l xjOH0Iwb7evzWqLK/ESwqp3P7nyppmLlfbZOFHWUfNJsjq2H3ePaAGhzOlYx1c70 XENzY4IvpYYdR0pJuh1gw1cNZfM9JDAynGJ5jvsATLGBGQbpFsy3w/PMZT17q8an DpBwqQmShUdCJ2m+6zznC3VsxJpbvWKNE1I93NxAWZXmFYxoHCzRihahUxKcNDrQ ZLH7RSlk9SE/ZtNSLkU15YnKtoW+ThFIssUpVio6U/fZot1+efZkmkXplSuFvj6R i7s14hMWQjSJzpJg1DXfhDMycEOujNiQppCG2EaDlVxvUtCqYBd3EHOI7KQON//+ iVtmEEnWh5rcCM+WsxLGf3Y7sVP3vfo1LOCxshb1XVfDmeMksoI= =BYQA -----END PGP SIGNATURE----- Merge tag 'drm-next-2022-05-25' of git://anongit.freedesktop.org/drm/drm Pull drm updates from Dave Airlie: "Intel have enabled DG2 on certain SKUs for laptops, AMD has started some new GPU support, msm has user allocated VA controls dma-buf: - add dma_resv_replace_fences - add dma_resv_get_singleton - make dma_excl_fence private core: - EDID parser refactorings - switch drivers to drm_mode_copy/duplicate - DRM managed mutex initialization display-helper: - put HDMI, SCDC, HDCP, DSC and DP into new module gem: - rework fence handling ttm: - rework bulk move handling - add common debugfs for resource managers - convert to kvcalloc format helpers: - support monochrome formats - RGB888, RGB565 to XRGB8888 conversions fbdev: - cfb/sys_imageblit fixes - pagelist corruption fix - create offb platform device - deferred io improvements sysfb: - Kconfig rework - support for VESA mode selection bridge: - conversions to devm_drm_of_get_bridge - conversions to panel_bridge - analogix_dp - autosuspend support - it66121 - audio support - tc358767 - DSI to DPI support - icn6211 - PLL/I2C fixes, DT property - adv7611 - enable DRM_BRIDGE_OP_HPD - anx7625 - fill ELD if no monitor - dw_hdmi - add audio support - lontium LT9211 support, i.MXMP LDB - it6505: Kconfig fix, DPCD set power fix - adv7511 - CEC support for ADV7535 panel: - ltk035c5444t, B133UAN01, NV3052C panel support - DataImage FG040346DSSWBG04 support - st7735r - DT bindings fix - ssd130x - fixes i915: - DG2 laptop PCI-IDs ("motherboard down") - Initial RPL-P PCI IDs - compute engine ABI - DG2 Tile4 support - DG2 CCS clear color compression support - DG2 render/media compression formats support - ATS-M platform info - RPL-S PCI IDs added - Bump ADL-P DMC version to v2.16 - Support static DRRS - Support multiple eDP/LVDS native mode refresh rates - DP HDR support for HSW+ - Lots of display refactoring + fixes - GuC hwconfig support and query - sysfs support for multi-tile - fdinfo per-client gpu utilisation - add geometry subslices query - fix prime mmap with LMEM - fix vm open count and remove vma refcounts - contiguous allocation fixes - steered register write support - small PCI BAR enablement - GuC error capture support - sunset igpu legacy mmap support for newer devices - GuC version 70.1.1 support amdgpu: - Initial SoC21 support - SMU 13.x enablement - SMU 13.0.4 support - ttm_eu cleanups - USB-C, GPUVM updates - TMZ fixes for RV - RAS support for VCN - PM sysfs code cleanup - DC FP rework - extend CG/PG flags to 64-bit - SI dpm lockdep fix - runtime PM fixes amdkfd: - RAS/SVM fixes - TLB flush fixes - CRIU GWS support - ignore bogus MEC signals more efficiently msm: - Fourcc modifier for tiled but not compressed layouts - Support for userspace allocated IOVA (GPU virtual address) - DPU: DSC (Display Stream Compression) support - DP: eDP support - DP: conversion to use drm_bridge and drm_bridge_connector - Merge DPU1 and MDP5 MDSS driver - DPU: writeback support nouveau: - make some structures static - make some variables static - switch to drm_gem_plane_helper_prepare_fb radeon: - misc fixes/cleanups mxsfb: - rework crtc mode setting - LCDIF CRC support etnaviv: - fencing improvements - fix address space collisions - cleanup MMU reference handling gma500: - GEM/GTT improvements - connector handling fixes komeda: - switch to plane reset helper mediatek: - MIPI DSI improvements omapdrm: - GEM improvements qxl: - aarch64 support vc4: - add a CL submission tracepoint - HDMI YUV support - HDMI/clock improvements - drop is_hdmi caching virtio: - remove restriction of non-zero blob types vmwgfx: - support for cursormob and cursorbypass 4 - fence improvements tidss: - reset DISPC on startup solomon: - SPI support - DT improvements sun4i: - allwinner D1 support - drop is_hdmi caching imx: - use swap() instead of open-coding - use devm_platform_ioremap_resource - remove redunant initializations ast: - Displayport support rockchip: - Refactor IOMMU initialisation - make some structures static - replace drm_detect_hdmi_monitor with drm_display_info.is_hdmi - support swapped YUV formats, - clock improvements - rk3568 support - VOP2 support mediatek: - MT8186 support tegra: - debugabillity improvements" * tag 'drm-next-2022-05-25' of git://anongit.freedesktop.org/drm/drm: (1740 commits) drm/i915/dsi: fix VBT send packet port selection for ICL+ drm/i915/uc: Fix undefined behavior due to shift overflowing the constant drm/i915/reg: fix undefined behavior due to shift overflowing the constant drm/i915/gt: Fix use of static in macro mismatch drm/i915/audio: fix audio code enable/disable pipe logging drm/i915: Fix CFI violation with show_dynamic_id() drm/i915: Fix 'mixing different enum types' warnings in intel_display_power.c drm/i915/gt: Fix build error without CONFIG_PM drm/msm/dpu: handle pm_runtime_get_sync() errors in bind path drm/msm/dpu: add DRM_MODE_ROTATE_180 back to supported rotations drm/msm: don't free the IRQ if it was not requested drm/msm/dpu: limit writeback modes according to max_linewidth drm/amd: Don't reset dGPUs if the system is going to s2idle drm/amdgpu: Unmap legacy queue when MES is enabled drm: msm: fix possible memory leak in mdp5_crtc_cursor_set() drm/msm: Fix fb plane offset calculation drm/msm/a6xx: Fix refcount leak in a6xx_gpu_init drm/msm/dsi: don't powerup at modeset time for parade-ps8640 drm/rockchip: Change register space names in vop2 dt-bindings: display: rockchip: make reg-names mandatory for VOP2 ... |