mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/
synced 2025-04-19 20:58:31 +09:00

Add a generic EDAC memory repair control driver to manage memory repairs in the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR features. For example, a CXL device with DRAM components that support PPR features may implement PPR maintenance operations. DRAM components may support two types of PPR: - hard PPR, for a permanent row repair, and - soft PPR, for a temporary row repair. Soft PPR is much faster than hard PPR, but the repair is lost with a power cycle. When a CXL device detects an error in a memory, it may report the need for a repair maintenance operation by using an event record where the "maintenance needed" flag is set. The event records contain the device physical address (DPA) and other optional attributes of the memory to repair. The kernel will report the corresponding CXL general media or DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair operation in response to the device request via the sysfs repair control. Device with memory repair features registers with EDAC device driver, which retrieves a memory repair descriptor from EDAC memory repair driver and exposes the sysfs repair control attributes to userspace in /sys/bus/edac/devices/<dev-name>/mem_repairX/. The common memory repair control interface abstracts the control of arbitrary memory repair functionality into a standardized set of functions. The sysfs memory repair attribute nodes are only available if the client driver has implemented the corresponding attribute callback function and provided operations to the EDAC device driver during registration. [ bp: Massage, fixup edac_dev_register() retvals, merge write_overflow fix to mem_repair_create_desc() ] Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
104 lines
4.5 KiB
ReStructuredText
104 lines
4.5 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later
|
|
|
|
=================
|
|
EDAC/RAS features
|
|
=================
|
|
|
|
Copyright (c) 2024-2025 HiSilicon Limited.
|
|
|
|
:Author: Shiju Jose <shiju.jose@huawei.com>
|
|
:License: The GNU Free Documentation License, Version 1.2 without
|
|
Invariant Sections, Front-Cover Texts nor Back-Cover Texts.
|
|
(dual licensed under the GPL v2)
|
|
|
|
- Written for: 6.15
|
|
|
|
Introduction
|
|
------------
|
|
|
|
EDAC/RAS components plugging and high-level design:
|
|
|
|
1. Scrub control
|
|
|
|
2. Error Check Scrub (ECS) control
|
|
|
|
3. ACPI RAS2 features
|
|
|
|
4. Post Package Repair (PPR) control
|
|
|
|
5. Memory Sparing Repair control
|
|
|
|
High level design is illustrated in the following diagram::
|
|
|
|
+-----------------------------------------------+
|
|
| Userspace - Rasdaemon |
|
|
| +-------------+ |
|
|
| | RAS CXL mem | +---------------+ |
|
|
| |error handler|---->| | |
|
|
| +-------------+ | RAS dynamic | |
|
|
| +-------------+ | scrub, memory | |
|
|
| | RAS memory |---->| repair control| |
|
|
| |error handler| +----|----------+ |
|
|
| +-------------+ | |
|
|
+--------------------------|--------------------+
|
|
|
|
|
|
|
|
+-------------------------------|------------------------------+
|
|
| Kernel EDAC extension for | controlling RAS Features |
|
|
|+------------------------------|----------------------------+ |
|
|
|| EDAC Core Sysfs EDAC| Bus | |
|
|
|| +--------------------------|---------------------------+| |
|
|
|| |/sys/bus/edac/devices/<dev>/scrubX/ | | EDAC device || |
|
|
|| |/sys/bus/edac/devices/<dev>/ecsX/ |<->| EDAC MC || |
|
|
|| |/sys/bus/edac/devices/<dev>/repairX | | EDAC sysfs || |
|
|
|| +---------------------------|--------------------------+| |
|
|
|| EDAC|Bus | |
|
|
|| | | |
|
|
|| +----------+ Get feature | Get feature | |
|
|
|| | | desc +---------|------+ desc +----------+ | |
|
|
|| |EDAC scrub|<-----| EDAC device | | | | |
|
|
|| +----------+ | driver- RAS |----->| EDAC mem | | |
|
|
|| +----------+ | feature control| | repair | | |
|
|
|| | |<-----| | +----------+ | |
|
|
|| |EDAC ECS | +---------|------+ | |
|
|
|| +----------+ Register RAS|features | |
|
|
|| ______________________|_____________ | |
|
|
|+---------|---------------|------------------|--------------+ |
|
|
| +-------|----+ +-------|-------+ +----|----------+ |
|
|
| | | | CXL mem driver| | Client driver | |
|
|
| | ACPI RAS2 | | scrub, ECS, | | memory repair | |
|
|
| | driver | | sparing, PPR | | features | |
|
|
| +-----|------+ +-------|-------+ +------|--------+ |
|
|
| | | | |
|
|
+--------|-----------------|--------------------|--------------+
|
|
| | |
|
|
+--------|-----------------|--------------------|--------------+
|
|
| +---|-----------------|--------------------|-------+ |
|
|
| | | |
|
|
| | Platform HW and Firmware | |
|
|
| +--------------------------------------------------+ |
|
|
+--------------------------------------------------------------+
|
|
|
|
|
|
1. EDAC Features components - Create feature-specific descriptors. For
|
|
example: scrub, ECS, memory repair in the above diagram.
|
|
|
|
2. EDAC device driver for controlling RAS Features - Get feature's attribute
|
|
descriptors from EDAC RAS feature component and registers device's RAS
|
|
features with EDAC bus and expose the features control attributes via
|
|
sysfs. For example, /sys/bus/edac/devices/<dev-name>/<feature>X/
|
|
|
|
3. RAS dynamic feature controller - Userspace sample modules in rasdaemon for
|
|
dynamic scrub/repair control to issue scrubbing/repair when excess number
|
|
of corrected memory errors are reported in a short span of time.
|
|
|
|
RAS features
|
|
------------
|
|
1. Memory Scrub
|
|
|
|
Memory scrub features are documented in `Documentation/edac/scrub.rst`.
|
|
|
|
2. Memory Repair
|
|
|
|
Memory repair features are documented in `Documentation/edac/memory_repair.rst`.
|