Shiju Jose 699ea5219c EDAC: Add a memory repair control feature
Add a generic EDAC memory repair control driver to manage memory repairs in
the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR
features.

For example, a CXL device with DRAM components that support PPR features may
implement PPR maintenance operations. DRAM components may support two types of
PPR:

 - hard PPR, for a permanent row repair, and
 - soft PPR,  for a temporary row repair.

Soft PPR is much faster than hard PPR, but the repair is lost with a power
cycle.

When a CXL device detects an error in a memory, it may report the need for
a repair maintenance operation by using an event record where the "maintenance
needed" flag is set. The event records contain the device physical
address (DPA) and other optional attributes of the memory to repair.

The kernel will report the corresponding CXL general media or DRAM trace event
to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.

Device with memory repair features registers with EDAC device driver, which
retrieves a memory repair descriptor from EDAC memory repair driver and exposes
the sysfs repair control attributes to userspace in

  /sys/bus/edac/devices/<dev-name>/mem_repairX/.

The common memory repair control interface abstracts the control of arbitrary
memory repair functionality into a standardized set of functions.  The sysfs
memory repair attribute nodes are only available if the client driver has
implemented the corresponding attribute callback function and provided
operations to the EDAC device driver during registration.

  [ bp: Massage, fixup edac_dev_register() retvals, merge
    write_overflow fix to mem_repair_create_desc() ]

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
2025-02-26 11:13:23 +01:00

104 lines
4.5 KiB
ReStructuredText

.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later
=================
EDAC/RAS features
=================
Copyright (c) 2024-2025 HiSilicon Limited.
:Author: Shiju Jose <shiju.jose@huawei.com>
:License: The GNU Free Documentation License, Version 1.2 without
Invariant Sections, Front-Cover Texts nor Back-Cover Texts.
(dual licensed under the GPL v2)
- Written for: 6.15
Introduction
------------
EDAC/RAS components plugging and high-level design:
1. Scrub control
2. Error Check Scrub (ECS) control
3. ACPI RAS2 features
4. Post Package Repair (PPR) control
5. Memory Sparing Repair control
High level design is illustrated in the following diagram::
+-----------------------------------------------+
| Userspace - Rasdaemon |
| +-------------+ |
| | RAS CXL mem | +---------------+ |
| |error handler|---->| | |
| +-------------+ | RAS dynamic | |
| +-------------+ | scrub, memory | |
| | RAS memory |---->| repair control| |
| |error handler| +----|----------+ |
| +-------------+ | |
+--------------------------|--------------------+
|
|
+-------------------------------|------------------------------+
| Kernel EDAC extension for | controlling RAS Features |
|+------------------------------|----------------------------+ |
|| EDAC Core Sysfs EDAC| Bus | |
|| +--------------------------|---------------------------+| |
|| |/sys/bus/edac/devices/<dev>/scrubX/ | | EDAC device || |
|| |/sys/bus/edac/devices/<dev>/ecsX/ |<->| EDAC MC || |
|| |/sys/bus/edac/devices/<dev>/repairX | | EDAC sysfs || |
|| +---------------------------|--------------------------+| |
|| EDAC|Bus | |
|| | | |
|| +----------+ Get feature | Get feature | |
|| | | desc +---------|------+ desc +----------+ | |
|| |EDAC scrub|<-----| EDAC device | | | | |
|| +----------+ | driver- RAS |----->| EDAC mem | | |
|| +----------+ | feature control| | repair | | |
|| | |<-----| | +----------+ | |
|| |EDAC ECS | +---------|------+ | |
|| +----------+ Register RAS|features | |
|| ______________________|_____________ | |
|+---------|---------------|------------------|--------------+ |
| +-------|----+ +-------|-------+ +----|----------+ |
| | | | CXL mem driver| | Client driver | |
| | ACPI RAS2 | | scrub, ECS, | | memory repair | |
| | driver | | sparing, PPR | | features | |
| +-----|------+ +-------|-------+ +------|--------+ |
| | | | |
+--------|-----------------|--------------------|--------------+
| | |
+--------|-----------------|--------------------|--------------+
| +---|-----------------|--------------------|-------+ |
| | | |
| | Platform HW and Firmware | |
| +--------------------------------------------------+ |
+--------------------------------------------------------------+
1. EDAC Features components - Create feature-specific descriptors. For
example: scrub, ECS, memory repair in the above diagram.
2. EDAC device driver for controlling RAS Features - Get feature's attribute
descriptors from EDAC RAS feature component and registers device's RAS
features with EDAC bus and expose the features control attributes via
sysfs. For example, /sys/bus/edac/devices/<dev-name>/<feature>X/
3. RAS dynamic feature controller - Userspace sample modules in rasdaemon for
dynamic scrub/repair control to issue scrubbing/repair when excess number
of corrected memory errors are reported in a short span of time.
RAS features
------------
1. Memory Scrub
Memory scrub features are documented in `Documentation/edac/scrub.rst`.
2. Memory Repair
Memory repair features are documented in `Documentation/edac/memory_repair.rst`.