2067 Commits

Author SHA1 Message Date
Linus Torvalds
ae8371a46e - Add infrastructure support to EDAC in order to be able to register memory
scrubbing RAS functionality with the kernel and expose sysfs nodes to
   control such scrubbing functionality. The main use case is CXL devices which
   provide different scrubbers for their built-in memories so that tools like
   rasdaemon can configure and control memory scrubbing and other, more
   advanced RAS functionality. (Shiju Jose and Jonathan Cameron)
 
 - Add support to ie31200_edac for client SoCs like Raptor Lake-S which have
   multiple memory controllers and out-of-band ECC capability. (Qiuxu Zhuo)
 
 - The usual round of cleanups, simplifications and fixlets
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmfitgYACgkQEsHwGGHe
 VUp/Tg/7BJeE9QunMlB2EcCbYM+3eelp+Sg899S/6iNdC66sevFPoVXTpv9qz7Q6
 +ZD+V5vKIuKmGlV9dtn8nK5o8VvA2EvZsYSp6kk85qC+GNoqlc9E50I1yB3+otl8
 /3qD7PH0Ww5a4csjg+ioTRTphXp5DaK5J1m+Gze4h9n2ADs/aDb6vWr2AobomYOT
 h8pIb5PBdX9ehjWqUP/d+G+/ZN7244+FtMt1p3/xhBMjRJcwUxeAkw1u59EC5Hpb
 poP60Sl4pjr6uUI6QXrGEvLqvX3kq+fqveRosX1L+SlgAXesGXSg/tbdY5T78zGS
 aTebwmej00tvqQIYfsPpFKqk4W2wxUfnG6a2K0U3fYINQqSjI8kPrq9kMLpPejAG
 Lb0rZmHwLTPMM+G0BZVc4QSClhO9GXnD1wIsH8YGcqEkjDo0wkDL3KVm8aFhcx7b
 BDHn7b9Zx9zIvPhlcRupsUUNiqrNAV3R+zfUWzH9JF/GeCrT148vs8cZX16QGk18
 bnA5SY/mv8EExbeaEltKRagdToqxW6WEq5vv6KuLpco4kfCWJYWUHe5QaAMzZUBZ
 hXW0vvzUbkBLZo2NsdbXJk0+iT/lmjjOaqHZGcuwtepLIsJoRySkabbsMe6TTTtY
 O4abNt5yJkgcCYJwMwKmS9RogROO9yZjhK+4QfGex2lYgjlMTds=
 =OpTE
 -----END PGP SIGNATURE-----

Merge tag 'edac_updates_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC updates from Borislav Petkov:

 - Add infrastructure support to EDAC in order to be able to register
   memory scrubbing RAS functionality with the kernel and expose sysfs
   nodes to control such scrubbing functionality.

   The main use case is CXL devices which provide different scrubbers
   for their built-in memories so that tools like rasdaemon can
   configure and control memory scrubbing and other, more advanced RAS
   functionality (Shiju Jose and Jonathan Cameron)

 - Add support to ie31200_edac for client SoCs like Raptor Lake-S which
   have multiple memory controllers and out-of-band ECC capability
   (Qiuxu Zhuo)

 - The usual round of cleanups, simplifications and fixlets

* tag 'edac_updates_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: (25 commits)
  MAINTAINERS: Add a secondary maintainer for bluefield_edac
  EDAC/ie31200: Switch Raptor Lake-S to interrupt mode
  EDAC/ie31200: Add Intel Raptor Lake-S SoCs support
  EDAC/ie31200: Break up ie31200_probe1()
  EDAC/ie31200: Fold the two channel loops into one loop
  EDAC/ie31200: Make struct dimm_data contain decoded information
  EDAC/ie31200: Make the memory controller resources configurable
  EDAC/ie31200: Simplify the pci_device_id table
  EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info()
  EDAC/ie31200: Fix the error path order of ie31200_init()
  EDAC/ie31200: Fix the DIMM size mask for several SoCs
  EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layer
  EDAC/device: Fix dev_set_name() format string
  EDAC/pnd2: Make read-only const array intlv static
  EDAC/igen6: Constify struct res_config
  EDAC/amd64: Simplify return statement in dct_ecc_enabled()
  EDAC: Update memory repair control interface for memory sparing feature
  EDAC: Add a memory repair control feature
  EDAC: Use string choice helper functions
  EDAC: Add a Error Check Scrub control feature
  ...
2025-03-25 14:00:26 -07:00
Borislav Petkov (AMD)
298ffd5375 Merge remote-tracking branches 'ras/edac-cxl', 'ras/edac-drivers' and 'ras/edac-misc' into edac-updates
* ras/edac-cxl:
  EDAC/device: Fix dev_set_name() format string
  EDAC: Update memory repair control interface for memory sparing feature
  EDAC: Add a memory repair control feature
  EDAC: Add a Error Check Scrub control feature
  EDAC: Add scrub control feature
  EDAC: Add support for EDAC device features control

* ras/edac-drivers:
  EDAC/ie31200: Switch Raptor Lake-S to interrupt mode
  EDAC/ie31200: Add Intel Raptor Lake-S SoCs support
  EDAC/ie31200: Break up ie31200_probe1()
  EDAC/ie31200: Fold the two channel loops into one loop
  EDAC/ie31200: Make struct dimm_data contain decoded information
  EDAC/ie31200: Make the memory controller resources configurable
  EDAC/ie31200: Simplify the pci_device_id table
  EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info()
  EDAC/ie31200: Fix the error path order of ie31200_init()
  EDAC/ie31200: Fix the DIMM size mask for several SoCs
  EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layer
  EDAC/{skx_common,i10nm}: Fix some missing error reports on Emerald Rapids
  EDAC/igen6: Fix the flood of invalid error reports
  EDAC/ie31200: work around false positive build warning

* ras/edac-misc:
  MAINTAINERS: Add a secondary maintainer for bluefield_edac
  EDAC/pnd2: Make read-only const array intlv static
  EDAC/igen6: Constify struct res_config
  EDAC/amd64: Simplify return statement in dct_ecc_enabled()
  EDAC: Use string choice helper functions

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-03-25 14:53:27 +01:00
Qiuxu Zhuo
a5db1b296b EDAC/ie31200: Switch Raptor Lake-S to interrupt mode
Raptor Lake-S SoCs notify correctable memory errors via CMCI (Corrected
Machine Check Interrupt). Switch Raptor Lake-S EDAC support from polling
to interrupt mode by registering the callback to the MCE decode notifier
chain.

Note that as Raptor Lake-S SoCs may not recover from uncorrectable memory
errors, the system will hang as soon as this type of error occurs, and the
registered callback on the MCE decode chain will not be executed. This is
the expected behavior.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-12-qiuxu.zhuo@intel.com
2025-03-10 10:47:40 -07:00
Qiuxu Zhuo
d0742284ec EDAC/ie31200: Add Intel Raptor Lake-S SoCs support
The Intel Raptor Lake-S SoC contains two memory controllers with DDR5
memory type and out-of-band ECC capability. The resource definitions of
the memory controller are different from previous generations. One notable
difference is that the PCI ERRSTS register is deprecated and is not used
to indicate the presence of errors or to clear the MMIO-mapped ECC error
log regsiters.

Extend the ie31200_edac driver to support multiple memory controllers,
add a resource configuration table and use an MSR register to clear the
ECC error log registers to provide EDAC support for Raptor Lake-S SoCs.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-11-qiuxu.zhuo@intel.com
2025-03-10 10:47:14 -07:00
Qiuxu Zhuo
498550e1fa EDAC/ie31200: Break up ie31200_probe1()
Split ie31200_probe1() into two helper functions to easily extend support
for multiple memory controllers.

No functional changes intended.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-10-qiuxu.zhuo@intel.com
2025-03-10 10:46:48 -07:00
Qiuxu Zhuo
a217961b83 EDAC/ie31200: Fold the two channel loops into one loop
Fold the two channel loops to simplify the code and improve readability.
Also, delete the comments related to the DRB register, as this register
is not used here.

 No functional changes intended.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-9-qiuxu.zhuo@intel.com
2025-03-10 10:46:21 -07:00
Qiuxu Zhuo
afdbc36555 EDAC/ie31200: Make struct dimm_data contain decoded information
The current dimm_data structure contains encoded DIMM information,
which needs to be decoded for a given SoC when it is used. Make it
contain decoded information when it's initialized so that the places
where it is used do not need to decode it again, thereby simplifying
the code.

No functional changes intended.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-8-qiuxu.zhuo@intel.com
2025-03-10 10:45:56 -07:00
Qiuxu Zhuo
2a52cce648 EDAC/ie31200: Make the memory controller resources configurable
The resources such as MMIO, register offset, register mask, memory DIMM
information, ECC error log location, etc., of the memory controller, and
the number of memory controllers can be device-ID-specific. It requires
adding numerous 'if (device_id == new_id)' special handling cases to the
code to support a new SoC.

Make these kinds of resources configurable and separate them from the code
to facilitate the addition of new SoC support.

No functional changes intended.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-7-qiuxu.zhuo@intel.com
2025-03-10 10:45:07 -07:00
Qiuxu Zhuo
312e67a03d EDAC/ie31200: Simplify the pci_device_id table
Use PCI_VDEVICE() to simplify the pci_device_id table.

No functional changes intended.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-6-qiuxu.zhuo@intel.com
2025-03-10 10:44:40 -07:00
Qiuxu Zhuo
44eae52089 EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info()
The 3rd parameter of *populate_dimm_info() pertains to the DIMM index
within a channel, not the channel index. Fix the parameter name to dimm
to reflect its actual purpose.

No functional changes intended.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-5-qiuxu.zhuo@intel.com
2025-03-10 10:44:12 -07:00
Qiuxu Zhuo
231e341036 EDAC/ie31200: Fix the error path order of ie31200_init()
The error path order of ie31200_init() is incorrect, fix it.

Fixes: 709ed1bcef12 ("EDAC/ie31200: Fallback if host bridge device is already initialized")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-4-qiuxu.zhuo@intel.com
2025-03-10 10:43:39 -07:00
Qiuxu Zhuo
3427befbbc EDAC/ie31200: Fix the DIMM size mask for several SoCs
The DIMM size mask for {Sky, Kaby, Coffee} Lake is not bits{7:0},
but bits{5:0}. Fix it.

Fixes: 953dee9bbd24 ("EDAC, ie31200_edac: Add Skylake support")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-3-qiuxu.zhuo@intel.com
2025-03-10 10:43:05 -07:00
Qiuxu Zhuo
d59d844e31 EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layer
The EDAC_MC_LAYER_CHIP_SELECT layer pertains to the rank, not the DIMM.
Fix its size to reflect the number of ranks instead of the number of DIMMs.
Also delete the unused macros IE31200_{DIMMS,RANKS}.

Fixes: 7ee40b897d18 ("ie31200_edac: Introduce the driver")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-2-qiuxu.zhuo@intel.com
2025-03-10 10:41:32 -07:00
Arnd Bergmann
49472722d9 EDAC/device: Fix dev_set_name() format string
Passing a variable string as the format to dev_set_name() causes a W=1 warning:

  drivers/edac/edac_device.c:736:9: error: format not a string literal and no format arguments [-Werror=format-security]
    736 |         ret = dev_set_name(&ctx->dev, name);
        |         ^~~

Use a literal "%s" instead so the name can be the argument.

Fixes: db99ea5f2c03 ("EDAC: Add support for EDAC device features control")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250304143603.995820-1-arnd@kernel.org
2025-03-05 23:35:01 +01:00
Colin Ian King
136899ffc4 EDAC/pnd2: Make read-only const array intlv static
Don't populate the const read-only array intlv on the stack at run time,
instead make it static. This also shrinks the object size:

  $ size pnd2_edac.o.*

     text    data     bss     dec     hex filename
    15632     264    1384   17280    4380 pnd2_edac.o.new
    15644     264    1384   17292    438c pnd2_edac.o.old

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20240919170427.497429-1-colin.i.king@gmail.com
2025-03-03 16:39:26 +01:00
Christophe JAILLET
ac2fbe0948 EDAC/igen6: Constify struct res_config
The res_config structs are not modified in this driver.

Constifying these structures moves some data to a read-only section, so
increase overall security, especially when the structure holds some function
pointers.

On a x86_64, with allmodconfig, as an example:

  Before:
  ======
     text	   data	    bss	    dec	    hex	filename
    36777	   2479	   4304	  43560	   aa28	drivers/edac/igen6_edac.o

  After:
  =====
     text	   data	    bss	    dec	    hex	filename
    37297	   1959	   4304	  43560	   aa28	drivers/edac/igen6_edac.o

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/a06153870951a64b438e76adf97d440e02c1a1fc.1738355198.git.christophe.jaillet@wanadoo.fr
2025-03-03 16:33:03 +01:00
Thorsten Blum
12378e1c3f EDAC/amd64: Simplify return statement in dct_ecc_enabled()
Simplify the return statement to improve the code's readability.

No functional changes.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Link: https://lore.kernel.org/r/20250201130953.1377-2-thorsten.blum@linux.dev
2025-02-28 13:21:43 +01:00
Shiju Jose
81e42fc1d3 EDAC: Update memory repair control interface for memory sparing feature
Update memory repair control interface for memory sparing feature.

CXL memory devices can support soft and hard memory sparing at cacheline,
row, bank and rank granularities. Memory sparing is defined as a repair
function that replaces a portion of memory with a portion of functional
memory at that same granularity.

When a CXL device detects an error in memory, it will report to the host
that there's need for a repair maintenance operation by using an event
record where the "maintenance needed" flag is set.

The event records contain the device physical address (DPA) and other
attributes of the memory to repair such as bank group, bank, rank, row,
column, channel etc.

The kernel will report the corresponding CXL general media or DRAM trace
event to userspace, and userspace tools (e.g. rasdaemon) will initiate
a repair operation in response to the device request via the sysfs
repair control.

  [ bp: Massage. ]

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-15-shiju.jose@huawei.com
2025-02-26 11:14:40 +01:00
Shiju Jose
699ea5219c EDAC: Add a memory repair control feature
Add a generic EDAC memory repair control driver to manage memory repairs in
the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR
features.

For example, a CXL device with DRAM components that support PPR features may
implement PPR maintenance operations. DRAM components may support two types of
PPR:

 - hard PPR, for a permanent row repair, and
 - soft PPR,  for a temporary row repair.

Soft PPR is much faster than hard PPR, but the repair is lost with a power
cycle.

When a CXL device detects an error in a memory, it may report the need for
a repair maintenance operation by using an event record where the "maintenance
needed" flag is set. The event records contain the device physical
address (DPA) and other optional attributes of the memory to repair.

The kernel will report the corresponding CXL general media or DRAM trace event
to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.

Device with memory repair features registers with EDAC device driver, which
retrieves a memory repair descriptor from EDAC memory repair driver and exposes
the sysfs repair control attributes to userspace in

  /sys/bus/edac/devices/<dev-name>/mem_repairX/.

The common memory repair control interface abstracts the control of arbitrary
memory repair functionality into a standardized set of functions.  The sysfs
memory repair attribute nodes are only available if the client driver has
implemented the corresponding attribute callback function and provided
operations to the EDAC device driver during registration.

  [ bp: Massage, fixup edac_dev_register() retvals, merge
    write_overflow fix to mem_repair_create_desc() ]

Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
2025-02-26 11:13:23 +01:00
Thorsten Blum
d09055122b EDAC: Use string choice helper functions
Remove hard-coded strings by using the str_enabled_disabled(), str_yes_no(),
str_write_read(), and str_plural() helper functions.

Add a space in "All DIMMs support ECC: yes/no" to improve readability.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20250223212429.3466-2-thorsten.blum@linux.dev
2025-02-25 22:19:55 +01:00
Shiju Jose
bcbd069b11 EDAC: Add a Error Check Scrub control feature
Add an Error Check Scrub (ECS) control to manage a memory device's ECS
feature.

The ECS is a feature defined in JEDEC DDR5 SDRAM Specification (JESD79-5) and
allows the DRAM to internally read, correct single-bit errors, and write back
corrected data bits to the DRAM array while providing transparency to error
counts.

The DDR5 device contains a number of memory media Field Replaceable Units
(FRU) per device. The DDR5 ECS feature and thus the ECS control driver
supports configuring the ECS parameters per FRU.

Memory devices support the ECS feature register with the EDAC device driver,
which retrieves the ECS descriptor from the EDAC ECS driver.  This driver
exposes sysfs ECS control attributes to userspace via

  /sys/bus/edac/devices/<dev-name>/ecs_fruX/.

The common sysfs ECS control interface abstracts the control of an arbitrary
ECS functionality to a common set of functions.

Support for the ECS feature is added separately because the control attributes
of the DDR5 ECS feature differ from those of the scrub feature.

The sysfs ECS attribute nodes are only present if the client driver has
implemented the corresponding attribute callback function and passed the
necessary operations to the EDAC RAS feature driver during registration.

  [ bp: Massage, fixup edac_dev_register() retvals. ]

Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/20250212143654.1893-4-shiju.jose@huawei.com
2025-02-25 15:42:32 +01:00
Shiju Jose
f90b738166 EDAC: Add scrub control feature
Add a scrub control to manage memory scrubbers in the system.

Devices with a scrub feature register with the EDAC device driver which
retrieves the scrub descriptor from the scrub driver and exposes the
control attributes for a instance to userspace at

  /sys/bus/edac/devices/<dev-name>/scrubX/.

The common sysfs scrub control interface abstracts the control of
arbitrary scrubbing functionality into a common set of functions. The
attribute nodes are only present if the client driver has implemented
the corresponding attribute callback function and passed the operations
to the device driver during registration.

  [ bp: Massage commit message, docs and code, simplify text a bit.
    Integrate fixup for: https://lore.kernel.org/r/202502251009.0sGkolEJ-lkp@intel.com
    Reported-by: kernel test robot <lkp@intel.com>
    Reported-by: Dan Carpenter <dan.carpenter@linaro.org> ]

Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/20250212143654.1893-3-shiju.jose@huawei.com
2025-02-25 15:39:09 +01:00
Shiju Jose
db99ea5f2c EDAC: Add support for EDAC device features control
Add generic EDAC device feature controls supporting the registration of RAS
features available in the system. The driver exposes control attributes for
these features to userspace in

  /sys/bus/edac/devices/<dev-name>/<ras-feature>

  [ bp: Touch-up documentation, simplify, make edac_dev_type static,
    fixup edac_dev_register() retvals. ]

Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/20250212143654.1893-2-shiju.jose@huawei.com
2025-02-25 15:33:27 +01:00
Qiuxu Zhuo
d9207cf776 EDAC/{skx_common,i10nm}: Fix some missing error reports on Emerald Rapids
When doing error injection to some memory DIMMs on certain Intel Emerald
Rapids servers, the i10nm_edac missed error reports for some memory DIMMs.

Certain BIOS configurations may hide some memory controllers, and the
i10nm_edac doesn't enumerate these hidden memory controllers. However, the
ADXL decodes memory errors using memory controller physical indices even
if there are hidden memory controllers. Therefore, the memory controller
physical indices reported by the ADXL may mismatch the logical indices
enumerated by the i10nm_edac, resulting in missed error reports for some
memory DIMMs.

Fix this issue by creating a mapping table from memory controller physical
indices (used by the ADXL) to logical indices (used by the i10nm_edac) and
using it to convert the physical indices to the logical indices during the
error handling process.

Fixes: c545f5e41225 ("EDAC/i10nm: Skip the absent memory controllers")
Reported-by: Kevin Chang <kevin1.chang@intel.com>
Tested-by: Kevin Chang <kevin1.chang@intel.com>
Reported-by: Thomas Chen <Thomas.Chen@intel.com>
Tested-by: Thomas Chen <Thomas.Chen@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250214002728.6287-1-qiuxu.zhuo@intel.com
2025-02-20 17:02:33 -08:00
Qiuxu Zhuo
267e5b1d26 EDAC/igen6: Fix the flood of invalid error reports
The ECC_ERROR_LOG register of certain SoCs may contain the invalid value
~0, which results in a flood of invalid error reports in polling mode.

Fix the flood of invalid error reports by skipping the invalid ECC error
log value ~0.

Fixes: e14232afa944 ("EDAC/igen6: Add polling support")
Reported-by: Ramses <ramses@well-founded.dev>
Closes: https://lore.kernel.org/all/OISL8Rv--F-9@well-founded.dev/
Tested-by: Ramses <ramses@well-founded.dev>
Reported-by: John <therealgraysky@proton.me>
Closes: https://lore.kernel.org/all/p5YcxOE6M3Ncxpn2-Ia_wCt61EM4LwIiN3LroQvT_-G2jMrFDSOW5k2A9D8UUzD2toGpQBN1eI0sL5dSKnkO8iteZegLoQEj-DwQaMhGx4A=@proton.me/
Tested-by: John <therealgraysky@proton.me>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250212083354.31919-1-qiuxu.zhuo@intel.com
2025-02-20 17:00:38 -08:00
Arnd Bergmann
c29dfd661f EDAC/ie31200: work around false positive build warning
gcc-14 produces a bogus warning in some configurations:

drivers/edac/ie31200_edac.c: In function 'ie31200_probe1.isra':
drivers/edac/ie31200_edac.c:412:26: error: 'dimm_info' is used uninitialized [-Werror=uninitialized]
  412 |         struct dimm_data dimm_info[IE31200_CHANNELS][IE31200_DIMMS_PER_CHANNEL];
      |                          ^~~~~~~~~
drivers/edac/ie31200_edac.c:412:26: note: 'dimm_info' declared here
  412 |         struct dimm_data dimm_info[IE31200_CHANNELS][IE31200_DIMMS_PER_CHANNEL];
      |                          ^~~~~~~~~

I don't see any way the unintialized access could really happen here,
but I can see why the compiler gets confused by the two loops.

Instead, rework the two nested loops to only read the addr_decode
registers and then keep only one instance of the dimm info structure.

[Tony: Qiuxu pointed out that the "populate DIMM info" comment was left
behind in the refactor and suggested moving it. I deleted the comment
as unnecessry in front os a call to populate_dimm_info(). That seems
pretty self-describing.]

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20250122065031.1321015-1-arnd@kernel.org
2025-02-20 16:56:41 -08:00
Komal Bajaj
c158647c10 EDAC/qcom: Correct interrupt enable register configuration
The previous implementation incorrectly configured the cmn_interrupt_2_enable
register for interrupt handling. Using cmn_interrupt_2_enable to configure
Tag, Data RAM ECC interrupts would lead to issues like double handling of the
interrupts (EL1 and EL3) as cmn_interrupt_2_enable is meant to be configured
for interrupts which needs to be handled by EL3.

EL1 LLCC EDAC driver needs to use cmn_interrupt_0_enable register to configure
Tag, Data RAM ECC interrupts instead of cmn_interrupt_2_enable.

Fixes: 27450653f1db ("drivers: edac: Add EDAC driver support for QCOM SoCs")
Signed-off-by: Komal Bajaj <quic_kbajaj@quicinc.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20241119064608.12326-1-quic_kbajaj@quicinc.com
2025-02-14 20:36:11 +01:00
Linus Torvalds
b9d8a295ed - The first part of a restructuring of AMD's representation of a northbridge
which is legacy now, and the creation of the new AMD node concept which
   represents the Zen architecture of having a collection of I/O devices within
   an SoC. Those nodes comprise the so-called data fabric on Zen. This has
   at least one practical advantage of not having to add a PCI ID each time
   a new data fabric PCI device releases. Eventually, the lot more uniform
   provider of data fabric functionality amd_node.c will be used by all the
   drivers which need it
 
 - Smaller cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmePuPIACgkQEsHwGGHe
 VUpU6Q//S9j9+YC9EpredFoJ5W0BfERR5XOum7YjlLxq2mVTStrf9Q1ecrwmS4Q6
 4mAydIDfhqNlouUjMBgNNFJcvm8lat+/pjY78oT8ZdjumslMbMxo81VmQ3fX+6fE
 izMrL81DG4j8zeleUyz5ecJEK/KPw1s3SkY736511PeJSalOU4hLYmU819imfAk/
 5c9os2GNhszIROE1YUYZQ3zXne1t2PNXKvctzVrJYjyKpIDgFNzTj6gXhePzXBNO
 iFdApqSgKdnnsD6VsfxYVnOKP+cSIl27Tbge6dm7DHQbSs00aVL64JPcX8/hWtp6
 ExrwBYiFk6yafwsNUu7/PmqbZNKYxDgvXFq8jSOFfioh6Km/QZYs8y1/qXN3qmSU
 78Ah5jyO+U+++FsSa2o9eRpU2l84UIQqvp84PeSLylzh7iLFyFCWsMfreNeIsF9v
 Jsost58JQOCufRK3qfMiDO88QUZRKyCfFymDAVcvPoBwp5nK9R1ohlbxgXrCPsE7
 Bd7J6jrlpcoRyYc8vhshkrnK2Sk6pP77OZOh5AZ9AybnALH0afUNLzk6sBtaObkZ
 xIJcSIBkKz3P4zWFKsXmqGYHWp1IsKsYRsNjCt5FExWOF+uKKKBjynHmlKeS0l/b
 J6bwDUPVW/gfkBqDV8bILultj9Gm8L5Z8SwvD1ww69OYN+c7oVk=
 =ZAjD
 -----END PGP SIGNATURE-----

Merge tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 updates from Borislav Petkov:

 - The first part of a restructuring of AMD's representation of a
   northbridge which is legacy now, and the creation of the new AMD node
   concept which represents the Zen architecture of having a collection
   of I/O devices within an SoC. Those nodes comprise the so-called data
   fabric on Zen.

   This has at least one practical advantage of not having to add a PCI
   ID each time a new data fabric PCI device releases. Eventually, the
   lot more uniform provider of data fabric functionality amd_node.c
   will be used by all the drivers which need it

 - Smaller cleanups

* tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/amd_node: Use defines for SMN register offsets
  x86/amd_node: Remove dependency on AMD_NB
  x86/amd_node: Update __amd_smn_rw() error paths
  x86/amd_nb: Move SMN access code to a new amd_node driver
  x86/amd_nb, hwmon: (k10temp): Simplify amd_pci_dev_to_node_id()
  x86/amd_nb: Simplify function 3 search
  x86/amd_nb: Use topology info to get AMD node count
  x86/amd_nb: Simplify root device search
  x86/amd_nb: Simplify function 4 search
  x86: Start moving AMD node functionality out of AMD_NB
  x86/amd_nb: Clean up early_is_amd_nb()
  x86/amd_nb: Restrict init function to AMD-based systems
  x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()
2025-01-21 09:38:52 -08:00
Linus Torvalds
48795f90cb - Remove the less generic CPU matching infra around struct x86_cpu_desc and
use the generic struct x86_cpu_id thing
 
 - Remove magic naked numbers for CPUID functions and use proper defines of the
   prefix CPUID_LEAF_*. Consolidate some of the crazy use around the tree
 
 - Smaller cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmePjeIACgkQEsHwGGHe
 VUqRBA//TinKFcWagaQB3lsnoBRwqyg6JJZIBNMF9sBMDD9HnvEZ/JduC+3+g1rx
 iztuCmRSgQsi/QvRaEFNuDMOgk6gACyXxi7Uf6eXsQkSlsZFViaqbXsy9kqslRbl
 7QP1NS1sfdSd42JPp2UZT/lg9kluuVnn5b40zZIwy2AAzwrNFfZAS4Yg7Qe4XQDF
 xBcHi8MAF+LTm5Tv0hLmx2UcfZLhi7hXy8mTAIFS0Liww+Y5qaam33xw9KxNU5lZ
 tVepzY5my43pRs4MB1CvaQCiZ84GxvAVqz3JYsg5YhVp45xh7P2WtjBeeOqLljaW
 MkWnDLOmlaD4Y0kL4QA3ReyBVux54RbDGKC0E/t5fwYlk3dQ7gYwSEvh5358R+0z
 kwxw3NdnNngoLRXAX45EonSxj36jb6KCBHAGqXSfL73OOt30RWCqknEnixcOp/BP
 chNxCiIx7qko+rAYOD62QkguEEPFdb8roeayhIKtiKL5zUwQAr+jt/pKVx2htWLi
 xxqSaVoCFu4edWpsEJnanqhS0Es0v7YiBU3jDC37rZJ+dtzf0C2ewD7Nb1g+wUTn
 NzDkmt58hQW4jBxoxHBIclLfhEETISTEGAAObTa5I5r8IDb7Dv+ZnSv7RfjoR9fL
 RWMz1bJ1Scem+Fx7fc/IRJFSElC41giSwFlhThHdAzI1m95zJN8=
 =9Hdg
 -----END PGP SIGNATURE-----

Merge tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 cpuid updates from Borislav Petkov:

 - Remove the less generic CPU matching infra around struct x86_cpu_desc
   and use the generic struct x86_cpu_id thing

 - Remove magic naked numbers for CPUID functions and use proper defines
   of the prefix CPUID_LEAF_*. Consolidate some of the crazy use around
   the tree

 - Smaller cleanups and improvements

* tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Make all all CPUID leaf names consistent
  x86/fpu: Remove unnecessary CPUID level check
  x86/fpu: Move CPUID leaf definitions to common code
  x86/tsc: Remove CPUID "frequency" leaf magic numbers.
  x86/tsc: Move away from TSC leaf magic numbers
  x86/cpu: Move TSC CPUID leaf definition
  x86/cpu: Refresh DCA leaf reading code
  x86/cpu: Remove unnecessary MwAIT leaf checks
  x86/cpu: Use MWAIT leaf definition
  x86/cpu: Move MWAIT leaf definition to common header
  x86/cpu: Remove 'x86_cpu_desc' infrastructure
  x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id'
  x86/cpu: Replace PEBS use of 'x86_cpu_desc' use with 'x86_cpu_id'
  x86/cpu: Expose only stepping min/max interface
  x86/cpu: Introduce new microcode matching helper
  x86/cpufeature: Document cpu_feature_enabled() as the default to use
  x86/paravirt: Remove the WBINVD callback
  x86/cpufeatures: Free up unused feature bits
2025-01-21 09:30:59 -08:00
Linus Torvalds
0763dd8928 - Remove the EDAC PowerPC Cell driver due to the removal of the IBM Cell
blades support
 
 - Add a new EDAC driver for Loongson SoCs which reports single-bit correctable
   errors
 
 - Extend the SKX and i10NM EDAC drivers to support UV systems which can have
   more than 8 nodes
 
 - Add Intel Clearwater Forest server support to i10nm_edac
 
 - Minor fix
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeLi5QACgkQEsHwGGHe
 VUpZEhAAsbvNhKhq84FUCoQP+OWOdQBv0s3WNGherWPAxC3bGY7xDKcjjdf6ebYG
 4Rk0+nTEK5kefA6PsyiWtxWQZiYUzERDrNpdjWq5RxaO6WtREDiczgJm4NaCCTBr
 F59eHImjW/ajBsU8FAcWnVZLo7KqZvtF19vQFL1TXAKJO6Zpb0ybLIts9BVSwwyV
 c1xYyjMFh1p1H8n7MOsF11u2QUpCcc/SMDesCGWSVAJ2QnB7Ox5NfUaI97lQSiow
 gnS8vTWTIM6e6rZk2HtkSage0Wt7UHCkIsza5DEdW3xQG10eZUE6o33kerxNMezd
 lMWNzavR26CYhkO6/McvhsClOHwcAZZVd4PUTXnNvlNTSV+EEbEbD5JWryQvmvkV
 gazOlPHwd0pj1MkAZBUTdCnR6/DCpqsu68sGMjAiPvR7pb3sBLyRF/DGg9p5Rz0a
 s3Q6SuTAg/ZQWNqgRNcnjh2SZbzqZ+GGD4blDTv1pNjWemTDYpxj73Wl5JxKTdtr
 6n7/ariQTaiMTpQ8ZhDkDP5eHWx5fyTY7P5MzPEuOmNzW/gLz0V5goSqXUUYhQjm
 YAKy5PDS1QwTBKzEOQ8dE3RcOhZtd1X2Vj04CcDgHrjhZBDderPkGs42R3B74cF0
 JB4k5QJFJ97Cxe9xYIFHju7Es6+z+j8kLzWGTydfHGMZfHMHCHM=
 =nwgH
 -----END PGP SIGNATURE-----

Merge tag 'edac_updates_for_v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC updates from Borislav Petkov:

 - Remove the EDAC PowerPC Cell driver due to the removal of the IBM
   Cell blades support

 - Add a new EDAC driver for Loongson SoCs which reports single-bit
   correctable errors

 - Extend the SKX and i10NM EDAC drivers to support UV systems which can
   have more than 8 nodes

 - Add Intel Clearwater Forest server support to i10nm_edac

 - Minor fix

* tag 'edac_updates_for_v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/cell: Remove powerpc Cell driver
  EDAC: Add an EDAC driver for the Loongson memory controller
  EDAC: Fix typos in comments
  EDAC/{i10nm,skx,skx_common}: Support UV systems
  EDAC/i10nm: Add Intel Clearwater Forest server support
2025-01-21 08:21:12 -08:00
Borislav Petkov (AMD)
368736db4d Merge remote-tracking branches 'ras/edac-drivers' and 'ras/edac-misc' into edac-updates
* ras/edac-drivers:
  EDAC/cell: Remove powerpc Cell driver
  EDAC: Add an EDAC driver for the Loongson memory controller
  EDAC/{i10nm,skx,skx_common}: Support UV systems
  EDAC/i10nm: Add Intel Clearwater Forest server support

* ras/edac-misc:
  EDAC: Fix typos in comments

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-01-17 19:36:27 +01:00
Michael Ellerman
6696037a56 EDAC/cell: Remove powerpc Cell driver
This driver can no longer be built since support for IBM Cell Blades was
removed, in particular PPC_CELL_COMMON.

Remove the driver.

  [ bp: Remove EDAC_CELL from Cell's defconfig too. ]

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241218105523.416573-23-mpe@ellerman.id.au
2025-01-16 17:07:50 +01:00
Mario Limonciello
d6caeafaa3 x86/amd_nb: Move SMN access code to a new amd_node driver
SMN access was bolted into amd_nb mostly as convenience.  This has
limitations though that require incurring tech debt to keep it working.

Move SMN access to the newly introduced AMD Node driver.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> # pdx86
Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> # PMF, PMC
Link: https://lore.kernel.org/r/20241206161210.163701-11-yazen.ghannam@amd.com
2025-01-08 10:59:44 +01:00
Zhao Qunqin
558aff7a63 EDAC: Add an EDAC driver for the Loongson memory controller
Add ECC support for Loongson SoC DDR controller. This driver reports single
bit errors (CE) only.

Only ACPI firmware is supported.

  [ bp: Document what last_ce_count is for. ]

Signed-off-by: Zhao Qunqin <zhaoqunqin@loongson.cn>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Huacai Chen <chenhuacai@loongson.cn>
Link: https://lore.kernel.org/r/20241219124846.1876-1-zhaoqunqin@loongson.cn
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-01-04 12:02:04 +01:00
Dave Hansen
85b08180df x86/cpu: Expose only stepping min/max interface
The x86_match_cpu() infrastructure can match CPU steppings. Since
there are only 16 possible steppings, the matching infrastructure goes
all out and stores the stepping match as a bitmap. That means it can
match any possible steppings in a single list entry. Fun.

But it exposes this bitmap to each of the X86_MATCH_*() helpers when
none of them really need a bitmap. It makes up for this by exporting a
helper (X86_STEPPINGS()) which converts a contiguous stepping range
into the bitmap which every single user leverages.

Instead of a bitmap, have the main helper for this sort of thing
(X86_MATCH_VFM_STEPS()) just take a stepping range. This ends up
actually being even more compact than before.

Leave the helper in place (renamed to __X86_STEPPINGS()) to make it
more clear what is going on instead of just having a random GENMASK()
in the middle of an already complicated macro.

One oddity that I hit was this macro:

       X86_MATCH_VFM_STEPS(vfm, X86_STEPPING_MIN, max_stepping, issues)

It *could* have been converted over to take a min/max stepping value
for each entry. But that would have been a bit too verbose and would
prevent the one oddball in the list (INTEL_COMETLAKE_L stepping 0)
from sticking out.

Instead, just have it take a *maximum* stepping and imply that the match
is from 0=>max_stepping. This is functional for all the cases now and
also retains the nice property of having INTEL_COMETLAKE_L stepping 0
stick out like a sore thumb.

skx_cpuids[] is goofy. It uses the stepping match but encodes all
possible steppings. Just use a normal, non-stepping match helper.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213185129.65527B2A%40davehans-spike.ostc.intel.com
2024-12-17 16:14:49 -08:00
Yan Zhen
586e62fe38 EDAC: Fix typos in comments
Fix the following typos:

'Alocate' ==> 'Allocate',
'specifed' ==> 'specified',
'Technlogy' ==> 'Technology',
'Brnach' ==> 'Branch',
'branchs' ==> 'branches'.

Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240930074023.618110-1-yanzhen@vivo.com
2024-12-15 22:17:34 +01:00
Kyle Meyer
584e09743d EDAC/{i10nm,skx,skx_common}: Support UV systems
The 3-bit source IDs in PCI configuration space registers, used to map
devices to sockets, are limited to 8 unique IDs, and each ID is local to
a UPI/QPI domain.

Source IDs cannot be used to map devices to sockets on UV systems
because they can exceed 8 sockets and have multiple UPI/QPI domains with
identical, repeating source IDs.

Use NUMA information to get package IDs instead of source IDs on UV
systems, and use package/source IDs to name IMC information structures.

Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/all/20241213012549.43099-1-kyle.meyer@hpe.com/
2024-12-13 11:10:31 -08:00
Borislav Petkov (AMD)
747367340c EDAC/amd64: Simplify ECC check on unified memory controllers
The intent of the check is to see whether at least one UMC has ECC
enabled. So do that instead of tracking which ones are enabled in masks
which are too small in size anyway and lead to not loading the driver on
Zen4 machines with UMCs enabled over UMC8.

Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh")
Reported-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Avadhut Naik <avadhut.naik@amd.com>
Reviewed-by: Avadhut Naik <avadhut.naik@amd.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20241210212054.3895697-1-avadhut.naik@amd.com
2024-12-11 21:47:33 +01:00
Qiuxu Zhuo
2e55bb9b71 EDAC/i10nm: Add Intel Clearwater Forest server support
Clearwater Forest is the successor to Sierra Forest. Add Clearwater
Forest CPU model ID for EDAC support.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
Link: https://lore.kernel.org/r/20241203022038.72873-1-qiuxu.zhuo@intel.com
2024-12-09 11:18:23 -08:00
Linus Torvalds
e70140ba0d Get rid of 'remove_new' relic from platform driver struct
The continual trickle of small conversion patches is grating on me, and
is really not helping.  Just get rid of the 'remove_new' member
function, which is just an alias for the plain 'remove', and had a
comment to that effect:

  /*
   * .remove_new() is a relic from a prototype conversion of .remove().
   * New drivers are supposed to implement .remove(). Once all drivers are
   * converted to not use .remove_new any more, it will be dropped.
   */

This was just a tree-wide 'sed' script that replaced '.remove_new' with
'.remove', with some care taken to turn a subsequent tab into two tabs
to make things line up.

I did do some minimal manual whitespace adjustment for places that used
spaces to line things up.

Then I just removed the old (sic) .remove_new member function, and this
is the end result.  No more unnecessary conversion noise.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-01 15:12:43 -08:00
Linus Torvalds
42d9e8b7cc powerpc updates for 6.13
- Rework kfence support for the HPT MMU to work on systems with >= 16TB of RAM.
 
  - Remove the powerpc "maple" platform, used by the "Yellow Dog Powerstation".
 
  - Add support for DYNAMIC_FTRACE_WITH_CALL_OPS,
    DYNAMIC_FTRACE_WITH_DIRECT_CALLS & BPF Trampolines.
 
  - Add support for running KVM nested guests on Power11.
 
  - Other small features, cleanups and fixes.
 
 Thanks to: Amit Machhiwal, Arnd Bergmann, Christophe Leroy, Costa Shulyupin,
 David Hunter, David Wang, Disha Goel, Gautam Menghani, Geert Uytterhoeven,
 Hari Bathini, Julia Lawall, Kajol Jain, Keith Packard, Lukas Bulwahn, Madhavan
 Srinivasan, Markus Elfring, Michal Suchanek, Ming Lei, Mukesh Kumar Chaurasiya,
 Nathan Chancellor, Naveen N Rao, Nicholas Piggin, Nysal Jan K.A, Paulo Miguel
 Almeida, Pavithra Prakash, Ritesh Harjani (IBM), Rob Herring (Arm), Sachin P
 Bappalige, Shen Lichuan, Simon Horman, Sourabh Jain, Thomas Weißschuh, Thorsten
 Blum, Thorsten Leemhuis, Venkat Rao Bagalkote, Zhang Zekun,
 zhang jiao.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRjvi15rv0TSTaE+SIF0oADX8seIQUCZ0Fi5AAKCRAF0oADX8se
 IeI0AQCAkNWRYzGNzPM6aMwDpq5qdeZzvp0rZxuNsRSnIKJlxAD+PAOxOietgjbQ
 Lxt3oizg+UcH/304Y/iyT8IrwI4n+gE=
 =xNtu
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - Rework kfence support for the HPT MMU to work on systems with >= 16TB
   of RAM.

 - Remove the powerpc "maple" platform, used by the "Yellow Dog
   Powerstation".

 - Add support for DYNAMIC_FTRACE_WITH_CALL_OPS,
   DYNAMIC_FTRACE_WITH_DIRECT_CALLS & BPF Trampolines.

 - Add support for running KVM nested guests on Power11.

 - Other small features, cleanups and fixes.

Thanks to Amit Machhiwal, Arnd Bergmann, Christophe Leroy, Costa
Shulyupin, David Hunter, David Wang, Disha Goel, Gautam Menghani, Geert
Uytterhoeven, Hari Bathini, Julia Lawall, Kajol Jain, Keith Packard,
Lukas Bulwahn, Madhavan Srinivasan, Markus Elfring, Michal Suchanek,
Ming Lei, Mukesh Kumar Chaurasiya, Nathan Chancellor, Naveen N Rao,
Nicholas Piggin, Nysal Jan K.A, Paulo Miguel Almeida, Pavithra Prakash,
Ritesh Harjani (IBM), Rob Herring (Arm), Sachin P Bappalige, Shen
Lichuan, Simon Horman, Sourabh Jain, Thomas Weißschuh, Thorsten Blum,
Thorsten Leemhuis, Venkat Rao Bagalkote, Zhang Zekun, and zhang jiao.

* tag 'powerpc-6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (89 commits)
  EDAC/powerpc: Remove PPC_MAPLE drivers
  powerpc/perf: Add per-task/process monitoring to vpa_pmu driver
  powerpc/kvm: Add vpa latency counters to kvm_vcpu_arch
  docs: ABI: sysfs-bus-event_source-devices-vpa-pmu: Document sysfs event format entries for vpa_pmu
  powerpc/perf: Add perf interface to expose vpa counters
  MAINTAINERS: powerpc: Mark Maddy as "M"
  powerpc/Makefile: Allow overriding CPP
  powerpc-km82xx.c: replace of_node_put() with __free
  ps3: Correct some typos in comments
  powerpc/kexec: Fix return of uninitialized variable
  macintosh: Use common error handling code in via_pmu_led_init()
  powerpc/powermac: Use of_property_match_string() in pmac_has_backlight_type()
  powerpc: remove dead config options for MPC85xx platform support
  powerpc/xive: Use cpumask_intersects()
  selftests/powerpc: Remove the path after initialization.
  powerpc/xmon: symbol lookup length fixed
  powerpc/ep8248e: Use %pa to format resource_size_t
  powerpc/ps3: Reorganize kerneldoc parameter names
  KVM: PPC: Book3S HV: Fix kmv -> kvm typo
  powerpc/sstep: make emulate_vsx_load and emulate_vsx_store static
  ...
2024-11-23 10:44:31 -08:00
Linus Torvalds
c1f2ffe207 - Log and handle twp new AMD-specific MCA registers: SYND1 and SYND2 and
report the Field Replaceable Unit text info reported through them
 
 - Add support for handling variable-sized SMCA BERT records
 
 - Add the capability for reporting vendor-specific RAS error info without
   adding vendor-specific fields to struct mce
 
 - Cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmc7OlEACgkQEsHwGGHe
 VUpXihAAgVdZExo/1Rmbh6s/259BH38GP6fL+ePaT1SlUzNi770TY2b7I4OYlms4
 xa9t8LAIVMrrIMIg6w6q8JN4YHAQoVdcbRBvHQYB1a24xtoyxaEJxLKQNLA1soUQ
 Jc9asWMHBuXnLfR/4S8Y2vWrzByOSwxqDBzQCu0Ryqvbg7vdRicNt+Hk9oHHIAYy
 cquZpoDGL3W6BA8sXONbEW/6rcQ33JsEQ+Ub4qr1q2g+kNwXrrFuXZlojmz2MxIs
 xgqeYKyrxK6heX0l8dSiipCATA+sOXXWWzbZtdPjFtDGzwIlV3p4yXN3fucrmHm1
 4Fg1gW5a1V82Qosn0FbGiZPojsahhOE2k1bz+yEMDM3Sg2qeRWcK+V3jiS5zKzPd
 WWqUbRtcaxayoEsAXnWrxrp3vxhlUUf1Ivtgk8mlMjhHPLijV5iranrRj+XHEikR
 H0D3Vm0T1LHCPf9AUsbmo0GAfAOeO9DTAB9LJdKv+OJ4ESVgSPJW/9NKWLXKq41p
 hhs7seJTYNw8sp67cL23TnkSp3S+9kd2U7Od3T1kubtd4fVxVnlowu8Fc6kjqd8v
 n+GbdLxhX7GbOgnT0z2OG5Xmc1pNW1JtRbuxSK59NFNia7r6ZkR7BE/OCtL82Rfm
 u7i76z1O0lV91y93GMCyP9DYn8K1ceU7gVCveY6mx/AHgzc87d8=
 =djpG
 -----END PGP SIGNATURE-----

Merge tag 'ras_core_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RAS updates from Borislav Petkov:

 - Log and handle twp new AMD-specific MCA registers: SYND1 and SYND2
   and report the Field Replaceable Unit text info reported through them

 - Add support for handling variable-sized SMCA BERT records

 - Add the capability for reporting vendor-specific RAS error info
   without adding vendor-specific fields to struct mce

 - Cleanups

* tag 'ras_core_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  EDAC/mce_amd: Add support for FRU text in MCA
  x86/mce/apei: Handle variable SMCA BERT record size
  x86/MCE/AMD: Add support for new MCA_SYND{1,2} registers
  tracing: Add __print_dynamic_array() helper
  x86/mce: Add wrapper for struct mce to export vendor specific info
  x86/mce/intel: Use MCG_BANKCNT_MASK instead of 0xff
  x86/mce/mcelog: Use xchg() to get and clear the flags
2024-11-19 12:04:51 -08:00
Linus Torvalds
77286b868f - Add support for Bluefield-2 SOCs to bluefield_edac
- Add support for Intel Panther Lake-H to igen6_edac
 
 - Add polling support to igen6_edac as some Intel M100 chips have trouble with
   error interrupts
 
 - Add Kaby Lake-S support to ie31200_edac
 
 - Fix memory source detection in the SKX common module which is used by
   a couple of Intel EDAC drivers
 
 - Add support for the NXP i.MX9 memory controller to fsl_edac
 
 - The usual fixes and cleanups all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmc7KAcACgkQEsHwGGHe
 VUpzxQ/6Ahr49jXu58M69UQSW3DdzEU+5NNxmUrZdRrdW/oCJXGpuRdmdFWzvWTj
 HtfCS7GmaSIUPjLaNisyKdCaZxWysBqyLe0Vaexw5nuyybF5TzdYWETqFef1ij9z
 Wqq1j5LPrz+9BiqFqkpbgzo6Y6Ubsv2RKuZu+1GkMT2zRrgEJuJgHi6RlJ8vqj//
 7FePl3CFQ3HDdTom0/L/gsMqSObj7HEq9cbalIjIYw/GRVkZol21vDwKrUkM7rpF
 tfrN1qq3NuJyqM7Du2jw2VtXDomrQ/ZkABNXCbtbczf8trLYUHR5QqIQjxy2ZFts
 jMKIbdCNAfgiqai6bpmm4QHWAIAV3L5DX7OuPmbpQeAzSmOqSEqNbnLbvA1e472f
 5upQH4OLOsHgbnnFTQJ7vcU5jHf41DSauMCFp60h2hyn5RIiVY5ASxRfQ3xdh/+a
 hp2N+hB/y46AjXAidsGhAuUw8nt44MN2x1gtiUfbtMIx6gTewtuu0SbwOb85JW16
 glhD8vxRGTUWoQit+Nh3u/P/rLSGkUJK87mfPr6O/95lleYy5hOizK2jGDbDWkA+
 zOnNXnSWKK/WM+B9qnJnU1sCC7vT3j7cTaDXB1XS2MtcJbArkNC0FOd6xD81PoGh
 MhfWBAKpirXQEomFqpVziDa2wlaUnZrv7/4GGmaBRO401O9iaE4=
 =C3dY
 -----END PGP SIGNATURE-----

Merge tag 'edac_updates_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC updates from Borislav Petkov:

 - Add support for Bluefield-2 SOCs to bluefield_edac

 - Add support for Intel Panther Lake-H to igen6_edac

 - Add polling support to igen6_edac as some Intel M100 chips have
   trouble with error interrupts

 - Add Kaby Lake-S support to ie31200_edac

 - Fix memory source detection in the SKX common module which is used by
   a couple of Intel EDAC drivers

 - Add support for the NXP i.MX9 memory controller to fsl_edac

 - The usual fixes and cleanups all over the place

* tag 'edac_updates_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/igen6: Add polling support
  EDAC/igen6: Initialize edac_op_state according to the configuration data
  EDAC/igen6: Avoid segmentation fault on module unload
  EDAC/ie31200: Add Kaby Lake-S dual-core host bridge ID
  MAINTAINERS: Change FSL DDR EDAC maintainership
  EDAC/{skx_common,i10nm}: Fix incorrect far-memory error source indicator
  EDAC/skx_common: Differentiate memory error sources
  EDAC/fsl_ddr: Add support for i.MX9 DDR controller
  dt-bindings: memory: fsl: Add compatible string nxp,imx9-memory-controller
  EDAC/fsl_ddr: Fix bad bit shift operations
  EDAC/fsl_ddr: Move global variables into struct fsl_mc_pdata
  EDAC/fsl_ddr: Pass down fsl_mc_pdata in ddr_in32() and ddr_out32()
  RAS/AMD/ATL: Add debug prints for DF register reads
  EDAC/bluefield: Use Arm SMC for EMI access on BlueField-2
  EDAC/bluefield: Fix potential integer overflow
  EDAC/igen6: Add Intel Panther Lake-H SoCs support
2024-11-19 12:00:10 -08:00
Michael Ellerman
3c592ce799 EDAC/powerpc: Remove PPC_MAPLE drivers
These two drivers are only buildable for the powerpc "maple" platform
(CONFIG_PPC_MAPLE), which has now been removed, see
commit 62f8f307c80e ("powerpc/64: Remove maple platform").

Remove the drivers.

Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://patch.msgid.link/20241112084134.411964-1-mpe@ellerman.id.au
2024-11-19 16:41:16 +11:00
Borislav Petkov (AMD)
1b38da0115 Merge branch 'edac-misc' into edac-updates
* edac-misc:
  MAINTAINERS: Change FSL DDR EDAC maintainership
  RAS/AMD/ATL: Add debug prints for DF register reads
  EDAC/bluefield: Use Arm SMC for EMI access on BlueField-2
  EDAC/bluefield: Fix potential integer overflow
  EDAC/igen6: Add Intel Panther Lake-H SoCs support

Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-11-18 11:33:23 +01:00
Orange Kao
e14232afa9 EDAC/igen6: Add polling support
Some PCs with Intel N100 (with PCI device 8086:461c, DID_ADL_N_SKU4)
experienced issues with error interrupts not working, even with the
following configuration in the BIOS.

    In-Band ECC Support: Enabled
    In-Band ECC Operation Mode: 2 (make all requests protected and
                                   ignore range checks)
    IBECC Error Injection Control: Inject Correctable Error on insertion
                                   counter
    Error Injection Insertion Count: 251658240 (0xf000000)

Add polling mode support for these machines to ensure that memory error
events are handled.

Signed-off-by: Orange Kao <orange@aiven.io>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/all/20241106114024.941659-3-orange@aiven.io
2024-11-08 13:36:55 -08:00
Qiuxu Zhuo
1d512b1aa5 EDAC/igen6: Initialize edac_op_state according to the configuration data
Currently, igen6_edac sets edac_op_state to EDAC_OPSTATE_NMI, while the
driver also supports memory errors reported from Machine Check. Initialize
edac_op_state to the correct value according to the configuration data
that the driver probed.

Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20241106114024.941659-2-orange@aiven.io
2024-11-08 13:35:21 -08:00
Orange Kao
fefaae9039 EDAC/igen6: Avoid segmentation fault on module unload
The segmentation fault happens because:

During modprobe:
1. In igen6_probe(), igen6_pvt will be allocated with kzalloc()
2. In igen6_register_mci(), mci->pvt_info will point to
   &igen6_pvt->imc[mc]

During rmmod:
1. In mci_release() in edac_mc.c, it will kfree(mci->pvt_info)
2. In igen6_remove(), it will kfree(igen6_pvt);

Fix this issue by setting mci->pvt_info to NULL to avoid the double
kfree.

Fixes: 10590a9d4f23 ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219360
Signed-off-by: Orange Kao <orange@aiven.io>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20241104124237.124109-2-orange@aiven.io
2024-11-04 12:09:45 -08:00
James Ye
f12c946ee7 EDAC/ie31200: Add Kaby Lake-S dual-core host bridge ID
Add device ID for dual-core Kaby Lake-S processors e.g. i3-7100.

Signed-off-by: James Ye <jye836@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Jason Baron <jbaron@akamai.com>
Link: https://lore.kernel.org/r/20240824120622.46226-1-jye836@gmail.com
2024-11-04 17:40:22 +01:00
Yazen Ghannam
612c2addff EDAC/mce_amd: Add support for FRU text in MCA
A new "FRU Text in MCA" feature is defined where the Field Replaceable
Unit (FRU) Text for a device is represented by a string in the new
MCA_SYND1 and MCA_SYND2 registers. This feature is supported per MCA
bank, and it is advertised by the McaFruTextInMca bit (MCA_CONFIG[9]).

The FRU Text is populated dynamically for each individual error state
(MCA_STATUS, MCA_ADDR, et al.). Handle the case where an MCA bank covers
multiple devices, for example, a Unified Memory Controller (UMC) bank
that manages two DIMMs.

  [ Yazen: Add Avadhut as co-developer for wrapper changes. ]
  [ bp: Do not expose MCA_CONFIG to userspace yet. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Co-developed-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241022194158.110073-6-avadhut.naik@amd.com
2024-10-31 10:53:04 +01:00