mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/
synced 2025-04-19 20:58:31 +09:00
drm/doc: Document device wedged event
Add documentation for device wedged event in a new "Device wedging" chapter. This describes basic definitions, prerequisites and consumer expectations along with an example. v8: Improve introduction (Christian, Rodrigo) v9: Add prerequisites section (Christian) v10: Clarify mmap cleanup and consumer prerequisites (Christian, Aravind) v11: Reference wedged event in device reset chapter (André) v12: Refine consumer expectations and terminologies (Xaver, Pekka) Signed-off-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: André Almeida <andrealmeid@igalia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250204070528.1919158-3-raag.jadav@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
This commit is contained in:
parent
b7cf9f4ac1
commit
a97bc11b20
@ -371,9 +371,119 @@ Reporting causes of resets
|
||||
|
||||
Apart from propagating the reset through the stack so apps can recover, it's
|
||||
really useful for driver developers to learn more about what caused the reset in
|
||||
the first place. DRM devices should make use of devcoredump to store relevant
|
||||
information about the reset, so this information can be added to user bug
|
||||
reports.
|
||||
the first place. For this, drivers can make use of devcoredump to store relevant
|
||||
information about the reset and send device wedged event with ``none`` recovery
|
||||
method (as explained in "Device Wedging" chapter) to notify userspace, so this
|
||||
information can be collected and added to user bug reports.
|
||||
|
||||
Device Wedging
|
||||
==============
|
||||
|
||||
Drivers can optionally make use of device wedged event (implemented as
|
||||
drm_dev_wedged_event() in DRM subsystem), which notifies userspace of 'wedged'
|
||||
(hanged/unusable) state of the DRM device through a uevent. This is useful
|
||||
especially in cases where the device is no longer operating as expected and has
|
||||
become unrecoverable from driver context. Purpose of this implementation is to
|
||||
provide drivers a generic way to recover the device with the help of userspace
|
||||
intervention, without taking any drastic measures (like resetting or
|
||||
re-enumerating the full bus, on which the underlying physical device is sitting)
|
||||
in the driver.
|
||||
|
||||
A 'wedged' device is basically a device that is declared dead by the driver
|
||||
after exhausting all possible attempts to recover it from driver context. The
|
||||
uevent is the notification that is sent to userspace along with a hint about
|
||||
what could possibly be attempted to recover the device from userspace and bring
|
||||
it back to usable state. Different drivers may have different ideas of a
|
||||
'wedged' device depending on hardware implementation of the underlying physical
|
||||
device, and hence the vendor agnostic nature of the event. It is up to the
|
||||
drivers to decide when they see the need for device recovery and how they want
|
||||
to recover from the available methods.
|
||||
|
||||
Driver prerequisites
|
||||
--------------------
|
||||
|
||||
The driver, before opting for recovery, needs to make sure that the 'wedged'
|
||||
device doesn't harm the system as a whole by taking care of the prerequisites.
|
||||
Necessary actions must include disabling DMA to system memory as well as any
|
||||
communication channels with other devices. Further, the driver must ensure
|
||||
that all dma_fences are signalled and any device state that the core kernel
|
||||
might depend on is cleaned up. All existing mmaps should be invalidated and
|
||||
page faults should be redirected to a dummy page. Once the event is sent, the
|
||||
device must be kept in 'wedged' state until the recovery is performed. New
|
||||
accesses to the device (IOCTLs) should be rejected, preferably with an error
|
||||
code that resembles the type of failure the device has encountered. This will
|
||||
signify the reason for wedging, which can be reported to the application if
|
||||
needed.
|
||||
|
||||
Recovery
|
||||
--------
|
||||
|
||||
Current implementation defines three recovery methods, out of which, drivers
|
||||
can use any one, multiple or none. Method(s) of choice will be sent in the
|
||||
uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
|
||||
more side-effects. If driver is unsure about recovery or method is unknown
|
||||
(like soft/hard system reboot, firmware flashing, physical device replacement
|
||||
or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
|
||||
will be sent instead.
|
||||
|
||||
Userspace consumers can parse this event and attempt recovery as per the
|
||||
following expectations.
|
||||
|
||||
=============== ========================================
|
||||
Recovery method Consumer expectations
|
||||
=============== ========================================
|
||||
none optional telemetry collection
|
||||
rebind unbind + bind driver
|
||||
bus-reset unbind + bus reset/re-enumeration + bind
|
||||
unknown consumer policy
|
||||
=============== ========================================
|
||||
|
||||
The only exception to this is ``WEDGED=none``, which signifies that the device
|
||||
was temporarily 'wedged' at some point but was recovered from driver context
|
||||
using device specific methods like reset. No explicit recovery is expected from
|
||||
the consumer in this case, but it can still take additional steps like gathering
|
||||
telemetry information (devcoredump, syslog). This is useful because the first
|
||||
hang is usually the most critical one which can result in consequential hangs or
|
||||
complete wedging.
|
||||
|
||||
Consumer prerequisites
|
||||
----------------------
|
||||
|
||||
It is the responsibility of the consumer to make sure that the device or its
|
||||
resources are not in use by any process before attempting recovery. With IOCTLs
|
||||
erroring out, all device memory should be unmapped and file descriptors should
|
||||
be closed to prevent leaks or undefined behaviour. The idea here is to clear the
|
||||
device of all user context beforehand and set the stage for a clean recovery.
|
||||
|
||||
Example
|
||||
-------
|
||||
|
||||
Udev rule::
|
||||
|
||||
SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
|
||||
RUN+="/path/to/rebind.sh $env{DEVPATH}"
|
||||
|
||||
Recovery script::
|
||||
|
||||
#!/bin/sh
|
||||
|
||||
DEVPATH=$(readlink -f /sys/$1/device)
|
||||
DEVICE=$(basename $DEVPATH)
|
||||
DRIVER=$(readlink -f $DEVPATH/driver)
|
||||
|
||||
echo -n $DEVICE > $DRIVER/unbind
|
||||
echo -n $DEVICE > $DRIVER/bind
|
||||
|
||||
Customization
|
||||
-------------
|
||||
|
||||
Although basic recovery is possible with a simple script, consumers can define
|
||||
custom policies around recovery. For example, if the driver supports multiple
|
||||
recovery methods, consumers can opt for the suitable one depending on scenarios
|
||||
like repeat offences or vendor specific failures. Consumers can also choose to
|
||||
have the device available for debugging or telemetry collection and base their
|
||||
recovery decision on the findings. This is useful especially when the driver is
|
||||
unsure about recovery or method is unknown.
|
||||
|
||||
.. _drm_driver_ioctl:
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user