mirror of
				https://github.com/torvalds/linux.git
				synced 2025-11-04 10:40:15 +02:00 
			
		
		
		
	
				The current implementation of drm_sched_start uses a hardcoded
-ECANCELED to dispose of a job when the parent/hw fence is NULL.
This results in drm_sched_job_done being called with -ECANCELED for
each job with a NULL parent in the pending list, making it difficult
to distinguish between recovery methods, whether a queue reset or a
full GPU reset was used.
To improve this, we first try a soft recovery for timeout jobs and
use the error code -ENODATA. If soft recovery fails, we proceed with
a queue reset, where the error code remains -ENODATA for the job.
Finally, for a full GPU reset, we use error codes -ECANCELED or
-ETIME. This patch adds an error code parameter to drm_sched_start,
allowing us to differentiate between queue reset and GPU reset
failures. This enables user mode and test applications to validate
the expected correctness of the requested operation. After a
successful queue reset, the only way to continue normal operation is
to call drm_sched_job_done with the specific error code -ENODATA.
v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain
    and amdgpu_device_unlock_reset_domain to allow user mode to track
    the queue reset status and distinguish between queue reset and
    GPU reset.
v2: Christian suggested using the error codes -ENODATA for queue reset
    and -ECANCELED or -ETIME for GPU reset, returned to
    amdgpu_cs_wait_ioctl.
v3: To meet the requirements, we introduce a new function
    drm_sched_start_ex with an additional parameter to set
    dma_fence_set_error, allowing us to handle the specific error
    codes appropriately and dispose of bad jobs with the selected
    error code depending on whether it was a queue reset or GPU reset.
v4: Alex suggested using a new name, drm_sched_start_with_recovery_error,
    which more accurately describes the function's purpose.
    Additionally, it was recommended to add documentation details
    about the new method.
v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex)
v6 (chk): rebase on upstream changes, cleanup the commit message,
          drop the new function again and update all callers,
          apply the errno also to scheduler fences with hw fences
v7 (chk): rebased
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240826122541.85663-1-christian.koenig@amd.com
		
	
					 | 
			||
|---|---|---|
| .. | ||
| Kconfig | ||
| lima_bcast.c | ||
| lima_bcast.h | ||
| lima_ctx.c | ||
| lima_ctx.h | ||
| lima_devfreq.c | ||
| lima_devfreq.h | ||
| lima_device.c | ||
| lima_device.h | ||
| lima_dlbu.c | ||
| lima_dlbu.h | ||
| lima_drv.c | ||
| lima_drv.h | ||
| lima_dump.h | ||
| lima_gem.c | ||
| lima_gem.h | ||
| lima_gp.c | ||
| lima_gp.h | ||
| lima_l2_cache.c | ||
| lima_l2_cache.h | ||
| lima_mmu.c | ||
| lima_mmu.h | ||
| lima_pmu.c | ||
| lima_pmu.h | ||
| lima_pp.c | ||
| lima_pp.h | ||
| lima_regs.h | ||
| lima_sched.c | ||
| lima_sched.h | ||
| lima_trace.c | ||
| lima_trace.h | ||
| lima_vm.c | ||
| lima_vm.h | ||
| Makefile | ||