The CI systems are pointing out list corruptions, so we still need to
fix something here.
Keep the asserts, but revert the lock changes for now.
Fixes: 59e4405e9e ("drm/amdgpu: revert to old status lock handling v3")
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It should return an error code if userq VA validation fails.
Fixes: 9e46b8bb05 ("drm/amdgpu: validate userq buffer virtual address and size")
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It turned out that protecting the status of each bo_va with a
spinlock was just hiding problems instead of solving them.
Revert the whole approach, add a separate stats_lock and lockdep
assertions that the correct reservation lock is held all over the place.
This not only allows for better checks if a state transition is properly
protected by a lock, but also switching back to using list macros to
iterate over the state of lists protected by the dma_resv lock of the
root PD.
v2: re-add missing check
v3: split into two patches
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In S0i3, GFX state is retained, so it's preferrable to
preempt queues rather than unmapping them as the overhead
is lower.
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Tested-by: David Perry <david.perry@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
That was actually complete nonsense and not validating the BOs
at all. The code just cleared all VM areas were it couldn't grab the
lock for a BO.
Try to fix this. Only compile tested at the moment.
v2: fix fence slot reservation as well as pointed out by Sunil.
also validate PDs, PTs, per VM BOs and update PDEs
v3: grab the status_lock while working with the done list.
v4: rename functions, add some comments, fix waiting for updates to
complete.
v4: rename amdgpu_vm_lock_done_list(), add some more comments
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This patch modifies the user queue management to use preempt/restore
operations instead of full map/unmap for queue eviction scenarios where
applicable. The changes include:
1. Introduces new helper functions:
- amdgpu_userqueue_preempt_helper()
- amdgpu_userqueue_restore_helper()
2. Updates queue state management to track PREEMPTED state
3. Modifies eviction handling to use preempt instead of unmap:
- amdgpu_userq_evict_all() now uses preempt_helper
- amdgpu_userq_restore_all() now uses restore_helper
The preempt/restore approach provides better performance during queue
eviction by avoiding the overhead of full queue teardown and setup.
Full map/unmap operations are still used for initial setup/teardown
and system suspend scenarios.
v2: rename amdgpu_userqueue_restore_helper/amdgpu_userqueue_preempt_helper to
amdgpu_userq_restore_helper/amdgpu_userq_preempt_helper for consistency. (Alex)
v3: amdgpu_userq_stop_sched_for_enforce_isolation() and
amdgpu_userq_start_sched_for_enforce_isolation() should use preempt and restore (Alex)
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It needs to validate the userq object virtual address to
determine whether it is residented in a valid vm mapping.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Before destroying the userq buffer object, it requires validating
the userq HW unmap status and ensuring the userq is unmapped from
hardware. If the user HW unmap failed, then it needs to reset the
queue for reusing.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This is no invocation for amdgpu_userq_active().
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This will help on validating the userq input args, and
rejecting for the invalid userq request at the IOCTLs
first place.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix the format of debugfs print in the mqd. Need to
add a colon so parser can parse it properly.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If the doorbell is invalid, be sure to set the r to an error
state so the function returns an error.
Reviewed-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The duplicate judgment of wptr_va could be removed to simplify the logic
Signed-off-by: Qiang Liu <liuqiang@kylinos.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If kasprintf() fails, some mutex still need to be released to avoid locking
issue, as already done in all other error handling path.
Fixes: c03ea34cbf ("drm/amdgpu: add support of debugfs for mqd information")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/all/366557fa7ca8173fd78c58336986ca56953369b9.1752087753.git.christophe.jaillet@wanadoo.fr/
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix undefined reference to amdgpu_mqd_info_fops during
debugfs_create_file if DEBUG_FS=n
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Link: https://lore.kernel.org/r/20250708101551.68033-1-sunil.khatri@amd.com
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Add debugfs support for mqd for each queue of the client.
The address exposed to debugfs could be used to dump
the mqd.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://lore.kernel.org/r/20250704075548.1549849-5-sunil.khatri@amd.com
Signed-off-by: Christian König <christian.koenig@amd.com>
A circular locking dependency was detected between the global
`adev->userq_mutex` and per-file `userq_mgr->userq_mutex` when
creating user queues. The issue occurs because:
1. `amdgpu_userq_suspend()` and `amdgpu_userq_resume` take `adev->userq_mutex` first, then
`userq_mgr->userq_mutex`
2. While `amdgpu_userq_create()` takes them in reverse order
This patch resolves the issue by:
1. Moving the `adev->userq_mutex` lock earlier in `amdgpu_userq_create()`
to cover the `amdgpu_userq_ensure_ev_fence()` call
2. Releasing it after we're done with both queue creation and the
scheduling halt check
v2: remove unused adev->userq_mutex lock (Prike)
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This resolves a deadlock between user queue management and GPU reset
paths by enforcing consistent lock ordering.
The deadlock occurred when:
1. Process exit path (amdgpu_userq_mgr_fini) would:
- Take uqm->userq_mutex
- Then try to take adev->userq_mutex for list operations
2. GPU reset path (amdgpu_userq_pre_reset) would:
- Take adev->userq_mutex first (for list traversal)
- Then take uqm->userq_mutex
The solution establishes a strict top-down locking order:
1. Always take adev->userq_mutex before any uqm->userq_mutex
2. Maintain this order consistently across all code paths
Changes made:
- Reordered locking in amdgpu_userq_mgr_fini() to take device lock first
- Kept existing proper order in amdgpu_userq_pre_reset()
- Simplified the fini flow by removing redundant operations
This prevents circular dependencies while maintaining thread safety
during both normal operation and GPU reset scenarios.
Fixes: 4ce60dbada ("drm/amdgpu: store userq_managers in a list in adev")
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
update the functions in amdgpu_userqueues.c from
DRM_DBG_DRIVER to drm_dbg_driver so multi gpu instance
can be logged in.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
change the DRM_ERROR and drm_err to drm_file_err
to add process name and pid to the logging.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drm_file will be used in usermode queues code to
enable better process information in logging and hence
add drm_file part of the userq_mgr struct.
update the drm_file pointer in userq_mgr for each
amdgpu_driver_open_kms.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The "ticket" pointer points to in the middle of the &exec struct so it
can't be NULL. Remove the check.
Reviewed-by: Christian König <christian.koenig@amd.com>
Acked-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
DRM_AMDGPU_NAVI3X_USERQ config support is not required for
usermode queue.
v2: rebase.
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
log context:seqno of the fence during timeout rather
than logging fence pointer.
Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
s/userqueue/userq/
1. remove the mix of amdgpu_userqueue and amdgpu_userq
2. to be consistent with other amdgpu_userq_fence.c
3. it's shorter
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-22 08:51:46 -04:00
Renamed from drivers/gpu/drm/amd/amdgpu/amdgpu_userqueue.c (Browse further)