Commit graph

16448 commits

Author SHA1 Message Date
Alex Deucher
4f2cd64510 drm/amdgpu: fix SPDX header on cyan_skillfish_reg_init.c
This should be MIT.  The driver in general is MIT and
the license text at the top of the file is MIT so fix
it.

Fixes: e8529dbc75 ("drm/amdgpu: add ip offset support for cyan skillfish")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4654
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 102c4f7c554ac5a5ecf0023fa0612beb58e3b0bd)
2025-10-28 11:03:04 -04:00
Alex Deucher
f3b37ebf2c drm/amdgpu: fix SPDX headers on amdgpu_cper.c/h
These should be MIT.  The driver in general is MIT and
the license text at the top of the files is MIT so fix
it.

Fixes: 92d5d2a09d ("drm/amdgpu: Introduce funcs for populating CPER")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4654
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit abd3f876404cafb107cb34bacb74706bfee11cbe)
2025-10-28 11:02:36 -04:00
Mario Limonciello
ba10f8d92a drm/amd: Check that VPE has reached DPM0 in idle handler
[Why]
Newer VPE microcode has functionality that will decrease DPM level
only when a workload has run for 2 or more seconds.  If VPE is turned
off before this DPM decrease and the PMFW doesn't reset it when
power gating VPE, the SOC can get stuck with a higher DPM level.

This can happen from amdgpu's ring buffer test because it's a short
quick workload for VPE and VPE is turned off after 1s.

[How]
In idle handler besides checking fences are drained check PMFW version
to determine if it will reset DPM when power gating VPE.  If PMFW will
not do this, then check VPE DPM level. If it is not DPM0 reschedule
delayed work again until it is.

v2: squash in return fix (Alex)

Cc: Peyton.Lee@amd.com
Reported-by: Sultan Alsawaf <sultan@kerneltoast.com>
Reviewed-by: Sultan Alsawaf <sultan@kerneltoast.com>
Tested-by: Sultan Alsawaf <sultan@kerneltoast.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4615
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 3ac635367eb589bee8edcc722f812a89970e14b7)
Cc: stable@vger.kernel.org
2025-10-28 10:58:34 -04:00
Jonathan Kim
277bb0f83e drm/amdgpu: enable suspend/resume all for gfx 12
Suspend/resume all gangs has been available for GFX12 for a while now
so enable it.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:28 -04:00
Jonathan Kim
0ef930e1fa drm/amdgpu: fix hung reset queue array memory allocation
By design the MES will return an array result that is twice the number
of hung doorbells it can report.

i.e. if up k reported doorbells are supported, then the
second half of the array, also of length k, holds the HQD information
(type/queue/pipe) where queue 1 corresponds to index 0 and k,
queue 2 corresponds to index 1 and k + 1 etc ...

The driver will use the HDQ info to target queue/pipe reset for
hardware scheduled user compute queues.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:28 -04:00
Jonathan Kim
8745ca5efb drm/amdgpu: fix initialization of doorbell array for detect and hang
Initialized doorbells should be set to invalid rather than 0 to prevent
driver from over counting hung doorbells since it checks against the
invalid value to begin with.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:28 -04:00
Jonathan Kim
d0de79f66a drm/amdgpu: fix gfx12 mes packet status return check
GFX12 MES uses low 32 bits of status return for success (1 or 0)
and high bits for debug information if low bits are 0.

GFX11 MES doesn't do this so checking full 64-bit status return
for 1 or 0 is still valid.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
2025-10-13 14:14:16 -04:00
Jesse.Zhang
883f309add drm/amdgpu: Fix NULL pointer dereference in VRAM logic for APU devices
Previously, APU platforms (and other scenarios with uninitialized VRAM managers)
triggered a NULL pointer dereference in `ttm_resource_manager_usage()`. The root
cause is not that the `struct ttm_resource_manager *man` pointer itself is NULL,
but that `man->bdev` (the backing device pointer within the manager) remains
uninitialized (NULL) on APUs—since APUs lack dedicated VRAM and do not fully
set up VRAM manager structures. When `ttm_resource_manager_usage()` attempts to
acquire `man->bdev->lru_lock`, it dereferences the NULL `man->bdev`, leading to
a kernel OOPS.

1. **amdgpu_cs.c**: Extend the existing bandwidth control check in
   `amdgpu_cs_get_threshold_for_moves()` to include a check for
   `ttm_resource_manager_used()`. If the manager is not used (uninitialized
   `bdev`), return 0 for migration thresholds immediately—skipping VRAM-specific
   logic that would trigger the NULL dereference.

2. **amdgpu_kms.c**: Update the `AMDGPU_INFO_VRAM_USAGE` ioctl and memory info
   reporting to use a conditional: if the manager is used, return the real VRAM
   usage; otherwise, return 0. This avoids accessing `man->bdev` when it is
   NULL.

3. **amdgpu_virt.c**: Modify the vf2pf (virtual function to physical function)
   data write path. Use `ttm_resource_manager_used()` to check validity: if the
   manager is usable, calculate `fb_usage` from VRAM usage; otherwise, set
   `fb_usage` to 0 (APUs have no discrete framebuffer to report).

This approach is more robust than APU-specific checks because it:
- Works for all scenarios where the VRAM manager is uninitialized (not just APUs),
- Aligns with TTM's design by using its native helper function,
- Preserves correct behavior for discrete GPUs (which have fully initialized
  `man->bdev` and pass the `ttm_resource_manager_used()` check).

v4: use ttm_resource_manager_used(&adev->mman.vram_mgr.manager) instead of checking the adev->gmc.is_app_apu flag (Christian)

Reviewed-by: Christian König <christian.koenig@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Christian König
33cc891b56 drm/amdgpu: hide VRAM sysfs attributes on GPUs without VRAM
Otherwise accessing them can cause a crash.

Signed-off-by: Christian König <christian.koenig@amd.com>
Tested-by: Mangesh Gadre <Mangesh.Gadre@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Sathishkumar S
74de0eaa00 drm/amdgpu: fix bit shift logic
BIT_ULL(n) sets nth bit, remove explicit shift and set the position

Fixes: a7a411e246 ("drm/amdgpu: fix shift-out-of-bounds in amdgpu_debugfs_jpeg_sched_mask_set")
Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Gui-Dong Han
6df8e84aa6 drm/amdgpu: use atomic functions with memory barriers for vm fault info
The atomic variable vm_fault_info_updated is used to synchronize access to
adev->gmc.vm_fault_info between the interrupt handler and
get_vm_fault_info().

The default atomic functions like atomic_set() and atomic_read() do not
provide memory barriers. This allows for CPU instruction reordering,
meaning the memory accesses to vm_fault_info and the vm_fault_info_updated
flag are not guaranteed to occur in the intended order. This creates a
race condition that can lead to inconsistent or stale data being used.

The previous implementation, which used an explicit mb(), was incomplete
and inefficient. It failed to account for all potential CPU reorderings,
such as the access of vm_fault_info being reordered before the atomic_read
of the flag. This approach is also more verbose and less performant than
using the proper atomic functions with acquire/release semantics.

Fix this by switching to atomic_set_release() and atomic_read_acquire().
These functions provide the necessary acquire and release semantics,
which act as memory barriers to ensure the correct order of operations.
It is also more efficient and idiomatic than using explicit full memory
barriers.

Fixes: b97dfa27ef ("drm/amdgpu: save vm fault information for amdkfd")
Cc: stable@vger.kernel.org
Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Alex Deucher
ff780f4f80 drm/amdgpu: set an error on all fences from a bad context
When we backup ring contents to reemit after a queue reset,
we don't backup ring contents from the bad context.  When
we signal the fences, we should set an error on those
fences as well.

v2: misc cleanups
v3: add locking for fence error, fix comment (Christian)
v4: fix wrap around, locking (Christian)

Fixes: 77cc0da39c ("drm/amdgpu: track ring state associated with a fence")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Alex Deucher
1f22fcb88b drm/amdgpu: handle wrap around in reemit handling
Compare the sequence numbers directly.

Fixes: 77cc0da39c ("drm/amdgpu: track ring state associated with a fence")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Alex Deucher
357d90be2c drm/amdgpu: fix handling of harvesting for ip_discovery firmware
Chips which use the IP discovery firmware loaded by the driver
reported incorrect harvesting information in the ip discovery
table in sysfs because the driver only uses the ip discovery
firmware for populating sysfs and not for direct parsing for the
driver itself as such, the fields that are used to print the
harvesting info in sysfs report incorrect data for some IPs.  Populate
the relevant fields for this case as well.

Fixes: 514678da56 ("drm/amdgpu/discovery: fix fw based ip discovery")
Acked-by: Tom St Denis <tom.stdenis@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:15 -04:00
Christian König
8f74c70be5 drm/amdgpu: block CE CS if not explicitely allowed by module option
The Constant Engine found on gfx6-gfx10 HW has been a notorious source of
problems.

RADV never used it in the first place, radeonsi only used it for a few
releases around 2017 for gfx6-gfx9 before dropping support for it as
well.

While investigating another problem I just recently found that submitting
to the CE seems to be completely broken on gfx9 for quite a while.

Since nobody complained about that problem it most likely means that
nobody is using any of the affected radeonsi versions on current Linux
kernels any more.

So to potentially phase out the support for the CE and eliminate another
source of problems block submitting CE IBs unless it is enabled again
using a debug flag.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:14 -04:00
Christian König
5d55ed19d4 drm/amdgpu: remove two invalid BUG_ON()s
Those can be triggered trivially by userspace.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:14 -04:00
Timur Kristóf
7bdd91abf0 drm/amd: Disable ASPM on SI
Enabling ASPM causes randoms hangs on Tahiti and Oland on Zen4.
It's unclear if this is a platform-specific or GPU-specific issue.
Disable ASPM on SI for the time being.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-13 14:14:14 -04:00
Linus Torvalds
284fc30e66 drm next fixes for 6.18-rc1
amdgpu:
 - DC DCE6 fixes
 - GPU reset fixes
 - Secure diplay messaging cleanup
 - MES fix
 - GPUVM locking fixes
 - PMFW messaging cleanup
 - PCI US/DS switch handling fix
 - VCN queue reset fix
 - DC FPU handling fix
 - DCN 3.5 fix
 - DC mirroring fix
 
 amdkfd:
 - Fix kfd process ref leak
 - mmap write lock handling fix
 - Fix comments in IOCTL
 
 xe:
 - Fix build with clang 16
 - Fix handling of invalid configfs syntax usage and spell out the
   expected syntax in the documentation
 - Do not try late bind firmware when running as VF since it
   shouldn't handle firmware loading
 - Fix idle assertion for local BOs
 - Fix uninitialized variable for late binding
 - Do not require perfmon_capable to expose free memory at page
   granularity. Handle it like other drm drivers do
 - Fix lock handling on suspend error path
 - Fix I2C controller resume after S3
 
 v3d:
 - fix fence locking
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmjpaS4ACgkQDHTzWXnE
 hr5U7xAAgBZ6kqMGN6EKKloDlpIgpQKF3pXGMo7NfwTTJ3x6jTGx41g74ohKCFn2
 Zms0RbHTjGB+u495p3dadOSwbvSW2vTYWUCMnxAWye8s7A4LA7SCTXFY84gEE8ki
 8THR7AU1L78MWILas2wGmDzoy2HNGLITLbcZO1B5Cdq8SxxxKsXJXXZahLjBB67C
 4a23OlANL16g5NCKEQ6owuRBf5hlyc6A+vQuBcfBbKg2VL83Iv/P6nt3Lv5uMwv6
 Qss3lnUrsNBrBMxx3QP8IGG3/EwHU/YcJYIYPbce6qZrxuG8Vw1hTPdJqr8kq9mv
 8wg8JC9VQtPrTuAiNCgC9QrvaDW1vMc/5abYQMJBRvexnXf4VjriSH0WM7YXFQcB
 JvChn5Vo8Sr+1B7oQUM5SOMXHxGfxsL3ypub4a3Die6G6ymhp6cDNwBq0vbU/LXy
 s1fddlNM/PrUACKcFaEGVZMDyz75QTMWae5OmcuUizdzHChUzlDhq1MbHrh0D2jP
 mNuORIHnrO+oZOyxpnQr5YBfZk+xFrnvSLI/hJP982m9FKJ4PVjLg+4YlRNjW4wO
 k4LjxyMutybAFEH0nEWrdGxry+y6EM/OrXpToXFquNq3NdI133NvDRiH+f4SI+8M
 Y6cZdwVSiO5fd7Zup0xKBi+jUi66cfkKS+3MXGfMDX7y1orMMqQ=
 =tSAX
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-10-11-1' of https://gitlab.freedesktop.org/drm/kernel

Pull more drm fixes from Dave Airlie:
 "Just the follow up fixes for rc1 from the next branch, amdgpu and xe
  mostly with a single v3d fix in there.

  amdgpu:
   - DC DCE6 fixes
   - GPU reset fixes
   - Secure diplay messaging cleanup
   - MES fix
   - GPUVM locking fixes
   - PMFW messaging cleanup
   - PCI US/DS switch handling fix
   - VCN queue reset fix
   - DC FPU handling fix
   - DCN 3.5 fix
   - DC mirroring fix

  amdkfd:
   - Fix kfd process ref leak
   - mmap write lock handling fix
   - Fix comments in IOCTL

  xe:
   - Fix build with clang 16
   - Fix handling of invalid configfs syntax usage and spell out the
     expected syntax in the documentation
   - Do not try late bind firmware when running as VF since it shouldn't
     handle firmware loading
   - Fix idle assertion for local BOs
   - Fix uninitialized variable for late binding
   - Do not require perfmon_capable to expose free memory at page
     granularity. Handle it like other drm drivers do
   - Fix lock handling on suspend error path
   - Fix I2C controller resume after S3

  v3d:
   - fix fence locking"

* tag 'drm-next-2025-10-11-1' of https://gitlab.freedesktop.org/drm/kernel: (34 commits)
  drm/amd/display: Incorrect Mirror Cositing
  drm/amd/display: Enable Dynamic DTBCLK Switch
  drm/amdgpu: Report individual reset error
  drm/amdgpu: partially revert "revert to old status lock handling v3"
  drm/amd/display: Fix unsafe uses of kernel mode FPU
  drm/amd/pm: Disable VCN queue reset on SMU v13.0.6 due to regression
  drm/amdgpu: Fix general protection fault in amdgpu_vm_bo_reset_state_machine
  drm/amdgpu: Check swus/ds for switch state save
  drm/amdkfd: Fix two comments in kfd_ioctl.h
  drm/amd/pm: Avoid interface mismatch messaging
  drm/amdgpu: Merge amdgpu_vm_set_pasid into amdgpu_vm_init
  drm/amd/amdgpu: Fix the mes version that support inv_tlbs
  drm/amd: Check whether secure display TA loaded successfully
  drm/amdkfd: Fix mmap write lock not release
  drm/amdkfd: Fix kfd process ref leaking when userptr unmapping
  drm/amdgpu: Fix for GPU reset being blocked by KIQ I/O.
  drm/amd/display: Disable scaling on DCE6 for now
  drm/amd/display: Properly disable scaling on DCE6
  drm/amd/display: Properly clear SCL_*_FILTER_CONTROL on DCE6
  drm/amd/display: Add missing DCE6 SCL_HORZ_FILTER_INIT* SRIs
  ...
2025-10-10 14:02:14 -07:00
Lijo Lazar
2e97663760 drm/amdgpu: Report individual reset error
If reinitialization of one of the GPUs fails after reset, it logs
failure on all subsequent GPUs eventhough they have resumed
successfully.

A sample log where only device at 0000:95:00.0 had a failure -

	amdgpu 0000:15:00.0: amdgpu: GPU reset(19) succeeded!
	amdgpu 0000:65:00.0: amdgpu: GPU reset(19) succeeded!
	amdgpu 0000:75:00.0: amdgpu: GPU reset(19) succeeded!
	amdgpu 0000:85:00.0: amdgpu: GPU reset(19) succeeded!
	amdgpu 0000:95:00.0: amdgpu: GPU reset(19) failed
	amdgpu 0000:e5:00.0: amdgpu: GPU reset(19) failed
	amdgpu 0000:f5:00.0: amdgpu: GPU reset(19) failed
	amdgpu 0000:05:00.0: amdgpu: GPU reset(19) failed
	amdgpu 0000:15:00.0: amdgpu: GPU reset end with ret = -5

To avoid confusion, report the error for each device
separately and return the first error as the overall result.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Christian König
a107aeb6a2 drm/amdgpu: partially revert "revert to old status lock handling v3"
The CI systems are pointing out list corruptions, so we still need to
fix something here.

Keep the asserts, but revert the lock changes for now.

Fixes: 59e4405e9e ("drm/amdgpu: revert to old status lock handling v3")
Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Jesse.Zhang
8d557eab3a drm/amdgpu: Fix general protection fault in amdgpu_vm_bo_reset_state_machine
After GPU reset with VRAM loss, a general protection fault occurs
during user queue restoration when accessing vm_bo->vm after
spinlock release in amdgpu_vm_bo_reset_state_machine.

The root cause is that vm_bo points to the last entry from the
list_for_each_entry loop, but this becomes invalid after the
spinlock is released. Accessing vm_bo->vm at this point leads
to memory corruption.

Crash log shows:
[  326.981811] Oops: general protection fault, probably for non-canonical address 0x4156415741e58ac8: 0000 [#1] SMP NOPTI
[  326.981820] CPU: 13 UID: 0 PID: 1035 Comm: kworker/13:3 Tainted: G            E       6.16.0+ #25 PREEMPT(voluntary)
[  326.981826] Tainted: [E]=UNSIGNED_MODULE
[  326.981827] Hardware name: Gigabyte Technology Co., Ltd. X870E AORUS PRO ICE/X870E AORUS PRO ICE, BIOS F3i 12/19/2024
[  326.981831] Workqueue: events amdgpu_userq_restore_worker [amdgpu]
[  326.981999] RIP: 0010:amdgpu_vm_assert_locked+0x16/0x70 [amdgpu]
[  326.982094] Code: 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 85 ff 74 45 48 8b 87 80 03 00 00 48 85 c0 74 40 <48> 8b b8 80 01 00 00 48 85 ff 74 3b 8b 05 0c b7 0e f0 85 c0 75 05
[  326.982098] RSP: 0018:ffffaa91c2a6bc20 EFLAGS: 00010206
[  326.982100] RAX: 4156415741e58948 RBX: ffff9e8f013e8330 RCX: 0000000000000000
[  326.982102] RDX: 0000000000000005 RSI: 000000001d254e88 RDI: ffffffffc144814a
[  326.982104] RBP: ffffaa91c2a6bc68 R08: 0000004c21a25674 R09: 0000000000000001
[  326.982106] R10: 0000000000000001 R11: dccaf3f2f82863fc R12: ffff9e8f013e8000
[  326.982108] R13: ffff9e8f013e8000 R14: 0000000000000000 R15: ffff9e8f09980000
[  326.982110] FS:  0000000000000000(0000) GS:ffff9e9e79995000(0000) knlGS:0000000000000000
[  326.982112] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  326.982114] CR2: 000055ed6c9caa80 CR3: 0000000797060000 CR4: 0000000000750ef0
[  326.982116] PKRU: 55555554

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Lijo Lazar
9b608fe948 drm/amdgpu: Check swus/ds for switch state save
For saving switch state, check if the GPU is having SWUS/DS
architecture. Otherwise, skip saving.

Reported-by: Roman Elshin <roman.elshin@gmail.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4602
Fixes: 1dd2fa0e00 ("drm/amdgpu: Save and restore switch state")
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:19 -04:00
Jesse.Zhang
b809ca91a5 drm/amdgpu: Merge amdgpu_vm_set_pasid into amdgpu_vm_init
As KFD no longer uses a separate PASID, the global amdgpu_vm_set_pasid()function is no longer necessary.
Merge its functionality directly intoamdgpu_vm_init() to simplify code flow and eliminate redundant locking.

v2: remove superflous check
  adjust amdgpu_vm_fin and remove amdgpu_vm_set_pasid (Chritian)

v3: drop amdgpu_vm_assert_locked (Chritian)

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4614
Fixes: 59e4405e9e ("drm/amdgpu: revert to old status lock handling v3")
Reviewed-by: Christian König <christian.koenig@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:07 -04:00
Shaoyun Liu
8dbac5cf8b drm/amd/amdgpu: Fix the mes version that support inv_tlbs
MES version 0x83 is not stable to use the inv_tlbs API. Defer it to 0x84 vertsion.

Fixes: 85442bac84 ("drm/amd/amdgpu: Fix the mes version that support inv_tlbs")
Signed-off-by: Shaoyun Liu <shaoyun.liu@amd.com>
Reviewed-by: Michael Chen <michael.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Mario Limonciello
c760bcda83 drm/amd: Check whether secure display TA loaded successfully
[Why]
Not all renoir hardware supports secure display.  If the TA is present
but the feature isn't supported it will fail to load or send commands.
This shows ERR messages to the user that make it seems like there is
a problem.

[How]
Check the resp_status of the context to see if there was an error
before trying to send any secure display commands.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/1415
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Philip Yang
58e6fc2fb9 drm/amdkfd: Fix kfd process ref leaking when userptr unmapping
kfd_lookup_process_by_pid hold the kfd process reference to ensure it
doesn't get destroyed while sending the segfault event to user space.

Calling kfd_lookup_process_by_pid as function parameter leaks the kfd
process refcount and miss the NULL pointer check if app process is
already destroyed.

Fixes: 2d274bf709 ("amd/amdkfd: Trigger segfault for early userptr unmmapping")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Heng Zhou
0c67342885 drm/amdgpu: Fix for GPU reset being blocked by KIQ I/O.
There is some probability that reset workqueue is blocked by KIQ I/O for 10+ seconds after gpu hangs.
So we need to add a in_reset check during each KIQ register poll.

Signed-off-by: Heng Zhou <Heng.Zhou@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-07 14:09:06 -04:00
Linus Torvalds
58809f614e drm next for 6.18-rc1
cross-subsystem:
 - i2c-hid: Make elan touch controllers power on after panel is enabled
 - dt bindings for STM32MP25 SoC
 - pci vgaarb: use screen_info helpers
 - rust pin-init updates
 - add MEI driver for late binding firmware update/load
 
 uapi:
 - add ioctl for reassigning GEM handles
 - provide boot_display attribute on boot-up devices
 
 core:
 - document DRM_MODE_PAGE_FLIP_EVENT
 - add vendor specific recovery method to drm device wedged uevent
 
 gem:
 - Simplify gpuvm locking
 
 ttm:
 - add interface to populate buffers
 
 sched:
 - Fix race condition in trace code
 
 atomic:
 - Reallow no-op async page flips
 
 display:
 - dp: Fix command length
 
 video:
 - Improve pixel-format handling for struct screen_info
 
 rust:
 - drop Opaque<> from ioctl args
 - Alloc:
 - BorrowedPage type and AsPageIter traits
 - Implement Vmalloc::to_page() and VmallocPageIter
 - DMA/Scatterlist:
 - Add dma::DataDirection and type alias for dma_addr_t
 - Abstraction for struct scatterlist and sg_table
 - DRM:
 - simplify use of generics
 - add DriverFile type alias
 - drop Object::SIZE
 - Rust:
 - pin-init tree merge
 - Various methods for AsBytes and FromBytes traits
 
 gpuvm:
 - Support madvice in Xe driver
 
 gpusvm:
 - fix hmm_pfn_to_map_order usage in gpusvm
 
 bridge:
 - Improve and fix ref counting on bridge management
 - cdns-dsi: Various improvements to mode setting
 - Support Solomon SSD2825 plus DT bindings
 - Support Waveshare DSI2DPI plus DT bindings
 - Support Content Protection property
 - display-connector: Improve DP display detection
 - Add support for Radxa Ra620 plus DT bindings
 - adv7511: Provide SPD and HDMI infoframes
 - it6505: Replace crypto_shash with sha()
 - synopsys: Add support for DW DPTX Controller plus DT bindings
 - adv7511: Write full Audio infoframe
 - ite6263: Support vendor-specific infoframes
 - simple: Add support for Realtek RTD2171 DP-to-HDMI plus DT bindings
 
 panel:
 - panel-edp: Support mt8189 Chromebooks; Support BOE NV140WUM-N64;
   Support SHP LQ134Z1; Fixes
 - panel-simple: Support Olimex LCD-OLinuXino-5CTS plus DT bindings
 - Support Samsung AMS561RA01
 - Support Hydis HV101HD1 plus DT bindings
 - ilitek-ili9881c: Refactor mode setting; Add support for Bestar
   BSD1218-A101KL68 LCD plus DT bindings
 - lvds: Add support for Ampire AMP19201200B5TZQW-T03 to DT bindings
 - edp: Add support for additonal mt8189 Chromebook panels
 - lvds: Add DT bindings for EDT ETML0700Z8DHA
 
 amdgpu:
 - add CRIU support for gem objects
 - RAS updates
 - VCN SRAM load fixes
 - EDID read fixes
 - eDP ALPM support
 - Documentation updates
 - Rework PTE flag generation
 - DCE6 fixes
 - VCN devcoredump cleanup
 - MMHUB client id fixes
 - VCN 5.0.1 RAS support
 - SMU 13.0.x updates
 - Expanded PCIe DPC support
 - Expanded VCN reset support
 - VPE per queue reset support
 - give kernel jobs unique id for tracing
 - pre-populate exported buffers
 - cyan skillfish updates
 - make vbios build number available in sysfs
 - userq updates
 - HDCP updates
 - support MMIO remap page as ttm pool
 - JPEG parser updates
 - DCE6 DC updates
 - use devm for i2c buses
 - GPUVM locking updates
 - Drop non-DC DCE11 code
 - improve fallback handling for pixel encoding
 
 amdkfd:
 - SVM/page migration fixes
 - debugfs fixes
 - add CRIO support for gem objects
 - SVM updates
 
 radeon:
 - use dev_warn_once in CS parsers
 
 xe:
 - add madvise interface
 - add DRM_IOCTL_XE_VM_QUERY_MEMORY_RANGE_ATTRS to query VMA count
   and memory attributes
 - drop L# bank mask reporting from media GT3 on Xe3+.
 - add SLPC power_profile sysfs interface
 - add configs attribs to add post/mid context-switch commands
 - handle firmware reported hardware errors notifying userspace with
   device wedged uevent
 - use same dir structure across sysfs/debugfs
 - cleanup and future proof vram region init
 - add G-states and PCI link states to debugfs
 - Add SRIOV support for CCS surfaces on Xe2+
 - Enable SRIOV PF mode by default on supported platforms
 - move flush to common code
 - extended core workarounds for Xe2/3
 - use DRM scheduler for delayed GT TLB invalidations
 - configs improvements and allow VF device enablement
 - prep work to expose mmio regions to userspace
 - VF migration support added
 - prepare GPU SVM for THP migration
 - start fixing XE_PAGE_SIZE vs PAGE_SIZE
 - add PSMI support for hw validation
 - resize VF bars to max possible size according to number of VFs
 - Ensure GT is in C0 during resume
 - pre-populate exported buffers
 - replace xe_hmm with gpusvm
 - add more SVM GT stats to debugfs
 - improve fake pci and WA kunnit handle for new platform testing
 - Test GuC to GuC comms to add debugging
 - use attribute groups to simplify sysfs registration
 - add Late Binding firmware code to interact with MEI
 
 i915:
 - apply multiple JSL/EHL/Gen7/Gen6 workarounds properly
 - protect against overflow in active_engine()
 - Use try_cmpxchg64() in __active_lookup()
 - include GuC registers in error state
 - get rid of dev->struct_mutex
 - iopoll: generalize read_poll_timout
 - lots more display refactoring
 - Reject HBR3 in any eDP Panel
 - Prune modes for YUV420
 - Display Wa fix, additions, and updates
 - DP: Fix 2.7 Gbps link training on g4x
 - DP: Adjust the idle pattern handling
 - DP: Shuffle the link training code a bit
 - Don't set/read the DSI C clock divider on GLK
 - Enable_psr kernel parameter changes
 - Type-C enabled/disconnected dp-alt sink
 - Wildcat Lake enabling
 - DP HDR updates
 - DRAM detection
 - wait PSR idle on dsb commit
 - Remove FBC modulo 4 restriction for ADL-P+
 - panic: refactor framebuffer allocation
 
 habanalabs:
 - debug/visibility improvements
 - vmalloc-backed coherent mmap support
 - HLDIO infrastructure
 
 nova-core:
 - various register!() macro improvements
 - minor vbios/firmware fixes/refactoring
 - advance firmware boot stages; process Booter and patch signatures
 - process GSP and GSP bootloader
 - Add r570.144 firmware bindings and update to it
 - Move GSP boot code to own module
 - Use new pin-init features to store driver's private data in a single
  allocation
 - Update ARef import from sync::aref
 
 nova-drm:
 - Update ARef import from sync::aref
 
 tyr:
 - initial driver skeleton for a rust driver for ARM Mali GPUs
 - capable of powering up, query metadata and provide it to userspace.
 
 msm:
 - GPU and Core:
 - in DT bindings describe clocks per GPU type
 - GMU bandwidth voting for x1-85
 - a623/a663 speedbins
 - cleanup some remaining no-iommu leftovers after VM_BIND conversion
 - fix GEM obj 32b size truncation
 - add missing VM_BIND param validation
 - IFPC for x1-85 and a750
 - register xml and gen_header.py sync from mesa
 - Display:
 - add missing bindings for display on SC8180X
 - added DisplayPort MST bindings
 - conversion from round_rate() to determine_rate()
 
 amdxdna:
 - add IOCTL_AMDXDNA_GET_ARRAY
 - support user space allocated buffers
 - streamline PM interfaces
 - Refactoring wrt. hardware contexts
 - improve error reporting
 
 nouveau:
 - use GSP firmware by default
 - improve error reporting
 - Pre-populate exported buffers
 
 ast:
 - Clean up detection of DRAM config
 
 exynos:
 - add DSIM bridge driver support for Exynos7870
 - Document Exynos7870 DSIM compatible in dt-binding
 
 panthor:
 - Print task/pid on errors
 - Add support for Mali G710, G510, G310, Gx15, Gx20, Gx25
 - Improve cache flushing
 - Fail VM bind if BO has offset
 
 renesas:
 - convert to RUNTIME_PM_OPS
 
 rcar-du:
 - Make number of lanes configurable
 - Use RUNTIME_PM_OPS
 - Add support for DSI commands
 
 rocket:
 - Add driver for Rockchip NPU plus DT bindings
 - Use kfree() and sizeof() correctly
 - Test DMA status
 
 rockchip:
 - dsi2: Add support for RK3576 plus DT bindings
 - Add support for RK3588 DPTX output
 
 tidss:
 - Use crtc_ fields for programming display mode
 - Remove other drivers from aperture
 
 pixpaper:
 - Add support for Mayqueen Pixpaper plus DT bindings
 
 v3d:
 - Support querying nubmer of GPU resets for KHR_robustness
 
 stm:
 - Clean up logging
 - ltdc: Add support support for STM32MP257F-EV1 plus DT bindings
 
 sitronix:
 - st7571-i2c: Add support for inverted displays and 2-bit grayscale
 
 tidss:
 - Convert to kernel's FIELD_ macros
 
 vesadrm:
 - Support 8-bit palette mode
 
 imagination:
 - Improve power management
 - Add support for TH1520 GPU
 - Support Risc-V architectures
 
 v3d:
 - Improve job management and locking
 
 vkms:
 - Support variants of ARGB8888, ARGB16161616, RGB565, RGB888 and P01x
 - Spport YUV with 16-bit components
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmjcpjkACgkQDHTzWXnE
 hr7Q7g/5AcxXqLUx7wvmDga9TpzIjDD+C+MOt568RpFQ9cYprI+/86ma7ELCpuNe
 dVgeobxQb/jyhf4acdBU+t5aZz+j8VPhPtIPrPY2kOVDuL1NfeQNS8VmGNpFhR+0
 6hqVrtfvbYdLBrAHrU/V/RwZlBJvI/D/I2QGuvZZwWzCBgYd4u4bGuRyBCvGDxOD
 CTPaEqYyzjvpVuzu7AGQk655WkZQnyPmiezIl2lit1meEMMMv80HePkyWHclZo7Q
 hMqsEasSp5w5Q5EpYqVr1z5IdBAV1O53oor9W573J3kEoB4o1zEsTPfLO4N1dgXo
 bfvc24uW3zyChWY2hWyRKvOzvAoClnjfY6whv9NRP0Qi4UjzhLlNOpmhm9cst/J+
 uj2Nn8UJtyvFJbTmDvoocpgdhq2mkGKdIVhVQ6tG7PjihFmyQRF7PJZjb+0Vee7L
 53F0c4d6HiBI4DHa+lH6fgQUBspIvSfmcnR0ACg29NByib+JEoPSPb4ET+uZ8lLd
 IbQvNiCdnUduYDCKfo5ea/FesP8AXy1KfSa+z7oEEFYHbbkc7PSztUagEyZdS/yS
 FnnYqmo/DidmyM4nxDQUII+UDqjng7fo+l4BzIhL12pR693KzCf0mexMr6SA24ny
 gasN97923OTle1J9xrPrKavkx6WjswZCvOaG7ZbnJB47ydJVu5w=
 =ZVKY
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2025-10-01' of https://gitlab.freedesktop.org/drm/kernel

Pull drm updates from Dave Airlie:
 "cross-subsystem:
   - i2c-hid: Make elan touch controllers power on after panel is
     enabled
   - dt bindings for STM32MP25 SoC
   - pci vgaarb: use screen_info helpers
   - rust pin-init updates
   - add MEI driver for late binding firmware update/load

  uapi:
   - add ioctl for reassigning GEM handles
   - provide boot_display attribute on boot-up devices

  core:
   - document DRM_MODE_PAGE_FLIP_EVENT
   - add vendor specific recovery method to drm device wedged uevent

  gem:
   - Simplify gpuvm locking

  ttm:
   - add interface to populate buffers

  sched:
   - Fix race condition in trace code

  atomic:
   - Reallow no-op async page flips

  display:
   - dp: Fix command length

  video:
   - Improve pixel-format handling for struct screen_info

  rust:
   - drop Opaque<> from ioctl args
   - Alloc:
       - BorrowedPage type and AsPageIter traits
       - Implement Vmalloc::to_page() and VmallocPageIter
   - DMA/Scatterlist:
       - Add dma::DataDirection and type alias for dma_addr_t
       - Abstraction for struct scatterlist and sg_table
   - DRM:
       - simplify use of generics
       - add DriverFile type alias
       - drop Object::SIZE
   - Rust:
       - pin-init tree merge
       - Various methods for AsBytes and FromBytes traits

  gpuvm:
   - Support madvice in Xe driver

  gpusvm:
   - fix hmm_pfn_to_map_order usage in gpusvm

  bridge:
   - Improve and fix ref counting on bridge management
   - cdns-dsi: Various improvements to mode setting
   - Support Solomon SSD2825 plus DT bindings
   - Support Waveshare DSI2DPI plus DT bindings
   - Support Content Protection property
   - display-connector: Improve DP display detection
   - Add support for Radxa Ra620 plus DT bindings
   - adv7511: Provide SPD and HDMI infoframes
   - it6505: Replace crypto_shash with sha()
   - synopsys: Add support for DW DPTX Controller plus DT bindings
   - adv7511: Write full Audio infoframe
   - ite6263: Support vendor-specific infoframes
   - simple: Add support for Realtek RTD2171 DP-to-HDMI plus DT bindings

  panel:
   - panel-edp: Support mt8189 Chromebooks; Support BOE NV140WUM-N64;
     Support SHP LQ134Z1; Fixes
   - panel-simple: Support Olimex LCD-OLinuXino-5CTS plus DT bindings
   - Support Samsung AMS561RA01
   - Support Hydis HV101HD1 plus DT bindings
   - ilitek-ili9881c: Refactor mode setting; Add support for Bestar
     BSD1218-A101KL68 LCD plus DT bindings
   - lvds: Add support for Ampire AMP19201200B5TZQW-T03 to DT bindings
   - edp: Add support for additonal mt8189 Chromebook panels
   - lvds: Add DT bindings for EDT ETML0700Z8DHA

  amdgpu:
   - add CRIU support for gem objects
   - RAS updates
   - VCN SRAM load fixes
   - EDID read fixes
   - eDP ALPM support
   - Documentation updates
   - Rework PTE flag generation
   - DCE6 fixes
   - VCN devcoredump cleanup
   - MMHUB client id fixes
   - VCN 5.0.1 RAS support
   - SMU 13.0.x updates
   - Expanded PCIe DPC support
   - Expanded VCN reset support
   - VPE per queue reset support
   - give kernel jobs unique id for tracing
   - pre-populate exported buffers
   - cyan skillfish updates
   - make vbios build number available in sysfs
   - userq updates
   - HDCP updates
   - support MMIO remap page as ttm pool
   - JPEG parser updates
   - DCE6 DC updates
   - use devm for i2c buses
   - GPUVM locking updates
   - Drop non-DC DCE11 code
   - improve fallback handling for pixel encoding

  amdkfd:
   - SVM/page migration fixes
   - debugfs fixes
   - add CRIO support for gem objects
   - SVM updates

  radeon:
   - use dev_warn_once in CS parsers

  xe:
   - add madvise interface
   - add DRM_IOCTL_XE_VM_QUERY_MEMORY_RANGE_ATTRS to query VMA count
     and memory attributes
   - drop L# bank mask reporting from media GT3 on Xe3+.
   - add SLPC power_profile sysfs interface
   - add configs attribs to add post/mid context-switch commands
   - handle firmware reported hardware errors notifying userspace with
     device wedged uevent
   - use same dir structure across sysfs/debugfs
   - cleanup and future proof vram region init
   - add G-states and PCI link states to debugfs
   - Add SRIOV support for CCS surfaces on Xe2+
   - Enable SRIOV PF mode by default on supported platforms
   - move flush to common code
   - extended core workarounds for Xe2/3
   - use DRM scheduler for delayed GT TLB invalidations
   - configs improvements and allow VF device enablement
   - prep work to expose mmio regions to userspace
   - VF migration support added
   - prepare GPU SVM for THP migration
   - start fixing XE_PAGE_SIZE vs PAGE_SIZE
   - add PSMI support for hw validation
   - resize VF bars to max possible size according to number of VFs
   - Ensure GT is in C0 during resume
   - pre-populate exported buffers
   - replace xe_hmm with gpusvm
   - add more SVM GT stats to debugfs
   - improve fake pci and WA kunnit handle for new platform testing
   - Test GuC to GuC comms to add debugging
   - use attribute groups to simplify sysfs registration
   - add Late Binding firmware code to interact with MEI

  i915:
   - apply multiple JSL/EHL/Gen7/Gen6 workarounds properly
   - protect against overflow in active_engine()
   - Use try_cmpxchg64() in __active_lookup()
   - include GuC registers in error state
   - get rid of dev->struct_mutex
   - iopoll: generalize read_poll_timout
   - lots more display refactoring
   - Reject HBR3 in any eDP Panel
   - Prune modes for YUV420
   - Display Wa fix, additions, and updates
   - DP: Fix 2.7 Gbps link training on g4x
   - DP: Adjust the idle pattern handling
   - DP: Shuffle the link training code a bit
   - Don't set/read the DSI C clock divider on GLK
   - Enable_psr kernel parameter changes
   - Type-C enabled/disconnected dp-alt sink
   - Wildcat Lake enabling
   - DP HDR updates
   - DRAM detection
   - wait PSR idle on dsb commit
   - Remove FBC modulo 4 restriction for ADL-P+
   - panic: refactor framebuffer allocation

  habanalabs:
   - debug/visibility improvements
   - vmalloc-backed coherent mmap support
   - HLDIO infrastructure

  nova-core:
   - various register!() macro improvements
   - minor vbios/firmware fixes/refactoring
   - advance firmware boot stages; process Booter and patch signatures
   - process GSP and GSP bootloader
   - Add r570.144 firmware bindings and update to it
   - Move GSP boot code to own module
   - Use new pin-init features to store driver's private data in a
     single allocation
   - Update ARef import from sync::aref

  nova-drm:
   - Update ARef import from sync::aref

  tyr:
   - initial driver skeleton for a rust driver for ARM Mali GPUs
   - capable of powering up, query metadata and provide it to userspace.

  msm:
   - GPU and Core:
      - in DT bindings describe clocks per GPU type
      - GMU bandwidth voting for x1-85
      - a623/a663 speedbins
      - cleanup some remaining no-iommu leftovers after VM_BIND conversion
      - fix GEM obj 32b size truncation
      - add missing VM_BIND param validation
      - IFPC for x1-85 and a750
      - register xml and gen_header.py sync from mesa
   - Display:
      - add missing bindings for display on SC8180X
      - added DisplayPort MST bindings
      - conversion from round_rate() to determine_rate()

  amdxdna:
   - add IOCTL_AMDXDNA_GET_ARRAY
   - support user space allocated buffers
   - streamline PM interfaces
   - Refactoring wrt. hardware contexts
   - improve error reporting

  nouveau:
   - use GSP firmware by default
   - improve error reporting
   - Pre-populate exported buffers

  ast:
   - Clean up detection of DRAM config

  exynos:
   - add DSIM bridge driver support for Exynos7870
   - Document Exynos7870 DSIM compatible in dt-binding

  panthor:
   - Print task/pid on errors
   - Add support for Mali G710, G510, G310, Gx15, Gx20, Gx25
   - Improve cache flushing
   - Fail VM bind if BO has offset

  renesas:
   - convert to RUNTIME_PM_OPS

  rcar-du:
   - Make number of lanes configurable
   - Use RUNTIME_PM_OPS
   - Add support for DSI commands

  rocket:
   - Add driver for Rockchip NPU plus DT bindings
   - Use kfree() and sizeof() correctly
   - Test DMA status

  rockchip:
   - dsi2: Add support for RK3576 plus DT bindings
   - Add support for RK3588 DPTX output

  tidss:
   - Use crtc_ fields for programming display mode
   - Remove other drivers from aperture

  pixpaper:
   - Add support for Mayqueen Pixpaper plus DT bindings

  v3d:
   - Support querying nubmer of GPU resets for KHR_robustness

  stm:
   - Clean up logging
   - ltdc: Add support support for STM32MP257F-EV1 plus DT bindings

  sitronix:
   - st7571-i2c: Add support for inverted displays and 2-bit grayscale

  tidss:
   - Convert to kernel's FIELD_ macros

  vesadrm:
   - Support 8-bit palette mode

  imagination:
   - Improve power management
   - Add support for TH1520 GPU
   - Support Risc-V architectures

  v3d:
   - Improve job management and locking

  vkms:
   - Support variants of ARGB8888, ARGB16161616, RGB565, RGB888 and P01x
   - Spport YUV with 16-bit components"

* tag 'drm-next-2025-10-01' of https://gitlab.freedesktop.org/drm/kernel: (1455 commits)
  drm/amd: Add name to modes from amdgpu_connector_add_common_modes()
  drm/amd: Drop some common modes from amdgpu_connector_add_common_modes()
  drm/amdgpu: update MODULE_PARM_DESC for freesync_video
  drm/amd: Use dynamic array size declaration for amdgpu_connector_add_common_modes()
  drm/amd/display: Share dce100_validate_global with DCE6-8
  drm/amd/display: Share dce100_validate_bandwidth with DCE6-8
  drm/amdgpu: Fix fence signaling race condition in userqueue
  amd/amdkfd: enhance kfd process check in switch partition
  amd/amdkfd: resolve a race in amdgpu_amdkfd_device_fini_sw
  drm/amd/display: Reject modes with too high pixel clock on DCE6-10
  drm/amd: Drop unnecessary check in amdgpu_connector_add_common_modes()
  drm/amd/display: Only enable common modes for eDP and LVDS
  drm/amdgpu: remove the redeclaration of variable i
  drm/amdgpu/userq: assign an error code for invalid userq va
  drm/amdgpu: revert "rework reserved VMID handling" v2
  drm/amdgpu: remove leftover from enforcing isolation by VMID
  drm/amdgpu: Add fallback to pipe reset if KCQ ring reset fails
  accel/habanalabs: add Infineon version check
  accel/habanalabs/gaudi2: read preboot status after recovering from dirty state
  accel/habanalabs: add HL_GET_P_STATE passthrough type
  ...
2025-10-02 12:47:25 -07:00
Rafael J. Wysocki
f58f86df6a Merge branches 'pm-core', 'pm-runtime' and 'pm-sleep'
Merge changes related to system sleep and runtime PM framework for
6.18-rc1:

 - Annotate loops walking device links in the power management core
   code as _srcu and add macros for walking device links to reduce the
   likelihood of coding mistakes related to them (Rafael Wysocki)

 - Document time units for *_time functions in the runtime PM API (Brian
   Norris)

 - Clear power.must_resume in noirq suspend error path to avoid resuming
   a dependant device under a suspended parent or supplier (Rafael
   Wysocki)

 - Fix GFP mask handling during hybrid suspend and make the amdgpu
   driver handle hybrid suspend correctly (Mario Limonciello, Rafael
   Wysocki)

 - Fix GFP mask handling after aborted hibernation in platform mode and
   combine exit paths in power_down() to avoid code duplication (Rafael
   Wysocki)

 - Use vmalloc_array() and vcalloc() in the hibernation core to avoid
   open-coded size computations (Qianfeng Rong)

 - Fix typo in hibernation core code comment (Li Jun)

 - Call pm_wakeup_clear() in the same place where other functions that do
   bookkeeping prior to suspend_prepare() are called (Samuel Wu)

* pm-core:
  PM: core: Add two macros for walking device links
  PM: core: Annotate loops walking device links as _srcu

* pm-runtime:
  PM: runtime: Documentation: ABI: Document time units for *_time

* pm-sleep:
  PM: hibernate: Combine return paths in power_down()
  PM: hibernate: Restrict GFP mask in power_down()
  PM: hibernate: Fix pm_hibernation_mode_is_suspend() build breakage
  drm/amd: Fix hybrid sleep
  PM: hibernate: Add pm_hibernation_mode_is_suspend()
  PM: hibernate: Fix hybrid-sleep
  PM: sleep: core: Clear power.must_resume in noirq suspend error path
  PM: sleep: Make pm_wakeup_clear() call more clear
  PM: hibernate: Fix typo in memory bitmaps description comment
  PM: hibernate: Use vmalloc_array() and vcalloc() to improve code
2025-09-29 12:54:01 +02:00
Mario Limonciello
df2ba57094 drm/amd: Add name to modes from amdgpu_connector_add_common_modes()
[Why]
When DC adds common modes it adds modes with a string to match what
they are. Non-DC doesn't. This can be inconsistent when turning on/off
DC support.

[How]
Add a name member to common_modes[] and copy it into the drm display
mode.

Cc: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Link: https://lore.kernel.org/r/20250924161624.1975819-6-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:54:22 -04:00
Mario Limonciello
6d622755bc drm/amd: Drop some common modes from amdgpu_connector_add_common_modes()
[Why]
DC and non-DC codepaths have different sets of common modes that are
added for eDP and LVDS cases. This can cause different behaviors for
turning on DC on hardware that can support both.

[How]
Drop extra modes from amdgpu_connector_add_common_modes() not present
in amdgpu_dm_connector_add_common_modes().

Cc: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250924161624.1975819-5-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:54:12 -04:00
Alex Deucher
dbf2341569 drm/amdgpu: update MODULE_PARM_DESC for freesync_video
To better describe what it does.

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3756
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:54:07 -04:00
Mario Limonciello
123a1750c5 drm/amd: Use dynamic array size declaration for amdgpu_connector_add_common_modes()
[Why]
Adding or removing a mode from common_modes[] can be fragile if a user
forgot to update the for loop boundaries.

[How]
Use ARRAY_SIZE() to detect size of the array and use that instead.

Cc: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Link: https://lore.kernel.org/r/20250924161624.1975819-4-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:53:55 -04:00
Jesse.Zhang
b8ae2640f9 drm/amdgpu: Fix fence signaling race condition in userqueue
This commit fixes a potential race condition in the userqueue fence
signaling mechanism by replacing dma_fence_is_signaled_locked() with
dma_fence_is_signaled().

The issue occurred because:
1. dma_fence_is_signaled_locked() should only be used when holding
   the fence's individual lock, not just the fence list lock
2. Using the locked variant without the proper fence lock could lead
   to double-signaling scenarios:
   - Hardware completion signals the fence
   - Software path also tries to signal the same fence

By using dma_fence_is_signaled() instead, we properly handle the
locking hierarchy and avoid the race condition while still maintaining
the necessary synchronization through the fence_list_lock.

v2: drop the comment (Christian)

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:53:23 -04:00
Mario Limonciello
210844d2c0 drm/amd: Drop unnecessary check in amdgpu_connector_add_common_modes()
[Why]
amdgpu_connector_add_common_modes() has a check for the width and height
of common modes being too small, but the array of common_modes[] has fixed
values.  The check is dead code.

[How]
Drop unnecessary check.

Cc: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250924161624.1975819-3-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:50:58 -04:00
Sunil Khatri
4e3b45d7b6 drm/amdgpu: remove the redeclaration of variable i
Variable "i" has been redeclared as integer later in the function
which is wrong and not serving any purpose.

Fixes: 899fbde146 ("drm/amdgpu: replace get_user_pages with HMM mirror helpers")
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:46:35 -04:00
Prike Liang
883bd89d00 drm/amdgpu/userq: assign an error code for invalid userq va
It should return an error code if userq VA validation fails.

Fixes: 9e46b8bb05 ("drm/amdgpu: validate userq buffer virtual address and size")
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:40:18 -04:00
Christian König
90e09ea4cf drm/amdgpu: revert "rework reserved VMID handling" v2
This reverts commit e44a0fe630.

Initially we used VMID reservation to enforce isolation between
processes. That has now been replaced by proper fence handling.

Both OpenGL, RADV and ROCm developers requested a way to reserve a VMID
for SPM, so restore that approach by reverting back to only allowing a
single process to use the reserved VMID.

Only compile tested for now.

v2: use -ENOENT instead of -EINVAL if VMID is not available

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:39:00 -04:00
Christian König
66f3883dbc drm/amdgpu: remove leftover from enforcing isolation by VMID
Initially we enforced isolation by reserving a VMID, but that practice
was now removed.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:38:54 -04:00
Jesse.Zhang
7469567d88 drm/amdgpu: Add fallback to pipe reset if KCQ ring reset fails
Add a fallback mechanism to attempt pipe reset when KCQ reset
fails to recover the ring. After performing the KCQ reset and
queue remapping, test the ring functionality. If the ring test
fails, initiate a pipe reset as an additional recovery step.

v2: fix the typo (Lijo)
v3: try pipeline reset when kiq mapping fails (Lijo)

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-25 15:38:48 -04:00
Mario Limonciello (AMD)
0a6e9e098f drm/amd: Fix hybrid sleep
[Why]
commit 530694f54d ("drm/amdgpu: do not resume device in thaw for
normal hibernation") optimized the flow for systems that are going
into S4 where the power would be turned off.  Basically the thaw()
callback wouldn't resume the device if the hibernation image was
successfully created since the system would be powered off.

This however isn't the correct flow for a system entering into
s0i3 after the hibernation image is created.  Some of the amdgpu
callbacks have different behavior depending upon the intended
state of the suspend.

[How]
Use pm_hibernation_mode_is_suspend() as an input to decide whether
to run resume during thaw() callback.

Reported-by: Ionut Nechita <ionut_n2001@yahoo.com>
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4573
Tested-by: Ionut Nechita <ionut_n2001@yahoo.com>
Fixes: 530694f54d ("drm/amdgpu: do not resume device in thaw for normal hibernation")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Tested-by: Kenneth Crudup <kenny@panix.com>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Cc: 6.17+ <stable@vger.kernel.org> # 6.17+: 495c8d3503: PM: hibernate: Add pm_hibernation_mode_is_suspend()
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-25 21:37:38 +02:00
Jesse.Zhang
5886090032 drm/amdgpu: Move VCN reset mask setup to late_init for VCN 5.0.1
This patch moves the initialization of the VCN supported_reset mask from
sw_init to a new late_init function for VCN 5.0.1. The change ensures
that all necessary hardware and firmware initialization is complete
before determining the supported reset types.

Key changes:
- Added vcn_v5_0_1_late_init() function to handle late initialization
- Moved supported_reset mask setup from sw_init to late_init
- Added check for per-queue reset support via amdgpu_dpm_reset_vcn_is_supported()
- Updated ip_funcs to use the new late_init function

This change helps ensure proper reset behavior by waiting until all
dependencies are initialized before determining available reset types.

Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:41:37 -04:00
Jesse.Zhang
dc704458dd drm/amdgpu: Add ring reset support for VCN v5.0.1
Implement the ring reset callback for VCN v5.0.1 to properly handle
hardware recovery when encountering GPU hangs. The new functionality:

1. Adds vcn_v5_0_1_ring_reset() function that:
   - Prepares for reset using amdgpu_ring_reset_helper_begin()
   - Performs VCN instance reset via amdgpu_dpm_reset_vcn()
   - Re-initializes hardware through vcn_v5_0_1_hw_init_inst()
   - Restarts DPG mode with vcn_v5_0_1_start_dpg_mode()
   - Completes reset with amdgpu_ring_reset_helper_end()

2. Hooks the reset function into the unified ring functions via:
   - Adding .reset = vcn_v5_0_1_ring_reset to vcn_v5_0_1_unified_ring_vm_funcs

3. Maintains existing behavior for SR-IOV VF cases by checking RRMT status

This provides proper hardware recovery capabilities for VCN 5.0.1 IP block
during fault conditions, matching functionality available in other VCN versions.

v2: Remove the RRMT_ENABLED cap setting in the reset function
    and replace adev->vcn.inst[ring->me].indirect_sram with vinst->indirect_sram (Lijo)

Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:41:27 -04:00
Jesse.Zhang
eb6910cdaa drm/amdgpu: Refactor VCN v5.0.1 HW init into separate instance function
Split the per-instance initialization code from vcn_v5_0_1_hw_init()
into a new vcn_v5_0_1_hw_init_inst() function. This improves code
organization by:

1. Separating the instance-specific initialization logic
2. Making the main init function more readable
3. Following the pattern used in queue reset

The SR-IOV specific initialization remains in the main function since
it has different requirements.

Reviewed-by: Sonny Jiang <sonny.jiang@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:41:11 -04:00
Rahul Kumar
86a54e45fd drm/amdgpu: Use kmalloc_array() instead of kmalloc()
Documentation/process/deprecated.rst recommends against the use of
kmalloc with dynamic size calculations due to the risk of overflow and
smaller allocation being made than the caller was expecting.

Replace kmalloc() with kmalloc_array() in amdgpu_amdkfd_gfx_v10.c,
amdgpu_amdkfd_gfx_v10_3.c, amdgpu_amdkfd_gfx_v11.c and
amdgpu_amdkfd_gfx_v12.c to make the intended allocation size clearer
and avoid potential overflow issues.

Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Rahul Kumar <rk0006818@gmail.com>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:35:54 -04:00
Sonny Jiang
854b9ab637 drm/amdgpu: Update amdgpu_vcn5_fw_shared for vcn_5_0_1
Align vcn5_fw_shared structure with FW

Signed-off-by: Sonny Jiang <sonny.jiang@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:22:51 -04:00
Mario Limonciello
1fb710793c drm/amdgpu: Enable MES lr_compute_wa by default
The MES set resources packet has an optional bit 'lr_compute_wa'
which can be used for preventing MES hangs on long compute jobs.

Set this bit by default.

Co-developed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:22:38 -04:00
Sunil Khatri
c5b3cc417b drm/amdgpu: use hmm_pfns instead of array of pages
we dont need to allocate local array of pages to hold
the pages returned by the hmm, instead we could use
the hmm_range structure itself to get to hmm_pfn
and get the required pages directly.

This avoids call to alloc/free quite a lot.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:22:31 -04:00
Lijo Lazar
b29c22b8da drm/amdgpu: Fix vbios build number parsing logic
It's not necessary that the build string and atom header section has a
difference of 32 bytes. Use the remaining bytes in the section as copy
limit.

Fixes: d6fa802661 ("drm/amdgpu: Add vbios build number interface")
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-09-23 10:21:09 -04:00
Dave Airlie
342f141ba9 amd-drm-next-6.18-2025-09-19:
amdgpu:
 - Fence drv clean up fix
 - DPC fixes
 - Misc display fixes
 - Support the MMIO remap page as a ttm pool
 - JPEG parser updates
 - UserQ updates
 - VCN ctx handling fixes
 - Documentation updates
 - Misc cleanups
 - SMU 13.0.x updates
 - SI DPM updates
 - GC 11.x cleaner shader updates
 - DMCUB updates
 - DML fixes
 - Improve fallback handling for pixel encoding
 - VCN reset improvements
 - DCE6 DC updates
 - DSC fixes
 - Use devm for i2c buses
 - GPUVM locking updates
 - GPUVM documentation improvements
 - Drop non-DC DCE11 code
 - S0ix fixes
 - Backlight fix
 - SR-IOV fixes
 
 amdkfd:
 - SVM updates
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQgO5Idg2tXNTSZAr293/aFa7yZ2AUCaM2u5AAKCRC93/aFa7yZ
 2PcOAQC0US+uIJxkp1gmJxlpWsceD8GhonYp456gSx743XiMfAD/W8d4VKKicWMe
 DDzSumgBgjEG7YpuzZJoaExNeS/LWAg=
 =oKqL
 -----END PGP SIGNATURE-----

Merge tag 'amd-drm-next-6.18-2025-09-19' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.18-2025-09-19:

amdgpu:
- Fence drv clean up fix
- DPC fixes
- Misc display fixes
- Support the MMIO remap page as a ttm pool
- JPEG parser updates
- UserQ updates
- VCN ctx handling fixes
- Documentation updates
- Misc cleanups
- SMU 13.0.x updates
- SI DPM updates
- GC 11.x cleaner shader updates
- DMCUB updates
- DML fixes
- Improve fallback handling for pixel encoding
- VCN reset improvements
- DCE6 DC updates
- DSC fixes
- Use devm for i2c buses
- GPUVM locking updates
- GPUVM documentation improvements
- Drop non-DC DCE11 code
- S0ix fixes
- Backlight fix
- SR-IOV fixes

amdkfd:
- SVM updates

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250919193354.2989255-1-alexander.deucher@amd.com
2025-09-22 08:45:51 +10:00