3
0
Fork 0
forked from mirrors/linux
Commit graph

1350 commits

Author SHA1 Message Date
Alex Deucher
c99a2e7ae2 drm/amdkfd: drop IOMMUv2 support
Now that we use the dGPU path for all APUs, drop the
IOMMUv2 support.

v2: drop the now unused queue manager functions for gfx7/8 APUs

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Tested-by: Mike Lothian <mike@fireburn.co.uk>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-11 14:47:25 -04:00
Emily Deng
f734b2133c drm/amdgpu/irq: Move irq resume to the beginning
Need to move irq resume to the beginning of reset sriov, or if
one interrupt occurs before irq resume, then the irq won't work anymore.

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-09 09:46:04 -04:00
Lijo Lazar
7957ec80ef drm/amdgpu: Add FRU sysfs nodes only if needed
Create sysfs nodes for FRU data only if FRU data is available. Move the
logic to FRU specific file.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-09 09:45:55 -04:00
Mario Limonciello
70e64c4d52 drm/amd: Disable S/G for APUs when 64GB or more host memory
Users report a white flickering screen on multiple systems that
is tied to having 64GB or more memory.  When S/G is enabled pages
will get pinned to both VRAM carve out and system RAM leading to
this.

Until it can be fixed properly, disable S/G when 64GB of memory or
more is detected.  This will force pages to be pinned into VRAM.
This should fix white screen flickers but if VRAM pressure is
encountered may lead to black screens.  It's a trade-off for now.

Fixes: 81d0bcf990 ("drm/amdgpu: make display pinning more flexible (v2)")
Cc: Hamza Mahfooz <Hamza.Mahfooz@amd.com>
Cc: Roman Li <roman.li@amd.com>
Cc: <stable@vger.kernel.org> # 6.1.y: bf0207e172 ("drm/amdgpu: add S/G display parameter")
Cc: <stable@vger.kernel.org> # 6.4.y
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2735
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2354
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-07 16:32:54 -04:00
Srinivasan Shanmugam
b8920e1e0d drm/amdgpu: Fix ENOSYS means 'invalid syscall nr' in amdgpu_device.c
ENOSYS should be used for nonexistent syscalls only, replace ENOSYS with
EOPNOTSUPP for reset handlers that are not implemented for respective ASIC.

WARNING: ENOSYS means 'invalid syscall nr' and nothing else
+       if (r == -ENOSYS)

WARNING: ENOSYS means 'invalid syscall nr' and nothing else
+       if (r == -ENOSYS)

And other following style fixes in amdgpu_device.c:

WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'.
WARNING: Block comments should align the * on each line
WARNING: Missing a blank line after declarations
WARNING: braces {} are not necessary for single statement blocks

Cc: Lijo Lazar <lijo.lazar@amd.com>
Cc: Kent Russell <kent.russell@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-27 14:59:29 -04:00
Horace Chen
83f24a8f05 drm/amdgpu: set sw state to gfxoff after SR-IOV reset
[Why]
Current SR-IOV will not set GC to off state, while it is a real
GC hard reset. Whthout GFX off flag, driver may do gfxhub invalidation
before firmware load and gfxhub gart enable. This operation may cause
CP to become busy because GC is not in the right state for invalidation.

[How]
Add a function for SR-IOV to clean up some sw state before recover. Set
adev->gfx.is_poweron to false to prevent gfxhub invalidation before gfx
firmware autoload complete.

Signed-off-by: Horace Chen <horace.chen@amd.com>
Reviewed-by: HaiJun Chang <HaiJun.Chang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-25 13:35:23 -04:00
Victor Lu
8ed49dd1d3 drm/amdgpu: Add RLCG interface driver implementation for gfx v9.4.3 (v3)
Add RLCG interface support for gfx v9.4.3 and multiple XCCs.
Do not enable it yet.

v2: Fix amdgpu_rlcg_reg_access_ctrl init, add support for multiple XCCs
    in amdgpu_mm_wreg_mmio_rlc

v3: Use GET_INST() when indexing amdgpu_rlcg_reg_access_ctrl

Signed-off-by: Victor Lu <victorchengchi.lu@amd.com>
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-18 11:16:41 -04:00
Shashank Sharma
43c064db65 drm/amdgpu: create a new file for doorbell manager
This patch:
- creates a new file for doorbell management.
- moves doorbell code from amdgpu_device.c to this file.

V2:
 - remove doc from function declaration (Christian)
 - remove 'device' from function names to make it consistent (Alex)
 - add SPDX license identifier (Luben)

V3:
 - change license to MIT license(Christian)

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-18 11:12:08 -04:00
Mario Limonciello
5d1eb4c4c8 drm/amd: Move helper for dynamic speed switch check out of smu13
This helper is used for checking if the connected host supports
the feature, it can be moved into generic code to be used by other
smu implementations as well.

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-12 11:12:10 -04:00
Arnd Bergmann
822130b5e8 drm/amdgpu: avoid integer overflow warning in amdgpu_device_resize_fb_bar()
On 32-bit architectures comparing a resource against a value larger than
U32_MAX can cause a warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:1344:18: error: result of comparison of constant 4294967296 with expression of type 'resource_size_t' (aka 'unsigned int') is always false [-Werror,-Wtautological-constant-out-of-range-compare]
                    res->start > 0x100000000ull)
                    ~~~~~~~~~~ ^ ~~~~~~~~~~~~~~

As gcc does not warn about this in dead code, add an IS_ENABLED() check at
the start of the function. This will always return success but not actually resize
the BAR on 32-bit architectures without high memory, which is exactly what
we want here, as the driver can fall back to bank switching the VRAM
access.

Fixes: 31b8adab32 ("drm/amdgpu: require a root bus window above 4GB for BAR resize")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-12 10:57:41 -04:00
Mario Limonciello
521289d2a2 drm/amd: Use attribute groups for PSP flashing attributes
Individually creating attributes can be racy, instead make attributes
using attribute groups and control their visibility with an is_visible
callback to only show when using appropriate products.

v2: squash in fix for PSP 13.0.10

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-07 13:51:46 -04:00
Alex Deucher
50a7c8765c drm/amdgpu: enable mcbp by default on gfx9
It's required for high priority queues.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2535
Reviewed-and-tested-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-30 13:12:15 -04:00
Alex Deucher
02ff519e99 drm/amdgpu: make mcbp a per device setting
So we can selectively enable it on certain devices.  No
intended functional change.

Reviewed-and-tested-by: Jiadong Zhu <Jiadong.Zhu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-30 13:12:14 -04:00
Lijo Lazar
184d838482 drm/amdgpu: Add vbios attribute only if supported
Not all devices carry VBIOS version information. Add the device
attribute only if supported.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Le Ma <le.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-23 15:35:26 -04:00
Srinivasan Shanmugam
2e9fee9b8e drm/amdgpu: Fix up kdoc in amdgpu_device.c
Fix these warnings by deleting the deviant arguments.

gcc with W=1
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:799: warning: Excess function parameter 'pcie_index' description in 'amdgpu_device_indirect_wreg'
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:799: warning: Excess function parameter 'pcie_data' description in 'amdgpu_device_indirect_wreg'
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:870: warning: Excess function parameter 'pcie_index' description in 'amdgpu_device_indirect_wreg64'
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:870: warning: Excess function parameter 'pcie_data' description in 'amdgpu_device_indirect_wreg64'

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:33:55 -04:00
Shiwu Zhang
9535a86a40 drm/amdgpu: bypass bios dependent operations
Since bios reading does not work currently so just bypass all operations
related to bios

v2: hardcode the vram info for APP_APU case (hawking)
v3: correct the vram_width with channel number * channel size (lijo)

Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 11:02:12 -04:00
Tong Liu01
04e8595819 drm/amdgpu: fix incorrect pcie_gen_mask in passthrough case
[why]
Passthrough case is treated as root bus and pcie_gen_mask is set as
default value that does not support GEN 3 and GEN 4 for PCIe link
speed. So PCIe link speed will be downgraded at smu hw init in
passthrough condition

[how]
Move get pci info after detect virtualization and check if it is
passthrough case when set pcie_gen_mask

Signed-off-by: Tong Liu01 <Tong.Liu01@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:40:00 -04:00
Philip Yang
84b4dd3f84 drm/amdkfd: Refactor migrate init to support partition switch
Rename smv_migrate_init to a better name kgd2kfd_init_zone_device
because it setup zone devive pgmap for page migration and keep it in
kfd_migrate.c to access static functions svm_migrate_pgmap_ops. Call it
only once in amdgpu_device_ip_init after adev ip blocks are initialized,
but before amdgpu_amdkfd_device_init initialize kfd nodes which enable
SVM support based on pgmap.

svm_range_set_max_pages is called by kgd2kfd_device_init everytime after
switching compute partition mode.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:36:58 -04:00
James Zhu
d425c6f48b drm/amdgpu: add partition scheduler list update
Add partition scheduler list update in late init
and xcp partition mode switch.

Signed-off-by: James Zhu <James.Zhu@amd.com>
Acked-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:59:35 -04:00
James Zhu
2c1c7ba457 drm/amdgpu: support partition drm devices
Support partition drm devices on GC_HWIP IP_VERSION(9, 4, 3).

This is a temporary solution and will be superceded.

Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: James Zhu <James.Zhu@amd.com>
Reviewed-and-tested-by: Philip Yang<Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:59:20 -04:00
Gavin Wan
b4520bfd80 drm/amdgpu: Checked if the pointer NULL before use it.
For SRIOV on some parts, the host driver does not post VBIOS. So the guest
cannot get bios information. Therefore, adev->virt.fw_reserve.p_pf2vf
and adev->mode_info.atom_context are NULL.

Signed-off-by: Gavin Wan <Gavin.Wan@amd.com>
Reviewed-by: Zhigang Luo <Zhigang.Luo@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:58:09 -04:00
Lijo Lazar
a0ba127960 drm/amdgpu: Fix unmapping of aperture
When aperture size is zero, there is no mapping done.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:57:46 -04:00
Le Ma
0c451baf3b drm/amdgpu: change the print level to warn for ip block disabled
Avoid to mislead users as it's not a real error.

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:54:12 -04:00
Le Ma
0c552ed387 drm/amdgpu: add indirect r/w interface for smn address greater than 32bits
On multiple AIDs platform, bit[34:32] in SMD address is leveraged to access
nonAID0 register smn address and new PCI_INDEX_HI register is introduced
to access the higher bits.

v2: rebase on latest register accessors (Alex)

Signed-off-by: Le Ma <le.ma@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:45:29 -04:00
Le Ma
0ee20b8696 drm/amdgpu: assign the doorbell index in 1st page to sdma page queue
Previously for vega10, the sdma_doorbell_range is only enough for sdma
gfx queue, thus the index on second doorbell page is allocated for sdma
page queue. From vega20, the sdma_doorbell_range on 1st page is enlarged.
Therefore, just leverage these index instead of allocation on 2nd page.

v2: change "(x << 1) + 2" to "(x + 1) << 1" for readability and add comments.

Signed-off-by: Le Ma <le.ma@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:43:29 -04:00
Lijo Lazar
5db392a045 drm/amdgpu: Use new atomfirmware init for GC 9.4.3
Use the new atomfirmware initialization logic for GC 9.4.3 based ASICs
also. ASIC init logic doesn't consider boot clocks during init.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:43:22 -04:00
James Zhu
81283fee15 drm/amdgpu/: add more macro to support offset variant
Add more macro to support offset variant and
simplify macro SOC15_WAIT_ON_RREG.

Signed-off-by: James Zhu <James.Zhu@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:40:46 -04:00
Le Ma
98a54e88e8 drm/amdgpu: add sysfs node for compute partition mode
Add current/available compute partitin mode sysfs node.

v2: make the sysfs node as IP independent one in amdgpu_gfx.c

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:40:25 -04:00
Lin.Cao
4994d1f0a7 drm/amdgpu: Fix vram recover doesn't work after whole GPU reset (v2)
v1: Vmbo->shadow is used to back vram bo up when vram lost. So that we
should set shadow as vmbo->shadow to recover vmbo->bo
v2: Modify if(vmbo->shadow) shadow = vmbo->shadow as if(!vmbo->shadow)
continue;

Fixes: e18aaea733 ("drm/amdgpu: move shadow_list to amdgpu_bo_vm")
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Lin.Cao <lincao12@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:35:30 -04:00
Yifan Zhang
0e768043bf drm/amdgpu: set gfx9 onwards APU atomics support to be true
APUs w/ gfx9 onwards doesn't reply on PCIe atomics, rather
it is internal path w/ native atomic support. Set have_atomics_support
to true.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Lang Yu <lang.yu@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:29:48 -04:00
lyndonli
59e9fff198 drm/amdgpu: Use the default reset when loading or reloading the driver
Below call trace and errors are observed when reloading
amdgpu driver with the module parameter reset_method=3.

It should do a default reset when loading or reloading the
driver, regardless of the module parameter reset_method.

v2: add comments inside and modify commit messages.

[  +2.180243] [drm] psp gfx command ID_LOAD_TOC(0x20) failed
and response status is (0x0)
[  +0.000011] [drm:psp_hw_start [amdgpu]] *ERROR* Failed to load toc
[  +0.000890] [drm:psp_hw_start [amdgpu]] *ERROR* PSP tmr init failed!
[  +0.020683] [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to
clear memory with ring turned off.
[  +0.000003] RIP: 0010:amdgpu_bo_release_notify+0x1ef/0x210 [amdgpu]
[  +0.000004] Call Trace:
[  +0.000003]  <TASK>
[  +0.000008]  ttm_bo_release+0x2c4/0x330 [amdttm]
[  +0.000026]  amdttm_bo_put+0x3c/0x70 [amdttm]
[  +0.000020]  amdgpu_bo_free_kernel+0xe6/0x140 [amdgpu]
[  +0.000728]  psp_v11_0_ring_destroy+0x34/0x60 [amdgpu]
[  +0.000826]  psp_hw_init+0xe7/0x2f0 [amdgpu]
[  +0.000813]  amdgpu_device_fw_loading+0x1ad/0x2d0 [amdgpu]
[  +0.000731]  amdgpu_device_init.cold+0x108e/0x2002 [amdgpu]
[  +0.001071]  ? do_pci_enable_device+0xe1/0x110
[  +0.000011]  amdgpu_driver_load_kms+0x1a/0x160 [amdgpu]
[  +0.000729]  amdgpu_pci_probe+0x179/0x3a0 [amdgpu]

Signed-off-by: lyndonli <Lyndon.Li@amd.com>
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Kenneth Feng <kenneth.feng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:21:42 -04:00
Srinivasan Shanmugam
47fc644f80 drm/amd/amdgpu: Fix style errors in amdgpu_drv.c & amdgpu_device.c
Fix following checkpatch style errors in amdgpu_drv.c &
amdgpu_device.c

ERROR: exactly one space required after that #ifdef
ERROR: spaces required around that '+=' (ctx:WxV)
ERROR: space required before the open brace '{'
ERROR: spaces required around that '||' (ctx:VxE)
ERROR: space prohibited before that close parenthesis ')'
ERROR: space required before the open parenthesis '('
ERROR: space required before the open brace '{'
ERROR: code indent should use tabs where possible

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-21 08:49:37 -04:00
Chong Li
38eecbe086 drm/amdgpu: release gpu full access after "amdgpu_device_ip_late_init"
[WHY]
 Function "amdgpu_irq_update()" called by "amdgpu_device_ip_late_init()" is an atomic context.
 We shouldn't access registers through KIQ since "msleep()" may be called in "amdgpu_kiq_rreg()".

[HOW]
 Move function "amdgpu_virt_release_full_gpu()" after function "amdgpu_device_ip_late_init()",
 to ensure that registers be accessed through RLCG instead of KIQ.

Call Trace:
  <TASK>
  show_stack+0x52/0x69
  dump_stack_lvl+0x49/0x6d
  dump_stack+0x10/0x18
  __schedule_bug.cold+0x4f/0x6b
  __schedule+0x473/0x5d0
  ? __wake_up_klogd.part.0+0x40/0x70
  ? vprintk_emit+0xbe/0x1f0
  schedule+0x68/0x110
  schedule_timeout+0x87/0x160
  ? timer_migration_handler+0xa0/0xa0
  msleep+0x2d/0x50
  amdgpu_kiq_rreg+0x18d/0x1f0 [amdgpu]
  amdgpu_device_rreg.part.0+0x59/0xd0 [amdgpu]
  amdgpu_device_rreg+0x3a/0x50 [amdgpu]
  amdgpu_sriov_rreg+0x3c/0xb0 [amdgpu]
  gfx_v10_0_set_gfx_eop_interrupt_state.constprop.0+0x16c/0x190 [amdgpu]
  gfx_v10_0_set_eop_interrupt_state+0xa5/0xb0 [amdgpu]
  amdgpu_irq_update+0x53/0x80 [amdgpu]
  amdgpu_irq_get+0x7c/0xb0 [amdgpu]
  amdgpu_fence_driver_hw_init+0x58/0x90 [amdgpu]
  amdgpu_device_init.cold+0x16b7/0x2022 [amdgpu]

Signed-off-by: Chong Li <chongli2@amd.com>
Reviewed-by: JingWen.Chen2@amd.com
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-18 16:28:50 -04:00
Aaron Liu
f22067419e drm/amdgpu: skip kfd-iommu suspend/resume for S0ix
GFX is in gfxoff mode during s0ix so we shouldn't need to
actually execute kfd_iommu_suspend/kfd_iommu_resume operation.

Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-14 13:37:20 -04:00
Shashank Sharma
0512e9ffeb drm/amdgpu: rename num_doorbells
Rename doorbell.num_doorbells to doorbell.num_kernel_doorbells to
make it more readable.

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Acked-by: Christian Koenig <christian.koenig@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-13 00:19:42 -04:00
Sreekant Somasekharan
00fa40353b drm/amdkfd: Check PCIe atomics support on GFX11 to set CP_HQD_HQ_STATUS0[29]
CP_HQD_HQ_STATUS0[29] bit will be used by CPFW to acknowledge whether
PCIe atomics are supported. The default value of this bit is set
to 0. Driver will check whether PCIe atomics are supported and set the
bit to 1 if supported. This will force CPFW to use real atomic ops.
If the bit is not set, CPFW will default to read/modify/write using the
firmware itself.

This is applicable only to GFX11 RS64 CP with MEC FW >= 509. If MEC
FW < 509 and for all GFX11 F32 CP, PCIe atomics needs to be supported
else it will skip the device.

This commit also involves moving amdgpu_amdkfd_device_probe() function
call after per-IP early_init loop in amdgpu_device_ip_early_init()
function so as to check for RS64 enabled device.

Signed-off-by: Sreekant Somasekharan <sreekant.somasekharan@amd.com>
Reviewed-by: Graham Sider <Graham.Sider@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-11 18:03:45 -04:00
Srinivasan Shanmugam
11f25c844e drm/amd/amdgpu: Drop the hang limit parameter
The driver doesn't resubmit jobs on hangs any more, hence drop
the hang limit parameter - amdgpu_job_hang_limit, wherever it is used.

Suggested-by: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
Cc: Kent Russell <kent.russell@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-11 18:03:43 -04:00
Yifan Zha
d2cdc01451 drm/amdgpu: Add JPEG IP block to SRIOV reinit
[Why]
Reset(mode1) failed as JPRG IP did not reinit under sriov.

[How]
Add JPEG IP block to sriov reinit function.

Signed-off-by: Yifan Zha <Yifan.Zha@amd.com>
Reviewed-by: Horace Chen <Horace.Chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-31 11:18:53 -04:00
YiPeng Chai
28606c4e58 drm/amdgpu: resume ras for gfx v11_0_3 during reset on SRIOV
Gfx v11_0_3 supports ras on SRIOV, so need to resume ras
during reset.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-22 00:58:50 -04:00
YiPeng Chai
ec64350d01 drm/amdgpu: reinit mes ip block during reset on SRIOV
Reinit mes ip block during reset on SRIOV.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-22 00:58:44 -04:00
Kai-Heng Feng
3ad5dcfe00 drm/amdgpu/nv: Apply ASPM quirk on Intel ADL + AMD Navi
S2idle resume freeze can be observed on Intel ADL + AMD WX5500. This is
caused by commit 0064b0ce85 ("drm/amd/pm: enable ASPM by default").

The root cause is still not clear for now.

So extend and apply the ASPM quirk from commit e02fe3bc7a
("drm/amdgpu: vi: disable ASPM on Intel Alder Lake based systems"), to
workaround the issue on Navi cards too.

Fixes: 0064b0ce85 ("drm/amd/pm: enable ASPM by default")
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2458
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-22 00:58:08 -04:00
Lee Jones
80bd2de1db drm/amd/amdgpu/amdgpu_device: Provide missing kerneldoc entry for 'reset_context'
Fixes the following W=1 kernel build warning(s):

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5152:
   warning: Function parameter or member 'reset_context' not described in 'amdgpu_device_gpu_recover'

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Signed-off-by: Lee Jones <lee@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-22 00:47:59 -04:00
Hawking Zhang
dabc114e4b drm/amdgpu: Move to common helper to query soc rev_id
Replace soc15, nv, soc21 get_rev_id callback with common
helper so we don't need to duplicate code when introduce
new asics.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Hawking Zhang
65ba96e91b drm/amdgpu: Move to common indirect reg access helper
Replace soc15, nv, soc21 specific callbacks with common
one. so we don't need to duplicate code when introduce
new asics.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Guchun Chen
c69d51395a drm/amdgpu: move poll enabled/disable into non DC path
Some amd asics having reliable hotplug support don't call
drm_kms_helper_poll_init in driver init sequence. However,
due to the unified suspend/resume path for all asics, because
the output_poll_work->func is not set for these asics, a warning
arrives when suspending.

[   90.656049]  <TASK>
[   90.656050]  ? console_unlock+0x4d/0x100
[   90.656053]  ? __irq_work_queue_local+0x27/0x60
[   90.656056]  ? irq_work_queue+0x2b/0x50
[   90.656057]  ? __wake_up_klogd+0x40/0x60
[   90.656059]  __cancel_work_timer+0xed/0x180
[   90.656061]  drm_kms_helper_poll_disable.cold+0x1f/0x2c [drm_kms_helper]
[   90.656072]  amdgpu_device_suspend+0x81/0x170 [amdgpu]
[   90.656180]  amdgpu_pmops_runtime_suspend+0xb5/0x1b0 [amdgpu]
[   90.656269]  pci_pm_runtime_suspend+0x61/0x1b0

drm_kms_helper_poll_enable/disable is valid when poll_init is called in
amdgpu code, which is only used in non DC path. So move such codes into
non-DC path code to get rid of such warnings.

v1: introduce use_kms_poll flag in amdgpu as the poll stuff check
v2: use dc_enabled as the flag to simply code
v3: move code into non DC path instead of relying on any flag

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/2411
Fixes: a4e771729a ("drm/probe_helper: sort out poll_running vs poll_enabled")
Reported-by: Bert Karwatzki <spasswolf@web.de>
Suggested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-13 17:27:48 -04:00
Guchun Chen
6ab68650a1 drm/amdgpu: use drm_device pointer directly rather than convert again
The convert from adev is redundant.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-13 17:27:48 -04:00
Guchun Chen
53e9d836ea drm/amdgpu: drop pm_sysfs_en flag from amdgpu_device structure
pm_sysfs_en is overlapped with pm.sysfs_initialized, so drop it
for simplifying code(no functional change).

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-13 17:27:48 -04:00
Bjorn Helgaas
58265640fb drm/amdgpu: Drop redundant pci_enable_pcie_error_reporting()
pci_enable_pcie_error_reporting() enables the device to send ERR_*
Messages.  Since f26e58bf6f ("PCI/AER: Enable error reporting when AER is
native"), the PCI core does this for all devices during enumeration, so the
driver doesn't need to do it itself.

Remove the redundant pci_enable_pcie_error_reporting() call from the
driver.

Note that this only controls ERR_* Messages from the device.  An ERR_*
Message may cause the Root Port to generate an interrupt, depending on the
AER Root Error Command register managed by the AER service driver.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-08 14:05:57 -05:00
Orlando Chamberlain
d37a3929ca drm/amdgpu: register a vga_switcheroo client for MacBooks with apple-gmux
Commit 3840c5bcc2 ("drm/amdgpu: disentangle runtime pm and
vga_switcheroo") made amdgpu only register a vga_switcheroo client for
GPU's with PX, however AMD GPUs in dual gpu Apple Macbooks do need to
register, but don't have PX. Instead of AMD's PX, they use apple-gmux.

Use apple_gmux_detect() to identify these gpus, and
pci_is_thunderbolt_attached() to ensure eGPUs connected to Dual GPU
Macbooks don't register with vga_switcheroo.

Fixes: 3840c5bcc2 ("drm/amdgpu: disentangle runtime pm and vga_switcheroo")
Link: https://lore.kernel.org/amd-gfx/20230210044826.9834-10-orlandoch.dev@gmail.com/
Signed-off-by: Orlando Chamberlain <orlandoch.dev@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-07 14:22:41 -05:00
Jack Xiao
dc907c9db8 drm/amd/amdgpu: fix warning during suspend
Freeing memory was warned during suspend.
Move the self test out of suspend.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=2151825
Cc: jfalempe@redhat.com
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-and-tested-by: Evan Quan <evan.quan@amd.com>
Tested-by: Jocelyn Falempe <jfalempe@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-02-15 22:19:43 -05:00
Kent Russell
2c496a6cf4 drm/amdgpu: Fix incorrect filenames in sysfs comments
This looks like a standard copy/paste mistake. Replace the incorrect
serial_number references with product_name and product_model

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-02-08 17:35:36 -05:00
Vitaly Prosyak
39934d3ed5 Revert "drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled"
This reverts commit fac53471d0.
The following change: move the drm_dev_unplug call after
amdgpu_driver_unload_kms in amdgpu_pci_remove. The reason is
the following: amdgpu_pci_remove calls drm_dev_unregister
and it should be called first to ensure userspace can't access the
device instance anymore. If we call drm_dev_unplug after
amdgpu_driver_unload_kms then we observe IGT PCI software unplug
test failure (kernel hung) for all ASICs. This is how this
regression was found.

After this revert, the following commands do work not, but it would
be fixed in the next commit:
 - sudo modprobe -r amdgpu
 - sudo modprobe amdgpu

Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Reviewed-by Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-31 13:59:21 -05:00
Dave Airlie
155c6b16ee Merge tag 'amd-drm-next-6.3-2023-01-27' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.3-2023-01-27:

amdgpu:
- GC11 fixes
- SMU13 fixes
- Freesync fixes
- DP MST fixes
- DP MST code rework and cleanup
- AV1 fixes for VCN4
- DCN 3.2.x fixes
- PSR fixes
- DML optimizations
- DC link code rework

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230127225917.2419162-1-alexander.deucher@amd.com
2023-01-30 15:37:57 +10:00
Dave Airlie
7dd1be30f0 Merge tag 'amd-drm-next-6.3-2023-01-20' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.3-2023-01-20:

amdgpu:
- Secure display fixes
- Fix scaling
- Misc code cleanups
- Display BW alloc logic updates
- DCN 3.2 fixes
- Fix power reporting on certain firmwares for CZN/RN
- SR-IOV fixes
- Link training cleanup and code rework
- HDCP fixes
- Reserved VMID fix
- Documentation updates
- Colorspace fixes
- RAS updates
- GC11.0 fixes
- VCN instance harvesting fixes
- DCN 3.1.4/5 workarounds for S/G displays
- Add PCIe info to the INFO IOCTL

amdkfd:
- XNACK fix

UAPI:
- Add PCIe gen/lanes info to the amdgpu INFO IOCTL
  Nesa ultimately plans to use this to make decisions about buffer placement optimizations
  Mesa MR: https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20790

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230120234523.7610-1-alexander.deucher@amd.com
2023-01-25 12:07:53 +10:00
Tim Huang
e11c775030 drm/amdgpu: skip psp suspend for IMU enabled ASICs mode2 reset
The psp suspend & resume should be skipped to avoid destroy
the TMR and reload FWs again for IMU enabled APU ASICs.

Signed-off-by: Tim Huang <tim.huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-24 12:27:37 -05:00
Daniel Vetter
b8f55f24bc drm-misc-next for $kernel-version:
UAPI Changes:
 
 Cross-subsystem Changes:
 
 Core Changes:
 
  * Cleanup unneeded include statements wrt <linux/fb.h>, <drm/drm_fb_helper.h>
    and <drm/drm_crtc_helper.h>
 
  * Remove unused helper DRM_DEBUG_KMS_RATELIMITED()
 
  * fbdev: Remove obsolete aperture field from struct fb_device, plus
    driver cleanups; Remove unused flag FBINFO_MISC_FIRMWARE
 
  * MIPI-DSI: Fix brightness, plus rsp. driver updates
 
  * scheduler: Deprecate drm_sched_resubmit_jobs()
 
  * ttm: Fix MIPS build; Remove ttm_bo_wait(); Documentation fixes
 
 Driver Changes:
 
  * Remove obsolete drivers for userspace modesetting i810, mga, r128,
    savage, sis, tdfx, via
 
  * bridge: Support CDNS DSI J721E, plus DT bindings; lt9611: Various
    fixes and improvements; sil902x: Various fixes; Fixes
 
  * nouveau: Removed support for legacy ioctls; Replace zero-size array;
    Cleanups
 
  * panel: Fixes
 
  * radeon: Use new DRM logging helpers
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEchf7rIzpz2NEoWjlaA3BHVMLeiMFAmPJAq4ACgkQaA3BHVML
 eiNEIgf+I0R9KmX890K4usKG9LfPH/nIv+4Am6x4/4lv0PzN2vYGhoyPJG8cyNvs
 KFms+lTUJBBgHeTG8S8NU1qKWUlA78eYQz8S4dbaocchsAiPTHq4f5J45zbQWMGI
 P56iNAflaO2ETtb3CsH0P0TPsW2TpZC3dvZUYpAEQDli66Bn2BCPCYspt4scOhZX
 S9usD28sB6L9AnALcCUMLqF4DUsW4FC8Zz46hKVUFlQpN5dcC1b0x0gyclyWy0wh
 yi1fkqzBB3N44JOIFFwan/KxQttgvrc9Shllkqss525AhE+v3afkK2i9ZXgdckuU
 kLC09pn6yuxubYgS0vJEU1bsqiMs+Q==
 =WjQb
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-2023-01-19' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

drm-misc-next for $kernel-version:

UAPI Changes:

Cross-subsystem Changes:

Core Changes:

 * Cleanup unneeded include statements wrt <linux/fb.h>, <drm/drm_fb_helper.h>
   and <drm/drm_crtc_helper.h>

 * Remove unused helper DRM_DEBUG_KMS_RATELIMITED()

 * fbdev: Remove obsolete aperture field from struct fb_device, plus
   driver cleanups; Remove unused flag FBINFO_MISC_FIRMWARE

 * MIPI-DSI: Fix brightness, plus rsp. driver updates

 * scheduler: Deprecate drm_sched_resubmit_jobs()

 * ttm: Fix MIPS build; Remove ttm_bo_wait(); Documentation fixes

Driver Changes:

 * Remove obsolete drivers for userspace modesetting i810, mga, r128,
   savage, sis, tdfx, via

 * bridge: Support CDNS DSI J721E, plus DT bindings; lt9611: Various
   fixes and improvements; sil902x: Various fixes; Fixes

 * nouveau: Removed support for legacy ioctls; Replace zero-size array;
   Cleanups

 * panel: Fixes

 * radeon: Use new DRM logging helpers

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/Y8kDk5YX7Yz3eRhM@linux-uq9g
2023-01-24 17:36:29 +01:00
Thomas Zimmermann
973ad6273c drm/amdgpu: Remove unnecessary include statements for drm_crtc_helper.h
Several source files include drm_crtc_helper.h without needing it or
only to get its transitive include statements; leading to unnecessary
compile-time dependencies.

Directly include required headers and drop drm_crtc_helper.h where
possible.

v2:
	* keep includes sorted in amdgpu_device.c (Sam)

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230116131235.18917-4-tzimmermann@suse.de
2023-01-18 09:25:30 +01:00
Thomas Zimmermann
53a17b6b75 drm/amdgpu: Fix coding style
Align a closing brace and remove trailing whitespaces. No functional
changes.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-17 16:11:52 -05:00
Mario Limonciello
ced6950276 drm/amd: Evaluate early init for all IP blocks even if one fails
If early init fails for a single IP block, then no further IP blocks
are evaluated.  This means that if a user was missing more than one
firmware binary they would have to keep adding binaries and re-probing
until they discovered the ones missing.

To make this easier, run early init for each IP block and report a single
failure if not all passed.

Reviewed-by: Aaron Liu <aaron.liu@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-17 16:11:51 -05:00
Dave Airlie
1f1c24dee2 Merge tag 'amd-drm-next-6.3-2023-01-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.3-2023-01-13:

amdgpu:
- Fix possible segfault in failure case
- Rework FW requests to happen in early_init for all IPs so
  that we don't lose the sbios console if FW is missing
- PSR fixes
- Misc cleanups
- Unload fix
- SMU13 fixes

amdkfd:
- Fix for cleared VRAM BOs
- Fix cleanup if GPUVM creation fails
- Memory accounting fix
- Use resource_size rather than open codeing it
- GC11 mGPU fix

radeon:
- Fix memory leak on shutdown

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230113225911.7776-1-alexander.deucher@amd.com
2023-01-16 15:04:13 +10:00
Dave Airlie
45be204806 Merge tag 'amd-drm-next-6.3-2023-01-06' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.3-2023-01-06:

amdgpu:
- secure display support for multiple displays
- DML optimizations
- DCN 3.2 updates
- PSR updates
- DP 2.1 updates
- SR-IOV RAS updates
- VCN RAS support
- SMU 13.x updates
- Switch 1 element arrays to flexible arrays
- Add RAS support for DF 4.3
- Stack size improvements
- S0ix rework
- Soft reset fix
- Allow 0 as a vram limit on APUs
- Display fixes
- Misc code cleanups
- Documentation fixes
- Handle profiling modes for SMU13.x

amdkfd:
- Error handling fixes
- PASID fixes

radeon:
- Switch 1 element arrays to flexible arrays

drm:
- Add DP adaptive sync DPCD definitions

UAPI:
- Add new INFO queries for peak and min sclk/mclk for profile modes on newer chips
  Proposed mesa patch: https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/278

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20230106222037.7870-1-alexander.deucher@amd.com
2023-01-16 14:00:12 +10:00
Mario Limonciello
b31d306378 drm/amd: Use amdgpu_ucode_* helpers for GPU info bin
The `amdgpu_ucode_request` helper will ensure that the return code for
missing firmware is -ENODEV so that early_init can fail.

The `amdgpu_ucode_release` helper is for symmetry on unloading.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-10 14:32:58 -05:00
Mario Limonciello
b7cdb41e7d drm/amd: Delay removal of the firmware framebuffer
Removing the firmware framebuffer from the driver means that even
if the driver doesn't support the IP blocks in a GPU it will no
longer be functional after the driver fails to initialize.

This change will ensure that unsupported IP blocks at least cause
the driver to work with the EFI framebuffer.

Cc: stable@vger.kernel.org
Suggested-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-09 16:26:48 -05:00
Daniel Vetter
03a0a10408 drm-misc-next for v6.3:
UAPI Changes:
 
  * connector: Support analog-TV mode property
 
  * media: Add MEDIA_BUS_FMT_RGB565_1X24_CPADHI,
    MEDIA_BUS_FMT_RGB666_1X18 and MEDIA_BUS_FMT_RGB666_1X24_CPADHI
 
 Cross-subsystem Changes:
 
  * dma-buf: Documentation fixes
 
  * i2c: Introduce i2c_client_get_device_id() helper
 
 Core Changes:
 
  * Improve support for analog TV output
 
  * bridge: Remove unused drm_bridge_chain functions
 
  * debugfs: Add per-device helpers and convert various DRM drivers
 
  * dp-mst: Various fixes
 
  * fbdev emulation: Always pick 32 bpp as default
 
  * KUnit: Add tests for managed helpers; Various cleanups
 
  * panel-orientation: Add quirks for Lenovo Yoga Tab 3 X90F and DynaBook K50
 
  * TTM: Open-code ttm_bo_wait() and remove the helper
 
 Driver Changes:
 
  * Fix preferred depth and bpp values throughout DRM drivers
 
  * Remove #CONFIG_PM guards throughout DRM drivers
 
  * ast: Various fixes
 
  * bridge: Implement i2c's probe_new in various drivers; Fixes; ite-it6505:
    Locking fixes, Cache EDID data; ite-it66121: Support IT6610 chip,
    Cleanups; lontium-tl9611: Fix HDMI on DragonBoard 845c; parade-ps8640:
    Use atomic bridge functions
 
  * gud: Convert to DRM shadow-plane helpers; Perform flushing synchronously
    during atomic update
 
  * ili9486: Support 16-bit pixel data
 
  * imx: Split off IPUv3 driver; Various fixes
 
  * mipi-dbi: Convert to DRM shadow-plane helpers plus rsp driver changes;i
    Support separate I/O-voltage supply
 
  * mxsfb: Depend on ARCH_MXS or ARCH_MXC
 
  * omapdrm: Various fixes
 
  * panel: Use ktime_get_boottime() to measure power-down delay in various
    drivers; Fix auto-suspend delay in various drivers; orisetech-ota5601a:
    Add support
 
  * sprd: Cleanups
 
  * sun4i: Convert to new TV-mode property
 
  * tidss: Various fixes
 
  * v3d: Various fixes
 
  * vc4: Convert to new TV-mode property; Support Kunit tests; Cleanups;
    dpi: Support RGB565 and RGB666 formats; dsi: Convert DSI driver to
    bridge
 
  * virtio: Improve tracing
 
  * vkms: Support small cursors in IGT tests; Various fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEchf7rIzpz2NEoWjlaA3BHVMLeiMFAmO0B0QACgkQaA3BHVML
 eiMKtAf9EXt5yaEonR4gsGNVD70VkebW+jPEWbEMg5hMFDHE9sjBja8T7bOxeeL1
 BVofJiAuEZW9176eqeWvShuwOuiE7lf4WQLMvXfFmtHF/Nac9HUtEvOcvc1vEDUB
 y6VFlVHe8mSp+Iy0WPLyZCtT4d7v2eM+VYm0HgFa74dSTzQTLGPiUI/XVb/YDaA7
 FHKB5NvsMX9S1XvjxCsq0jA5bo8SSzh5CVKerdAZkBJMCSmKiY09o2p5C7vw3EAz
 WUXL5CXqd9bwgvZa/9ZClQtpJaikNGoOaMSEl67rbWdVTk0LIpqx3nTY15py5LXd
 SG4Wtfor3Vf1REa9TrR2uCIOmh+1gQ==
 =0/PC
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-2023-01-03' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

drm-misc-next for v6.3:

UAPI Changes:

 * connector: Support analog-TV mode property
 * media: Add MEDIA_BUS_FMT_RGB565_1X24_CPADHI,
   MEDIA_BUS_FMT_RGB666_1X18 and MEDIA_BUS_FMT_RGB666_1X24_CPADHI

Cross-subsystem Changes:

 * dma-buf: Documentation fixes
 * i2c: Introduce i2c_client_get_device_id() helper

Core Changes:

 * Improve support for analog TV output
 * bridge: Remove unused drm_bridge_chain functions
 * debugfs: Add per-device helpers and convert various DRM drivers
 * dp-mst: Various fixes
 * fbdev emulation: Always pick 32 bpp as default
 * KUnit: Add tests for managed helpers; Various cleanups
 * panel-orientation: Add quirks for Lenovo Yoga Tab 3 X90F and DynaBook K50
 * TTM: Open-code ttm_bo_wait() and remove the helper

Driver Changes:

 * Fix preferred depth and bpp values throughout DRM drivers
 * Remove #CONFIG_PM guards throughout DRM drivers
 * ast: Various fixes
 * bridge: Implement i2c's probe_new in various drivers; Fixes; ite-it6505:
   Locking fixes, Cache EDID data; ite-it66121: Support IT6610 chip,
   Cleanups; lontium-tl9611: Fix HDMI on DragonBoard 845c; parade-ps8640:
   Use atomic bridge functions
 * gud: Convert to DRM shadow-plane helpers; Perform flushing synchronously
   during atomic update
 * ili9486: Support 16-bit pixel data
 * imx: Split off IPUv3 driver; Various fixes
 * mipi-dbi: Convert to DRM shadow-plane helpers plus rsp driver changes;i
   Support separate I/O-voltage supply
 * mxsfb: Depend on ARCH_MXS or ARCH_MXC
 * omapdrm: Various fixes
 * panel: Use ktime_get_boottime() to measure power-down delay in various
   drivers; Fix auto-suspend delay in various drivers; orisetech-ota5601a:
   Add support
 * sprd: Cleanups
 * sun4i: Convert to new TV-mode property
 * tidss: Various fixes
 * v3d: Various fixes
 * vc4: Convert to new TV-mode property; Support Kunit tests; Cleanups;
   dpi: Support RGB565 and RGB666 formats; dsi: Convert DSI driver to
   bridge
 * virtio: Improve tracing
 * vkms: Support small cursors in IGT tests; Various fixes

Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/Y7QIwlfElAYWxRcR@linux-uq9g
2023-01-04 14:59:25 +01:00
Christian König
7ccfd79fdd drm/amdgpu: rename vram_scratch into mem_scratch
Rename vram_scratch into mem_scratch and allow allocating it into GTT as
well.

The only problem with that is that we won't have a default page for the
system aperture any more.

Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-03 16:50:03 -05:00
Christian König
58ab2c08d7 drm/amdgpu: use VRAM|GTT for a bunch of kernel allocations
Technically all of those can use GTT as well, no need to force things
into VRAM.

Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-03 16:49:54 -05:00
Likun Gao
360cd08196 drm/amdgpu: adjust the sequence to check soft reset
1.Drop soft reset check when do should recover gpu check.
  (As it will skip gpu reset operation if some ip is hang but
   not support soft reset)
2.Check soft reset status before do soft reset when pre asic reset.
  a. If check soft reset return true, it means: some ip is hang and
     it also support soft reset, will try soft reset first.
  b. If check soft reset return false, it means:
       I.  All the ip are not hang, will skip gpu reset.
       II. Some ip is hang but not support soft reset, will skip soft
           reset and retry with full reset later.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-03 16:49:26 -05:00
Alex Deucher
afa6646b1c drm/amdgpu: skip MES for S0ix as well since it's part of GFX
It's also part of gfxoff.

Cc: stable@vger.kernel.org # 6.0, 6.1
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-20 13:08:12 -05:00
Alex Deucher
5620a1889e drm/amdgpu: skip MES for S0ix as well since it's part of GFX
It's also part of gfxoff.

Cc: stable@vger.kernel.org # 6.0, 6.1
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-20 12:47:07 -05:00
Alex Deucher
5804463a65 Revert "drm/amdgpu: disallow gfxoff until GC IP blocks complete s2idle resume"
This reverts commit f543d28687.

This is no longer needed since we no longer touch SDMA 5.x for s0i3.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-20 12:46:58 -05:00
Alex Deucher
2a7798ea73 drm/amdgpu: for S0ix, skip SDMA 5.x+ suspend/resume
SDMA 5.x is part of the GFX block so it's controlled via
GFXOFF.  Skip suspend as it should be handled the same
as GFX.

v2: drop SDMA 4.x.  That requires special handling.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-20 12:46:55 -05:00
Alex Deucher
47198eb721 drm/amdgpu: don't mess with SDMA clock or powergating in S0ix
It's handled by GFXOFF for SDMA 5.x and SMU saves the state on
SDMA 4.x.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-20 12:46:52 -05:00
Shikang Fan
47ea20762b drm/amdgpu: Add an extra evict_resource call during device_suspend.
- evict_resource is taking too long causing sriov full access mode timeout.
  So, add an extra evict_resource in the beginning as an early evict.

Signed-off-by: Shikang Fan <shikang.fan@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-13 17:10:32 -05:00
Christian König
9bff18d134 drm/ttm: use per BO cleanup workers
Instead of a single worker going over the list of delete BOs in regular
intervals use a per BO worker which blocks for the resv object and
locking of the BO.

This not only simplifies the handling massively, but also results in
much better response time when cleaning up buffers.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221125102137.1801-3-christian.koenig@amd.com
2022-12-06 10:53:20 +01:00
Christian König
cd3a8a5962 drm/ttm: remove ttm_bo_(un)lock_delayed_workqueue
Those functions never worked correctly since it is still perfectly
possible that a buffer object is released and the background worker
restarted even after calling them.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221125102137.1801-2-christian.koenig@amd.com
2022-12-06 10:28:12 +01:00
Liang He
dfd0287bd3 drm/amdgpu: Fix potential double free and null pointer dereference
In amdgpu_get_xgmi_hive(), we should not call kfree() after
kobject_put() as the PUT will call kfree().

In amdgpu_device_ip_init(), we need to check the returned *hive*
which can be NULL before we dereference it.

Signed-off-by: Liang He <windhl@126.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-29 11:03:37 -05:00
Shikang Fan
3c22c1ead6 drm/amdgpu: fix for suspend/resume kiq fence fallback under sriov
- in device_resume, sriov configure interrupt should be in full access,
  so release_full_gpu should be done after kfd_resume.
- remove the previous workaround solution for sriov.

Fixes: ec4927d463 ("drm/amdgpu: fix for suspend/resume sequence under sriov")
Signed-off-by: Shikang Fan <shikang.fan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-23 10:31:31 -05:00
Yang Yingliang
b85e285e3d drm/amdgpu: fix pci device refcount leak
As comment of pci_get_domain_bus_and_slot() says, it returns
a pci device with refcount increment, when finish using it,
the caller must decrement the reference count by calling
pci_dev_put().

So before returning from amdgpu_device_resume|suspend_display_audio(),
pci_dev_put() is called to avoid refcount leak.

Fixes: 3f12acc8d6 ("drm/amdgpu: put the audio codec into suspend state before gpu reset V3")
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-23 09:47:13 -05:00
Dave Airlie
fc58764bbf Merge tag 'amd-drm-next-6.2-2022-11-18' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.2-2022-11-18:

amdgpu:
- SR-IOV fixes
- Clean up DC checks
- DCN 3.2.x fixes
- DCN 3.1.x fixes
- Don't enable degamma on asics which don't support it
- IP discovery fixes
- BACO fixes
- Fix vbios allocation handling when vkms is enabled
- Drop buggy tdr advanced mode GPU reset handling
- Fix the build when DCN is not set in kconfig
- MST DSC fixes
- Userptr fixes
- FRU and RAS EEPROM fixes
- VCN 4.x RAS support
- Aldrebaran CU occupancy reporting fix
- PSP ring cleanup

amdkfd:
- Memory limit fix
- Enable cooperative launch on gfx 10.3

amd-drm-next-6.2-2022-11-11:

amdgpu:
- SMU 13.x updates
- GPUVM TLB race fix
- DCN 3.1.4 updates
- DCN 3.2.x updates
- PSR fixes
- Kerneldoc fix
- Vega10 fan fix
- GPUVM locking fixes in error pathes
- BACO fix for Beige Goby
- EEPROM I2C address cleanup
- GFXOFF fix
- Fix DC memory leak in error pathes
- Flexible array updates
- Mtype fix for GPUVM PTEs
- Move Kconfig into amdgpu directory
- SR-IOV updates
- Fix possible memory leak in CS IOCTL error path

amdkfd:
- Fix possible memory overrun
- CRIU fixes

radeon:
- ACPI ref count fix
- HDA audio notifier support
- Move Kconfig into radeon directory

UAPI:
- Add new GEM_CREATE flags to help to transition more KFD functionality to the DRM UAPI.
  These are used internally in the driver to align location based memory coherency
  requirements from memory allocated in the KFD with how we manage GPUVM PTEs.  They
  are currently blocked in the GEM_CREATE IOCTL as we don't have a user right now.
  They are just used internally in the kernel driver for now for existing KFD memory
  allocations. So a change to the UAPI header, but no functional change in the UAPI.

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221118170807.6505-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
2022-11-22 13:41:11 +10:00
YiPeng Chai
1a11a65d53 drm/amdgpu: Enable mode-1 reset for RAS recovery in fatal error mode
The patch is enabling mode-1 reset for RAS recovery in fatal error mode.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-17 18:07:52 -05:00
Dave Airlie
4e291f2f58 drm-misc-next for 6.2:
UAPI Changes:
 
 Cross-subsystem Changes:
 
 Core Changes:
 - atomic-helper: Add begin_fb_access and end_fb_access hooks
 - fb-helper: Rework to move fb emulation into helpers
 - scheduler: rework entity flush, kill and fini
 - ttm: Optimize pool allocations
 
 Driver Changes:
 - amdgpu: scheduler rework
 - hdlcd: Switch to DRM-managed resources
 - ingenic: Fix registration error path
 - lcdif: FIFO threshold tuning
 - meson: Fix return type of cvbs' mode_valid
 - ofdrm: multiple fixes (kconfig, types, endianness)
 - sun4i: A100 and D1 support
 - panel:
   - New Panel: Jadard JD9365DA-H3
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRcEzekXsqa64kGDp7j7w1vZxhRxQUCY2y3vQAKCRDj7w1vZxhR
 xQu+AP9CzbNI2s12aNS8DskEZggo0lqUyRiEBaRJ2jrWZdGr0gEA1+Lc06HaKmGC
 2WBD4nw2I7ch0NUN6VFjQ9ATofevPwA=
 =AY3a
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-2022-11-10-1' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

drm-misc-next for 6.2:

UAPI Changes:

Cross-subsystem Changes:

Core Changes:
- atomic-helper: Add begin_fb_access and end_fb_access hooks
- fb-helper: Rework to move fb emulation into helpers
- scheduler: rework entity flush, kill and fini
- ttm: Optimize pool allocations

Driver Changes:
- amdgpu: scheduler rework
- hdlcd: Switch to DRM-managed resources
- ingenic: Fix registration error path
- lcdif: FIFO threshold tuning
- meson: Fix return type of cvbs' mode_valid
- ofdrm: multiple fixes (kconfig, types, endianness)
- sun4i: A100 and D1 support
- panel:
  - New Panel: Jadard JD9365DA-H3

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20221110083612.g63eaocoaa554soh@houat
2022-11-16 07:17:32 +10:00
Christian König
0788a47e7c drm/amdgpu: stop resubmittting jobs in amdgpu_pci_resume
The state of VRAM is unreliable due to a PCI event like AER, link reset
or DPC.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-15 15:25:45 -05:00
Christian König
6868a2c465 drm/amdgpu: stop resubmitting jobs for GPU reset v2
Re-submitting IBs by the kernel has many problems because pre-
requisite state is not automatically re-created as well. In
other words neither binary semaphores nor things like ring
buffer pointers are in the state they should be when the
hardware starts to work on the IBs again.

Additional to that even after more than 5 years of
developing this feature it is still not stable and we have
massively problems getting the reference counts right.

As discussed with user space developers this behavior is not
helpful in the first place. For graphics and multimedia
workloads it makes much more sense to either completely
re-create the context or at least re-submitting the IBs
from userspace.

For compute use cases re-submitting is also not very
helpful since userspace must rely on the accuracy of
the result.

Because of this we stop this practice and instead just
properly note that the fence submission was canceled. The
only use case we keep the re-submission for now is SRIOV
and function level resets.

v2: as suggested by Sshaoyun stop resubmitting jobs even for SRIOV

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-15 15:25:37 -05:00
Christian König
06a2d7cc3f drm/amdgpu: revert "implement tdr advanced mode"
This reverts commit e6c6338f39.

This feature basically re-submits one job after another to
figure out which one was the one causing a hang.

This is obviously incompatible with gang-submit which requires
that multiple jobs run at the same time. It's also absolutely
not helpful to crash the hardware multiple times if a clean
recovery is desired.

For testing and debugging environments we should rather disable
recovery alltogether to be able to inspect the state with a hw
debugger.

Additional to that the sw implementation is clearly buggy and causes
reference count issues for the hardware fence.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-15 15:25:22 -05:00
YiPeng Chai
d293470e10 drm/amdgpu: Fixed the problem that ras error can't be queried after gpu recovery is completed
Amdgpu_ras_set_error_query_ready is called at the start of
amdgpu_device_gpu_recover to disable query ras error, but the
code behind only enables query ras error in full reset path,
but not in soft reset path, emergency restart path and skip
the hardware reset path.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-15 13:35:16 -05:00
Alex Deucher
220c8cc855 drm/amdgpu: there is no vbios fb on devices with no display hw (v2)
If we enable virtual display functionality on parts with
no display hardware we can end up trying to check for and
reserve the vbios FB area on devices where it doesn't exist.
Check if display hardware is actually present on the hardware
before trying to reserve the memory.

v2: move the check into common code

Acked-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-15 13:35:16 -05:00
Alex Deucher
d09ef24303 drm/amdgpu: clarify DC checks
There are several places where we don't want to check
if a particular asic could support DC, but rather, if
DC is enabled.  Set a flag if DC is enabled and check
for that rather than if a device supports DC or not.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-15 11:51:45 -05:00
Alex Deucher
25263da376 drm/amdgpu: rework SR-IOV virtual display handling
virtual display is enabled unconditionally in SR-IOV, but
without specifying the virtual_display module, the number
of crtcs defaults to 0.  Set a single display by default
for SR-IOV if the virtual_display parameter is not set.
Only enable virtual display by default on SR-IOV on asics
which actually have display hardware.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-15 11:51:32 -05:00
Thomas Zimmermann
45b64fd9f7 drm/fb-helper: Remove unnecessary include statements
Remove include statements for <drm/drm_fb_helper.h> where it is not
required (i.e., most of them). In a few places include other header
files that are required by the source code.

v3:
	* fix amdgpu include statements
	* fix rockchip include statements

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221103151446.2638-23-tzimmermann@suse.de
2022-11-05 17:12:04 +01:00
Victor Zhao
ec4927d463 drm/amdgpu: fix for suspend/resume sequence under sriov
- clear kiq ring after suspend/resume under sriov to aviod kiq ring
test failure
- update irq after resume to fix kiq interrput loss

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04 16:05:53 -04:00
Yiqing Yao
8a1fbb4a5e drm/amdgpu: Disable MCBP from soc21 for SRIOV
[why]
Start from soc21, CP does not support MCBP, so disable it.

[how]
Used amgpu_mcbp flag alone instead of checking if is in SRIOV to
enable/disable MCBP.
Only set flag to enable on asic_type prior to soc21 in SRIOV.

Signed-off-by: Yiqing Yao <yiqing.yao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-04 16:05:53 -04:00
Mario Limonciello
7863c15526 drm/amd: Fail the suspend if resources can't be evicted
If a system does not have swap and memory is under 100% usage,
amdgpu will fail to evict resources.  Currently the suspend
carries on proceeding to reset the GPU:

```
[drm] evicting device resources failed
[drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
[drm] free PSP TMR buffer
[TTM] Failed allocating page table
[drm] evicting device resources failed
amdgpu 0000:03:00.0: amdgpu: MODE1 reset
amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
```

At this point if the suspend actually succeeded I think that amdgpu
would have recovered because the GPU would have power cut off and
restored.  However the kernel fails to continue the suspend from the
memory pressure and amdgpu fails to run the "resume" from the aborted
suspend.

```
ACPI: PM: Preparing to enter system sleep state S3
SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
  cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
  node 0: slabs: 22, objs: 1122, free: 0
ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)

[drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
[drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
[drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
amdgpu 0000:03:00.0: PM: failed to resume async: error -62
```

To avoid this series of unfortunate events, fail amdgpu's suspend
when the memory eviction fails.  This will let the system gracefully
recover and the user can try suspend again when the memory pressure
is relieved.

Reported-by: post@davidak.de
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27 15:12:09 -04:00
wangjianli
12024b1761 amd/amdgpu: fix repeated words in comments
Delete the redundant word 'the'.

Signed-off-by: wangjianli <wangjianli@cdjrlc.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-24 14:38:46 -04:00
Prike Liang
f543d28687 drm/amdgpu: disallow gfxoff until GC IP blocks complete s2idle resume
In the S2idle suspend/resume phase the gfxoff is keeping functional so
some IP blocks will be likely to reinitialize at gfxoff entry and that
will result in failing to program GC registers.Therefore, let disallow
gfxoff until AMDGPU IPs reinitialized completely.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-24 14:34:48 -04:00
YuBiao Wang
693073a04d drm/amdgpu: skip mes self test for gc 11.0.3 in recover
Temporary disable mes self teset for gc 11.0.3 during gpu_recovery.

Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
Acked-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-24 14:34:47 -04:00
Evan Quan
3059cd8c5f drm/amd/pm: disable cstate feature for gpu reset scenario
Suggested by PMFW team and same as what did for gfxoff feature.
This can address some Mode1Reset failures observed on SMU13.0.0.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.0.x
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-18 22:12:20 -04:00
Victor Zhao
a340847b02 Revert "drm/amdgpu: let mode2 reset fallback to default when failure"
This reverts commit dac6b80818.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.

Fixes: dac6b80818 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-18 22:08:33 -04:00
Evan Quan
b31d6ada83 drm/amd/pm: disable cstate feature for gpu reset scenario
Suggested by PMFW team and same as what did for gfxoff feature.
This can address some Mode1Reset failures observed on SMU13.0.0.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.0.x
2022-10-17 17:41:21 -04:00
Victor Zhao
b98a1648d6 Revert "drm/amdgpu: let mode2 reset fallback to default when failure"
This reverts commit dac6b80818.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.

Fixes: dac6b80818 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-17 17:41:20 -04:00
Bokun Zhang
d7274ec723 drm/amdgpu: Add amdgpu suspend-resume code path under SRIOV
- Under SRIOV, we need to send REQ_GPU_FINI to the hypervisor
  during the suspend time. Furthermore, we cannot request a
  mode 1 reset under SRIOV as VF. Therefore, we will skip it
  as it is called in suspend_noirq() function.

- In the resume code path, we need to send REQ_GPU_INIT to the
  hypervisor and also resume PSP IP block under SRIOV.

Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29 09:41:46 -04:00
Lijo Lazar
bb66ecbf12 drm/amdgpu: Use simplified API for p2p dist calc
Use the simpified API that calculates distance between two devices.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29 09:41:42 -04:00
Lijo Lazar
d0fa84f174 drm/amdgpu: Disable verbose for p2p dist calc
Disable verbose while getting p2p distance. With verbose, it shows
warning if ACS redirect is set between the devices. Adds noise
to dmesg logs when a few GPU devices are on the same platform.

Example log:

amdgpu 0000:34:00.0: ACS redirect is set between the client and provider (0000:31:00.0)
amdgpu 0000:34:00.0: to disable ACS redirect for this path, add the kernel parameter:
	pci=disable_acs_redir=0000:30:00.0;0000:2e:00.0;0000:33:00.0;0000:2e:10.0

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29 09:41:42 -04:00
Christian König
68ce8b2422 drm/amdgpu: add gang submit backend v2
Allows submitting jobs as gang which needs to run on multiple
engines at the same time.

Basic idea is that we have a global gang submit fence representing when the
gang leader is finally pushed to run on the hardware last.

Jobs submitted as gang are never re-submitted in case of a GPU reset since this
won't work and will just deadlock the hardware immediately again.

v2: fix logic inversion, improve documentation, fix rcu

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-20 12:40:32 -04:00
YiPeng Chai
83d29a5f8a drm/amdgpu: Fixed psp fence and memory issues when removing amdgpu device
V3:
Fixed psp fence and memory issues for the asic
using smu v13_0_2 when removing amdgpu device.

[Why]:
1. psp_suspend->psp_free_shared_bufs->
       psp_ta_free_shared_buf->
           amdgpu_bo_free_kernel->
             ...->amdgpu_bo_release_notify->
                    amdgpu_fill_buffer
   psp will free vram memory used by psp when psp_suspend
   is called. But for the asic using smu v13_0_2, because
   psp_suspend is called before adev->shutdown is set to
   true when removing the first hive device, amdgpu fill_buffer
   will be called, which will cause fence issues when evicting
   all vram resources in amdgpu vram mgr_fini.
2. Since psp_hw_fini is not called after calling psp_suspend
   and psp_suspend only calls psp_ring_stop, the psp ring memory
   will not be released when amdgpu device is removed.

[How]:
1. Set shutdown to true before calling amdgpu_device_gpu_recover,
   then amdgpu_fill_buffer will not be called when psp_suspend is
   called.
2. Free psp ring memory in psp_sw_fini.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-19 15:17:47 -04:00
YiPeng Chai
f5c7e77970 drm/amdgpu: Adjust removal control flow for smu v13_0_2
Adjust removal control flow for smu v13_0_2:
   During amdgpu uninstallation, when removing the first
device, the kernel needs to first send a mode1reset message
to all gpu devices. Otherwise, smu initialization will fail
the next time amdgpu is installed.

V2:
1. Update commit comments.
2. Remove the global variable amdgpu_device_remove_cnt
   and add a variable to the structure amdgpu_hive_info.
3. Use hive to detect the first removed device instead of
   a global variable.

V3:
 1. Update commit comments.
 2. Split a patch into multiple patches.
 3. The current patch does:
    a. Add a work mode of AMDGPU_RESET_FOR_DEVICE_REMOVE into
       the existing gpu recover path, which make all devices
       in hive list only have HW reset but no resume (except
       the base IP).
    b. Call AMDGPU_RESET_FOR_DEVICE_REMOVE and
       AMDGPU_NEED_FULL_RESET mode of amdgpu_device_gpu_recover
       in amdgpu_pci_remove when removing the first device in
       hive list.
    c. When removing the first device, the IP blocks keyword
       function call sequence is as follows:
.suspend->mode1reset->.resume(basic ip)->.hw_fini->.early_fini->.sw_fini.
   ^                           |
   |-<----------<---------<----|
	The first three sequences are because of a call to
        amdgpu_device_gpu_recover. The three sequences will be
        executed in a loop until all devices in the hive list
        are iterated.
        The sequences starting from .hw_fini only apply to the
        first device. Since .suspend has been called before,
        except the resumed phase1 basic ip blocks, all other ip
        blocks .hw_fini of current device will do nothing.
     d. When removing other devices, the calling sequences is the
        same as legacy:
	   .hw_fini -> .early_fini -> .sw_fini.
	Since .suspend has been called when removing the first device,
        except the resumed phase1 basic ip blocks, all of other ip
        blocks .hw_fini of current device will do nothing.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-19 15:17:20 -04:00
Alex Deucher
c1c39032a0 drm/amdgpu: make sure to init common IP before gmc
Move common IP init before GMC init so that HDP gets
remapped before GMC init which uses it.

This fixes the Unsupported Request error reported through
AER during driver load. The error happens as a write happens
to the remap offset before real remapping is done.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216373

The error was unnoticed before and got visible because of the commit
referenced below. This doesn't fix anything in the commit below, rather
fixes the issue in amdgpu exposed by the commit. The reference is only
to associate this commit with below one so that both go together.

Fixes: 8795e182b0 ("PCI/portdrv: Don't disable AER reporting in get_port_device_capability()")

Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-14 12:38:52 -04:00
Vignesh Chander
2efc30f016 drm/amdgpu: Fix hive reference count leak
both get_xgmi_hive and put_xgmi_hive can be skipped since the
reset domain is not necessary for VF

Signed-off-by: Vignesh Chander <Vignesh.Chander@amd.com>
Reviewed-by: Shaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13 14:32:57 -04:00
shaoyunl
46c676600c drm/amdgpu: Use per device reset_domain for XGMI on sriov configuration
For SRIOV configuration, host driver control the reset method(either FLR or
heavier chain reset). The host will notify the guest individually with FLR
message if individual GPU within the hive need to be reset. So for guest
side, no need to use hive->reset_domain to replace the original per
device reset_domain

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13 12:54:23 -04:00
YiPeng Chai
fac53471d0 drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled
V1:
  The psp_cmd_submit_buf function is called by psp_hw_fini to send
TA unload messages to psp to terminate ras, asd and tmr. But when
amdgpu is uninstalled, drm_dev_unplug is called earlier than
psp_hw_fini in amdgpu_pci_remove, the calling order as follows:
static void amdgpu_pci_remove(struct pci_dev *pdev) {
	drm_dev_unplug
	......
	amdgpu_driver_unload_kms->amdgpu_device_fini_hw->...
		->.hw_fini->psp_hw_fini->...
		->psp_ta_unload->psp_cmd_submit_buf
	......
}
The program will return when calling drm_dev_enter in psp_cmd_submit_buf.

So the call to drm_dev_enter in psp_cmd_submit_buf should be
removed, so that the TA unload messages can be sent to the psp
when amdgpu is uninstalled.

V2:
1. Restore psp_cmd_submit_buf to its original code.
2. Move drm_dev_unplug call after amdgpu_driver_unload_kms in
   amdgpu_pci_remove.
3. Since amdgpu_device_fini_hw is called by amdgpu_driver_unload_kms,
   remove the unplug check to release device mmio resource in
   amdgpu_device_fini_hw before calling drm_dev_unplug.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-07 22:21:00 -04:00
Alex Sierra
ab23c5b9c7 drm/amdgpu: ensure no PCIe peer access for CPU XGMI iolinks
[Why] Devices with CPU XGMI iolink do not support PCIe peer access.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-29 17:59:30 -04:00
Chengming Gui
d3ef9d57f2 drm/amd/amdgpu: avoid soft reset check when gpu recovery disabled
Avoid soft reset, even ip hang check (ring/ib test) when gpu recovery
disabled.

v2: add missing "}"

Signed-off-by: Chengming Gui <Jack.Gui@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-25 13:35:18 -04:00
shaoyunl
947f63f17e drm/amdgpu: Remove the additional kfd pre reset call for sriov
The additional call is caused by merge conflict

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-22 16:47:09 -04:00
YiPeng Chai
9dfa4860ef drm/amdgpu: fix hive reference leak when adding xgmi device
Only amdgpu_get_xgmi_hive but no amdgpu_put_xgmi_hive
which will leak the hive reference.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-22 16:47:09 -04:00
André Almeida
0ad7347a64 drm/amd: Add detailed GFXOFF stats to debugfs
Add debugfs interface to log GFXOFF statistics:

- Read amdgpu_gfxoff_count to get the total GFXOFF entry count at the
  time of query since system power-up

- Write 1 to amdgpu_gfxoff_residency to start logging, and 0 to stop.
  Read it to get average GFXOFF residency % multiplied by 100
  during the last logging interval.

Both features are designed to be keep the values persistent between
suspends.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16 18:17:31 -04:00
Victor Zhao
72fadb1367 drm/amdgpu: revert context to stop engine before mode2 reset
For some hang caused by slow tests, engine cannot be stopped which
may cause resume failure after reset. In this case, force halt
engine by reverting context addresses

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16 18:14:31 -04:00
Victor Zhao
dac6b80818 drm/amdgpu: let mode2 reset fallback to default when failure
- introduce AMDGPU_SKIP_MODE2_RESET flag
- let mode2 reset fallback to default reset method if failed

v2: move this part out from the asic specific part

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16 18:14:31 -04:00
Lijo Lazar
0a83bb35d8 drm/amdgpu: Avoid another list of reset devices
A list of devices to be reset is already created in
amdgpu_device_gpu_recover function. Creating another list with the
same nodes is incorrect and not supported in list_head. Instead, pass
the device list as part of reset context.

Fixes: 9e08564727 (drm/amdgpu: Refactor mode2 reset logic for v13.0.2)
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-10 15:07:14 -04:00
Jack Xiao
ed67f7292b drm/amdgpu: move mes self test after drm sched re-started
mes self test rely on vm mapping, move it after
drm sched re-started so that vm mapping can work
during gpu reset.

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-and-tested-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-28 16:28:54 -04:00
Evan Quan
0da0def770 drm/amdgpu: drop non-necessary call trace dump
This extra call trace dump comes out in every gpu reset.
And it gives people a wrong impression that something
went wrong. Although actually there was not.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-28 16:28:54 -04:00
Andrey Grodzovsky
f6a3f66063 drm/amdgpu: Get rid of amdgpu_job->external_hw_fence
This is a follow-up cleanup to [1]. See bellow refcount balancing
for calling amdgpu_job_submit_direct after this cleanup as far
as I calculated.

amdgpu_fence_emit
	dma_fence_init 1
	dma_fence_get(fence) 2
	rcu_assign_pointer(*ptr, dma_fence_get(fence) 3

---> amdgpu_job_submit_direct completes before fence signaled
			amdgpu_sa_bo_free
				(*sa_bo)->fence = dma_fence_get(fence) 4

			amdgpu_job_free
				dma_fence_put 3

			amdgpu_vcn_enc_get_destroy_msg
				*fence = dma_fence_get(f) 4
				dma_fence_put(f); 3

			amdgpu_vcn_enc_ring_test_ib
				dma_fence_put(fence) 2

			amdgpu_fence_process
				dma_fence_put 1

			amdgpu_sa_bo_remove_locked
				dma_fence_put 0

---> amdgpu_job_submit_direct completes after fence signaled
			amdgpu_fence_process
				dma_fence_put 2

			amdgpu_job_free
				dma_fence_put 1

			amdgpu_vcn_enc_get_destroy_msg
				*fence = dma_fence_get(f) 2
				dma_fence_put(f); 1

			amdgpu_vcn_enc_ring_test_ib
				dma_fence_put(fence) 0

[1] - https://patchwork.kernel.org/project/dri-devel/cover/20220624180955.485440-1-andrey.grodzovsky@amd.com/

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-18 16:37:25 -04:00
Likun Gao
f1549c09c5 drm/amdgpu: support reset flag set for gpu reset
Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, force to use full reset
method. Otherwise, try soft reset by default if the related ASIC
supportted, if soft reset failed, will use full reset.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-13 11:25:17 -04:00
Alex Deucher
6e9c65f71e drm/amdgpu: fix documentation warning
Fixes this issue:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5094: warning: expecting prototype for amdgpu_device_gpu_recover_imp(). Prototype was for amdgpu_device_gpu_recover() instead

Fixes: cf72704414 ("drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover")
Reviewed-by: Kent Russell <kent.russell@amd.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-29 22:03:51 -04:00
Kent Russell
d193b12b2f drm/amdgpu: Fix typos in amdgpu_stop_pending_resets
Change amdggpu to amdgpu and pedning to pending

Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-29 17:10:21 -04:00
Andrey Grodzovsky
9ae55f030d drm/amdgpu: Follow up change to previous drm scheduler change.
Align refcount behaviour for amdgpu_job embedded HW fence with
classic pointer style HW fences by increasing refcount each
time emit is called so amdgpu code doesn't need to make workarounds
using amdgpu_job.job_run_counter to keep the HW fence refcount balanced.

Also since in the previous patch we resumed setting s_fence->parent to NULL
in drm_sched_stop switch to directly checking if job->hw_fence is
signaled to short circuit reset if already signed.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Tested-by: Yiqing Yao <yiqing.yao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-28 11:24:41 -04:00
Andrey Grodzovsky
9e225fb9e6 drm/amdgpu: Prevent race between late signaled fences and GPU reset.
Problem:
After we start handling timed out jobs we assume there fences won't be
signaled but we cannot be sure and sometimes they fire late. We need
to prevent concurrent accesses to fence array from
amdgpu_fence_driver_clear_job_fences during GPU reset and amdgpu_fence_process
from a late EOP interrupt.

Fix:
Before accessing fence array in GPU disable EOP interrupt and flush
all pending interrupt handlers for amdgpu device's interrupt line.

v2: Switch from irq_get/put to full enable/disable_irq for amdgpu

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-28 11:24:24 -04:00
Alex Deucher
163d4cd26a drm/amdgpu: fix adev variable used in amdgpu_device_gpu_recover()
Use the correct adev variable for the drm_fb_helper in
amdgpu_device_gpu_recover().  Noticed by inspection.

Fixes: 087451f372 ("drm/amdgpu: use generic fb helpers instead of setting up AMD own's.")
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-21 18:17:24 -04:00
Yifan Zhang
68ad7f90c7 drm/amdgpu: remove redundant enable_mes and enable_mes_kiq
enable_mes and enable_mes_kiq are set in both device init and
MES IP init. Leave the ones in MES IP init, since it is
a more accurate way to judge from GC IP version.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Jack Xiao <Jack.Xiao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-14 21:38:41 -04:00
Andrey Grodzovsky
247c7b0dac drm/amdgpu: Stop any pending reset if another in progress.
We skip rest requests if another one is already in progress.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10 15:26:18 -04:00
Andrey Grodzovsky
cf72704414 drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover
We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10 15:26:12 -04:00
Andrey Grodzovsky
b5fd0cf3ea drm/amdgpu: Add work_struct for GPU reset from kfd.
We need to have a work_struct to cancel this reset if another
already in progress.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10 15:26:07 -04:00
Andrey Grodzovsky
ab9a0b1f36 drm/amdgpu: Cache result of last reset at reset domain level.
Will be read by executors of async reset like debugfs.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10 15:25:34 -04:00
Ramesh Errabolu
08a2fd23c6 drm/amdgpu: Add peer-to-peer support among PCIe connected AMD GPUs
Add support for peer-to-peer communication among AMD GPUs over PCIe
bus. Support REQUIRES enablement of config HSA_AMD_P2P.

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-08 11:40:12 -04:00
Somalapuram Amaranath
3d8785f6c0 drm/amdgpu: adding device coredump support
Added device coredump information:
- Kernel version
- Module
- Time
- VRAM status
- Guilty process name and PID
- GPU register dumps
v1 -> v2: Variable name change
v1 -> v2: NULL check
v1 -> v2: Code alignment
v1 -> v2: Adding dummy amdgpu_devcoredump_free
v1 -> v2: memset reset_task_info to zero
v2 -> v3: add CONFIG_DEV_COREDUMP for variables
v2 -> v3: remove NULL check on amdgpu_devcoredump_read

Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Reviewed-by: Shashank Sharma <Shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-06 14:41:19 -04:00
Somalapuram Amaranath
651d7ee63f drm/amdgpu: save the reset dump register value for devcoredump
Allocate memory for register value and use the same values for devcoredump.
v1 -> v2: Change krealloc_array() to kmalloc_array()
v2 -> v3: Fix alignment

Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Reviewed-by: Shashank Sharma <Shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-06 14:41:12 -04:00
Alex Deucher
b5a0168e14 drm/amdgpu: fix up comment in amdgpu_device_asic_has_dc_support()
LVDS support was implemented in DC a while ago.  Just DAC
support is left to do.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-03 16:45:01 -04:00
Alex Deucher
1d6c363330 drm/amdgpu: simplify the logic in amdgpu_device_parse_gpu_info_fw()
Drop all of the extra cases in the default case.

Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-03 16:45:00 -04:00
Mitchell Augustin
f74e78ca90 amdgpu: amdgpu_device.c: Removed trailing whitespace
Removed trailing whitespace from end of line in amdgpu_device.c

Signed-off-by: Mitchell Augustin <kernel@mitchellaugustin.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-03 16:43:36 -04:00
Alex Deucher
b8b64595d6 drm/amdgpu: simplify amdgpu_device_asic_has_dc_support()
Drop extra cases in the default case.

Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-03 16:43:36 -04:00
Sunil Khatri
4d33e7040d drm/amdgpu: move amdgpu_gmc_tmz_set after ip_version populated
To enable TMZ feature based on IP version needs adev->ip_version
populated but its empty. Move amdgpu_gmc_tmz_set to a place where
ip_version is populated.

Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-26 14:56:32 -04:00
Stanley.Yang
950d64250f drm/amdgpu: support ras on SRIOV
support umc/gfx/sdma ras on guest side

Changed from V1:
    move sriov judgment in amdgpu_ras_interrupt_fatal_error_handler

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-26 14:56:32 -04:00
Likun Gao
8424f2ccb3 drm/amdgpu/psp: Add vbflash sysfs interface support
Add sysfs interface to copy VBIOS.

v2: squash in fix for proper vmalloc API (Alex)

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-10 17:53:10 -04:00
Yiqing Yao
98f5618846 drm/amdgpu: flush delete wq after wait fence
[why]
lru_list not empty warning in sw fini during repeated device bind unbind.
There should be a amdgpu_fence_wait_empty() before the flush_delayed_work()
call as Christian suggested.

[how]
Move to do flush_delayed_work for ttm bo delayed delete wq after fence_driver_hw_fini.

Tested by: Yiqing Yao <yiqing.yao@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Yiqing Yao <yiqing.yao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-06 16:56:41 -04:00
Mukul Joshi
c004d44e10 drm/amdgpu: Enable KFD with MES enabled
Enable KFD initialization with MES enabled.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Acked-by: Oak Zeng <Oak.Zeng@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04 10:43:52 -04:00
Jack Xiao
9c12f5cd06 drm/amdgpu: skip kfd routines when mes enabled
For kfd hasn't supported mes, skip kfd routines.

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04 10:43:52 -04:00
Jack Xiao
928fe236c0 drm/amdgpu: add mes_kiq module parameter v2
mes_kiq parameter is used to enable mes kiq pipe.
This module parameter is unneccessary or enabled by default
in final version.

v2: reword commit message.

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04 10:43:49 -04:00
Jack Xiao
de33a32968 drm/amdgpu: use the whole doorbell space for mes
Use the whole doorbell space for mes. Each queue in one process occupies
one doorbell slot to ring the queue submitting.

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04 10:04:01 -04:00
Hawking Zhang
85d1bcc6e0 drm/amdgpu: switch to atomfirmware_asic_init
Some initial settings now are not available from
the atom data table. The assumption that !ps[0]
|| !ps[1] in amdgpu_atom_asic_init is not valid.
In addition, driver needs to strictly follow
atomfirmware structure (asic_init_parameters) to
initialize parameters used to execute asic_init
function, otherwise, the execution of asic_init
would fail.

This shall be applicable to all soc15 adapters,but
let make the transition on soc21 first.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-28 17:48:00 -04:00
Alex Deucher
e24d0e91b3 drm/amdgpu/discovery: move all table parsing into amdgpu_discovery.c
This data has no dependencies, so encapsulate it all within
amdgpu_discovery.c.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-28 17:47:43 -04:00
Bokun Zhang
e15c9d06e9 drm/amd/amdgpu: Update PF2VF header
- In the latest version of the header, there is a variable name change.
  This should not cause any backward compatibility since the variable is
  at the same offset in the struct.

Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-21 16:00:14 -04:00
Evan Quan
25faeddcf3 drm/amdgpu: expand cg_flags from u32 to u64
With this, we can support more CG flags.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-08 17:24:24 -04:00
Alex Deucher
b818a5d374 drm/amdgpu/gmc: use PCI BARs for APUs in passthrough
If the GPU is passed through to a guest VM, use the PCI
BAR for CPU FB access rather than the physical address of
carve out.  The physical address is not valid in a guest.

v2: Fix HDP handing as suggested by Michel

Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michel Dänzer <mdaenzer@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-25 12:40:24 -04:00
Philip Yang
436afdfa35 drm/amdgpu: Move reset domain init before calling RREG32
amdgpu_detect_virtualization reads register, amdgpu_device_rreg access
adev->reset_domain->sem if kernel defined CONFIG_LOCKDEP, below is the
random boot hang backtrace on Vega10. It may get random NULL pointer
access backtrace if amdgpu_sriov_runtime is true too.

Move amdgpu_reset_create_reset_domain before calling to RREG32.

 BUG: kernel NULL pointer dereference, address:
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP NOPTI
 Workqueue: events work_for_cpu_fn
 RIP: 0010:down_read_trylock+0x13/0xf0
 Call Trace:
  <TASK>
  amdgpu_device_skip_hw_access+0x38/0x80 [amdgpu]
  amdgpu_device_rreg+0x1b/0x170 [amdgpu]
  amdgpu_detect_virtualization+0x73/0x100 [amdgpu]
  amdgpu_device_init.cold.60+0xbe/0x16b1 [amdgpu]
  ? pci_bus_read_config_word+0x43/0x70
  amdgpu_driver_load_kms+0x15/0x120 [amdgpu]
  amdgpu_pci_probe+0x1a1/0x3a0 [amdgpu]

Fixes: d0fb18b535 ("drm/amdgpu: Move reset sem into reset_domain")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-15 14:40:49 -04:00
Alex Deucher
85ac2021fe drm/amdgpu: only check for _PR3 on dGPUs
We don't support runtime pm on APUs.  They support more
dynamic power savings using clock and powergating.

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-15 14:34:41 -04:00
Alex Deucher
1b537e6410 drm/amdgpu: remove unused gpu_info firmwares
These were leftover from bring up and are no longer
necessary.  The information is available via
the IP discovery table.

Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-02 18:40:07 -05:00
Yifan Zhang
957b0787ee drm/amdgpu: move amdgpu_gmc_noretry_set after ip_versions populated
otherwise adev->ip_versions is still empty when amdgpu_gmc_noretry_set
is called.

Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-02 18:40:06 -05:00
Dave Airlie
38a15ad948 Merge tag 'amd-drm-next-5.18-2022-02-25' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.18-2022-02-25:

amdgpu:
- Raven2 suspend/resume fix
- SDMA 5.2.6 updates
- VCN 3.1.2 updates
- SMU 13.0.5 updates
- DCN 3.1.5 updates
- Virtual display fixes
- SMU code cleanup
- Harvest fixes
- Expose benchmark tests via debugfs
- Drop no longer relevant gart aperture tests
- More RAS restructuring
- W=1 fixes
- PSR rework
- DP/VGA adapter fixes
- DP MST fixes
- GPUVM eviction fix
- GPU reset debugfs register dumping support
- Misc display fixes
- SR-IOV fix
- Aldebaran mGPU fix
- Add module parameter to disable XGMI for testing

amdkfd:
- IH ring overflow logging fixes
- CRIU fixes
- Misc fixes

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220225183535.5907-1-alexander.deucher@amd.com
2022-03-01 16:19:02 +10:00
Andrey Grodzovsky
2656fd230d drm/amdgpu: Exclude PCI reset method for now.
According to my investigation of the state of PCI
reset recently it's not working. The reason is
due to the fact the kernel PCI code rejects SBR
when there are more then one PF under same bridge
which we always have (at least AUDIO PF but usually
more) and that because SBR will reset all the PFS
and devices under the same bridge as you and you
cannot assume they support SBR.
Once we anble FLR support we can reenable this option as
FLR is doable on single PF and doens't have this
restriction.

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-24 17:25:20 -05:00
Dave Airlie
54f43c17d6 drm-misc-next for v5.18:
UAPI Changes:
 
 Cross-subsystem Changes:
 - Split out panel-lvds and lvds dt bindings .
 - Put yes/no on/off disabled/enabled strings in linux/string_helpers.h
   and use it in drivers and tomoyo.
 - Clarify dma_fence_chain and dma_fence_array should never include eachother.
 - Flatten chains in syncobj's.
 - Don't double add in fbdev/defio when page is already enlisted.
 - Don't sort deferred-I/O pages by default in fbdev.
 
 Core Changes:
 - Fix missing pm_runtime_put_sync in bridge.
 - Set modifier support to only linear fb modifier if drivers don't
   advertise support.
 - As a result, we remove allow_fb_modifiers.
 - Add missing clear for EDID Deep Color Modes in drm_reset_display_info.
 - Assorted documentation updates.
 - Warn once in drm_clflush if there is no arch support.
 - Add missing select for dp helper in drm_panel_edp.
 - Assorted small fixes.
 - Improve fb-helper's clipping handling.
 - Don't dump shmem mmaps in a core dump.
 - Add accounting to ttm resource manager, and use it in amdgpu.
 - Allow querying the detected eDP panel through debugfs.
 - Add helpers for xrgb8888 to 8 and 1 bits gray.
 - Improve drm's buddy allocator.
 - Add selftests for the buddy allocator.
 
 Driver Changes:
 - Add support for nomodeset to a lot of drm drivers.
 - Use drm_module_*_driver in a lot of drm drivers.
 - Assorted small fixes to bridge/lt9611, v3d, vc4, vmwgfx, mxsfb, nouveau,
   bridge/dw-hdmi, panfrost, lima, ingenic, sprd, bridge/anx7625, ti-sn65dsi86.
 - Add bridge/it6505.
 - Create DP and DVI-I connectors in ast.
 - Assorted nouveau backlight fixes.
 - Rework amdgpu reset handling.
 - Add dt bindings for ingenic,jz4780-dw-hdmi.
 - Support reading edid through aux channel in ingenic.
 - Add a drm driver for Solomon SSD130x OLED displays.
 - Add simple support for sharp LQ140M1JW46.
 - Add more panels to nt35560.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEuXvWqAysSYEJGuVH/lWMcqZwE8MFAmIWLSEACgkQ/lWMcqZw
 E8OP7hAAjix94EX5fhFa7OAdqUbFtsiKhK/4zNtV9FWpFiEsDBz+dlbfDQWIx5an
 FIiiiQtSfWjpDv6pcMhoNf80w+dDbc/Cuauz6nNGO7Pkaerh2D/EPG74FD7f7nE3
 EIScVs1heYtzM9usKrFKupNYgIdgZxmeovClWuE0OTjLOes2PGvvxXK6iJqNqJMX
 VlDO5SR7GRqsDUDV6vmwl63uKL77xJXAahAXIx+BQ/1xrtEhlu6NwsgHIsmPmMSN
 YluX34zc1xD/6/uUqvEdp7u46/5/He1c5Q/ia1WV3wRxsO/eMZ+axXqCZP3XGZdt
 rMdGNtj1MWKkudYiowStWkCVSG/0fXJCFIAhvRmeZy+YqPdVlqZ2W7g4H1l9iJoo
 UVfT9cHrKoxHsukvIEckC5Ov9v1yr39Bd4wUuqaUTUSxY8VID5vjY63TsXl9Zke1
 SluTFe9qybbnRNz/hYRvwIS1eT8HvUauAfAhypGTLI5DYHTD7PawcfMJkNzCtJm4
 Ta4SC3rTpkpN+7oc8SoNgqRHQ8U9KL5oksP0wVa8vwHsMptSd3X4pUljc6TcfjLv
 GEo41D5AuJz3HRVcn9yqPbLoPE2FFB7bfwIMH77yNnoos4Izy/LGhKpN0YdImmI5
 W5XVFB0jltGSIhkzLe1mFpLrdJwdUTSUVeCK4H5PhZZQEHLkVtg=
 =HuwD
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-2022-02-23' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

drm-misc-next for v5.18:

UAPI Changes:

Cross-subsystem Changes:
- Split out panel-lvds and lvds dt bindings .
- Put yes/no on/off disabled/enabled strings in linux/string_helpers.h
  and use it in drivers and tomoyo.
- Clarify dma_fence_chain and dma_fence_array should never include eachother.
- Flatten chains in syncobj's.
- Don't double add in fbdev/defio when page is already enlisted.
- Don't sort deferred-I/O pages by default in fbdev.

Core Changes:
- Fix missing pm_runtime_put_sync in bridge.
- Set modifier support to only linear fb modifier if drivers don't
  advertise support.
- As a result, we remove allow_fb_modifiers.
- Add missing clear for EDID Deep Color Modes in drm_reset_display_info.
- Assorted documentation updates.
- Warn once in drm_clflush if there is no arch support.
- Add missing select for dp helper in drm_panel_edp.
- Assorted small fixes.
- Improve fb-helper's clipping handling.
- Don't dump shmem mmaps in a core dump.
- Add accounting to ttm resource manager, and use it in amdgpu.
- Allow querying the detected eDP panel through debugfs.
- Add helpers for xrgb8888 to 8 and 1 bits gray.
- Improve drm's buddy allocator.
- Add selftests for the buddy allocator.

Driver Changes:
- Add support for nomodeset to a lot of drm drivers.
- Use drm_module_*_driver in a lot of drm drivers.
- Assorted small fixes to bridge/lt9611, v3d, vc4, vmwgfx, mxsfb, nouveau,
  bridge/dw-hdmi, panfrost, lima, ingenic, sprd, bridge/anx7625, ti-sn65dsi86.
- Add bridge/it6505.
- Create DP and DVI-I connectors in ast.
- Assorted nouveau backlight fixes.
- Rework amdgpu reset handling.
- Add dt bindings for ingenic,jz4780-dw-hdmi.
- Support reading edid through aux channel in ingenic.
- Add a drm driver for Solomon SSD130x OLED displays.
- Add simple support for sharp LQ140M1JW46.
- Add more panels to nt35560.

Signed-off-by: Dave Airlie <airlied@redhat.com>

From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/686ec871-e77f-c230-22e5-9e3bb80f064a@linux.intel.com
2022-02-25 05:50:18 +10:00
Somalapuram Amaranath
15fd09a05a drm/amdgpu: add reset register dump trace on GPU
Dump the list of register values to trace event on GPU reset.

Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-23 14:26:36 -05:00
Alex Deucher
b784f42cf7 drm/amdgpu: drop testing module parameter
This test is not particularly useful now that GTT and GART
are decoupled in the driver.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-23 14:02:51 -05:00
Alex Deucher
0b1a63487b drm/amdgpu: drop benchmark module parameter
Now that we expose the benchmarks via debugfs, there is no
longer a need for the module parameter.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-23 14:02:51 -05:00
Alex Deucher
f113cc32e3 drm/amdgpu: add a benchmark mutex
To avoid multiple runs in parallel to avoid mixing results.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-23 14:02:50 -05:00
Jiawei Gu
8ab62eda17 drm/sched: Add device pointer to drm_gpu_scheduler
Add device pointer so scheduler's printing can use
DRM_DEV_ERROR() instead, which makes life easier under multiple GPU
scenario.

v2: amend all calls of drm_sched_init()
v3: fill dev pointer for all drm_sched_init() calls

Signed-off-by: Jiawei Gu <Jiawei.Gu@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220221095705.5290-1-Jiawei.Gu@amd.com
2022-02-23 10:04:14 +01:00
Mario Limonciello
0ab5d711ec drm/amd: Refactor amdgpu_aspm to be evaluated per device
Evaluating `pcie_aspm_enabled` as part of driver probe has the implication
that if one PCIe bridge with an AMD GPU connected doesn't support ASPM
then none of them do.  This is an invalid assumption as the PCIe core will
configure ASPM for individual PCIe bridges.

Create a new helper function that can be called by individual dGPUs to
react to the `amdgpu_aspm` module parameter without having negative results
for other dGPUs on the PCIe bus.

Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-17 15:59:05 -05:00
yipechai
867e24ca49 drm/amdgpu: define amdgpu_ras_late_init to call all ras blocks' .ras_late_init
Define amdgpu_ras_late_init to call all ras blocks' .ras_late_init.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-17 15:59:05 -05:00
Alex Deucher
dfcc3e8c24 drm/amdgpu: make cyan skillfish support code more consistent
Since this is an existing asic, adjust the code to follow
the same logic as previously so the driver state is consistent.

No functional change intended.

Acked-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-16 17:30:02 -05:00
Surbhi Kakarya
7258fa31ea drm/amdgpu: Handle the GPU recovery failure in SRIOV environment.
This patch handles the GPU recovery failure in sriov environment by
retrying the reset if the first reset fails. To determine the condition
of retry, a new macro AMDGPU_RETRY_SRIOV_RESET is added which returns
true if failure is due to ETIMEDOUT, EINVAL or EBUSY, otherwise return
false.A new macro AMDGPU_MAX_RETRY_LIMIT is used to limit the retry to 2.

It also handles the return status in Post Asic Reset by updating the return
code with asic_reset_res and eventually return the return code in
amdgpu_job_timedout().

Signed-off-by: Surbhi Kakarya <surbhi.kakarya@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:41 -05:00
Rajneesh Bhardwaj
7157934699 drm/amdgpu: Fix a kerneldoc warning
Add missing parameters to fix a kerneldoc warning

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:40 -05:00
Andrey Grodzovsky
c7703ce38c drm/amdgpu: Fix htmldoc warning
Update function name.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220211205500.601391-1-andrey.grodzovsky@amd.com
2022-02-11 16:05:08 -05:00
Andrey Grodzovsky
3675c2f26f drm/amdgpu: Revert 'drm/amdgpu: annotate a false positive recursive locking'
Since we have a single instance of reset semaphore which we
lock only once even for XGMI hive we don't need the nested
locking hint anymore.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74120.html
2022-02-09 12:19:14 -05:00
Andrey Grodzovsky
e923be9934 drm/amdgpu: Rework amdgpu_device_lock_adev
This functions needs to be split into 2 parts where
one is called only once for locking single instance of
reset_domain's sem and reset flag and the other part
which handles MP1 states should still be called for
each device in XGMI hive.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74118.html
2022-02-09 12:18:39 -05:00
Andrey Grodzovsky
89a7a87093 drm/amdgpu: Move in_gpu_reset into reset_domain
We should have a single instance per entrire reset domain.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74116.html
2022-02-09 12:17:57 -05:00
Andrey Grodzovsky
d0fb18b535 drm/amdgpu: Move reset sem into reset_domain
We want single instance of reset sem across all
reset clients because in case of XGMI we should stop
access cross device MMIO because any of them could be
in a reset in the moment.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74117.html
2022-02-09 12:17:32 -05:00
Andrey Grodzovsky
cfbb6b0047 drm/amdgpu: Rework reset domain to be refcounted.
The reset domain contains register access semaphor
now and so needs to be present as long as each device
in a hive needs it and so it cannot be binded to XGMI
hive life cycle.
Adress this by making reset domain refcounted and pointed
by each member of the hive and the hive itself.

v4:

Fix crash on boot witrh XGMI hive by adding type to reset_domain.
XGMI will only create a new reset_domain if prevoius was of single
device type meaning it's first boot. Otherwsie it will take a
refocunt to exsiting reset_domain from the amdgou device.

Add a wrapper around reset_domain->refcount get/put
and a wrapper around send to reset wq (Lijo)

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74121.html
2022-02-09 12:17:09 -05:00
Andrey Grodzovsky
f287a3c5b0 drm/amdgpu: Drop concurrent GPU reset protection for device
Since now all GPU resets are serialzied there is no need for this.

This patch also reverts 'drm/amdgpu: race issue when jobs on 2 ring timeout'

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74119.html
2022-02-09 12:16:53 -05:00
Andrey Grodzovsky
681260df4d drm/amdgpu: Drop hive->in_reset
Since we serialize all resets no need to protect from concurrent
resets.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74115.html
2022-02-09 12:16:29 -05:00
Andrey Grodzovsky
54f329cc7a drm/amdgpu: Serialize non TDR gpu recovery with TDRs
Use reset domain wq also for non TDR gpu recovery trigers
such as sysfs and RAS. We must serialize all possible
GPU recoveries to gurantee no concurrency there.
For TDR call the original recovery function directly since
it's already executed from within the wq. For others just
use a wrapper to qeueue work and wait on it to finish.

v2: Rename to amdgpu_recover_work_struct

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74113.html
2022-02-09 12:15:23 -05:00
Andrey Grodzovsky
5fd8518d18 drm/amdgpu: Move scheduler init to after XGMI is ready
Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74112.html
2022-02-09 12:15:04 -05:00
Andrey Grodzovsky
a4c63cafa5 drm/amdgpu: Introduce reset domain
Defined a reset_domain struct such that
all the entities that go through reset
together will be serialized one against
another. Do it for both single device and
XGMI hive cases.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Suggested-by: Christian König <ckoenig.leichtzumerken@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74111.html
2022-02-09 12:14:32 -05:00
Changcheng Deng
6a77bce58c drm/amdgpu: remove duplicate include in 'amdgpu_device.c'
'linux/pci.h' included in 'amdgpu_device.c' is duplicated.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02 18:26:31 -05:00
Alex Deucher
8cda7a4f96 drm/amdgpu/UAPI: add new CTX OP to get/set stable pstates
Add a new CTX ioctl operation to set stable pstates for profiling.
When creating traces for tools like RGP or using SPM or doing
performance profiling, it's required to enable a special
stable profiling power state on the GPU.  These profiling
states set fixed clocks and disable certain other power
features like powergating which may impact the results.

Historically, these profiling pstates were enabled via sysfs,
but this adds an interface to enable it via the CTX ioctl
from the application.  Since the power state is global
only one application can set it at a time, so if multiple
applications try and use it only the first will get it,
the ioctl will return -EBUSY for others.  The sysfs interface
will override whatever has been set by this interface.

Mesa MR: https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/207

v2: don't default r = 0;
v3: rebase on Evan's PM cleanup

Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-27 15:50:08 -05:00
Tianci.Yin
7270e8957e drm/amdgpu: Fix an error message in rmmod
[why]
In rmmod procedure, kfd sends cp a dequeue request, but the
request does not get response, then an error message "cp
queue pipe 4 queue 0 preemption failed" printed.

[how]
Performing kfd suspending after disabling gfxoff can fix it.

Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Tianci.Yin <tianci.yin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-27 15:48:30 -05:00
Alex Deucher
901e2be20d drm/amdgpu: move PX checking into amdgpu_device_ip_early_init
We need to set the APU flag from IP discovery before
we evaluate this code.

Acked-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-25 18:00:36 -05:00
Hawking Zhang
1b2dc99e2d drm/amdgpu: switch to amdgpu_sriov_rreg/wreg
Instead of ip specific implementation for rlcg
indirect register access

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Zhou, Peng Ju <PengJu.Zhou@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-25 18:00:33 -05:00
Xiaojian Du
86700a4026 drm/amdgpu: modify a pair of functions for the pcie port wreg/rreg
This patch will modify a pair of functions for pcie port wreg/rreg.
AMD GPU have had an independent NBIO block from SOC15 arch.
If the dirver wants to read/write the address space of the pcie devices,
it has to go through the NBIO block.
This patch will move the pcie port wreg/rreg functions to
"amdgpu_device.c", so that to reuse the functions on the
future GPU ASICs.

Signed-off-by: Xiaojian Du <Xiaojian.Du@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-19 22:32:11 -05:00
Jingwen Chen
9a458402fb drm/amd/amdgpu: fixing read wrong pf2vf data in SRIOV
[Why]
This fixes 892deb4826 ("drm/amdgpu: Separate vf2pf work item init from virt data exchange").
we should read pf2vf data based at mman.fw_vram_usage_va after gmc
sw_init. commit 892deb4826 breaks this logic.

[How]
calling amdgpu_virt_exchange_data in amdgpu_virt_init_data_exchange to
set the right base in the right sequence.

v2:
call amdgpu_virt_init_data_exchange after gmc sw_init to make data
exchange workqueue run

v3:
clean up the code logic

v4:
add some comment and make the code more readable

Fixes: 892deb4826 ("drm/amdgpu: Separate vf2pf work item init from virt data exchange")
Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
Reviewed-by: Horace Chen <horace.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-18 18:00:58 -05:00
Jingwen Chen
22c16d251a drm/amd/amdgpu: fixing read wrong pf2vf data in SRIOV
[Why]
This fixes 892deb4826 ("drm/amdgpu: Separate vf2pf work item init from virt data exchange").
we should read pf2vf data based at mman.fw_vram_usage_va after gmc
sw_init. commit 892deb4826 breaks this logic.

[How]
calling amdgpu_virt_exchange_data in amdgpu_virt_init_data_exchange to
set the right base in the right sequence.

v2:
call amdgpu_virt_init_data_exchange after gmc sw_init to make data
exchange workqueue run

v3:
clean up the code logic

v4:
add some comment and make the code more readable

Fixes: 892deb4826 ("drm/amdgpu: Separate vf2pf work item init from virt data exchange")
Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
Reviewed-by: Horace Chen <horace.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-18 17:41:03 -05:00
Alex Deucher
0ffb1fd158 drm/amdgpu: invert the logic in amdgpu_device_should_recover_gpu()
Rather than opting into GPU recovery support, default to on, and
opt out if it's not working on a particular GPU.  This avoids the
need to add new asics to this list since this is a core feature.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 18:06:44 -05:00
CHANDAN VURDIGERE NATARAJ
4175c32be5 drm/amdgpu: Enable recovery on yellow carp
Add yellow carp to devices which support recovery

Signed-off-by: CHANDAN VURDIGERE NATARAJ <chandan.vurdigerenataraj@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 18:06:44 -05:00
Alex Deucher
b3523c4573 drm/amdgpu: invert the logic in amdgpu_device_should_recover_gpu()
Rather than opting into GPU recovery support, default to on, and
opt out if it's not working on a particular GPU.  This avoids the
need to add new asics to this list since this is a core feature.

Reviewed-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:52:50 -05:00
CHANDAN VURDIGERE NATARAJ
31425abeda drm/amdgpu: Enable recovery on yellow carp
Add yellow carp to devices which support recovery

Signed-off-by: CHANDAN VURDIGERE NATARAJ <chandan.vurdigerenataraj@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:52:50 -05:00
Yang Li
ab3b9de65b drm/amdgpu: clean up some inconsistent indenting
Eliminate the follow smatch warnings:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:3504 amdgpu_device_init()
warn: inconsistent indenting
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1716
amdgpu_ras_error_status_query() warn: if statement not indented
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1058 amdgpu_ras_error_inject()
warn: inconsistent indenting

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:52:10 -05:00
yipechai
5e67bba301 drm/amdgpu: Modify mmhub block to fit for the unified ras block data and ops
1.Modify mmhub block to fit for the unified ras block data and ops.
2.Change amdgpu_mmhub_ras_funcs to amdgpu_mmhub_ras, and the corresponding variable name remove _funcs suffix.
3.Remove the const flag of mmhub ras variable so that mmhub ras block can be able to be inserted into amdgpu device ras block link list.
4.Invoke amdgpu_ras_register_ras_block function to register mmhub ras block into amdgpu device ras block link list. 5.Remove the redundant code about mmhub in amdgpu_ras.c after using the unified ras block.
5.Remove the redundant code about mmhub in amdgpu_ras.c after using the unified ras block.
6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of mmhub versions. If .ras_late_init and .ras_fini had been defined by the selected mmhub version, the defined functions will take effect; if not defined, default fill them with amdgpu_mmhub_ras_late_init and amdgpu_mmhub_ras_fini.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:51:59 -05:00
yipechai
6492e1b07c drm/amdgpu: Unify ras block interface for each ras block
1. Define unified ops interface for each block.
2. Add ras_block_match function pointer in ops interface, each ras block can customize specail match function to identify itself.
3. Add amdgpu_ras_block_match_default new function. If a ras block doesn't define .ras_block_match, default execute amdgpu_ras_block_match_default to identify this ras block.
4. Define unified basic ras block data for each ras block.
5. Create dedicated amdgpu device ras block link list to manage all of the ras blocks.
6. Add amdgpu_ras_register_ras_block new function interface for each ras block to register itself to ras controlling block.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:51:59 -05:00
Evan Quan
bc143d8b83 drm/amd/pm: do not expose implementation details to other blocks out of power
Those implementation details(whether swsmu supported, some ppt_funcs supported,
accessing internal statistics ...)should be kept internally. It's not a good
practice and even error prone to expose implementation details.

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:51:14 -05:00
Prike Liang
4eaf21b752 drm/amdgpu: not return error on the init_apu_flags
In some APU project we needn't always assign flags to identify each other,
so we may not need return an error.

Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11 15:44:27 -05:00
Tom St Denis
4cc9f86f85 drm/amd/amdgpu: Add pcie indirect support to amdgpu_mm_wreg_mmio_rlc()
The function amdgpu_mm_wreg_mmio_rlc() is used by debugfs to write to
MMIO registers.  It didn't support registers beyond the BAR mapped MMIO
space.  This adds pcie indirect write support.

Signed-off-by: Tom St Denis <tom.stdenis@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11 15:44:26 -05:00
Nirmoy Das
575e55ee4f drm/amdgpu: recover gart table at resume
Get rid off pin/unpin of gart BO at resume/suspend and
instead pin only once and try to recover gart content
at resume time. This is much more stable in case there
is OOM situation at 2nd call to amdgpu_device_evict_resources()
while evicting GART table.

v3: remove gart recovery from other places
v2: pin gart at amdgpu_gart_table_vram_alloc()

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11 15:44:26 -05:00
Nirmoy Das
1dd8b1b987 drm/amdgpu: do not pass ttm_resource_manager to gtt_mgr
Do not allow exported amdgpu_gtt_mgr_*() to accept
any ttm_resource_manager pointer. Also there is no need
to force other module to call a ttm function just to
eventually call gtt_mgr functions.

v4: remove unused adev.
v3: upcast mgr from ttm resopurce manager instead of
getting it from adev.
v2: pass adev's gtt_mgr instead of adev.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11 15:44:26 -05:00
Leslie Shi
62d5f9f711 drm/amdgpu: Unmap MMIO mappings when device is not unplugged
Patch: 3efb17ae7e92 ("drm/amdgpu: Call amdgpu_device_unmap_mmio() if device
is unplugged to prevent crash in GPU initialization failure") makes call to
amdgpu_device_unmap_mmio() conditioned on device unplugged. This patch unmaps
MMIO mappings even when device is not unplugged.

v2: Add condition of drm_dev_enter() to deleted unmaps in patch
"drm/amdgpu: Unmap all MMIO mappings"

Signed-off-by: Leslie Shi <Yuliang.Shi@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11 15:44:26 -05:00