Record the debug mode status in RAS.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Enable RAS feature by default for aqua vanjaram on apu
platform.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
typo fix.
Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Candice Li <candice.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix delete nodes that it has been freed.
Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Set VCN/JPEG RAS masks to enable software RAS for
VCN and JPEG.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Make the code architecture more simple.
v2: reuse ras_reset_error_count in ras_reset_error_status.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If one of the devices in the hive detects a
fatal error, need to send ras recovery reset
message to PMFW of all devices in the hive.
For that add a flag in hive to indicate that
it's undergoing ras recovery
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
introduced "ras_err_info" to better identify a RAS ERROR source.
NOTE:
For legacy chips, keep the original RAS error print format.
v1:
RAS errors may come from different dies during a RAS error query,
therefore, need a new data structure to identify the source of RAS ERROR.
v2:
- use new data structure 'amdgpu_smuio_mcm_config_info' instead of
ras_err_id (in v1 patch)
- refine ras error dump function name
- refine ras error dump log format
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Expose ras table version & schema info to sysfs
v2: Updated schema to get poison support info
from ras context, removed asic specific checks
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It should first check block ras obj whether be set, it should
return 0 directly if block ras obj or hw_ops is not set.
If block doesn't support RAS just return 0 is fine.
Changed from V1:
return 0 directly if block ras obj or hw ops is not set
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This patch fixes a memory leak in the amdgpu_ras_feature_enable() function.
The leak occurs when the function sends a command to the firmware to enable
or disable a RAS feature for a GFX block. If the command fails, the kfree()
function is not called to free the info memory.
Fixes: 9f051d6ff1 ("drm/amdgpu: Free ras cmd input buffer properly")
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Cong Liu <liucong2@kylinos.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use an inline function for version check. Gives more flexibility to
handle any format changes.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
So driver doesn't generate incorrect message until
the new format is settled down for aqua_vanjaram
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Do not access the pointer for ras input cmd buffer
if it is even not allocated.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Disable gfx ras command is needed in some use cases
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Mode1 reset needs to recover mp1 in fatal error case
for mp0 v13_0_10.
v2:
Define a macro to wrap psp function calls.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
RAS global isr will only be invoked by hardware
interrupt. Don't need to query ras capability in isr
In addition, amdgpu_ras_interrupt_fatal_error_handler
ensures the isr won't be called from guest linux
side by accident. The RAS cap check in isr that
introduced to fix sriov crash is not needed any more
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Treat SDMA/VCN/JPEG as RAS capable IP blocks in poison mode.
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Register RAS fatal error interrupt and add handler.
v2: only register NBIO RAS for dGPU platform.
change nbio_v7_9_set_ras_controller_irq_state and nbio_v7_9_set_ras_err_event_athub_irq_state
to dummy functions.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
For non-GFX IP blocks, set up ras obj if ras feature
is allowed. For GFX IP blocks, force issue ras
enable_feature command to firmware and only set up ras
obj if ras feature is allowed
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
For GFX IP that only supports poison consumption, GFX
RAS won't be marked as enabled. i.e., hardware doesn't
support gfx sram ecc. But driver still needs to issue
firmware to enable poison consumption mode for GFX IP.
In such case, check poison mode and treat GFX IP as
RAS capable IP block.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Some IP blocks only support partial ras feature and don't
have ras counter and/or ras error status register at all.
Driver should not create err_count sysfs node for those
IP blocks.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Only disable RAS by default for aqua vanjaram on APU platform.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The VBIOS part number is read both in amdgpu_atom_parse() as well
as in atom_get_vbios_pn() and stored twice in the `struct atom_context`
structure. Remove the first unnecessary read and move the `pr_info`
line from that read into the second.
v2: squash in unused variable removal
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Disable RAS feature by default for aqua vanjaram on APU platform.
Changed from V1:
Splite Disable RAS by default on APU platform into a
separated patch.
Changed from V2:
Avoid to modify global variable amdgpu_ras_enable.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Enable RAS for aqua vanjaram.
Changed from V1:
Split the change in amdgpu_ras_asic_supported into a
separated patch.
Changed from V2:
Avoid to modify global variable amdgpu_ras_enable.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The address parameter of GFX RAS injection isn't related to XGMI node
number, keep it unchanged.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Candice Li <candice.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Do not compare injection address with mc_vram_size
if mc_vram_size is zero.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Using "is_app_apu" to identify device in the native
APU mode or carveout mode.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The fixed commit listed in the Fixes tag below, introduced a bug in
amdgpu_ras.c::amdgpu_reserve_page_direct(), in that when introducing the new
amdgpu_umc_fill_error_record() and internally in that new function the physical
address (argument "uint64_t retired_page"--wrong name) is right-shifted by
AMDGPU_GPU_PAGE_SHIFT. Thus, in amdgpu_reserve_page_direct() when we pass
"address" to that new function, we should NOT right-shift it, since this
results, erroneously, in the page address to be 0 for first
2^(2*AMDGPU_GPU_PAGE_SHIFT) memory addresses.
This commit fixes this bug.
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Fixes: 400013b268 ("drm/amdgpu: add umc_fill_error_record to make code more simple")
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Link: https://lore.kernel.org/r/20230610113536.10621-1-luben.tuikov@amd.com
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Report the number of records stored in the RAS EEPROM table in debugfs.
This can be used by user-space to calculate the capacity of the RAS EEPROM
table since "bad_page_cnt_threshold" is also reported in the same place in
debugfs.
See commit 7fb6407145 ("drm/amdgpu: Add bad_page_cnt_threshold to debugfs").
ras_num_recs can already be inferred by dumping the RAS EEPROM table, also in
the same debugfs location, see commit reference c65b0805e7 (drm/amdgpu:
RAS EEPROM table is now in debugfs, 2021-04-08). This commit makes it an
integer value easily shown in a single file.
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Stanley Yang <Stanley.Yang@amd.com>
Cc: John Clements <john.clements@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Link: https://lore.kernel.org/r/20230603051043.211548-1-luben.tuikov@amd.com
Acked-by: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add ras info to EEPROM table, app can analyse device ECC
status without GPU driver through EEPROM table ras info.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The mask is only needed to be set when RAS block instance number is
more than 1 and invalid bits should be also masked out.
We only check valid bits for GFX and SDMA block for now, and will
add check for other RAS blocks in the future.
v2: move the check under injection operation since the mask is only
used by RAS error inject.
v3: add valid bits handling for SDMA.
v4: print message if the mask is adjusted.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
So GFX RAS injection could use default function if it doesn't define its
own injection interface.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
User can specify injected instances by the mask. For backward
compatibility, the mask value is incorporated into sub block index
without interface change of RAS TA.
User uses logical mask and driver should convert it to physical value
before sending it to RAS TA.
v2: update parameter name.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It turns out STATUS_VALID_FLAG needs to be checked
ahead of any other fields. ADDRESS_VALID_FLAG and
ERR_INFO_VALID_FLAG only manages ADDRESS and ERR_INFO
field respectively. driver should continue poll
ERR CNT field even ERR_INFO_VALD_FLAG is not set.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Persistent edc harvesting is supported in APP APU
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add common helper to reset ras error status. It
applies to IP blocks that follow the new ras error
logging register design, and need to write 0 to
reset the error status. For IP blocks that don't
support the new design, please still implement ip
specific helper.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add common helper to query ras error status and
log error information, including memory block id
and erorr count. The helpers are applicable to IP
blocks that follow the new ras error logging design.
For IP blocks that don't support the new design,
please still implement ip specific helper to query
ras error.
v2: optimize struct amdgpu_ras_err_status_reg_entry
and the implementaion in helper (Lijo/Tao)
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Some ASICs support fatal error event but do not
support pcie_bif ras.
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
XGMI RAS should be according to the gmc xgmi physical nodes number,
XGMI RAS should not be enabled if xgmi num_physical_nodes is zero.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
GPU will stop working once fatal error is detected.
it will inform driver to do reset to recover from
the fatal error.
v2: squash in logic fix (Srinivasan)
v3: squash in logic fix (Dan)
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Candice Li <candice.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amdgpu_ras_register_ras_block should always be invoked
by ras_sw_init, where driver needs to check ras caps
at ip level, instead of asic level.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
pcie_bif ras blocks needs to be initialized as early
as possible to handle fatal error detected in hw_init
phase. also align the pcie_bif ras sw_init with other
ras blocks
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Ignore ras umc bad page threshold by default, GPU initialization won't
be stopped in this mode.
v2: refine the description of bad_page_threshold.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If a UMC bad page is reserved but not freed by an application, the
application may trigger uncorrectable error repeatly by accessing the page.
v2: add specific function to do the check.
v3: remove duplicate pages, calculate new added bad page number.
v4: reuse save_bad_pages to calculate new added bad page number.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]:
Amdgpu ras uses amdgpu_ras_is_supported to check whether
the ras block supports the ras function. amdgpu_ras_is_supported
uses .ras_enabled to determine whether the ras function of the
block is enabled.
But for special asic with mem ecc enabled but sram ecc not
enabled, some ras blocks support poison mode but their ras function
is not enabled on .ras_enabled, these ras blocks will run abnormally.
[How]:
If the ras block is not supported on .ras_enabled but the asic
supports poison mode and the ras block has ras configuration, it
can be considered that the ras block supports ras function.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why]:
For special asic with mem ecc enabled but sram ecc
not enabled, some ras blocks can register their ras
configuration to ras list, but these ras blocks are not
enabled on .ras_enabled, so it can not get ras block
object using amdgpu_ras_get_ras_block.
[How]:
Remove ras block support check. Even if the ras block
checked is not in the ras list, it will return a null
pointer and will have no effect.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Perform gpu reset after gfx finishes processing
ras poison consumption on gfx_v11_0_3.
V2:
Move gfx poison consumption handler from hw_ops to ip
function level.
V3:
Adjust the calling position of amdgpu_gfx_poison_consumation_handler.
V4:
Since gfx v11_0_3 does not have .hw_ops instance, the .hw_ops null
pointer check in amdgpu_ras_interrupt_poison_consumption_handler
needs to be adjusted.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amdgpu_ras_block_late_init will be invoked in IP
specific ras_late_init call as a common helper for
all the IP blocks.
However, when amdgpu_ras_block_late_init call
amdgpu_ras_query_error_count to query ras error
counters, amdgpu_ras_query_error_count queries
all the IP blocks that support ras query interface.
This results to wrong error counters cached in
software copies when there are ras errors detected
at time zero or warm reset procedure. i.e., in
sdma_ras_late_init phase, it counts on sdma/mmhub
errors, while, in mmhub_ras_late_init phase, it
still counts on sdma/mmhub errors.
The change updates amdgpu_ras_query_error_count
interface to allow query specific ip error counter.
It introduces a new input parameter: query_info. if
query_info is NULL, it means query all the IP blocks,
otherwise, only query the ip block specified by
query_info.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. no need to query poison mode on SRIOV guest side, host can handle it.
2. define the function to simplify code.
v2: rename amdgpu_ras_poison_mode_query to amdgpu_ras_query_poison_mode.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Support VCN/JPEG RAS in both bare metal and SRIOV environment.
v2: update commit description.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Injection on guest is not allowed.
v2: return directly in SRIOV environment.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Replace the open-code with sysfs_emit() to simplify the code.
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Set support flag for VCN/JPEG 4.0.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The patch is enabling mode-1 reset for RAS recovery in fatal error mode.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Make the code simpler.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Make the code more readable.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
RAS error address translation algorithm is common
across dGPU and A + A platform as along as the SOC
integrates the same generation of UMC IP.
UMC RAS is managed by x86 MCA on A + A platform,
umc_ras in GPU driver is not initialized at all on
A + A platform. In such case, any umc_ras callback
implemented for dGPU config shouldn't be invoked
from A + A specific callback.
The change moves convert_error_address out of dGPU
umc_ras structure and makes it share between A + A
and dGPU config.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit dac6b80818.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.
Fixes: dac6b80818 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix some issues found by checkpatch script.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Make the code reusable and remove redundant code.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use the convert interface to simplify code.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
For the asic using smu v13_0_2, there is the following
warning when uninstalling amdgpu:
amdgpu: ras disable gfx failed poison:1 ret:-22.
[Why]:
For the asic using smu v13_0_2, the psp .suspend and
mode1reset is called before amdgpu_ras_pre_fini during
amdgpu uninstall, it has disabled all ras features and
reset the psp. Since the psp is reset, calling
amdgpu_ras_disable_all_features in amdgpu_ras_pre_fini
to disable ras features will fail.
[How]:
If all ras features are disabled, amdgpu_ras_disable_all_features
will not be called to disable all ras features again.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
No need to reset error status since only umc ras supported on psp v13_0_0.
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
- introduce AMDGPU_SKIP_MODE2_RESET flag
- let mode2 reset fallback to default reset method if failed
v2: move this part out from the asic specific part
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, force to use full reset
method. Otherwise, try soft reset by default if the related ASIC
supportted, if soft reset failed, will use full reset.
Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It should not init whole ras bad page framework on sriov guest side
due to it is handled on host side.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
GFX is the only IP block that RAS TA needs to program
the hardware when receiving enable_feature command.
Changed from V1:
remove amdgpu_ras_need_send_ras_feature inline function,
use GFX RAS block check directly.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Save the extra usless work schedule.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Adjust the sequence for ras late init and separate ras reset error status
from query status.
v2: squash in fix from Candice
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
support umc/gfx/sdma ras on guest side
Changed from V1:
move sriov judgment in amdgpu_ras_interrupt_fatal_error_handler
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Qeury ras status before ras poison consumption handling, add more
comment and log.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-and-tested-by: Mohammad Zafar Ziya <Mohammadzafar.ziya@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Enable RAS IH if poison consumption handler is implemented.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Mohammad Zafar Ziya <Mohammadzafar.ziya@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The fatal error handler is independent from general ras interrupt
handler since there is no related IH ring.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add support for general RAS poison consumption handler.
v2: remove callback function for poison consumption.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Prepare for the implementation of poison consumption handler.
v2: separate umc handler from poison creation.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add vcn and jpeg ras support options
V2: vcn and jpeg ras flag enabled for aldebaran asic only
V3: vcn and jpeg ras flag disabled for error counter query
Generic poison query interface added
VCN and JPEG ras enabled based on IP version check
V4: vcn and jpeg ras flag moved under ecc flag for dGPU
Signed-off-by: Mohammad Zafar Ziya <Mohammadzafar.ziya@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It should notice SMU to update bad channel info when detected
uncorrectable error in UMC block
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. Define amdgpu_ras_block_late_fini_default in amdgpu_ras.c as
.ras_fini common function, which is called when
.ras_fini of ras block isn't initialized.
2. Remove the code of using amdgpu_ras_block_late_fini to
initialize .ras_fini in ras blocks.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
centrally calls the .ras_fini function of all ras blocks.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fixed warning reported by kernel test robot:
1.warning: variable 'ras_obj' is used uninitialized
whenever '||' condition is true.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Turn previously global function into a static function to avoid the
following Clang warning:
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2459:5: warning: no previous prototype
for function 'amdgpu_ras_block_late_init_default' [-Wmissing-prototypes]
int amdgpu_ras_block_late_init_default(struct amdgpu_device *adev,
^
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2459:1: note: declare 'static' if the
function is not intended to be used outside of this translation unit
int amdgpu_ras_block_late_init_default(struct amdgpu_device *adev,
^
static
Signed-off-by: Maíra Canal <maira.canal@usp.br>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Clang build fails with
amdgpu_ras.c:2416:7: error: variable 'ras_obj' is used uninitialized
whenever 'if' condition is true
if (adev->in_suspend || amdgpu_in_reset(adev)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
amdgpu_ras.c:2453:6: note: uninitialized use occurs here
if (ras_obj->ras_cb)
^~~~~~~
There is a logic error in the error handler's labels.
ex/ The sysfs: is the last goto label in the normal code but
is the middle of error handler. Rework the error handler.
cleanup: is the first error, so it's handler should be last.
interrupt: is the second error, it's handler is next. interrupt:
handles the failure of amdgpu_ras_interrupt_add_hander() by
calling amdgpu_ras_interrupt_remove_handler(). This is wrong,
remove() assumes the interrupt has been setup, not torn down by
add(). Change the goto label to cleanup.
sysfs is the last error, it's handler should be first. sysfs:
handles the failure of amdgpu_ras_sysfs_create() by calling
amdgpu_ras_sysfs_remove(). But when the create() fails there
is nothing added so there is nothing to remove. This error
handler is not needed. Remove the error handler and change
goto label to interrupt.
Fixes: b293e891b0 ("drm/amdgpu: add helper function to do common ras_late_init/fini (v3)")
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. Define amdgpu_ras_block_late_init_default in amdgpu_ras.c as
.ras_late_init common function, which is called when
.ras_late_init of ras block isn't initialized.
2. Remove the code of using amdgpu_ras_block_late_init to
initialize .ras_late_init in ras blocks.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Define amdgpu_ras_late_init to call all ras blocks' .ras_late_init.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. Merge amdgpu_ras_late_init to
amdgpu_ras_block_late_init.
2. Remove amdgpu_ras_late_init since no ras block
calls amdgpu_ras_late_init.
3. Merge amdgpu_ras_late_fini to
amdgpu_ras_block_late_fini.
4. Remove amdgpu_ras_late_fini since no ras block
calls amdgpu_ras_late_fini.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In order to reduce redundant struct conversion, modify
operating sysfs and interrupt function interface parameters.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Optimize amdgpu_nbio_ras_late_init/amdgpu_nbio_ras_fini function code.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. Define amdgpu_ras_block_late_init to create sysfs nodes
and interrupt handles.
2. Define amdgpu_ras_block_late_fini to remove sysfs nodes
and interrupt handles.
3. Replace ras block variable members in struct
amdgpu_ras_block_object with struct ras_common_if, which
can make it easy to associate each ras block instance
with each ras block functional interface.
4. Add .ras_cb to struct amdgpu_ras_block_object.
5. Change each ras block to fit for the changement of struct
amdgpu_ras_block_object.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amd-drm-next-5.18-2022-02-11-1:
amdgpu:
- Clean up of power management code
- Enable freesync video mode by default
- Clean up of RAS code
- Improve VRAM access for debug using SDMA
- Coding style cleanups
- SR-IOV fixes
- More display FP reorg
- TLB flush fixes for Arcuturus, Vega20
- Misc display fixes
- Rework special register access methods for SR-IOV
- DP2 fixes
- DP tunneling fixes
- DSC fixes
- More IP discovery cleanups
- Misc RAS fixes
- Enable both SMU i2c buses where applicable
- s2idle improvements
- DPCS header cleanup
- Add new CAP firmware support for SR-IOV
amdkfd:
- Misc cleanups
- SVM fixes
- CRIU support
- Clean up MQD manager
UAPI:
- Add interface to amdgpu CTX ioctl to request a stable power state for profiling
https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/207
- Add amdkfd support for CRIU
https://github.com/checkpoint-restore/criu/pull/1709
- Remove old unused amdkfd debugger interface
Was only implemented for Kaveri and was only ever used by an old HSA tool that was never open sourced
radeon:
- Fix error handling in radeon_driver_open_kms
- UVD suspend fix
- Misc fixes
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220211220706.5803-1-alexander.deucher@amd.com
The commit d5e8ff5f7b ("drm/amdgpu: Fixed the defect of soft lock caused by infinite loop")
had fixed this defect.
Revert workaround
commit a2170b4af6 ("drm/amdgpu: Add judgement to avoid infinite loop").
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
MESA polls for errors every 2-3 seconds. Printing with dev_info() causes
the dmesg log to fill up with the same message, e.g,
[18028.206676] amdgpu 0000:0b:00.0: amdgpu: df doesn't config ras function.
Make it dev_dbg_once(), as it isn't something correctible during boot or
thereafter, so printing just once is sufficient. Also sanitize the message.
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: yipechai <YiPeng.Chai@amd.com>
Fixes: 8b0fb0e967 ("drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops")
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. The infinite loop causing soft lock occurs on multiple amdgpu cards
supporting ras feature.
2. This a workaround patch to fix 6492e1b07c.
It is valid for multiple amdgpu cards of the same type.
3. The root cause is that each GPU card device has a separate .ras_list
link header, but the instance and linked list node of each ras block
are unique. When each device is initialized, each ras instance will
repeatedly add link node to the device every time. In this way, only
the .ras_list of the last initialized device is completely correct.
the .ras_list->prev and .ras_list->next of the device initialzied
before can still point to the correct ras instance, but the prev
pointer and next pointer of the pointed ras instance both point to
the last initialized device's .ras_ list instead of the beginning
.ras_ list. When using list_for_each_entry_safe searches for
non-existent Ras nodes on devices other than the last device, the
last ras instance next pointer cannot always be equal to the
beginning .ras_list, so that the loop cannot be terminated, the
program enters a infinite loop.
BTW: Since the data and initialization process of each card are the same,
the link list between ras instances will not be destroyed every time
the device is initialized.
4. The soft locked logs are as follows:
[ 262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G OE 5.13.0-27-generic #29~20.04.1-Ubuntu
[ 262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020
[ 262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[ 262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu]
[ 262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45
[ 262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202
[ 262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248
[ 262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000
[ 262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001
[ 262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e
[ 262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003
[ 262.166258] FS: 0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000
[ 262.166261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0
[ 262.166267] Call Trace:
[ 262.166272] amdgpu_ras_do_recovery+0x130/0x290 [amdgpu]
[ 262.166529] ? psi_task_switch+0xd2/0x250
[ 262.166537] ? __switch_to+0x11d/0x460
[ 262.166542] ? __switch_to_asm+0x36/0x70
[ 262.166549] process_one_work+0x220/0x3c0
[ 262.166556] worker_thread+0x4d/0x3f0
[ 262.166560] ? process_one_work+0x3c0/0x3c0
[ 262.166563] kthread+0x12b/0x150
[ 262.166568] ? set_kthread_struct+0x40/0x40
[ 262.166571] ret_from_fork+0x22/0x30
Fixes: 6492e1b07c ("drm/amdgpu: Unify ras block interface for each ras block")
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Create common amdgpu_umc_fill_error_record function for all versions
of UMC and clean up related codes.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit df4f0041c6.
Xgmi ras initialization had been moved from .late_init to early_init,
the defect of repeated calling amdgpu_ras_register_ras_block had been
fixed, so revert this patch.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix the code style warnings in amdgpu_ras:
1. ERROR: space required before the open parenthesis '('.
2. WARNING: line length of xxx exceeds 100 columns.
3. ERROR: "foo* bar" should be "foo *bar".
4. WARNING: unnecessary whitespace before a quoted newline.
5. WARNING: space prohibited before semicolon.
6. WARNING: suspect code indent for conditional statements.
7. WARNING: braces {} are not necessary for single statement blocks.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
drivers fixes:
- i915 fixes for ttm backend + one pm wakelock fix
- amdgpu fixes, fairly big pile of small things all over. Note this
doesn't yet containe the fixed version of the otg sync patch that
blew up
- small driver fixes: meson, sun4i, vga16fb probe fix
drm core fixes:
- cma-buf heap locking
- ttm compilation
- self refresh helper state check
- wrong error message in atomic helpers
- mipi-dbi buffer mapping
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEb4nG6jLu8Y5XI+PfTA9ye/CYqnEFAmHhrxUACgkQTA9ye/CY
qnEowxAAgBPwGEobRGMbR3Me98vEKvcWqSxBe/k1VC4LhO5DrvbG5iW9cuxCJZM2
wlGlGAtU7C7pcCP5Xp1UlMqZ5a0rSVhqMPPkMKO9+7033ofSlAQatnMI1EENH6Hn
BkhXwTyuOBSN6zqskg8FKqzF+VPTt5ZV2U5qJzQweP/wFtZPAKI4tWE4oKiHactH
fJHnAi7T6ytF6a7J21BsSEluk4z7BjmcmFF0tW6iuq7Y6TXDFXFq9QFDR041b2rI
GYDUXl2mebp/L+2M3sPYuMiIiyJ8enh7crNIdmi+EstmzRADa7RMjnY3j2tJg/7M
pqnuJZAVcpkCurb7NMr3ycmrxnhfUsZfbuXvm+k5yJYfQCaGNiKy1ObsFWH9zBDz
XMuxcE+csSaX/7rjoyXrL2ZTRPXnVwJNJ8x1CuKn3giLxMSqnPnDMjyHmNLB8qa1
R0wbPQbdx5+jWgs/ngUGFNo4vFBnNmqQP4G3LaWJ/Ku5cSrEM+Jt9GJOw5jjVgun
gaOlKlUpMBlKmXOwkvhRW2AwHRcL7lrBuIw0oFOThdMzkSNlKSNBMpAkgvD/9C7g
IDtJgA7a3MqzLjhPLmhUB3rFyXz5dg5rNZhoH8z4DFiJVpTKYYA5/UoU/RTXZGqW
yFdrBkFmNJeKTZZAe+e/4OP/dm1vKVEKK6Ko/M9CrELzBKSussg=
=h74X
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2022-01-14' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Daniel Vetter:
"drivers fixes:
- i915 fixes for ttm backend + one pm wakelock fix
- amdgpu fixes, fairly big pile of small things all over. Note this
doesn't yet containe the fixed version of the otg sync patch that
blew up
- small driver fixes: meson, sun4i, vga16fb probe fix
drm core fixes:
- cma-buf heap locking
- ttm compilation
- self refresh helper state check
- wrong error message in atomic helpers
- mipi-dbi buffer mapping"
* tag 'drm-next-2022-01-14' of git://anongit.freedesktop.org/drm/drm: (49 commits)
drm/mipi-dbi: Fix source-buffer address in mipi_dbi_buf_copy
drm: fix error found in some cases after the patch d1af5cd86997
drm/ttm: fix compilation on ARCH=um
dma-buf: cma_heap: Fix mutex locking section
video: vga16fb: Only probe for EGA and VGA 16 color graphic cards
drm/amdkfd: Fix ASIC name typos
drm/amdkfd: Fix DQM asserts on Hawaii
drm/amdgpu: Use correct VIEWPORT_DIMENSION for DCN2
drm/amd/pm: only send GmiPwrDnControl msg on master die (v3)
drm/amdgpu: use spin_lock_irqsave to avoid deadlock by local interrupt
drm/amdgpu: not return error on the init_apu_flags
drm/amdkfd: Use prange->update_list head for remove_list
drm/amdkfd: Use prange->list head for insert_list
drm/amdkfd: make SPDX License expression more sound
drm/amdkfd: Check for null pointer after calling kmemdup
drm/amd/display: invalid parameter check in dmub_hpd_callback
Revert "drm/amdgpu: Don't inherit GEM object VMAs in child process"
drm/amd/display: reset dcn31 SMU mailbox on failures
drm/amdkfd: use default_groups in kobj_type
drm/amdgpu: use default_groups in kobj_type
...
Use ARRAY_SIZE to get array length.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Eliminate the following coccicheck warning:
./drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2725:16-17: Unneeded semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
No longer insert ras blocks into ras_list if it already exists in ras_list.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. Move xgmi special error inject function from amdgpu_ras.c to xgmi block.
2. Support to use psp_ras_trigger_error as default error inject function in amdgpu_ras.c. If .ras_error_inject isn't defined in ras block, default error inject function will take effect.
v2: squash in warning fix (Alex)
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1.Modify sdma block to fit for the unified ras block data and ops.
2.Change amdgpu_sdma_ras_funcs to amdgpu_sdma_ras, and the corresponding variable name remove _funcs suffix.
3.Remove the const flag of sdma ras variable so that sdma ras block can be able to be inserted into amdgpu device ras block link list.
4.Invoke amdgpu_ras_register_ras_block function to register sdma ras block into amdgpu device ras block link list.
5.Remove the redundant code about sdma in amdgpu_ras.c after using the unified ras block.
6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of sdma versions. If .ras_late_init and .ras_fini had been defined by the selected sdma version, the defined functions will take effect; if not defined, default fill them with amdgpu_sdma_ras_late_init and amdgpu_sdma_ras_fini.
v2: squash in warning fix (Alex)
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1.Modify umc block to fit for the unified ras block data and ops.
2.Change amdgpu_umc_ras_funcs to amdgpu_umc_ras, and the corresponding variable name remove _funcs suffix.
3.Remove the const flag of umc ras variable so that umc ras block can be able to be inserted into amdgpu device ras block link list.
4.Invoke amdgpu_ras_register_ras_block function to register umc ras block into amdgpu device ras block link list.
5.Remove the redundant code about umc in amdgpu_ras.c after using the unified ras block.
6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of umc versions. If .ras_late_init and .ras_fini had been defined by the selected umc version, the defined functions will take effect; if not defined, default fill them with amdgpu_umc_ras_late_init and amdgpu_umc_ras_fini.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1.Modify nbio block to fit for the unified ras block data and ops.
2.Change amdgpu_nbio_ras_funcs to amdgpu_nbio_ras, and the corresponding variable name remove _funcs suffix.
3.Remove the const flag of mmhub ras variable so that nbio ras block can be able to be inserted into amdgpu device ras block link list.
4.Invoke amdgpu_ras_register_ras_block function to register nbio ras block into amdgpu device ras block link list.
5.Remove the redundant code about nbio in amdgpu_ras.c after using the unified ras block.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1.Modify mmhub block to fit for the unified ras block data and ops.
2.Change amdgpu_mmhub_ras_funcs to amdgpu_mmhub_ras, and the corresponding variable name remove _funcs suffix.
3.Remove the const flag of mmhub ras variable so that mmhub ras block can be able to be inserted into amdgpu device ras block link list.
4.Invoke amdgpu_ras_register_ras_block function to register mmhub ras block into amdgpu device ras block link list. 5.Remove the redundant code about mmhub in amdgpu_ras.c after using the unified ras block.
5.Remove the redundant code about mmhub in amdgpu_ras.c after using the unified ras block.
6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of mmhub versions. If .ras_late_init and .ras_fini had been defined by the selected mmhub version, the defined functions will take effect; if not defined, default fill them with amdgpu_mmhub_ras_late_init and amdgpu_mmhub_ras_fini.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1.Modify hdp block to fit for the unified ras block data and ops.
2.Change amdgpu_hdp_ras_funcs to amdgpu_hdp_ras, and the corresponding variable name remove _funcs suffix.
3.Remove the const flag of hdp ras variable so that hdp ras block can be able to be inserted into amdgpu device ras block link list.
4.Invoke amdgpu_ras_register_ras_block function to register hdp ras block into amdgpu device ras block link list.
5.Remove the redundant code about hdp in amdgpu_ras.c after using the unified ras block.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1.Modify gmc block to fit for the unified ras block data and ops.
2.Change amdgpu_xgmi_ras_funcs to amdgpu_xgmi_ras, and the corresponding variable name remove _funcs suffix.
3.Remove the const flag of gmc ras variable so that gmc ras block can be able to be inserted into amdgpu device ras block link list.
4.Invoke amdgpu_ras_register_ras_block function to register gmc ras block into amdgpu device ras block link list.
5.Remove the redundant code about gmc in amdgpu_ras.c after using the unified ras block.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1.Modify gfx block to fit for the unified ras block data and ops.
2.Change amdgpu_gfx_ras_funcs to amdgpu_gfx_ras, and the corresponding variable name remove _funcs suffix.
3.Remove the const flag of gfx ras variable so that gfx ras block can be able to be inserted into amdgpu device ras block link list.
4.Invoke amdgpu_ras_register_ras_block function to register gfx ras block into amdgpu device ras block link list.
5.Remove the redundant code about gfx in amdgpu_ras.c after using the unified ras block.
6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of gfx versions. If .ras_late_init and .ras_fini had been defined by the selected gfx version, the defined functions will take effect; if not defined, default fill with amdgpu_gfx_ras_late_init and amdgpu_gfx_ras_fini.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Modify the compilation failed problem when other ras blocks' .h include amdgpu_ras.h.
v2: squash in forward declaration warning fix (Alex)
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1. Define unified ops interface for each block.
2. Add ras_block_match function pointer in ops interface, each ras block can customize specail match function to identify itself.
3. Add amdgpu_ras_block_match_default new function. If a ras block doesn't define .ras_block_match, default execute amdgpu_ras_block_match_default to identify this ras block.
4. Define unified basic ras block data for each ras block.
5. Create dedicated amdgpu device ras block link list to manage all of the ras blocks.
6. Add amdgpu_ras_register_ras_block new function interface for each ras block to register itself to ras controlling block.
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Those implementation details(whether swsmu supported, some ppt_funcs supported,
accessing internal statistics ...)should be kept internally. It's not a good
practice and even error prone to expose implementation details.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Do not allow exported amdgpu_vram_mgr_*() to accept
any ttm_resource_manager pointer. Also there is no need
to force other module to call a ttm function just to
eventually call vram_mgr functions.
v2: pass adev's vram_mgr instead of adev
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Memory of err_data should be cleaned before usage
when there're multiple entry in ras ih.
Otherwise garbage data from last loop will be used.
Signed-off-by: Jiawei Gu <Jiawei.Gu@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
core:
- add privacy screen support
- move nomodeset option into drm subsystem
- clean up nomodeset handling in drivers
- make drm_irq.c legacy
- fix stack_depot name conflicts
- remove DMA_BUF_SET_NAME ioctl restrictions
- sysfs: send hotplug event
- replace several DRM_* logging macros with drm_*
- move hashtable to legacy code
- add error return from gem_create_object
- cma-helper: improve interfaces, drop CONFIG_DRM_KMS_CMA_HELPER
- kernel.h related include cleanups
- support XRGB2101010 source buffers
ttm:
- don't include drm hashtable
- stop pruning fences after wait
- documentation updates
dma-buf:
- add dma_resv selftest
- add debugfs helpers
- remove dma_resv_get_excl_unlocked
- documentation
- make fences mandatory in dma_resv_add_excl_fence
dp:
- add link training delay helpers
gem:
- link shmem/cma helpers into separate modules
- use dma_resv iteratior
- import dma-buf namespace into gem helper modules
scheduler:
- fence grab fix
- lockdep fixes
bridge:
- switch to managed MIPI DSI helpers
- register and attach during probe fixes
- convert to YAML in several places.
panel:
- add bunch of new panesl
simpledrm:
- support FB_DAMAGE_CLIPS
- support virtual screen sizes
- add Apple M1 support
amdgpu:
- enable seamless boot for DCN 3.01
- runtime PM fixes
- use drm_kms_helper_connector_hotplug_event
- get all fences at once
- use generic drm fb helpers
- PSR/DPCD/LTTPR/DSC/PM/RAS/OLED/SRIOV fixes
- add smart trace buffer (STB) for supported GPUs
- display debugfs entries
- new SMU debug option
- Documentation update
amdkfd:
- IP discovery enumeration refactor
- interface between driver fixes
- SVM fixes
- kfd uapi header to define some sysfs bitfields.
i915:
- support VESA panel backlights
- enable ADL-P by default
- add eDP privacy screen support
- add Raptor Lake S (RPL-S) support
- DG2 page table support
- lots of GuC/HuC fw refactoring
- refactored i915->gt interfaces
- CD clock squashing support
- enable 10-bit gamma support
- update ADL-P DMC fw to v2.14
- enable runtime PM autosuspend by default
- ADL-P DSI support
- per-lane DP drive settings for ICL+
- add support for pipe C/D DMC firmware
- Atomic gamma LUT updates
- remove CCS FB stride restrictions on ADL-P
- VRR platform support for display 11
- add support for display audio codec keepalive
- lots of display refactoring
- fix runtime PM handling during PXP suspend
- improved eviction performance with async TTM moves
- async VMA unbinding improvements
- VMA locking refactoring
- improved error capture robustness
- use per device iommu checks
- drop bits stealing from i915_sw_fence function ptr
- remove dma_resv_prune
- add IC cache invalidation on DG2
nouveau:
- crc fixes
- validate LUTs in atomic check
- set HDMI AVI RGB quant to full
tegra:
- buffer objects reworks for dma-buf compat
- NVDEC driver uAPI support
- power management improvements
etnaviv:
- IOMMU enabled system support
- fix > 4GB command buffer mapping
- close a DoS vector
- fix spurious GPU resets
ast:
- fix i2c initialization
rcar-du:
- DSI output support
exynos:
- replace legacy gpio interface
- implement generic GEM object mmap
msm:
- dpu plane state cleanup in prep for multirect
- dpu debugfs cleanups
- dp support for sc7280
- a506 support
- removal of struct_mutex
- remove old eDP sub-driver
anx7625:
- support MIPI DSI input
- support HDMI audio
- fix reading EDID
lvds:
- fix bridge DT bindings
megachips:
- probe both bridges before registering
dw-hdmi:
- allow interlace on bridge
ps8640:
- enable runtime PM
- support aux-bus
tx358768:
- enable reference clock
- add pulse mode support
ti-sn65dsi86:
- use regmap bulk write
- add PWM support
etnaviv:
- get all fences at once
gma500:
- gem object cleanups
kmb:
- enable fb console
radeon:
- use dma_resv_wait_timeout
rockchip:
- add DSP hold timeout
- suspend/resume fixes
- PLL clock fixes
- implement mmap in GEM object functions
- use generic fbdev emulation
sun4i:
- use CMA helpers without vmap support
vc4:
- fix HDMI-CEC hang with display is off
- power on HDMI controller while disabling
- support 4K@60Hz modes
- support 10-bit YUV 4:2:0 output
vmwgfx:
- fix leak on probe errors
- fail probing on broken hosts
- new placement for MOB page tables
- hide internal BOs from userspace
- implement GEM support
- implement GL 4.3 support
virtio:
- overflow fixes
xen:
- implement mmap as GEM object function
omapdrm:
- fix scatterlist export
- support virtual planes
mediatek:
- MT8192 support
- CMDQ refinement
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmHX1vMACgkQDHTzWXnE
hr6rAw/9ES5RO5N3Ku9foFk1CI9bqy1Kh663KLkkEc+rDdhKpiZbBnAsrKkZ9sGu
fNuHmWNN5nWXtDSOqHWuslt3F7Gh+qEBQtlkqC9mZsBm3bWB0aJK6E4QaJxfeSaK
ta6AmyGx8DaV+C69i86dnemQurYSDVjROd7LDPKnCU0Fye/JxiXSXQmXksKMFVxd
x5vmO9yfeDSg3EF+u1yB6nJNUYZBV0vhrAfjPqxPCRBXuQc7akuaglE/SFwlGnEk
vn0GjVHEQcRTqYKrHr64xvQxIoKXcJP0pkDUyT7KYCsyj8GJkvxkb7/ls5pp5DvL
SwyNg3J3vwUVP6w6GEvzf3ffG720qqUZvCbvLmE+A/t2DhGILiAm+HXSo43PTOW8
uagT7Gxma8dy8EovjSxioS9HPX8Gcu+S+XYavgOsevOZ7oeEt4f4TLW7LXsw9d6y
75FrMhiUpreab5hAh8Le0swuLYZHjdnJRdjSTqZJ/T6VdTdVftLT6IfwvSDx5CHy
cWuufgcAjd7xVTXFquHWYXWLTQkiSMGf1M02jx9IWolTd4Cm41LNBhqMEDHZLHJD
7ngGgoaREVDQ+MqjG90yfIwJFIpJPI3YOaHLi/Kznga+iDzFY6cyOQWW2vX7ZdY5
7+LJWsgGT8Feb7/bzD5hX1mYqJLxh1pWUIaqIKMl+7LJL7gTVU8=
=MByd
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2022-01-07' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"Highlights are support for privacy screens found in new laptops, a
bunch of nomodeset refactoring, and i915 enables ADL-P systems by
default, while starting to add RPL-S support.
vmwgfx adds GEM and support for OpenGL 4.3 features in userspace.
Lots of internal refactorings around dma reservations, and lots of
driver refactoring as well.
Summary:
core:
- add privacy screen support
- move nomodeset option into drm subsystem
- clean up nomodeset handling in drivers
- make drm_irq.c legacy
- fix stack_depot name conflicts
- remove DMA_BUF_SET_NAME ioctl restrictions
- sysfs: send hotplug event
- replace several DRM_* logging macros with drm_*
- move hashtable to legacy code
- add error return from gem_create_object
- cma-helper: improve interfaces, drop CONFIG_DRM_KMS_CMA_HELPER
- kernel.h related include cleanups
- support XRGB2101010 source buffers
ttm:
- don't include drm hashtable
- stop pruning fences after wait
- documentation updates
dma-buf:
- add dma_resv selftest
- add debugfs helpers
- remove dma_resv_get_excl_unlocked
- documentation
- make fences mandatory in dma_resv_add_excl_fence
dp:
- add link training delay helpers
gem:
- link shmem/cma helpers into separate modules
- use dma_resv iteratior
- import dma-buf namespace into gem helper modules
scheduler:
- fence grab fix
- lockdep fixes
bridge:
- switch to managed MIPI DSI helpers
- register and attach during probe fixes
- convert to YAML in several places.
panel:
- add bunch of new panesl
simpledrm:
- support FB_DAMAGE_CLIPS
- support virtual screen sizes
- add Apple M1 support
amdgpu:
- enable seamless boot for DCN 3.01
- runtime PM fixes
- use drm_kms_helper_connector_hotplug_event
- get all fences at once
- use generic drm fb helpers
- PSR/DPCD/LTTPR/DSC/PM/RAS/OLED/SRIOV fixes
- add smart trace buffer (STB) for supported GPUs
- display debugfs entries
- new SMU debug option
- Documentation update
amdkfd:
- IP discovery enumeration refactor
- interface between driver fixes
- SVM fixes
- kfd uapi header to define some sysfs bitfields.
i915:
- support VESA panel backlights
- enable ADL-P by default
- add eDP privacy screen support
- add Raptor Lake S (RPL-S) support
- DG2 page table support
- lots of GuC/HuC fw refactoring
- refactored i915->gt interfaces
- CD clock squashing support
- enable 10-bit gamma support
- update ADL-P DMC fw to v2.14
- enable runtime PM autosuspend by default
- ADL-P DSI support
- per-lane DP drive settings for ICL+
- add support for pipe C/D DMC firmware
- Atomic gamma LUT updates
- remove CCS FB stride restrictions on ADL-P
- VRR platform support for display 11
- add support for display audio codec keepalive
- lots of display refactoring
- fix runtime PM handling during PXP suspend
- improved eviction performance with async TTM moves
- async VMA unbinding improvements
- VMA locking refactoring
- improved error capture robustness
- use per device iommu checks
- drop bits stealing from i915_sw_fence function ptr
- remove dma_resv_prune
- add IC cache invalidation on DG2
nouveau:
- crc fixes
- validate LUTs in atomic check
- set HDMI AVI RGB quant to full
tegra:
- buffer objects reworks for dma-buf compat
- NVDEC driver uAPI support
- power management improvements
etnaviv:
- IOMMU enabled system support
- fix > 4GB command buffer mapping
- close a DoS vector
- fix spurious GPU resets
ast:
- fix i2c initialization
rcar-du:
- DSI output support
exynos:
- replace legacy gpio interface
- implement generic GEM object mmap
msm:
- dpu plane state cleanup in prep for multirect
- dpu debugfs cleanups
- dp support for sc7280
- a506 support
- removal of struct_mutex
- remove old eDP sub-driver
anx7625:
- support MIPI DSI input
- support HDMI audio
- fix reading EDID
lvds:
- fix bridge DT bindings
megachips:
- probe both bridges before registering
dw-hdmi:
- allow interlace on bridge
ps8640:
- enable runtime PM
- support aux-bus
tx358768:
- enable reference clock
- add pulse mode support
ti-sn65dsi86:
- use regmap bulk write
- add PWM support
etnaviv:
- get all fences at once
gma500:
- gem object cleanups
kmb:
- enable fb console
radeon:
- use dma_resv_wait_timeout
rockchip:
- add DSP hold timeout
- suspend/resume fixes
- PLL clock fixes
- implement mmap in GEM object functions
- use generic fbdev emulation
sun4i:
- use CMA helpers without vmap support
vc4:
- fix HDMI-CEC hang with display is off
- power on HDMI controller while disabling
- support 4K@60Hz modes
- support 10-bit YUV 4:2:0 output
vmwgfx:
- fix leak on probe errors
- fail probing on broken hosts
- new placement for MOB page tables
- hide internal BOs from userspace
- implement GEM support
- implement GL 4.3 support
virtio:
- overflow fixes
xen:
- implement mmap as GEM object function
omapdrm:
- fix scatterlist export
- support virtual planes
mediatek:
- MT8192 support
- CMDQ refinement"
* tag 'drm-next-2022-01-07' of git://anongit.freedesktop.org/drm/drm: (1241 commits)
drm/amdgpu: no DC support for headless chips
drm/amd/display: fix dereference before NULL check
drm/amdgpu: always reset the asic in suspend (v2)
drm/amdgpu: put SMU into proper state on runpm suspending for BOCO capable platform
drm/amd/display: Fix the uninitialized variable in enable_stream_features()
drm/amdgpu: fix runpm documentation
amdgpu/pm: Make sysfs pm attributes as read-only for VFs
drm/amdgpu: save error count in RAS poison handler
drm/amdgpu: drop redundant semicolon
drm/amd/display: get and restore link res map
drm/amd/display: support dynamic HPO DP link encoder allocation
drm/amd/display: access hpo dp link encoder only through link resource
drm/amd/display: populate link res in both detection and validation
drm/amd/display: define link res and make it accessible to all link interfaces
drm/amd/display: 3.2.167
drm/amd/display: [FW Promotion] Release 0.0.98
drm/amd/display: Undo ODM combine
drm/amd/display: Add reg defs for DCN303
drm/amd/display: Changed pipe split policy to allow for multi-display pipe split
drm/amd/display: Set optimize_pwr_state for DCN31
...
AMD systems currently lay out MCA bank types such that the type of bank
number "i" is either the same across all CPUs or is Reserved/Read-as-Zero.
For example:
Bank # | CPUx | CPUy
0 LS LS
1 RAZ UMC
2 CS CS
3 SMU RAZ
Future AMD systems will lay out MCA bank types such that the type of
bank number "i" may be different across CPUs.
For example:
Bank # | CPUx | CPUy
0 LS LS
1 RAZ UMC
2 CS NBIO
3 SMU RAZ
Change the structures that cache MCA bank types to be per-CPU and update
smca_get_bank_type() to handle this change.
Move some SMCA-specific structures to amd.c from mce.h, since they no
longer need to be global.
Break out the "count" for bank types from struct smca_hwid, since this
should provide a per-CPU count rather than a system-wide count.
Apply the "const" qualifier to the struct smca_hwid_mcatypes array. The
values in this array should not change at runtime.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211216162905.4132657-3-yazen.ghannam@amd.com
This commit fixes the compile-time warning below:
warning: no previous prototype for ‘amdgpu_ras_mca_query_error_status’
[-Wmissing-prototypes]
Changes since v1:
- As suggested by Alexander Deucher:
1. Make function static instead of adding prototype.
Signed-off-by: Isabella Basso <isabbasso@riseup.net>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This fixes various warnings relating to erroneous docstring syntax, of
which some are listed below:
warning: Function parameter or member 'adev' not described in
'amdgpu_atomfirmware_ras_rom_addr'
...
warning: expecting prototype for amdgpu_atpx_validate_functions().
Prototype was for amdgpu_atpx_validate() instead
...
warning: Excess function parameter 'mem' description in 'amdgpu_preempt_mgr_new'
...
warning: Cannot understand * @kfd_get_cu_occupancy - Collect number of
waves in-flight on this device
...
warning: This comment starts with '/**', but isn't a kernel-doc
comment. Refer Documentation/doc-guide/kernel-doc.rst
Signed-off-by: Isabella Basso <isabbasso@riseup.net>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The RAS poison mode is enabled by default on the platform.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
skip get ecc info for aldebarn through check ip version
do not affect other asic type
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
this is a workaround due to get ecc info failed during gpu recovery
[ 700.236122] amdgpu 0000:09:00.0: amdgpu: Failed to export SMU ecc table!
[ 700.236128] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
[ 704.331171] amdgpu: qcm fence wait loop timeout expired
[ 704.331194] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[ 704.332445] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
[ 704.332448] amdgpu 0000:09:00.0: amdgpu: Bailing on TDR for s_job:ffffffffffffffff, as another already in progress
[ 704.332456] amdgpu: Pasid 0x8000 destroy queue 0 failed, ret -62
[ 710.360924] amdgpu 0000:09:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000013 SMN_C2PMSG_82:0x00000007
[ 710.360964] amdgpu 0000:09:00.0: amdgpu: Failed to disable smu features.
[ 710.361002] amdgpu 0000:09:00.0: amdgpu: Fail to disable dpm features!
[ 710.361014] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
v2:
still need call ras_disable_all_featrures to handle
ras initilization failure case.
Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw,
so ras ta will unload before send ras disable command, ras dsiable operation
must before hw fini.
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
if smu support ECCTABLE, driver can message smu to get ecc_table
then query umc error info from ECCTABLE
v2:
optimize source code makes logical more reasonable
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fix race condition failure during UMC UE injection.
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
During mode2 reset, the GPU is temporarily removed from the
mgpu_info list. As a result, page retirement fails because it
cannot find the GPU in the GPU list.
To fix this, create our own list of GPUs that support MCE notifier
based page retirement and use that list to check if the UMC error
occurred on a GPU that supports MCE notifier based page retirement.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
On Aldebaran, GPU driver will handle bad page retirement
for GPU memory even though UMC is host managed. As a result,
register a bad page retirement handler on the mce notifier
chain to retire bad pages on Aldebaran.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
clear error count when persistant harvesting is not enabled
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In ras poison mode, umc uncorrectable error will be ignored until
the corrupted data consumed by another ras module (such as gfx, sdma).
v2: update the debug message and replace dev_warn with dev_info.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add RAS poison supported flag and tell PSP RAS TA about the info.
v2: rename poison mode to poison supported, we can also disable poison
mode even we support it.
print value of poison supported if ras feature enablement fails.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>