We have a number of crashes either reported by QA, or from their machines to our crash system via Sentry. The repro’s are really opaque to the team as it seems to come up in different places in the game that have little clearly connecting them from the content side.
Each crash report offers some different perspective but not a clear underlying cause beyond a memory access fault. Most look to have nvidia gpu dump decoding working as desired with info approximately as follows:
Decoding Aftermath GPU Crash:
Device Info:
Status : PageFault
Adapter Reset: False
Engine Reset : True
Page Fault Info:
GPU VA : 0x00003ff000000000
Type : AddressTranslationError
Access : Read
Engine : Graphics
Client : GraphicsProcessingCluster
Resource: <no info>
Marker Data:
No marker info.
Active Shaders:
1 total.
[0]:
! Internal
Type = Compute
Hash = 3553972226
! Failed to get binary hash (2)
... snip...
{
"Page fault info": {
"Access Type": "Read",
"Client": "Graphics Processing Cluster",
"Engine": "Graphics",
"Fault Type": "Failed to translate the virtual address.",
"GPU virtual address": 70300024700928
}
},
{
"Shader infos": {
"Info": {
"Shader hash": "N/A",
"Shader name": "compute_02",
"Shader size": 33536,
"Shader type": "Compute"
}
}
},
... snip ...
{
"Device info": {
"Adapter reset occurred": false,
"Device state": "Error_DMA_PageFault",
"Engine reset occurred": true
}
},
... snip...
{
"Active Warps": [
{
"GPU PC Address": "compute_02 [Content removed]
"Shader mapping": null,
"Warp count": 3
}
]
},
{
"Faulted Warps": [
{
"Fault Description": "A shader instruction caused an MMU fault when accessing memory.\nThis can be caused by shader bugs and binding setup issues, or possibly by a shader compiler bug or shader microcode corruption.",
"Fault Name": "MMU Fault Error",
"Shader GPU PC Address": "compute_02 [Content removed]
"Shader mapping": null
}
]
},
The internal part is what has my curiosity. A compute shader on the graphics pipe named compute_02 consistently is of note when marked internal for its shader type. Also Error_DMA_PageFault has me wondering if this is an upload issue with a resource.
Across different crashes, the breadcrumbs suggest different parts of the frame are in flight on the GPU. In many, we’re near the beginning of the Base Pass. In others we have HZB active along with a few other stages following it.
I’m attaching a couple logs and nvidia dumps. We’re transitioning between 5.5 and 5.6 currently so there may be some variation there.
Wondering if there is any insight to be shared here?