Hard to Reproduce GPU Hangs from MMU Faults

We have a number of crashes either reported by QA, or from their machines to our crash system via Sentry. The repro’s are really opaque to the team as it seems to come up in different places in the game that have little clearly connecting them from the content side.

Each crash report offers some different perspective but not a clear underlying cause beyond a memory access fault. Most look to have nvidia gpu dump decoding working as desired with info approximately as follows:

Decoding Aftermath GPU Crash:
 
	Device Info:
		Status       : PageFault
		Adapter Reset: False
		Engine Reset : True
 
	 Page Fault Info:
		GPU VA  : 0x00003ff000000000
		Type    : AddressTranslationError
		Access  : Read
		Engine  : Graphics
		Client  : GraphicsProcessingCluster
	Resource: <no info>
 
	Marker Data:
		No marker info.
 
	Active Shaders:
		1 total.
		[0]:
			! Internal
			Type = Compute
			Hash = 3553972226
			! Failed to get binary hash (2)
 
... snip...
{
    "Page fault info": {
      "Access Type": "Read",
      "Client": "Graphics Processing Cluster",
      "Engine": "Graphics",
      "Fault Type": "Failed to translate the virtual address.",
      "GPU virtual address": 70300024700928
    }
  },
  {
    "Shader infos": {
      "Info": {
        "Shader hash": "N/A",
        "Shader name": "compute_02",
        "Shader size": 33536,
        "Shader type": "Compute"
      }
    }
  },
... snip ...
{
    "Device info": {
      "Adapter reset occurred": false,
      "Device state": "Error_DMA_PageFault",
      "Engine reset occurred": true
    }
  },
... snip...
{
    "Active Warps": [
      {
        "GPU PC Address": "compute_02 [Content removed]
        "Shader mapping": null,
        "Warp count": 3
      }
    ]
  },
  {
    "Faulted Warps": [
      {
        "Fault Description": "A shader instruction caused an MMU fault when accessing memory.\nThis can be caused by shader bugs and binding setup issues, or possibly by a shader compiler bug or shader microcode corruption.",
        "Fault Name": "MMU Fault Error",
        "Shader GPU PC Address": "compute_02 [Content removed]
        "Shader mapping": null
      }
    ]
  },

The internal part is what has my curiosity. A compute shader on the graphics pipe named compute_02 consistently is of note when marked internal for its shader type. Also Error_DMA_PageFault has me wondering if this is an upload issue with a resource.

Across different crashes, the breadcrumbs suggest different parts of the frame are in flight on the GPU. In many, we’re near the beginning of the Base Pass. In others we have HZB active along with a few other stages following it.

I’m attaching a couple logs and nvidia dumps. We’re transitioning between 5.5 and 5.6 currently so there may be some variation there.

Wondering if there is any insight to be shared here?

Steps to Reproduce
We don’t currently have a repro. We have around a dozen one-time hits that have been automatically reported from QA testing and crash reports going to our central Sentry server collecting crashes.

Hi there,

As a first step you will probably want to get shader debugging setup with the aftermath so you can see exactly what shader is crashing and where.

To do this, set the following CVar in your DefaultEngine.ini under the rendering section:

r.GPUCrashDebugging.Aftermath.DumpShaderDebugInfo=1

Also in Config/Windows/WindowsEngine.ini add:

[ShaderCompiler]

r.Shaders.Symbols=1

Note that you can probably get away with just outputting the actual shader symbols on the machine debugging the crash dump.

Note that this also requires 5.6 (there are issues preventing this from working in 5.5)

With these settings, GPU crashes will output an additional *.nvdbg file along with the *.nv-gpudmp crash dump. This *.nvdbg file provides extra debug information necessary for associating shader source code.

After opening a new crash dump in Nvidia Nsight, set your shader search paths, in tools->options, as follows (replacing the project path with your own):

[Image Removed]You can then check the crash info tab to see exactly where you are getting the crash in shader code:

[Image Removed]Let me know the results of this, and maybe send through the *.nvdbg, *.nv-gpudmp, and associated dxil shader and pdb symbol files listed when you click the “Show Symbol Files” button. That way I can open the nv-gpudmp with source code available on my end as well.

Regards,

Lance Chaney

Thanks for getting back to me so quickly Lance!

This is a great approach, but I’m left with the challenge of the fact I don’t have repro. Unfortunately this bug seems to show up from QA about 10 times per month. I’m likely going to have to get this in CI and collect the symbols as artifacts.

I’ll start reading through the code, but it’d be good to know if enable this flags also turns off an optimization flags too.

Thanks,

-Bert

That’s probably a good idea. Fyi, there is a separate CVar for turning off shader code optimization (r.Shaders.Optimize=0). The CVars I mentioned before should not automatically disable optimization.

Some clarification on the aftermath CVar placement. The aftermath CVar (r.GPUCrashDebugging.Aftermath.DumpShaderDebugInfo=1) should be placed in either:

File [<ENGINE_PATH>/Config/ConsoleVariables.ini], section [Startup]

or

File [<PROJECT_PATH>/Config/DefaultEngine.ini], section [ConsoleVariables]

Placing this CVar anywhere else will probably not work (e.g. placing it under the [/Script/Engine.RendererSettings] section in DefaultEngine.ini will not work).

Since the crash is so infrequent, it’s probably also a good idea to verify that you can get proper shader code association from GPU crashes in your QA builds.

To purposely trigger a GPU crash for testing you can use the following code snippet.

float alpha = 1;
float result = 1;
while (alpha > 0.0)
{
   alpha -= LightVector.x; // Some input to the shader
   result *= (1.0 - 0.1*alpha);
}
 
 
// …
float Masking = result; // Use the result somehow, this is used in TranslucentLightInjectionShaders.usf

You can do something like this in any global shader, but this is the one I have been using. It can be tricky to these kinds of shaders to compile, so I always use the same example. This one modifies TranslucentLightInjectionShaders.usf. Place this snippet right before the lines:

// 0: no contribution, 1:full contribution

float Masking = 1.0f;

And comment out or replace this existing Masking assignment. This will trigger a crash a GPU crash dump you can use for testing shader code association.

I’ve confirmed that the crashing machine does not need shader symbols output (r.Shaders.Symbols=1). Shader symbols can be associated later, as long as the *.nvdbg file is output (r.GPUCrashDebugging.Aftermath.DumpShaderDebugInfo=1).

For CI generating shader symbol build artifacts, Unreal has CVars for outputting a pre-zipped version of the shader files for easier handling. Add these settings to you WindowsEngine.ini instead of r.Shaders.Symbols=1:

[ShaderCompiler]

r.Shaders.GenerateSymbols=1

r.Shaders.WriteSymbols=1

r.Shaders.WriteSymbols.Zip=1

or

[ShaderCompiler_BuildMachine]

r.Shaders.GenerateSymbols=1

r.Shaders.WriteSymbols=1

r.Shaders.WriteSymbols.Zip=1

See shader symbol docs here.

Let me know if you have any more question, or once you get a new crash with the shader debugging information. As I said before sending the debug files, including the *.nvdbg, *.nv-gpudmp, and associated dxil shader and pdb symbol files (listed when you click the “Show Symbol Files” button) in addition to the crash log showing active shader breadcrumbs, would be very helpful.

Regards,

Lance Chaney

Thanks Lance this is great info. Appreciate the rigor as this is exactly what I need and was looking through. Just ran into -buildmachine differences and spotted that config option. The sample to help validate is great too.

It’s going to take me a couple days likely to sift through and validate it all is working with CI. I’ll let you know when I have something more to report.

Thanks,

-Bert

I have symbols generating correctly and can confirm the sample you provided helps me reproduce a GPU Crash to validate the whole setup is working.

I don’t have *nvdbg files generating along side the *.nv-gpudump file from the crash. I did place the aftermath cvar here File [<ENGINE_PATH>/Config/ConsoleVariables.ini], section [Startup].

I started tracing back to debug to make sure it’s enabled, validated it is handed in the flags correctly at creation time, and saw this in the Aftermath header:

// NOTE: shader debug information is only supported for DX12 applications using

// shaders compiled as DXIL. This flag has no effect on DX11 applications.

GFSDK_Aftermath_FeatureFlags_GenerateShaderDebugInfo = 0x00000008,

Realized that while we’re running the DX12 backend, we are using SM5 shaders. This was backwards compatibility with Dx11. I’m inferring, but will check, that we’re only generating one set of shaders for both which I’m assuming is coming from FXC and not DXC, thus no DXIL.

FYI, I realize I never answered some of your initial questions regarding what could cause this type of error. These kinds of page fault errors simply indicate that the process tried to access an invalid memory address. This can be caused by a number of things, such as an unbound resource, or a resource that has been destroyed before or during access. Sometimes this could be caused by incorrect resource state transition logic as well.

Unfortunately SM5 shader line mappings are not supported by Aftermath, as noted in the aftermath readme (Engine\Source\ThirdParty\NVIDIA\NVaftermath\Readme.md):

“Shader line mappings are only supported for DXIL shaders, i.e., Shader Model 6 or above. Line mappings for shaders in DXBC shader byte code format are unsupported.”

I wouldn’t expect this to necessarily stop the nvdbg file from being output at all though. At least in my test case, even with only SM5 shaders enabled, I still get the nvdbg file output in the crash dump directory. The shader line association doesn’t actually work in this case though (since there is no dxil file). To clarify, do you have both SM5 and SM6 support enabled? or do you only support SM5. Unfortunately, in the SM5 case, aftermath is not able to perform shader code association. So you would need to upgrade to SM6 in order to get this to work in Aftermath. In case you have both supported, you will need to be running the build with SM6 enabled to get shader code association.

There are a few gotcha’s that could prevent Aftermath for outputting the nvdbg file. One is to having something attached that intercepts D3D api calls, such as Microsoft Pix. If you run the build with -AttachPix as a launch argument, you will not get an nvdbg output. There is a section in the Aftermath readme covering these known limitations and incompatibilities. Here is the excerpt, I’ve highlighted the potentially relevant sections:

# Limitations and Known Issues

* Nsight Aftermath covers only GPU crashes. CPU crashes in the NVIDIA graphics

driver, the D3D runtime, the Vulkan loader, or the application cannot be

captured.

* Nsight Aftermath is only fully supported on Turing or later GPUs.

## D3D

* Nsight Aftermath is only fully supported for D3D12 devices. Only basic support

with a reduced feature set (no API resource tracking and no shader address

mapping) is available for D3D11 devices.

* Nsight Aftermath is fully supported on Windows 10 and newer, with limited support on

Windows 7.

* Nsight Aftermath event markers and resource tracking is incompatible with the

D3D debug layer and tools using D3D API interception, such as Microsoft PIX

or Nsight Graphics.

* Shader line mappings are only supported for DXIL shaders, i.e., Shader Model 6 or

above. Line mappings for shaders in DXBC shader byte code format are unsupported.

* Shader line mappings are not yet supported for shaders compiled with the DirectX

Shader Compiler’s `-Zs` option for generating “slim PDBs”.

## Vulkan

* Shader line mappings are not yet supported for SPIR-V shaders compiled with the

NonSemantic.Shader.DebugInfo.100 extended instruction set, i.e., shaders compiled

with the `-gVS` option of `glslangValidator` or the `-fspv-debug=vulkan-with-source`

option of the DirectX Shader Compiler.

Regards,

Lance Chaney

Thanks Lance!

I’m versed in GPU crashes for sure. It’s just been a few years since I’ve been doing it in UE. I’m coming in late on a project without the folks who set things up at this point. I was mostly wondering if even without the shader reflection, we were getting any clear signals there.

At this point I do have it all working with SM6. Thank you for all the pointers getting there! The next gotcha is that it seems to be bloating our builds. Need to validate, but per the docs it does seem like there is extra debug data embedded in the shaders which is causing it. This is coming back to me from the last time I was in PC d3d12 land. I think I’ll be able to get to a place where I can get builds to QA from this to at least catch this crash in a controlled environment, even if it doesn’t match what we ship with and is just for debugging purposes.

I’ll try to circle back and check on the SM5 version outputting nvdbg file with the old project settings once I get through figuring out what we can do about the bloated packages if anything. I know I didn’t have any of the debug/intercept layers on from the command line, but that isn’t to say something else might have enabled one.

Thanks again!

-Bert

Can you verify which files in your cooked data are bloating out? Using an app like TreeSize or WinDirStat is a good way to check this. I would not have expected generating shader symbols add much additional bloat to the build. Other than the zipped dxil + pdb files of course. Unreal adds the -Qstrip_debug to the DXC compiler when generating shader code, and extracts the pdb information manually for serialization when outputting shader symbols. So there shouldn’t be much additional debug data in the actual packaged game build. It looks like debug info and reflection data is also stripped out of FXC compiled shader code. So it should be the same for the SM5 case.

Regards,

Lance Chaney

Hi Lance,

Took a little time. I ended up doing 3 builds/cooks for comparison.

1.) Original with SM5 -> StagedBuild/Windows ~42.1 GB

2.) Swap to SM6 -> StagedBuild/Windows ~47.6 GB

3.) Enable Symbols w/ SM6 -> StagedBuild/Windows ~199 GB

CI failed on us for both the Win64 and WinGDK builds. Win64 has a massive pak file and WinGDK had a massive final GDK package.

When I cook locally to test I’ve been running this:

.\RunUAT.bat BuildCookRun -project="Belfry" -platform=Win64 -build -cook -AdditionalCookerOptions="-nodev" -stage -iterate -config=Test -buildmachine

Sample cook lines I got from the CI team look like these (these are for TEST builds):

BuildCookRun -project=Belfry -p4 -buildmachine -DDC=TeamCityAgentDDC -platform=Win64 -cook -skipstage -AdditionalCookerOptions="-nodev -verbosecookerwarnings -nower -cookprocesscount=8" -NoCompileEditor -unrealexe=UnrealEditor-Cmd.exe
 
D:\Stoic\work\91fcf94ed121fdd\Engine\Binaries\Win64\UnrealEditor-Cmd.exe "D:\Stoic\work\91fcf94ed121fdd\Belfry\Belfry.uproject" -run=Cook  -TargetPlatform=Windows  -buildmachine -ddc=TeamCityAgentDDC -unversioned -nodev -verbosecookerwarnings -nower -cookprocesscount=8 -fileopenlog -abslog="D:\Stoic\work\91fcf94ed121fdd\Engine\Programs\AutomationTool\Saved\Cook-2025.11.17-21.03.59.txt" -stdout -CrashForUAT -unattended -NoLogTimes -buildmachine

Using Beyond Compare to do a folder/file diff, the changes do appear to be isolated to shader changes under the content tree. The relevant uasset files look comparable in size with small diffs. The uexp files are what have gotten significantly larger. Some typical examples I’m looking at went from 2.16 MB -> 40.08 MB.

Are we just missing something in the packaging step? I can circle back with the CI team and see what they’re doing for packaging.

Thanks again,

-Bert