GPU crash - MMU fault detected during a GPU memory Read

We have somewhat recently started experiencing a frequent GPU crash in our game builds, and could use some help identifying likely causes and possible resolutions. We are currently doing a tranche of testing internally to try and isolate the problem, and will update with further details if this yields results.

We are using a forked version of 5.3.2 release, with minimal changes made to it. We experience this in game builds and these are created for Windows, with development configuration currently.

This crash seems to reliably impact out target machines using ADA 6000 GPUs, but we aren’t able to reproduce on workstations using RTX 4090 GPUs

Nsight description of the crash:

  • MMU fault detected during a GPU memory Read of a destroyed unnamed resource or other resource(s) at address 0x000000XXXXXXXXXX.
  • There are no debug names found for the resources in the Page Fault Resource History list.

We are able to somewhat reliably reproduce by:

  • Running build with automated level switching in a cycle, until crashing
  • Load directly into a specific level, loop sequence until crashing

Levels consist of a Nanite & Lumen environment with several skeletal mesh performing looping animations driven by Sequencer, some of these emit Niagara particles from their skeletal mesh

We have determined the following from testing:

  • Crash persists even after removing all Nanite environment meshes and Niagara effects from the level
  • Not likely related to GPU skinning as this is not enabled in our project

The development team suspects it’s related to:

  • A caching issue loading data from previous level
  • Data being corrupted by the load in-between scenes

Any insight you can potentially provide would be valued. I will try and share more information on the issues when I receive it from the team.

Thanks,

Thor

Steps to Reproduce
Still working on isolating repro for this, trying to identify specific source of issue.

Hello,

When dealing with GPU crashes our general recommendations are available in the docs here, though in later versions of this page there is additional information worth looking at because of the added information regarding why these crashes happen.

The most useful debug data usually comes from reproducing the crash with -gpucrashdebugging enabled (or r.GPUCrashDebugging=1) which will enable GPU breadcrumbs, NVidia Aftermath and DRED. And you’ll also want to test with the latest GPU drivers to ensure it’s not a driver issue that has been fixed.

On NVidia GPUs it can be helpful to get Aftermath crash dumps as outlined here that you can then open up in NSight and sometimes determine what shader and code was executing. GPU breadcrumbs and the commandlists output from DRED at the time of a crash can also provide hints as to what is going on.

If you can provide the logs with containing the breadcrumbs that can help us look for known issues with similar GPU breadcrumbs. And of course, providing a repro of the issue in a template project would be ideal. We have similar GPUs and may be able to repro the issue with the same driver.