GPU Crash in UE5.6 related to AsyncCompute & Nanite

I encountered large-scale GPU crashes, which I have currently successfully avoided by setting r.RDG.AsyncCompute=0.

The direct cause of the errors, according to Aftermath analysis, is Nanite. Nanite-related shaders caused page fault and misaligned address issues, occurring specifically in LoadNaniteMaterial, Raster, and Cluster-related code sections respectively.

Our project is based on UE5.6, with some minor modifications, but the problematic part (Nanite) has been completely unchanged.

I have provided several crash reports, including Aftermath (with shader symbols) and logs (with breadcrumbs).

重现步骤
Use a large scene with some BPs and splines, size is just like city demo.

Hello,

Thank you for reaching out.

We have not been able to reproduce this crash.

Can you please send us a minimal test project that demonstrates this, and include screenshots of the issue?

The guide for test projects:

[Content removed]

If you cannot provide a test project, can you please provide more detailed reproduction steps, including any settings or configurations needed?

Hello,

We are handing this to another team for further investigation and consideration.

Hi,

Thanks for reaching out. I started looking at the dumps and noticed two crashes in the same spot inside NaniteDebugViews.usf. Could you let me know if you are opening the debug views when the crash happens? I suspect that we are reading an incorrect address from the DebugViewData ByteAddressBuffer, but I need to do some more digging to find out how this could be related to disabling async compute.

[mention removed]​

Hi, thanks for the reply.

The crash we encountered is probabilistic and triggered when multiple team members work on a large-sized map. This issue occurs across multiple computers, typically after running for more than half an hour. Since we cannot pinpoint which part of the map causes the issue, and our scene is still being modified, we are unable to provide a smaller test case. However, information related to this crash includes:

  1. It is more likely to occur when switching to debug view.
  2. It is more likely to occur when switching to full-screen mode.
  3. It is more likely to occur when using Ctrl+Z.

Additionally, Nvidia driver versions between 570.00 and 580.97 have potential bugs, which we suspect may also affect UE’s stability. We are currently more stable on version 581.08, but we are unsure whether to enable async compute, as a high-probability crash would disrupt the production workflow. We may gradually enable it on some machines to observe and determine if the crash reoccurs.

Also, ​hope to hear your valuable suggestions.

Okay, thanks for the info. Do let me know if your stability improves with the newer drivers. Can you reproduce the crash again and generate an Aftermath crash dump with shader PDBs? That way, it will be much easier to pinpoint the source of the crash.

Thanks for the reply.

Now for stability purpose, we stay AsyncCompute-OFF for now, and will switch it back to ON when merge 5.7 at the end of this year. We now are at driver 581.08 & r.RDG.AsyncCompute=0, and are stable.

For shader symbols, please check the attachment above right in this question proposed, I have place in the dxil(with symbol) along with aftermath gpudmp, I have tried that symbols can be resolved if search path options are set to these dxils.

Also, I will try it on my machine when I get the chance to focus on this issue.

Hi,

Based on our previous discussions, the crashing passes are almost ​​graphics passes​​. This is because ​​pixel shaders typically execute in 2x2 tiles​​.

Consequently, ​​when the resolution is not a multiple of 2​​, the shader might attempt to read from an ​​invalid or out-of-bounds memory address​​, leading to access violations and crashes.

In contrast, the ​​Visibility pass​​, being a ​​compute shader​​, operates differently. It does not perform off-screen writes in the same manner, which inherently makes it less susceptible to this specific issue.

Do you have any updates on the fix (shelf) we provided? I’m looking forward to your feedback on your tests.

Thanks. We have merged that fix in our engine. Tested for several days on a couple of machines, and for now​ haven’t seen any new hangs, we will keep testing for more days also on packaged cook builds.

Thank you for your feedback. The fixes have been submitted to the UE5 Main.

Hi, what is the CL to fix this? Thanks.

https://github.com/EpicGames/UnrealEngine/commit/ab1083e4d04765e4daf32c68f807d10b7cec23ad

https://github.com/EpicGames/UnrealEngine/commit/e23bf8c26bb2ae9dc5240fa1c001a71b35ac34ae