5.5 GPU Crash device hung, NodeAndClusterCull: Recommended action?

This question was created in reference to: [5.5 GPU Crash: DXGI_ERROR_DEVICE_HUNG, [Content removed]

We have just updated a project to 5.5 and are getting frequent reports of GPU hangs which align with those described in the post referenced above - these all occur in the editor (so far) while loading levels, on PCs with 3090s and latest drivers. We have r.SceneCulling set to 1 (default value). In the mentioned question it is stated that the problem may be fixed in 5.6, but upgrading to this engine version is not yet an option for us. After reviewing the details in that question, it is not immediately clear to us what our next steps should be to try and prevent the hangs from happening, could you suggest a course of action?

Hello,

We don’t have a reliable workaround for this GPU hang and we’re still working with NVidia to determine the cause and a fix for this.

Things to try for users experiencing this include:

  • Try to just disable async compute for Lumen with r.Lumen.AsyncCompute=0 or turn off Async Compute entirely with r.RDG.AsyncCompute=0 while loading, then back on after the level is done loading
  • Set r.Nanite.InstanceHierarchyArgsMaxWorkGroups to a low value like 512 before and during level loading, then increase it to 64k or default after loading is complete. Alternately, make the size of the Editor viewport small while the level is loading, which has a similar effect and enlarge the viewport after loading has completed.
  • Temporarily disable areas visible in the scene with large numbers of local shadow casting lights so they’re not casting shadows while loading.

Please let us know if any of these suggestions work for your use case.

Thank you for the updates, we do have a fairly consistent repro, but so far the workarounds that prevent the crash in our case don’t work in all cases

Also, as a reminder, Epic is on holiday break from 6/30 - 7/11, returning on 7/14. Confidential issues will be unanswered during that time and responses to non-confidential issues may be slow..

Thank you for the detailed info! In the cases we have repros for, the Aftermath GPU crash dumps indicate a PageFault in InstanceCull which happens right before NodeAndClusterCull. If you have logs for the crash cases that were run with -d3ddebug enabled they might include this:

Op: 43, BeginEvent [NoOcclusionPass] Op: 44, ResourceBarrier Op: 45, Dispatch Op: 46, Dispatch Op: 47, ResourceBarrier Op: 48, Dispatch Op: 49, ResourceBarrier Op: 50, ResourceBarrier - LAST COMPLETED Op: 51, ExecuteIndirect <-- InstanceCull - Group Work where PageFault happens Op: 52, WriteBufferImmediate Op: 53, BeginEvent [NodeAndClusterCull] <-- NodeAndClustercull hasn't begun yetAre you seeing similar output in your crash logs with -d3ddebug enabled? If not, it could very well be that there are 2 separate issues in this area, one from too much work and one from PageFaults. I also noticed that when I reproduced the crash locally I get a PageFault error with the newer drivers (576.28+), but wasn’t always getting that error with older drivers I tested (551 - 560.94) on a 4070 Super

Removing WPO and reducing Nanite fidelity on the trees saved us from crashing

This makes sense because we are also seeing that the crash usually takes a large amount of group work and culling to reproduce and on the 30 and 40 GPU series, so reducing it should make it less likely to crash. My local repro case involves uses the Hillside sample.

We’re still working with NVidia to determine the root cause.

Thanks Alex,

I can now provide an update on where we are:

  • We tried temporarily setting r.Nanite.InstanceHierarchyArgsMaxWorkGroups to 512 as suggested but this did not seem to be enough to avoid the GPU hangs.
  • We tried temporarily disabling VSMs while loading levels and this seemingly prevented the hangs from happening.
  • We did not try temporarily disabling Lumen async compute.

So far we have only seen this issue when launching the editor, so the workaround is only active in this case and we are ok with it for the time being, but we cannot rule out the problem may manifest in packaged builds.

We seem to manage to get a somewhat consistent repro frequency when the workaround is disabled so if you wanted us to try some code changes to see if we could get more information about the problem which might contribute to its future resolution we may be able to find time to do so.

Hello Alex,

we used to have a consistent repro for this issue that is not related to loading but we implemented a workaround. I can’t send you the project without approval but I can describe the repro and then the workaround, which does not involve any CVars, to better help you investigate the issue.

Repro:

  • We had a huge procedural assembly consisting of Nanite-level Megascan trees. The poly count of those trees is around 4 millions. The leaves of those trees are modeled using masked materials so they create a lot of Nanite overdraw. Additionally, the leaves and branches of those trees had WPO.
  • We had a few buildings where gameplay was happening. One of these buildings had a hole in the roof through which the sun light was entering so all the trees’ shadows were being cast inside it.
  • We were using software Lumen and VSM because we experienced quality issues with Megalights.
  • The most important thing is that we were animating the sun directional light with Sequencer so all VSM pages were invalidated every frame. The issue did not reproduce if we teleported directly to the crashing area without triggering the Sequencer animation of the sun directional light.
  • For debugging purposes, I increased to TdrDelay to 60 instead of the default value of 2 and I noticed that there is no page fault but simply very long frames that last around 6 seconds.

Workaround:

  • Removing WPO and reducing Nanite fidelity on the trees saved us from crashing in that area even when the directional light was animating. Some people are still experiencing the crash in different areas so we will track other problematic content and apply the same workaround.

Conclusion:

  • Nanite’s NodeAndClusterCull pass is experiencing multiple seconds hitches in ShadowDepth when updating too many VSM pages at a time when there is an elevated amount of Nanite overdraw, likely caused by masked materials and WPO. In turn, these long hitches are triggering the TDR mechanism in Windows.
  • Issues on level loading when there are a huge amount of world partition cells loaded may be caused by the elevated amount of VSM pages that have to be rendered at once (I did experience that once with the Valley of The Ancients project). They might also be caused by Nanite overdraw and WPO but I theorize that the sheer volume of Nanite+VSM updates might be enough to create a frame that takes over 2 seconds to render.

Hope that helped,

Jean-Michel Gilbert