GPU hang; probably another NodeAndClusterCull issue.

anonymous-edc · August 11, 2025, 6:14pm

I see that this issue has been reported in a few other threads but I wanted to report that our team is in the process of upgrading from Unreal 5.4 to 5.5 and some of our QA testers are seeing this crash with a pretty high occurrence rate while loading into a particular level.

I haven’t been able to repro this myself (on a 4070) but our QA team is able to get this pretty reliably on their 30XX cards. Please let me know if there’s any settings you’d like to us toggle to gather info that might help you diagnose the root cause of the issue.

We have been able to mitigate the issue by disabling Nanite and Lumen async compute with:

r.Lumen.AsyncCompute=0

r.Nanite.Streaming.AsyncCompute=0

Here are the breadcrumbs from the crash:

`[2025.08.05-23.01.38:268][421]LogD3D12RHI: Error: [D3DDebug] ID3D12Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung. As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered. The current Device Context was executing commands when the hang occurred. The application may want to respawn and fallback to less aggressive use of the display hardware).
[2025.08.05-23.01.38:269][421]LogD3D12RHI: Error: GPU crash detected:

Device 0 Removed: DXGI_ERROR_DEVICE_HUNG

[2025.08.05-23.01.38:270][421]LogRHI: Error: Active GPU breadcrumbs:

Device 0, Pipeline Graphics: (In: 0x80b60e5b, Out: 0x80b60e55)
(ID: 0x80b60d95) [ Active] Frame 51419
(ID: 0x80b60ebf) [ Active] FRDGBuilder::Execute
(ID: 0x80b60de5) [ Active] Scene
(ID: 0x80b60e51) [ Finished] ShadowDepths
(ID: 0x80b60e52) [ Finished] FVirtualShadowMapArray::BuildPageAllocation
(ID: 0x80b60e53) [ Finished] InitializePhysicalPages
(ID: 0x80b60e54) [ Active] ShadowDepths
(ID: 0x80b60e55) [ Active] BuildRenderingCommandsDeferred(Culling=%s)
(ID: 0x80b60e56) [ Active] RenderVirtualShadowMaps(Nanite)
(ID: 0x80b60e59) [ Active] Nanite::DrawGeometry
(ID: 0x80b60e5a) [ Active] NoOcclusionPass
(ID: 0x80b60e5b) [ Active] NodeAndClusterCull
(ID: 0x80b60e5c) [Not Started] RenderVirtualShadowMaps(Non-Nanite)`

anonymous-edc · August 11, 2025, 6:14pm

Steps to Reproduce

Petrockets · August 11, 2025, 11:58pm

Hello,

This does look like the same issue. Have you tested just disabling part of Lumen on the Async queue during level load, for example r.Lumen.DiffuseIndirect.AsyncCompute=0? Judging by the breadcrumbs it looks like what we see when rendering the first frame after loading a level, and in recent tests, temporarily disabling select GPU passes on the Async queue prior, can workaround the issue.

anonymous-edc · August 12, 2025, 9:06pm

I tried to disable the AsyncComputeTransientAliasing cvar that was mentioned in another thread but we were still seeing the crash with that disabled. I haven’t done too many tests with limiting the perf impact since we’re likely going to follow with an update to 5.6 shortly after our 5.5 update. I’ll wait until we’re on 5.6 before I start trying to minimize the perf impact of any crash mitigations.

Our QA team can repro this pretty reliably though so please holler if there’s any tests we can do to help you track down the root cause here.

Petrockets · August 13, 2025, 4:55pm

r.RDG.AsyncComputeTransientAliasing=0 may fix the issue if you run into it in 5.6, but it hasn’t prevented the hang for most users in 5.5. We’ll reach out if we have a reliable workaround or fix to test, but as it stands, it appears there may be a driver related issue with local memory - or potentially a memory overwrite that is affecting local memory that causes the ViewIndex in GetNaniteView() in InstanceCull CS to be out of bounds which leads to a DMAFault.