GPU crash in Nanite::InitContext or DistanceFieldStreaming

Hello,

I have some updates from my colleague, who is currently improving the crash minidump generation process. To test the crash handling, he is rewriting random bytes in the mapped memory of the game. Interestingly, it reports 1 out of 10 crashes as a GPU crash. The GPU report looks like the other crashes from the crash hunt sessions - no DRED, no Aftermath, no active shaders or resources (all of them are enabled). During the testing, he also got a lot of crashes going from the Nvidia driver. However, subsequent sessions without any artificial memory overwrites resulted in the same Nvidia driver crashes after several seconds of the game run. This state continued until he cleared the PSO cache. Then, the game stabilized back to normal. This behaviour leads us to the conclusion that some memory stomps may lead to GPU crashes and overall instability in the subsequent runs.

“In the past we had a UE side PSO cache but that’s been disabled for years - I’m assuming you’re not using that feature and are referring to the default driver PSO cache.”

I cannot give you an exact answer here, because PSO caching was a task for another colleague [mention removed]​. But I think he just cherrypicked PSO handling improvements from the higher engine versions and improved the precaching at game side.

Regards,

Tomas Ruzicka.

[Attachment Removed]

Hello.

Unfortunately, I don’t have anything new. A few weeks back, the crashes started being completely random again. So the originally reported crashes seem to be irrelevant.

I see, in our crash reporting system (Backtrace), the Breadcrumbs didn’t mark any specific pass in most of the cases (see the first row - Frame N { FRDGBuilder::Execute }). The pass itself is reported as [Active] in the log, but the nested passes are [Not Started].

The BasePass is crashing too (25 crashes), but the reports are currently fragmented by the unique material instance name per BasePass crash report.

Following reports are from the last three days, gathered during our nightly crash hunt sessions (50+ PCs):

[Image Removed]

For example, the NightVision passes are our custom compute passes, which are more texture fetch heavy, but not so heavy (0.2 ms at full HD at mid end GPU) to cause the GPU crash.

Increasing the TDR in the registry didn’t help as well.

In the meantime, we are also trying to resolve the cpu part of render crashes, discussed in the separate thread: [Content removed]

So far, without any bigger impact to the GPU crashes too.

Regards,

Tomas Ruzicka.

[Attachment Removed]

Hello.

Thanks for the reply.

I’ve found several mentioned CPU crashes of Nanite::FStreamingManager::AddParentNewRequestsRecursive() in our Backtrace and I’ve integrated your proposed CL today.

Yes, we are using the r.RDG.Events=3 and r.ShowMaterialDrawEvents=1 cvars for the nightly crash hunt sessions to get more verbose reports.

Today, I’ve also tried to uncomment the GFSDK_Aftermath_FeatureFlags_EnableShaderErrorReporting flag in the RHICoreNvidiaAftermath.cpp when the “-nvaftermathall” cmd arg is used, to get potentially more valuable Aftermath reports. So far, we are getting one 100% GPU crash (“Misaligned Address Error”) over various materials in the main view Nanite BasePass, but only when the scene capture is active. Unfortunately, the above provided CL didn’t fix this particular GPU crash.

We will try to run nightly crash hunts with less and less render features in the upcoming days to see if it helps in some way.

Regards,

Tomas Ruzicka.

[Attachment Removed]

Hello.

Thanks for the reply.

We fixed the LDS overflow in the GlobalDistanceField.ush two weeks ago by a similar fix. It has resolved the crash itself, but it didn’t influence the rest of the GPU crashing.

Continuing GPU crash hunt investigation shown to us following behaviour:

  • reference state (Nanite, Lumen, VSM, no HW RT, no async compute) - 100% GPU crashes
  • same as above + FPS limited to 60 - 80% GPU crashes
  • same as above + no scene captures and no custom render passes - 65% GPU crashes
  • same as above + no postprocessing at all - 45% GPU crashes

Opposite tests: minimal drawing - only the opaque geo with Nanite, with and without simple lighting (dir light, VSM) surprisingly didn’t crash at all.

Enabled postprocessing in the minimal draw test, reintroduced the GPU crashes.

We haven’t finished yet the rest of the bisection tests of the minimal draw (including a full lighting, no VSM, clouds, vol fog and transparents), however, the results above show to us the following behaviour: more busy render and more passes with enabled Nanite leads to more GPU crashes.

Completely disabled Nanite in the “reference state” above didn’t introduce any GPU crashes.

Regards,

Tomas Ruzicka.

[Attachment Removed]

Hello.

Enabled GFSDK_Aftermath_FeatureFlags_EnableShaderErrorReporting flag has discovered another LDS index overflow in the LumenScreenProbeFiltering.usf - ScreenProbeCompositeTracesWithScatterCS() at a block containing:

InterlockedAdd(SharedAccumulators[ThreadIndex][0], QuantizedLighting.x);
InterlockedAdd(SharedAccumulators[ThreadIndex][1], QuantizedLighting.y);
InterlockedAdd(SharedAccumulators[ThreadIndex][2], QuantizedLighting.z);

We have locally fixed this crash by the wrapping all LDS accesses in the scope by:

if (all(QuantizedGatherTexelCoord < ScreenProbeGatherOctahedronResolution))

{ … }

Repro steps are unfortunately unknown.

Only one PC reported this, while running the build with:

  • r.RDG.ParallelExecute 0
  • r.RDG.Debug.FlushGPU 1

Specs:

  • AMD Ryzen 9 9900X 12-Core Processor
  • NVIDIA GeForce RTX 3070 Ti, driver version: 581.02

Regarding to our GPU crash hunts, we had run with a “simple rendering” versions:

  • A - Draw everything, but without VSM.
  • B - Draw only opaque geo and do simple lighting (dir light, fog, atmosphere, VSM), no custom render passes, no scene captures, no postprocessing. FPS limit set to 60.
  • C - Same as B, but with Lumen enabled. No FPS limit.
  • D - Same as C, but with clouds, vol fog, god rays and SSS.
  • E - Same as D, but with transparent geo.

Surprisingly, all of them were crashing a lot.

Most of the GPU crashes are still reporting empty “{ FRDGBuilder::Execute }” breadcrumbs. However, Aftermath markers seem to be working better and half of the GPU crashes contain the executing Nanite::DrawGeometry marker.

So far, we have only one stable “simple rendering” version - drawing only the Nanite and static geo without anything else. We will do a bisection between this stable version and the “B” version above in the following days.

Regards,

Tomas Ruzicka.

[Attachment Removed]

Hello.

Thanks for the proposed CL.

I’m attaching some nv-gpudmps. However, they will probably be useless as the GFSDK_Aftermath_FeatureFlags_GenerateShaderDebugInfo callback is never called (even when it is enabled) in these types of crashes.

The crashing PCs have nothing special. Our top crashing PCs are:

  • Intel(R) Core™ i5-14600KF, NVIDIA GeForce RTX 3070 Ti, 581.29
  • AMD Ryzen 9 7900X, NVIDIA GeForce RTX 3070 Ti, 581.57
  • AMD Ryzen 9 9900X, NVIDIA GeForce RTX 3070 Ti, 581.02
  • AMD Ryzen 9 9900X, NVIDIA GeForce RTX 3070, 581.02

We observed, when the game build crashed on one machine within 15 minutes over and over, the exactly same build never crashed on another machine.

“We will do a bisection between this stable version and the “B” version above in the following days.”

Last bisection has shown us that reducing the render features also reduces the GPU crash rate when the Nanite is active.

  • Drawing opaque geometry in the base pass without anything else (features were disabled via show flags) didn’t crash on any machine.
  • Enabled simple lighting (directional lights, sky lighting, fog, without shadows) introduced a small amount of random GPU crashes.
  • Enabled Lumen increased the GPU crash rate.
  • Same for the volumetric fog, transparents and postprocessing.
  • With enabled postprocessing, we’ve got our actual 100% GPU crash rate.

Then, we retested the r.Nanite=0 scenario (with everything else enabled and with the fallback meshes) and again, we’ve got zero GPU crashes.

Then, we did another test, where we replaced all Nanite materials by the default gray material. Here, we also got zero gpu crashes.

Next test, where we dramatically simplified our opaque materials to just sampling an albedo, RMA (roughness, metalness, AO) and normal textures also didn’t crash so often (several GPU crashes from one night run).

We also tried to run a Matrix demo with some flythrough camera and we also didn’t receive any GPU crashes.

Is it possible that our materials may utilize some uncommon code path in Nanite, which may result in random GPU memory overwrites? For example by using too many UVs or something similar?

Our colleague opened a separate ticket about the unaligned memory access GPU crash in Nanite, as we still don’t know if it is connected together or not.

[Content removed]

Regards,

Tomas Ruzicka.

[Attachment Removed]

Hello.

Thank you both for the answers.

I’m looking into our cherrypicks and we have already merged all of your proposed CLs.

  • 45662288 - missing ViewRect initialization (two days ago).
  • 40842942, 41040155, 41066128 - Nanite material fixes (several months ago).

Unfortunately, none of them changed the crash rate.

We can probably close this thread, as my colleague opened an another one, which is more specific:

[Content removed]

to not discuss the GPU crashes at two places.

Thank you anyway for your help.

Regards,

Tomas Ruzicka.

[Attachment Removed]

Hello Alex, thank you for your prompt response,

This is full distribution of the crashes but beware, it is quite misleading because it is not normalized against HW we have at the company. I would say it does not matter because the graph seems to be more a graph of the HW we have in the company. Same goes for drivers. It is more a graph of the driver people have installed.

[Image Removed]I checked SecondsSinceStart and there is a single occurance of 191 seconds and then occurrences start only above 470 seconds. Majority of people crash within 500-700 seconds into the game. There is then smaller group of users concentrating around 1100 seconds into the game. We run the repro project here with max time limit of one hour, after this timeout we restart it as we consider this as a successful result (a session which is not suffering from the issue).

[Attachment Removed]

Hello Alex,

I have been further iterating on the repro project locally. The amount of custom HLSL we have there was suspicious, so I have managed to get rid of all custom HLSL (Shaders folder) we have there in the project and the GPU crashing continues. I also removed all custom HLSL nodes I was able to find in the materials.

I also did not measure any clear impact of slomo 2 (running the flythrough twice as fast).

[Attachment Removed]

Hello Alex,

after running some tests over the past two days, my working theory is a race condition involving occlusion queries and HZB. Our content and materials seem to trigger it more easily. This would explain why various CVars influence the crash rate but rarely eliminate the issue entirely. It also aligns with why adjusting materials or overall scene load changes the frequency. Running with -d3ddebug significantly reduces the number of crashes as well. Most of breadcrumbs in gpu crashes of the repro project are also in BuildHZB.

In the repro project’s control test (no changes to repro project), we recorded 36% gpu crash rate last test.

I then removed all rendering CVars from DefaultEngine.ini except:

[/Script/Engine.RendererSettings]
# Renderer
r.RHICmd.ParallelTranslate.CombineSingleAndParallel=1
r.NumBufferedOcclusionQueries=2
 
# Lighting (required to avoid missing-permutation startup crash)
r.AllowStaticLighting=False
r.SupportStationarySkylight=False

This resulted in same 36% gpu crash rate what means all the removed CVars did not have a significant impact on it.

Removing r.NumBufferedOcclusionQueries=2 yields a 0% crash rate.

However, running the reverse test - removing only that CVar while keeping all of our other CVars - produces 8% crash rate, still including also BuildHZB breadcrumbs.

It looks like r.NumBufferedOcclusionQueries=2 greatly increases the likelihood of hitting some race condition, but the underlying issue still exists without it; it’s just much harder to reproduce.

I also tested the Matrix demo with our CVars for several hours without triggering GPU crashes, suggesting the issue also depends on the game’s specific performance profile.

Note that we introduced r.NumBufferedOcclusionQueries=2 in UE 5.5 to regain lost performance on some hardware after switching to parallel rendering ( [Content removed] ).

One more note, unrelated to the GPU crashes: our repro project is hitting nightly hundreds of assertions related to CheckCompilingPSOs(). It might also serve as a good repro case for an issue same or similar to UE-288175. I saw that our suggested fix from [Content removed] was submitted to ue5-main (5.8), but I haven’t confirmed whether it also resolves the assertion triggered by this repro project.

[Attachment Removed]

Hello Alex,

I’ve further reduced the repro project to make it easier to pinpoint the issue.

I’ve uploaded ReproProject2.zip and ReproPackaged2.zip into the same google drive folder as before. They are based on the same 5.7 repro project, but all game C++, HLSL code, and unnecessary .ini files have been removed. I also removed all rendering CVars from DefaultEngine.ini except for r.NumBufferedOcclusionQueries=2. The project still reproduces the GPU crashes.

Did you have any luck reproducing the crashes on your end?

Thank you,

Ondrej

[Attachment Removed]

That’s good to hear, Alex! Yes, the breadcrumbs look the same. Your machine seems to be more on the “unlucky” side if it took 15 hours, but that’s still possible since we also have both rare and frequent crashers. Our top crashers crash like every 10 minutes. In our production setup, we use the same CVars as in Repro 1, which disable async compute, so we don’t see the async compute call stack. Repro 2 aims to stay as close to the UE defaults as possible, so it keeps async compute at its default value. That is not a problem as the issue reproduces on both.

[Attachment Removed]

In case it’s useful, I attached specs of one of our most crash-prone PCs - this one we nationalized for gpu crash investigation from the developer who was topping the crash hunt charts :slight_smile:

[Attachment Removed]

UDN says the attachment is no longer available, here is second try:

[Attachment Removed]

Hello Alex,

additional info from our test today. Packaging the project with enabled bGenerateNaniteFallbackMeshes and running it with r.Nanite=0 gives us result

r.Nanite=1 - 136 GPU crashes

r.nanite=0 - 1 GPU crash (page fault in ShadowDepths, so probably unrelated.

Note that our fallbacks are configured to low res as we use them only for collision in the actual game. It is not clear whether disabling Nanite actually gets rid of the issue or just hides the race condition but it certainly got rid of all GPU crashes.

[Attachment Removed]

Hello Alex,

while the application does not crash with d3d debug, I was able to catch interesting error on repro project once without any GPU crash following. Note that async compute is disabled in our main project, so not sure if this error is relevant then.

01014702 342.34570313 [24988] D3D12 ERROR: ID3D12CommandQueue1::ExecuteCommandLists: Simultaneous-access or Buffer Resource (0x00000261747927D0:'Nanite.StreamingManager.ClusterPageData') is still referenced by tilemapping GPU operations in-flight on another Command Queue (0x0000026153CDBB00:'Compute Queue (GPU 0)'). It is not safe to start write|transition_barrier GPU operations now on this Command Queue (0x0000026153B46680:'3D Queue (GPU 0)'). This can result in race conditions and application instability. [ EXECUTION ERROR #1047: OBJECT_ACCESSED_WHILE_STILL_IN_USE]Some other issues reported but these don’t sound so critical:

01093343    365.85870361    [24988] D3D12 ERROR: ID3D12CommandQueue1::ExecuteCommandLists: Placed resources, reserved resources, or committed resources with D3D12_HEAP_FLAG_CREATE_NOT_ZEROED flag with either render target or depth stencil flags must be initialized with a Discard/Clear/Copy operations before other operations are supported. Resource (0x00000262A8BA1D80:'<VARIOUS RESOURCES>'), Subresource (0) is not initialized but is used in Function (ID3D12CommandList::DrawIndexedInstanced) on Command List (0x000002612DBB5FB0:'FD3D12CommandList (GPU 0)'). [ EXECUTION ERROR #1422: RENDER_TARGET_OR_DEPTH_STENCIL_RESOUCE_NOT_INITIALIZED]    
 
01014696    342.34548950    [24988] D3D12 WARNING: ID3D12Fence1::SetEventOnCompletion: Fence values can never be less than zero, so waiting for a fence value of zero will always be satisfied [ EXECUTION WARNING #1424: FENCE_ZERO_WAIT]    
 
01014697    342.34555054    [24988] D3D12 WARNING: ID3D12CommandList::ResourceBarrier: Called on the same subresource(s) of Resource(0x000002610E43C0C0:'<VARIOUS RESOURCES>') in separate Barrier Descs which is inefficient and likely unintentional. Desc[3] and Desc[5] on (subresource : 4294967295). [ RESOURCE_MANIPULATION WARNING #1008: RESOURCE_BARRIER_DUPLICATE_SUBRESOURCE_TRANSITIONS]

[Attachment Removed]

Hello Alex,

I have been digging also more in the aftermath crash on startup of repro project 1 when -nvaftermathall is used. I believe it is also important as it can hinder the ability to debug gpu crashes. Here is the full callstack in case you would like to create a proper bug report out of it.

Assertion failed: IsInRenderingThread() [File:D:\build\++UE5\Sync\Engine\Source\Runtime\RHI\Private\PipelineStateCache.cpp] [Line: 1871] 
 
GZWClient_Win64_DebugGame!FDebug::CheckVerifyFailedImpl2()
GZWClient_Win64_DebugGame!TSharedPipelineStateCache<FRHIComputeShader * __ptr64,FComputePipelineState * __ptr64>::ConsolidateThreadedCaches()
GZWClient_Win64_DebugGame!TSharedPipelineStateCache<FRHIComputeShader * __ptr64,FComputePipelineState * __ptr64>::FlushResources()
GZWClient_Win64_DebugGame!TSharedPipelineStateCache<FRHIComputeShader * __ptr64,FComputePipelineState * __ptr64>::GetResources()
GZWClient_Win64_DebugGame!PipelineStateCache::GetPipelineStates()
GZWClient_Win64_DebugGame!UE::RHICore::Nvidia::Aftermath::D3D12::CreateShaderAssociations()
GZWClient_Win64_DebugGame!UE::RHICore::Nvidia::Aftermath::AreMarkersEnabled()
GFSDK_Aftermath_Lib_x64
nvwgf2umx
nvwgf2umx
nvwgf2umx
nvwgf2umx
kernel32
ntdll

[Attachment Removed]

The version which produces the build is unmodified 5.7.0 from Epic Launcher. Therefore I think the error still exists or it is a different one.

Perhaps ReproProject2.zip can help you as it has all CVars removed meaning almost all of them are kept to UE defaults. The aim of this version 2 repro project is to further minimize possible causes on the project-side while still providing a gpu crashing repro.

[Attachment Removed]

I would add that understanding this was not straightforward, because our PSO count stabilizes after a few minutes of gameplay, so the issue cannot be identified as a continuously rising PSO count. Drivers can continue creating additional shader variants internally long after that point. There is no API that provides visibility into the shader heap, so this behavior cannot be observed using in-engine stats.

It is also important to note that drivers may choose to generate significantly more shader variants for optimization when the game is GPU-bound. Because this happens only in GPU-bound scenarios, changing CVars or ShowFlags can strongly affect the crash rate. For example, r.NumBufferedOcclusionQueries=1 shifts the bottleneck to the CPU for some users, preventing them from hitting the shader heap limit. Similarly, CPU-side optimizations, such as UE 5.5 parallel rendering, can introduce more GPU-bound scenarios and therefore more shader heap limit hits. Likewise, older or slower GPUs are more likely to become GPU-bound, which increases the likelihood of crashes.

[Attachment Removed]

Hello,

Given the likelihood this will be solved by a future driver update, is there anything else to discuss on this ticket, or should we close it?

[Attachment Removed]