GPU crash in Nanite::InitContext or DistanceFieldStreaming

Hello,

we are sorry for the late reply in this post [Content removed]

In the meantime, we have upgraded to the UE 5.5.4 and I started to investigate the GPU crashes again. A lot of GPU crashes seem to be resolved, however some of them still persist.

They are hard to reproduce locally. We are running crash hunt tests over the night, which simulate the player approach through the game to gather various crashes. So far, we are getting about two GPU crashes per night run from about 50 PCs being involved.

The game is started with the following render arguments:

[Image Removed]

and the one version of the following extra arguments is appended:

[Image Removed]

We are having disabled render features like a HW raytracing, mesh shaders, async compute and mega lights. This feature cutoff has resolved a lot of other GPU crashes in the past.

Also, we have merged a several Aftermath improvements and fixes from the 5.6, including custom improvements:

  • being able to symbolize and pair the dumped shaders with the Nvidia crash dumps
  • added shader names reporting in the Aftermath “Active Shaders” section
  • fixed Aftermath resource names reporting

Currently, the most frequent crash has following Breadcrumbs:

  • RenderScene -> Scene -> Nanite::VisBuffer -> Nanite::InitContext -> RasterClear

followed by:

  • UpdateGlobalDistanceField -> Update MostlyStatic -> CullToClipmaps, GlobalDistanceField.HasPendingStreamingReadback
  • Scene -> PrePass DDM_AllOpaque (Forced by Nanite) -> DepthPassParallel -> ParallelDraw

Despite all debugging features above being enabled, only a few GPU reports contain all crash data. A lot of reports are missing the last DRED ops, pagefault data or the Aftermath is not being invoked (even after prolonging the timeout to one minute).

Do you have any clues how to resolve this, please? Is this just a driver dependent, that decides to gather some info and sometimes not?

When we are lucky and get a better GPU report, it doesn’t make much sense. The one report (attached in this post) says, specifically the Breadcrumbs, the crash happened in the UAV clear pass within the Nanite::InitContext pass. However, the code and the shader are so simple, that we don’t see any mistakes around.

[Image Removed]

We were searching for possible Nanite fixes on Github, there are a lot of changes in the NaniteCullRaster.cpp, but none of them mentioned this kind of issue.

But, when we look to the active shaders reported by the Aftermath, it directs us to the completely different passes, which are already finished (based on the Breadcrumbs):

[Image Removed]

We don’t know which reported information is correct and which one is just rough. We never get reports with an active shader for the RasterClear pass, but we’ve got several reports mentioning the DistanceFieldStreaming shaders.

It seems, the UDN doesn’t mention any related problems and the Github changes in the DistanceFieldStreaming.cpp are not very frequent. We are not sure, if this change may resolve it somehow: https://github.com/EpicGames/UnrealEngine/commit/9a57280ffdb410fffbbf9cd81eef0146540f40a6.

Do you have any ideas or suggestions, what may resolve this kind of crashes, please?

Thank you.

Best regards,

Tomas Ruzicka.

[Attachment Removed]

Another story is the D3D debug layer and GBV. When the crash happens, we are getting an unsettling message:

Error: [D3DDebug] Kernel memory failure. There might be a memory leak.

sometimes followed by:

Error: [D3DDebug] ID3D12CommandQueue::ExecuteCommandLists: Command lists must be successfully closed before execution.

but nothing else.

When we tested the debug layer and GBV with the “gpudebugcrash pagefault” command, which caused the test texture eviction, then the debug layer very rarely reported the expected error:

Error: [D3DDebug] ID3D12Device::Evict: CORRUPTION: An ID3D12Heap object (0x0000022B1D4067D0:‘TransientResourceAllocator Backing Heap’) is referenced by GPU operations in-flight on Command Queue (0x0000022A63250EB0:‘3D Queue (GPU 0)’). It is not safe to Evict objects that may have GPU operations pending. This can result in application instability.

But in the all other cases, when we used the “gpudebugcrash pagefault” command, it took several frames to crash, the Breadcrumbs and Aftermath reported completely different active passes, different active shaders and different last tracked resources accessed at particular virtual address. Debug layer reported a lot of similar errors like this:

Error: [D3DDebug] ID3D12CommandQueue::ExecuteCommandLists: A Heap (0x000002039025E7E0:‘TransientResourceAllocator Backing Heap’) referenced in a command list using ClearUnorderedAccessViewUint is non-resident when the command list is being executed.

but not the D3D::Evict corruption message.

Do you know if the debug layer and GBV works in this way or if it needs some other CVar tweaks, please?

Thank you.

-

How to symbolize NV dumps in attached zip:

  1. In Nsight go to Tools -> Options -> Search Paths and set “current_symbols” relative path for the first three inputs (source, binary, debug info). For nvdbg input set “.” self dir path.
  2. Extract zip and open specific nv-gpudmp.
  3. Copy corresponding three shader files (dxil, pdb, source) from particular directory in symbols into current_symbols directory and reload symbols by the button on the right side of the Nsight window.

Best regards,

Tomas Ruzicka.

[Attachment Removed]

Despite all debugging features above being enabled, only a few GPU reports contain all crash data. A lot of reports are missing the last DRED ops, pagefault data or the Aftermath is not being invoked (even after prolonging the timeout to one minute)

Do you have any clues how to resolve this, please? Is this just a driver dependent, that decides to gather some info and sometimes not?

There is a known issue with PSO management that can cause GPU crashes that don’t output useful crash data, or random data. The details and the fix can be found in this technote:

https://dev.epicgames.com/community/learning/knowledge-base/DBOL/tech-note-fix-for-pso-management-issue-on-nvidia-hardware-in-unreal-engine-5-5

Looking at Aftermath dump provided which indicates a crash in DistanceFieldStreaming.dxil, I was unable to find any known issues in this area. If you can provide a way to reproduce the crash we can take a closer look.

RenderScene -> Scene -> Nanite::VisBuffer -> Nanite::InitContext -> RasterClear

We don’t have any UE 5.5 crashes with these breadcrumbs, just 5.4, but with no known repro, fix or workaround.

UpdateGlobalDistanceField -> Update MostlyStatic -> CullToClipmaps, GlobalDistanceField.HasPendingStreamingReadback

I found some UE 5.4 crashes with these breadcrumbs, but with no known repro, fix or workaround.

Scene -> PrePass DDM_AllOpaque (Forced by Nanite) -> DepthPassParallel -> ParallelDraw

I was unable to find any crashes with similar breadcrumbs, known issues or fixes.

[Attachment Removed]

Another story is the D3D debug layer and GBV. When the crash happens, we are getting an unsettling message: Error: [D3DDebug] Kernel memory failure. There might be a memory leak.

How are you reproducing the crash when this happens? Does this error only occur when using the gpudebugcrash pagefault command? Intentionally making the GPU crash isn’t always successful and results can vary between driver and engine versions.

[Attachment Removed]

Is it possible that memory stomps may corrupt the PSO cache or some internal driver state?

I’m not sure how you could corrupt the PSO cache, we just tell the driver to compile a PSO and it reads it from its cache. If the driver PSO cache somehow got corrupted then maybe it’s possible but likely there is some validation and you’d need to check with the IHVs to be sure. Without a repro it’s just speculation though. In the past we had a UE side PSO cache but that’s been disabled for years - I’m assuming you’re not using that feature and are referring to the default driver PSO cache.

[Attachment Removed]

Hello,

Just wanted to check if there has been any progress getting a consistent repro for any of the mentioned GPU crashes in this ticket?

I re-checked for any status updates on known issues in these areas and haven’t found anything new.

[Attachment Removed]

I think I see a SceneCapture crash in the screenshot above, if you’re rendering Nanite into it you may want to consider apply the CLs for fixing the issue mentioned here :

[Content removed]

Do you enable r.RDG.events=3 and r.ShowMaterialDrawEvents=1 in your distributed builds that generated the crash reports in Backtrace? I see you’ve increased the timeout for generating reports so that shouldn’t be an issue, but we only activate those when investigating a GPU crash locally and don’t leave them on in a distributed build so it’s possible there’s an issue we’re not aware of when running with those settings on non-developer spec machine.

[Attachment Removed]

Hello.

We have found and cherrypicked another Nanite and scene capture CLs, specifically:

None of them helped to reduce the GPU crashes.

We have hackfixed the misaligned address accesses of ByteAddressBuffers in the NaniteDataDecode.ush and NaniteAttributeDecode.ush, reported by the enabled GFSDK_Aftermath_FeatureFlags_EnableShaderErrorReporting flag.

So far, this flag has discovered only one new error - LDS out-of-range in the GlobalDistanceField.ush: BuildGridTilesCS():

#define MAX_CULLED_TILES 511u
groupshared uint SharedCulledBoundsList[MAX_CULLED_TILES];
...
for (uint IndexInCulledBoundsList = 0; IndexInCulledBoundsList < SharedNumCulledBounds; ++IndexInCulledBoundsList)
{
	uint UpdateBoundsIndex = SharedCulledBoundsList[IndexInCulledBoundsList];

where the IndexInCulledBoundsList was bigger or equal MAX_CULLED_TILES while addressing the SharedCulledBoundsList.

Later on, further investigation showed that only one option removed the GPU crashes completely: disabling the Nanite completely by the “r.Nanite” cvar.

Keeping Nanite enabled without particular features didn’t reduce the GPU crash count:

  • r.Nanite.ComputeRasterization 0
  • r.Nanite.ProgrammableRaster 0
  • r.Nanite.AllowMaskedMaterials 0
  • r.Nanite.AsyncRasterization 0
  • r.Nanite.Streaming.Async 0
  • r.Nanite.SoftwareVRS 0
  • r.Nanite.MaterialVisibility.Async 0

Globally disabled the scene captures or disabled our custom render passes didn’t reduce the GPU crash count too.

Regards,

Tomas Ruzicka.

[Attachment Removed]

It looks like we added some code to avoid exceeding MAX_CULLED_TILESin UE 5.6 in CL#40136927, but the changelist is so large it’s buried. The hlsl now looks like:

        const uint NumCulledBounds = min(SharedNumCulledBounds, MAX_CULLED_TILES);
        for (uint IndexInCulledBoundsList = 0; IndexInCulledBoundsList < NumCulledBounds; ++IndexInCulledBoundsList)
        {
            uint UpdateBoundsIndex = SharedCulledBoundsList[IndexInCulledBoundsList];

Did fixing the out of bounds there address the crashes you were seeing?

Also, we have confirmed that memory corruption could result in bad shaders getting written to disk in the PSO cache. Nvidia is aware of the issue and we’ll likely need to await a driver update for that one.

[Attachment Removed]

It’s possible the Lumen LDS access issue you found was from non-power-of-two TracingOctahedronResolution values which can cause GPU crashes. This was fixed in UE 5.7:

CL#44141989 Lumen - fix visual artifacts and out of memory reads/writes when using non power of 2 TracingOctahedronResolution. Now TracingOctahedronResolution is clamped to valid input values.

* ScreenProbeGenerateRaysCS can generate rays with MipSize=0 for non power of two values

* ScreenProbeCompositeTracesWithScatterCS then uses X / MipSize for groupshared memory indexing (QuantizedGatherTexelCoord), which results in out of bounds groupshared memory reads and writes

* This can happen either by setting it directly from a CVar or by tweaking LumenFinalGatherQuality in PPV

Most of the GPU crashes are still reporting empty “{ FRDGBuilder::Execute }” breadcrumbs. However, Aftermath markers seem to be working better and half of the GPU crashes contain the executing Nanite::DrawGeometry marker.

Can you attach Aftermath dumps with symbols from these for us to look at? And do have any additional details about the hardware and drivers from these cases?

[Attachment Removed]

In looking at our history about the fix for misaligned data with odd viewport sizes (CL#45641169) I found there was an additional CL that goes with it which doesn’t appear to be in your list.

CL#45662288 Missing ViewRect initialization in 45641146.

Is it possible that our materials may utilize some uncommon code path in Nanite, which may result in random GPU memory overwrites? For example by using too many UVs or something similar?

Nothing immediately comes to mind except possibly virtual textures or something to do with streaming. I’ll reach out to colleagues for suggestions here as well.

[Attachment Removed]

When it comes to materials, the only other thing that comes to mind is that there was a number of fixes related to material sections. Under some circumstances, the materials could be in invalid in such a way that a material slot ended up being uninitialized, which could lead to hangs.

The relevant changes to harden are in Main, but merged over in a big CL. The individual CLs are here: 40842942, 41040155 and 41066128.

[Attachment Removed]

Hello,

we have prepared a reproduction project for the GPU crashes we are experiencing. You can request the Google Drive access to it from our confidential data sharing ticket: [Content removed]

ReproProject.zip (101 GB) – This is a heavily stripped-down version of our game. It contains almost no code, a set of HLSL shaders referenced from the material editor via Custom nodes, a large number of art assets, and our main map World, reduced to a small subset of content (mostly a black scene with selected static meshes). To build this minimal project, we removed nearly all code, plugins, and blueprint logic, which resulted in numerous missing-asset errors. To allow the project to package successfully, many of these errors are silenced via log verbosity settings in DefaultEngine.ini.

Our main project runs on UE 5.5, but for the reproduction we upgraded this minimal version to UE 5.7, which we believe is more suitable.

The project builds successfully using Visual Studio 2022 with the 14.44 toolchain and can be packaged to match the output in ReproPackaged.zip.

ReproPackaged.zip (12 GB) - This contains the packaged DebugGame build of the project above. To run the game in the similar way we reproduce GPU crashes, use RunGZWClient.bat. It includes a critical CVar that disables GC verification; without this, the project crashes during GC after the UE 5.7 upgrade. This issue is unrelated to the GPU crashes and was not addressed in the minimal repro.

You will also notice a number of errors related to missing shaders or shader maps. These stem from the conversion from our full project to the minimal repro. Our main project does not exhibit these issues, yet it experiences what appears to be the same GPU crashes. We also suspect an engine issue in UE 5.7, as enabling -nvaftermathall now produces a startup crash in CreateShaderAssociations(). The project must be run without this flag.

In terms of reproducing the crash: typically, letting the game run for one to two hours is sufficient, no input is required as there is automated flythrough prepared. Crash frequency varies significantly by machine. Some systems never crash, others fail in a very consistent time window. A notable pattern is that most affected users crash roughly 10 minutes after launch-rarely earlier, sometimes later.

We can reproduce the issue on most of our internal hardware, which includes NVIDIA GPUs such as the 3070, 4070, 5070, 4080, 5080, and 2070. We have limited AMD hardware, so we cannot reliably confirm behavior there. Reported NVIDIA driver versions range from 566.14 to 581.80 (most users are on the latest). CPU configurations include a mix of AMD and Intel processors.

If you encounter difficulty reproducing the problem, running the project on multiple machines overnight is helpful. In a single night, we collect over 1,400 GPU crash events across 62 PCs.

ReproCrashes.zip (400 KB) – This archive contains a small selection of logs and crash contexts collected during our most recent overnight run of the repro project. Identifying a consistent pattern in the breadcrumbs is challenging; so far the data appears largely random. We are working on exporting our full crash dataset to provide a larger archive with significantly more samples, but hopefully you will be able to reproduce the crashes on your side in the meantime.

Thank you,

Ondrej

[Attachment Removed]

Hi, thank you for the repro case! Things are a bit slower this week due to the US Thanksgiving holiday, but we are looking into this.

Of the GPUs the issue repros on (3070, 4070, 5070, 4080, 5080, and 2070), is there one particular GPU the crash repros more often or more quickly on?

[Attachment Removed]

Thanks for providing the updated project, I’m in the process of acquiring and setting up the latest project files you uploaded. I’ll post an update here when I’m set up and able to reproduce the crash. You’re probably right that adjusting the occlusion query buffer size has some effect on a race condition of sorts occurring. We don’t have any known issues related to increasing the number of buffered occlusion queries.

[Attachment Removed]

Hi, I was able to reproduce the crash in the packaged project after 15 hours. The breadcrumbs look similar and there isn’t an Aftermath dump so I’d say it the repro is similar enough.

[2025.12.02-17.11.10:690][938]LogRHI: Error: Active GPU breadcrumbs:
	Device 0, Pipeline Graphics: (In: 0xb0212010, Out: 0xb021200e)
		(ID: 0xb0211fde) [     Active]	Frame 3925936
		(ID: 0xb0211fe1) [     Active]		SceneRender - ViewFamilies
		(ID: 0xb021208b) [     Active]			RenderGraphExecute - /ViewFamilies
		(ID: 0xb0211fea) [     Active]				Scene
		(ID: 0xb021200e) [     Active]					HZB
		(ID: 0xb021200f) [     Active]						BuildHZB(ViewId=0)
		(ID: 0xb0212010) [     Active]							BuildHZB
		(ID: 0xb0212011) [Not Started]					ComputeLightGrid
		(ID: 0xb0212012) [Not Started]						CullLights 25x13x32 NumLights 0 NumCaptures 0
		(ID: 0xb0212013) [Not Started]					LightFunctionAtlasGeneration
		(ID: 0xb0212014) [Not Started]					CompositionBeforeBasePass
		(ID: 0xb0212015) [Not Started]						DeferredDecals BeforeBasePass
		(ID: 0xb0212092) [Not Started]							ParallelDraw (Index: 0, Num: 2)
		(ID: 0xb0212093) [Not Started]							ParallelDraw (Index: 1, Num: 2)
		(ID: 0xb0212017) [Not Started]							Decals (Relevant: 58, Total: 224)
		(ID: 0xb0212019) [Not Started]					BasePass
		(ID: 0xb0212094) [Not Started]						ParallelDraw (Index: 0, Num: 3)
		(ID: 0xb0212095) [Not Started]						ParallelDraw (Index: 1, Num: 3)
		(ID: 0xb0212096) [Not Started]						ParallelDraw (Index: 2, Num: 3)
		(ID: 0xb021201b) [Not Started]						NaniteBasePass
		(ID: 0xb021201c) [Not Started]							Nanite::BasePass
		(ID: 0xb021201d) [Not Started]								Nanite::ShadeBinning
	Device 0, Pipeline AsyncCompute: (In: 0xb021200a, Out: 0xb0212009)
		(ID: 0xb0211fde) [     Active]	Frame 3925936
		(ID: 0xb0211fe1) [     Active]		SceneRender - ViewFamilies
		(ID: 0xb021208b) [     Active]			RenderGraphExecute - /ViewFamilies
		(ID: 0xb0211fea) [     Active]				Scene
		(ID: 0xb0211fee) [   Finished]					FXSystemPreRender
		(ID: 0xb0212009) [     Active]					PrepareImageBasedVRS
		(ID: 0xb021200a) [     Active]						ContrastAdaptiveShading

[Attachment Removed]

Thanks for the details on the aftermath crash, that was supposed to have been fixed in UE 5.6:

CL#40357307 Fixed pipeline state cache cleanup on non-rendering threads/tasks- Fixes rare assert on GPU crash handling

Can you verify you have that change in the version that produces this crash?

I’ve managed to reproduce the crash again within 7 hours this time and have started diffing the rendering CVARs the project uses against the defaults for 5.6/5.7. I started by looking at the PSO related CVARs because when I look at how the symptoms of a long running game that crashes with no GPU dump is very similar to the PSO management crashes and the on-disk cache size was 4GB for the game client. But nothing conclusive to report yet.

[Attachment Removed]

Hi,

Putting an update in here for anyone else following along. After further investigation this GPU hang does appear to be similar to what was happening in the case where PSOs were not released in UE 5.5 and the driver was running out of memory and the GPU was crashing. The largest similarity here, as in that case, was the absence of an Aftermath dump file, and the other symptoms included random breadcrumbs and in passes that are simple and shouldn’t crash, as well as the GPU crash taking a long time (30+mins to hours) before crashing and the local PSO cache size becoming large.

Currently, there isn’t a known workaround for this issue, but it appears to be rare, and we’re hopeful a future driver update will address the underlying issue.

[Attachment Removed]

Hello,

thanks for the reply.

We have merged the mentioned PSO fix for several months.

Recently, we’ve also merged this change: https://github.com/EpicGames/UnrealEngine/commit/7c1b04f2fe7851f1fa8bec78a5d087c5abf56f9d “Include uniform buffer name in the FRHIUniformBufferLayoutInitializer hash”.

Through the time, our nightly GPU crash hunt sessions discovered several memory stomps (with asan). Some of them were in our code (already fixed) and some of them are in the engine render code. We still didn’t cover all of them, because repro steps and root causes are unclear. However, server asan builds are super stale, compared to the client builds.

Is it possible that memory stomps may corrupt the PSO cache or some internal driver state? If the corrupted PSO will be cached and used over and over resulting in GPU crash until the next PSO cache clear.

We see, the very frequent Distance Field GPU crashes in the past are no longer happening for now. However, it started to crash elsewhere. Also, people (company PCs) who crashed very often two weeks ago are stable now and vice versa.

Regards,

Tomas Ruzicka.

[Attachment Removed]

Both D3DDebug messages “Kernel memory failure” and “Command lists must be successfully closed before execution” are from our night GPU crash hunt sessions, where we don’t have any repro steps.

D3DDebug error messages from the “gpudebugcrash pagefault” cmd are different.

We have enabled the -d3ddebug and -d3d12gpuvalidation for a half of the night GPU crash hunt sessions in the hope that it will help us to identify sources of GPU crashes to find at least some repro steps.

So far, we have identified one validation error in the “ConservativeScaledShadingRateTexture” texture which missed the D3D12_RESOURCE_STATE_SHADING_RATE_SOURCE flag (ETextureCreateFlags::Foveation) and we fixed it locally right away. However, nothing else showed up in the log until the GPU crash happened - introduced by the one of two validation errors mentioned above.

Regards,

Tomas Ruzicka.

[Attachment Removed]