GPU crash in RayTracingScene

It’s worth mentioning we never get aftermath dumps any more regardless of whether the device hangs or it’s removed. Our soak tests cannot reproduce the issue alone, the pc does need to be in active use, that led me to believe the gpu scheduler may play a role.

We are not certain on this but I speculated that the device is removed via a tdr because it has hung in the rt scene build but increasing the tdr delay only increases the length of the hang, so it’s not simply a long build causing a spurious tdr.

I also integrated the patch you shared above and we’ve reproduced the issue since integrating it.

[Attachment Removed]

It’s worth mentioning we never get aftermath dumps any more regardless of whether the device hangs or it’s removed.

Are you not getting any Aftermath dumps for any GPU hangs or just this RayTracing one?

Normally we recommend user’s verify their Aftermath setup with these steps:

  1. Enable r.DumpShaderDebugInfo=1 and r.GPUCrashDebugging.Aftermath.DumpShaderDebugInfo=1 in your local ConsoleVariables.ini
  2. Start the Editor or game with **-nvaftermathall -gpucrashdebugging.**After shaders finish compiling, check your <project>/Saved/ShaderDebugInfo/PCD3D_SM6/Global exists and contains shader debug symbols for each of the shader permutations.
    1. If you don’t see this folder or they don’t have DXIL files then make sure ShaderVersion.ush is writeable and run r.InvalidateCachedShaders 1 (to update the version hash) and run the console command RecompileShaders Global and when it’s done you should see the shader symbols in <project>/Saved/ShaderDebugInfo/PCD3D_SM6
    2. Don’t include -attachPIX when running the Editor/game because Aftermath may fail to initialize because of the debug layer (LogNvidiaAftermath: Aftermath enabled but failed to initialize (bad0000a))
  3. Do a test crash dump by running the console command GPUDebugCrash hang. It should output a .nvdbg and .nv-gpudump in <project>/Saved/Crashes.

Open the .nv-gpudmp file in NSight (at least version 2025.1.0 or higher). After you update the Search Paths in NSight it should find the DXIL and be able to show you the names of the shaders and the lines that crashed in the source/DXIL.

If Aftermath only fails to output a .nv-gpudump for the RayTracing crash, it could be related to an issue we saw in the past where a driver bug prevented shaders from report relevant hashes, so we couldn’t associate them, but you would still be able to put a breakpoint in our callbacks (Callback_GpuCrashDump and Callback_ShaderDebugInfoLookup) when the crash occurred and the breakpoint would get hit. I don’t think this is the issue because you’re using newer driver versions, but you should be able to verify the callbacks are working by using the test GPUDebugCrash hang console command with sufficiently high settings for Aftermath timeout settings.

r.GPUCrashDebugging.Aftermath.DumpStartWaitTime=30
r.GPUCrashDebugging.Aftermath.DumpProcessWaitTime=100

Also please let me know what recent driver versions you have been testing. Our latest recommended driver version is now 581.42

[Attachment Removed]

We currently have everyone on the 581.08 game ready driver, I will have the drivers updated to 581.42.

We did already set the cvars -

    r.GPUCrashDebugging.Aftermath.DumpStartWaitTime=30
    r.GPUCrashDebugging.Aftermath.DumpProcessWaitTime=100

I believe we have tried the other stuff in the past but I will re-confirm and get back to you on that one.

[Attachment Removed]

Just to let you know, we’ve moved the team up to driver 581.42 - we’ll keep you posted on whether we keep seeing the crash. We’re also going to take a look at disabling DLSS and Reflex, to see if that will help us get Aftermath crashes.

[Attachment Removed]

Thanks for the update - by any chance did you get the .dxil output to work?

One thing to check is in your <project name>\Saved\ShaderDebugInfo\PCD3D_SM6\Global\GPUDebugCrashUtilsCS\0 folder you should see a CompileDXC.bat file and inside that .bat file it should include the -Fo <shadername>.dxil option which is how you tell the compiler to output a .dxil file.

When you open Aftermath, you should be able to point the symbol folder to <project name>\Saved\ShaderDebugInfo\PCD3D_SM6\Global\GPUDebugCrashUtilsCS and it should load up really quickly. Alternately, you can copy the files out of the <project name>\Saved\ShaderDebugInfo\PCD3D_SM6\Global\GPUDebugCrashUtilsCS\0 folder into the folder containing your .nvdbg and .nv-gpudump and Aftermath should be able to find them without you setting the symbol path at all - this is usually what I do when I send a crash dump with symbols in a .zip file for someone else to open and inspect.

[Attachment Removed]

Hi,

I just got an automated message saying that this ticket had been closed - given I replied 3 days ago, I don’t know why this has happened, as we certainly haven’t resolved this yet! [mention removed]​ could you help us out and make sure it’s not been closed? Sorry about this.

[Attachment Removed]

No worries, your reply automatically opened the ticket.

[Attachment Removed]

Hello,

We’ve encountered [an issue that can occur in random passes that might be driver [Content removed] One similarity is that in your case you are also not getting an Aftermath crash dump when the GPU hang happens. There isn’t a workaround for the issue unfortunately, but there will hopefully be a driver update that addresses it.

Another possibly related GPU hang is with persistent SBTs and streaming. If you have a way to repro the crash you can try turning off persistent SBTs with r.RayTracing.PersistentSBT=0 to see if that has an effect on the crash rate.

[Attachment Removed]

We actually do see crashes in BuildHZB and we’ve had a couple in other random passes, until recently I was treating these as separate but much less common crashes but when RT is off the crash rate is the same but reliably in BuildHZB instead of RaytracingScene. We also get no aftermath dumps with those either.

We definitely did experience the random crashing mentioned in the 5.5 tech note that you linked in that thread and we did integrate the patch. If I remember correctly the patch did reduce but didn’t eliminate the random crashing.

This is all speculation and completely anecdotal but I had also started to suspect the shader heap because the cadence of the crash reports we’re seeing seems to spike and settle as development progresses. Susceptible machines seem to go through phases of intense crashing and then settle into a more stable but not entirely crash free state. We recently set up a sentry server which has helped illustrate that pattern better. It’s possible though that this is just a reflection of the intensity of testing being done on the affected machine.

[Attachment Removed]

Hi,

I recommend you reach out to your friendly NVidia rep about possibilities of increasing the heap the driver uses for your game, which has been a fairly reliable way of diagnosing this issue. There is a trade off in that you get less memory for your game, but that hasn’t been a problem in the few cases I’ve seen this issue occur in.

[Attachment Removed]

Thanks Alex, we’ve passed this onto our Nvidia contact and we’ll see what they say.

[Attachment Removed]

Hello!

I just wanted to post a reminder that Epic will be on holiday break starting next week (Dec 22, 2025) and ending Jan 5, 2026 and there will be no responses from Epic on public or private tickets, though you may receive replies from the community on public tickets.

We wish you happy holidays!

[Attachment Removed]

Happy new year!

We’re still waiting to hear back from our Nvidia contact (obviously they’ve been off for Christmas too). Hopefully they’ll get back to us soon.

[Attachment Removed]

Hey Alex, I wanted to run a thought past you. We currently have a pso cache that likely has a large number of stale pso’s in it. I speculated that when we preload the pso bundle full of pso’s we don’t actually use anymore that we’re putting unnecessary pressure on the shader heap and exacerbating the problem. I thought it might be worth clearing and regenerating our pso cache in the hope of reducing the crash rate?

Does that line of thinking make sense or is there maybe something I’m missing about the pso caching system that would mitigate a risk like that?

[Attachment Removed]

Hello,

Apologies for the delay.

I thought it might be worth clearing and regenerating our pso cache in the hope of reducing the crash rate

Unfortunately, I tried this on the repro case - specifically I tried clearing the PSO file cache between runs and the GPU crash would still occur within hours. If there are corrupted PSOs in the PSO cache on disk then it could help there, but I’m not aware of a PSOs on disk could be corrupted by accident. I’ve also reached out to my colleague who is more familiar with the PSO system for additional thoughts in this area.

[Attachment Removed]

Ahh sorry I should’ve been more clear, I’m referring to bundled pso’s we collect offline which are then loaded and compiled into the driver cache before we start the game, i.e. the process documented here - https://dev.epicgames.com/documentation/en\-us/unreal\-engine/manually\-creating\-bundled\-pso\-caches\-in\-unreal\-engine.

Wiping the driver cache will cause these pso’s to regenerate but as our cache is stale I believe that will mean putting pressure on the shader heap with shaders we don’t actually run.

[Attachment Removed]

Hi,

My bad, I missed the word “bundle” in my initial read. Clearing the lists is required which is one of the reasons we are moving away from manually built caches that need to be recollected every time. To confirm though - you are using a combination of a bundled PSO with PSO precaching? That is a valid strategy, I just wanted to confirm you are using PSO precaching also.

I don’t think the driver will attempt to make shader optimizations for stale PSOs in a bundle that are never used, but I’m not certain - I’ve reached out to our NVidia contacts about it and I recommend you do the same.

UPDATE

The driver maintains a shader heap containing shaders the application uses and optimized versions, but shaders of unused PSOs do not take up space in the heap. This may be specific to a game’s driver profile so you may want to still check with your Nvidia contact. I’ve confirmed this is true in general.

[Attachment Removed]

Hi Alex,

We’ve managed to get a preview driver from Nvidia, and from initial testing it looks like the crash bug may have been fixed (and indeed be related to what you’re mentioning above). We’re going to roll the driver out to more testers on more machines, but it’s looking good so far.

Thanks a lot for your help on this!

[Attachment Removed]

That’s excellent news, glad to hear those crashes are going away!

[Attachment Removed]

We’re currently trying to repro with the “render in main renderer” flag disabled, we’ll keep you posted- apologies for slow updates, it’s been a busy few weeks!

We’ve been running for a week or so now with async compute fully disabled (by forcing GSupportsEfficientAsyncCompute to false in D3D12Adapter.cpp for Nvidia cards), and we’re still seeing crashes in the RayTracingScene update, but this time with nothing at all running on the async queue:

<Breadcrumbs>{{Frame 265873},A,{{{HZB},F},{{ComputeLightGrid},F},{{LightFunctionAtlasGeneration},F},{{CustomDepth},F},{{SingleLayerWaterDepthPrepass},F},{{LumenSceneUpdate: %u card captures %.3fM texels},F},{{CompositionBeforeBasePass},F},{{RayTracingDynamicGeometry},F},{{RayTracingScene},A},{{LumenSceneLighting%s},N},{{BasePass},N,{{{NaniteBasePass},N}}},{{ShadowDepths},N,{{{BuildRenderingCommandsDeferred(Culling=%s)},N}}},{{Nanite::Readback},N},{{LightCompositionTasks_PreLighting},N},{{RenderDeferredLighting},N,{{{LumenScreenProbeGather},N},{{LumenReflections},N},{{InitTranslucencyLightingVolumeTextures},N},{{MegaLights},N},{{InjectTranslucencyLightingVolumeMegaLights},N},{{FilterTranslucentVolume %dx%dx%d Cascades:%d},N},{{SubsurfaceScattering},N}}},{{ComputeVolumetricFog},N},{{SingleLayerWater},N,{{{LumenReflections},N}}},{{ExponentialHeightFog},N},{{PostRenderOpsFX},N,{{{FXSystemPostRenderOpaque},N}}},{{LumenReflections},N},{{RenderTranslucency},N},{{Distortion},N},{{RenderVelocities},N},{{PostProcessing},N,{{{DOF(Alpha=%s)},N,{{{TAA},N}}},{{DLSS},N},{{MotionBlur},N}}}}}</Breadcrumbs>(Apologies for the formatting, for some reason 5.6 has made the formatting of breadcrumbs in CrashContext.runtime-xml worse than it used to be.)

To me this is pointing again towards the raytracing update being the problem, not the scene capture component, but we will test with “render in main renderer” disabled just so that we can rule it out.

We’ve still not been able to get any Aftermath crash dumps out of any of these crashes, so any new developments on that front would be a great help!

[Attachment Removed]