We are currently using UE5.6.1, and we are seeing infrequent and hard to reproduce GPU crashes inside RayTracingScene. There does not seem to be any particular connecting factor for these crashes - they are happening at various points throughout the game, but the fact that they are crashing inside RayTracingScene on the graphics pipeline is the only connecting factor. Sometimes we see it after 40 minute play sessions, sometimes we see it shortly after loading up the game.
We are even seeing these crashes while the game is paused - nothing about the scene is changing, nothing in the RayTracingScene should have changed from frame to frame, and yet after several minutes of not crashing, we see a crash when nothing has changed.
We are also not currently able to get Aftermath GPU dumps - the log always reports “Timed out while waiting for Aftermath to start the GPU crash dump”, and fails to output a dump.
So far we are only seeing this crash on one particular tester’s machine, who is using a GeForce RTX 3060 on driver version 581.08. (He was also seeing the crashes on older drivers, this is just his current driver version.) However we are a small team with a limited pool of hardware to test on, so there is no guarantee that this is the only configuration which reproduces the crash.
We’re slightly stumped on how to track this crash down. So far we have tried:
made sure that we are keeping RT scene complexity under control - we’re now consistently staying below the 400MB RT scene budget
simplified the amount of dynamic RT geometry (reduced LODs on skeletal meshes, replaced spline meshes with baked down static meshes)
tried running with GPU validation enabled, to see if anything is flagged up as problematic - this has not shown anything specifically wrong with our RT scene
tried increasing the r.GPUCrashDebugging.Aftermath.DumpStartWaitTime and r.GPUCrashDebugging.Aftermath.DumpProcessWaitTime values to 10 and 30 seconds, in the hope we can get an Aftermath dump to debug
tried turning on r.D3D12.RayTracing.GPUValidation to find any issues, but this immediately results in a GPU crash every time we turn it on
tried checking NSight captures to see if it throws up any errors about our RT scene - so far we’ve not spotted anything problematic
Any assistance in tracking this down would be greatly appreciated.
Can you provide more information on what that is? Does it appear as active in the breadcrumbs of every GPU crash dump?
Tried turning on r.D3D12.RayTracing.GPUValidation to find any issues, but this immediately results in a GPU crash every time we turn it on.
Can you provide more information about this including the breadcrumbs and log? If it produced an aftermath crash dump that could also be useful.
In 5.5 we were getting a crash inEndBreadcrumbGPUbut the fix should be in 5.6
CL 38191579D3D12 RT Validation: Move breadcrumbs out of recursive cmd list.
Actually, it sounds like it’s a known issue thatr.D3D12.RayTracing.GPUValidation is broken i n 5.6 since we decoupled FRHIRayTracingScene and FRHIRayTracingShaderBindingTable. That validation needs both SBT and TLAS to work, but SBT is no longer accessible in RHIBuildAccelerationStructures.
We need to move the validation logic to a new RHI function that is called from high level and takes both FRHIRayTracingScene and FRHIRayTracingShaderBindingTable as parameters so it can properly validate InstanceContributionToHitGroupIndex etc. Or potentially even do it in RHI independent code so it works on all platforms I’ll check on the status of this work to see if we have any changelists available.
I just wanted to update with more of our findings since investigating the issue further.
Our breadcrumbs always have RaytracingScene on the graphics queue but the async queue is always different. I did suspect it was possibly a sync issue between the graphics and async queues so we tested with r.Lumen.AsyncCompute=0 to prevent syncing between the queues but unfortunately the issue still manifested.
We do use scene captures at some points in the game and we suspected maybe the fact that there were 2 acceleration structure builds as a result but we’ve confirmed there are no scene captures running when almost all of the crashes happen.
Most of our crashes are when the game is paused so I’ve been looking at what is being built behind the pause screen. Generally we’re only seeing Static Meshes with WPO BLAS’s being built behind the pause screen but in the area where we got our most recent crash there are none being built, the only acceleration structure being built in that area is the main scene TLAS. Both the decal and far field layers are empty so there is only 1 layer with actual geometry in it’s TLAS.
> Can you provide more information on what that is? Does it appear as active in the breadcrumbs of every GPU crash dump?
Despite what it looks like, this isn’t actually a custom render pass - this is a Scene Capture 2D component with the “Render in Main Renderer” flag set to true, rendering a SceneDepth capture. It’s set up in FScene::UpdateSceneCaptureContents, if you search for SceneCapturePass_SceneDepth you should find the spot where it’s added to the queue.
As Andrew said, it is not always present in our breadcrumbs - the async queue changes between crashes, but we always see a crash in RayTracingScene specifically.
If there’s anything more we can provide to help with debugging, please let us know! And if there’s any patches to fix r.D3D12.RayTracing.GPUValidation we’d be happy to apply them and see if we get any more useful information.
The crash issue with r.D3D12.RayTracing.GPUValidation hasn’t been addressed yet, but one alternative is to turn on Nvidia RT Validation as specified in https://developer.nvidia.com/blog/ray-tracing-validation-at-the-driver-level/. It should be a minimal change to implement and is something we also want to add in a future release.
I did suspect it was possibly a sync issue between the graphics and async queues so we tested with r.Lumen.AsyncCompute=0 to prevent syncing between the queues but unfortunately the issue still manifested.
To prevent async entirely you’ll need to use r.RDG.AsyncCompute=0
I’ve attached a patch that shows how to enable Nvidia RT Validation in UE 5.6, though it hasn’t been tested in some time, hopefully it will reveal useful information for you. Also, you may need to fix the paths in the .patch or just copy the changes out - it’s just two files and the changes are simple code additions.
I have applied the patch and am going to going to attempt to reproduce with the validation enabled.
In the mean time I wanted to update on the progress of the investigation. I wanted to rule out the possibility of an overflow in the buffer during compaction, so I had QA test the build with compaction disabled and they confirmed the build still crashed.
We haven’t yet tried with async compute completely disabled as that is quite a drastic step and probably will reduce our gpu performance.
I tried enabling full dred on the crashing machines but unfortunately it didn’t provide any more information.
I collated 10 dumps to better understand any patterns in the crash. I’ve shared up the breadcrumbs here, in all but one the async pipe’s last breadcrumb was in SceneCapturePass_SceneDepth or the adjacent passes FXSystemPreRender & RayTracingDynamicGeometryUpdate. RayTracingDynamicGeometryUpdate is just making the resource available on the async queue and I believe I’ve ruled that out with r.Lumen.AsyncCompute=0.
We’ve seen this crash on 3 machines, an RTX 3060, an RTX 3070 Ti and a 4070 Ti Super. All of which are running driver 581.08.
Thanks for providing the extra breadcrumbs. How are you verifying there are no scene captures running when almost all of the crashes happen? Seeing the SceneCapturePass_SceneDepthas the most common active async compute queue item is odd if there are no scene captures.
Apologies, I should have been more specific. I meant to say there are no scene capture raytracing scenes, i.e. only a single set of 3 tlas’s being built. We have a scene capture that’s set up to run in the main pass that is basically running all the time but it doesn’t use raytracing so there’s no tlas build for that one.
We have other scene captures at other points in the game where we do use raytracing and those have their own corresponding tlas builds, it’s those scene captures I was referring to, but the main pass scene capture that uses the custom depth stencil does run at all times.
Have there been any fixes since 5.6.1 to Aftermath crash dump writing? We’re still consistently seeing that Aftermath writing has failed due to a timeout, even though we’ve significantly increased the timer values:
We’ve been running with r.rdg.asynccompute disabled and so far we’ve not managed to reproduce the crash although it’s still too early to tell as it often takes several hours to manifest.
I’m assuming you don’t have either of these CVars enabled on PC which could lead to that in 5.6
D3D12.PSOPrecache.KeepLowLevelz
D3D12.PSO.KeepUsedPSOsInLowLevelCache
I’ve also had issues where Aftermath failed to initialize when a D3D debug layer was in use - for example when running with -attachPix, but it doesn’t sound like you’re doing that.
You may want to look at the history of the following files to see if there’s anything related to Aftermath to cherry pick.
Another thought is to try reproducing the crash with the NVidia Aftermath Crash Monitor active and see if it is able to capture something, but I haven’t had good luck yet with that myself. Sometimes when Aftermath fails to write it’s because the issue is in the driver or with an internal shader or buffer. Have you reached out to NVidia about it? Earlier this year we were running into a driver bug where it wouldn’t report raytracing shader hashes.
Do you have bindless enabled? We have some reports of the ray tracing scene build crashing when bindless is enabled.
I wanted to update, we’ve managed to reproduce the crash with r.rdg.asynccompute disabled again. This crash has the same breadcrumb signature as a lot of the others. The graphics queue building the raytracing scene and the async compute running the custom depth pass. I believe because the above flag only affects graphics work scheduled on the render graph and the scene capture doesn’t use the render graph so disabling async compute on the render graph hasn’t really changed the relevant passes although it has made the async compute queue less noisey.
I intend to focus more on the custom depth pass in future diagnostics as none of the other changes I’ve made so far have affected the character of the crash at all.
The SceneCapture custom depth on async compute sounds like a good place to focus. I did some local testing but wasn’t able to replicate the setup you’re describing where the scene capture depth pass runs on async compute with r.rdg.asynccompute=0. If you are able to reproduce the issue reliably with a scene capture rendering in the main view with the necessary settings that would help us investigate further.
Apologies for the delay. Regarding the scene capture GPU crash, if you’re rendering Nanite into it you may want to consider apply the CLs for fixing the issue mentioned here :
Regarding the missing Aftermath crash dump - it sounds more like an aftermath issue than engine issue. However, it’s possible that if the crash is happening during internal BVH build the Aftermath GPU dump won’t help because it will only show up as “Internal Shader” and that’s it. If you can reproduce the crash on console where there are better debugging tools you may have more luck.
Alternately, disable different r.RayTracing.Geometry.* to narrow down the geometry that may be involved in the crash may help. e.g. r.RayTracing.Geometry.StaticMeshes, r.RayTracing.Geometry.HierarchicalInstancedStaticMesh, r.RayTracing.Geometry.SkeletalMeshes etc.
We’ve tried disabling the scene capture component to rule it out as a cause of the crash, and we’ve managed to reproduce the crash in RayTracingScene. Here’s the breadcrumbs from that crash:
And we’ve also been actively trying to reduce the number of spline meshes present in the game (mostly for optimization reasons, but also to eliminate them as a source of the crash). We’ve previously managed to reproduce the crash with r.RayTracing.Geometry.SplineMeshes set to 0. And we can’t really turn off skeletal mesh RT geometry, because we need that to shadow skeletal meshes with Megalights.
Thanks for info. If it only crashes with skeletal meshes that could be an area to focus on and see what about those meshes might be causing it. However, I spoke to my colleagues and one thing that was pointed out is the logs indicate
Error: Result failed
at D:\Townfall\Engine\Source\Runtime\D3D12RHI\Private\D3D12Viewport.cpp:554
with error DXGI_ERROR_DEVICE_REMOVED with Reason: DXGI_ERROR_DEVICE_REMOVED
This could be caused by a runtime error, and because it happened during Present() there could be other causes including:
OS version
whether it’s a virtual machine or not
DLSS
anything that hooks the swapchain (e.g. OBS, Rivatuner, Discord etc.)
When the error isn’t an actual GPU hang or crash then Aftermath has nothing to do and that could be why you don’t get a crash dump.
If you’re able to repro the crash with -d3ddebug you might get more relevant error data.