GPU Crash With Ray Tracing Nanite Mesh

Hi,

I have two questions about Ray Tracing with Nanite mesh and Ray Tracing diagnostic tools.

About Ray Tracing with Nanite mesh:

I enabled the Ray Traced Shadow and MegaLights in my project. It caused some shadowing artifacts with the Nanite fallback mesh.

Then I set r.RayTracing.Nanite.Mode to 1 to resolve this issue.

The Shadow artifacts are gone, but it caused random GPU crashes.

I used r.D3D12.RayTracing.GPUValidation=1 to identify the problematic ray tracing geometry. The results randomly pointed to different Nanite meshes. I think it should be related to ray tracing on the streamed out Nanite mesh, so I disabled r.RayTracing.Nanite.Mode, and the crashes disappeared.

Any idea about how to enable r.RayTracing.Nanite.Mode without GPU crashes.

About the diagnostic tool - r.D3D12.RayTracing.GPUValidation

  • The BreadCrumb system doesn’t work with the RT geometry validation. The RT Geometry Validation triggers a GPU crash while it failed, but the BreadCrumb’s CPU side data doesn’t match with the GPU readback buffer. The output information is incorrect. I had to add more detailed log on the CPU side and read the validation shader to identify the problematic mesh.
  • The RT Scene info validation always fails on my side. I’m not sure about the reason now. I’ll do more investigation.

重现步骤
UE version 5.5.4

MegaLights enabled

Ray Traced Shadow enabled

r.RayTracing.Nanite.Mode=1

Hi,

​There is a fix for Nanite Ray Tracing in 5.6, if it’s possible, please try UE5.6 and see if it helps. UE5.6 will be released later today.

The fix in UE5 Main. But I’m not sure if it’s the only fix related, because there are tons of changes and reactors in UE5.6, it would be better to upgrade the engine and test again.

https://github.com/EpicGames/UnrealEngine/commit/85466cb592c4ca30c67a499a144a650f12d8d9fd

Hi,

Thanks for your reply. Here is the update.

I merged the fix, but it didn’t work. I still had random GPU crashes when setting r.RayTracing.Nanite.Mode=1.

And when I enabled r.D3D12.RayTracing.GPUValidation, it also failed.

We can’t switch to 5.6 now since we have a build to deliver in several months.

Regarding the RT Scene Info validation failing with r.D3D12.RayTracing.GPUValidation=1, I fixed the issue by setting the NumHitGroups to the SBT’s total geometry segments count times RAY_TRACING_NUM_SHADER_SLOTS.

Could you confirm that the r.D3D12.RayTracing.GPUValidation still works, or is it outdated?

Hi,

I think the validation still works. I didn’t see the engineers use it a lot, but you could use the -gpucrashdebugging, and there could be more information in logs and aftermath.

I have several options and would like you to give them a try:

  1. Disable the r.RayTracing.Nanite.Mode, set the Fallback Triangle Percent to 100, and see if it helps with the shadow artifacts. ( If you could provide a sample content, maybe our TAs could take a look and help improve the shadow quality).
  2. Please create a sample project that reproduces the crash in UE5.5 and send it to us. We will then continue to investigate the issue.
  3. Could you try setting the r.Lumen.HardwareRayTracing.LightingMode to 0 and see if the crash is still present?
  4. Could you try disabling the VSM and see if the crash is still present?

Thank you.

Hi,

Here’s the update.

The crash was caused by Nanite meshes. https://github.com/EpicGames/UnrealEngine/commit/8400a0ca22b3cc29183eb82a48f594693340ff65—This fix is for dynamic RT geometries. I tried it, and it didn’t work.

The RT geometry validation works with -gpucrashdebugging, and the DRED output is correct. However, the BreadCrumb system output is incorrect; it is for the regular graphics pipeline command buffer, not for the validation command buffer.

The GPU validation readback buffer output is:

[Image Removed]The validation shader code is:

[Image Removed]The validation failed because the vertex index exceeds the max vertices num, and the problematic mesh is a Nanite mesh. Actually, the validation failed on random Nanite meshes.

I haven’t tried the AfterMath yet, but I think the problematic RT geometry data should cause the GPU crash. It’s a mismatch between the Nanite streamed out geometry’s primitive info and vertex info.

You can repro the crash with any level filled with Nanite meshes and set r.RayTracing.Nanite.Mode=1 and r.D3D12.RayTracing.GPUValidation=1.

Hi,

Here’s the update.

For r.D3D12.RayTracing.GPUValidation:

  • Need to transfer the max count of the shader binding table slots, including dynamic and static ones, to the Ray Tracing Scene Info Validation.
  • For the Nanite Streamed Out mesh data, need to add an Index Buffer Offset to the geometry Build Param Validation since all streamed out mesh data is resident in a uniform index buffer. I checked on 5.6, and I’m not sure if it has been fixed somewhere else.[Image Removed]

For the random GPU crash after setting r.RayTracing.Nanite.Mode=1, unfortunately, it’s not caused by invalid Ray Tracing Geometry data. After I fixed the RT Geometry Validation, all the geometry data passed it.

I’m suspecting that it may relate to accessing an invalid buffer, NaniteRayTracing.AuxiliaryDataBuffer, but I can’t repro the crash recently, so there’s no proof.

Hi,

Thank you for the reply.

I have fixed r.D3D12.RayTracing.GPUValidation locally, and all the geometries passed the validation. Unfortunately, this is not the reason for the GPU crash.

And also, I enabled D3D12 RayTracingValidationLayer, but no error was detected except several performance warnings for the BLAS building.

I enabled shader debug info generation and collected more data.

GPU crashed on reading data from NaniteRayTracing_RayTracingDataBuffer, which is the buffer NaniteRayTracing.AuxiliaryDataBuffer. The buffer’s status seems good. It may be caused by reading exceeded the boundary.

I wanted to check the register values, but I have no NV NSight Pro. So I have attached a pack file of the crash dump and the related debug info files. Let me know if you can check the register values in the GPU dump and confirm the crash reason.

[Image Removed]I’m currently investigating Nanite.FRayTracingManager. I found some data race conditions between Nanite.FRayTracingManager.Update and the ongoing GPUScene upload task. However, I can’t directly repro the crash by manipulating the code so far.

Hi,

We have upgraded the Engine to 5.6, and I enabled r.RayTracing.Shadows and r.RayTracing.Nanite.Mode

After fixing a buffer allocation issue and a data race condition for Nanite::FRayTracingManager, so far, everything works fine. No GPU crash reproed.

I noticed there are a lot of changes in the ray tracing codes, not sure which part fixed it.

Anyway, thanks for your help!

BTW, I found another fix for GPU crash with ray tracing. https://github.com/EpicGames/UnrealEngine/commit/8400a0ca22b3cc29183eb82a48f594693340ff65

If it still crashes, there are some additional steps to help identify the issue

  1. Unshelve the CL#43215065 (these are the changes in UE5.6 about Aftermath shader association and general improvements)
  2. Modify your local engine’s ConsoleVariables.ini to r.DumpShaderDebugInfo=1, r.Shaders.Symbols=1, r.Shaders.ExtraData=1
  3. Build your game and Start your game with -nvaftermathall -nomaterialshaderddc -gpucrashdebugging
  4. Verify you now have a <project name>\Saved\ShaderDebugInfo\PCD3D_SM6\Global\GPUDebugCrashUtilsCS\0 folder with .usf, .pdb, .dxil and other files in there.
  5. Run the game until it crashes.
  6. You should now have a new crash folder in <project name>\Saved\Crashes and some .nvdbg files in the new crash folder OR in <project name>\Saved\Logs - if the .nvdbg files are in \Saved\Logs move them to the new crash folder where the .nv-gpudmp file is.
  7. If you could get the nv-gpudmp file, make sure you have installed latest NVidia NSights which contains the latest Aftermath dump viewer
  8. Opening the .nv-gpudmp file in the crash folder should open NVidia NSight
  9. Once it opens go to Tools > Options > Search Paths and set Shader Source to <engine root>/Engine/Shaders and set Search sub-directories to Yes
  10. Add the path to your project’s Saved/ShaderDebugInfo to Shader Binaries and Separate Shader Debug Information and set Search sub-directories to Yes
  11. After NSights finishes scanning/loading symbols, select the Crash Info tab, select one of the entries under Active Warps
  12. In the Shader Source panel change Language from IL to Source and change the File from the sentinel to the actual .dxil file which is also the name of the shader and should be GPUDebugCrashUtils.dxil from the test run.

If you got all the information, please send it to us and we will try to identify the issue, thank you.

Thank you for the information.

Our engineer has been aware of a similar issue but not a GPU crash. He will continue to investigate after the Unreal Fest. I will get back to you if there is any update.

Thank you for your update. I have asked our engineers if r.D3D12.RayTracing.GPUValidation is still working for UE5.5/5.6, they said

it was probably broken since we decoupled FRHIRayTracingScene and FRHIRayTracingShaderBindingTable. That validation needs both SBT and TLAS to work, but SBT is no longer accessible in RHIBuildAccelerationStructures. We need to move the validation logic to a new RHI function that is called from high level and takes both FRHIRayTracingScene and FRHIRayTracingShaderBindingTable as parameters so it can properly validate InstanceContributionToHitGroupIndex etc. Or potentially even do it in RHI independent code so it works on all platforms. Anyway, we will probably refactor it to make it work on all platforms in the future.

I also tried to repro the crash with CitySample and ValleyoftheAcient projects with r.RayTracing.Nanite.Mode=1, but I couldn’t repro on my side either. I will keep my eyes on this issue, if there is any clue on it, I will give you an update.

Really sorry, I don’t have the Nsight Pro either. I will find another colleague to follow up on the issue.

Hi,

I can see the register values, but I don’t understand how to debug it.(sorry), I upload all the values and the screenshot of debugging the dumpfile. Hope it helps.

​BTW, there is a fix for reading visbuffer pixels OOB, https://github.com/EpicGames/UnrealEngine/commit/a880ddd15e590001c998b05c6acff8c57d0f902f

Please give it a try as well.

Great news, thank you for your feedback.

Do you mind pointing out what you have fixed for the buffer allocation issue and a data race condition for Nanite::FRayTracingManager, if they are bugs I will raise them to our engineers and fix them in the following version. Thank you.