NodeAndClusterCull GPU crash due to TDR in packaged game

Hello,

We’ve been experiencing TDR crashes recently and, despite our best efforts, have been unable to pinpoint what is causing those.

Our only certainty is that it always occur during the NodeAndClusterCull step.

As a band-aid, we increased TDR delays to avoid crashing and to allow the affected team to continue their work, but this still results in 20-30s hangs which is obviously not acceptable in a final product.

I tried playing with the following nanite parameters, which we previously had to crank up in order to solve visual artifacts (displayed values were used without issues even before crashes started to occur) :

r.Nanite.MaxNodes=8388608

r.Nanite.MaxVisiblePatches=8388608

r.Nanite.MaxVisibleClusters=16777216

r.Nanite.MaxCandidatePatches=8388608

r.Nanite.MaxCandidateClusters=67108864

Setting those back to defaults alows to reduce the length of the hang (and as such dodge some TDR crashes), but still most times leads to hangs a tad bit longer that 2s, which are not ideal and would end up triggering the TDR anyway.

Our next step would be to try to progressively remove content from our level to snif out which objects or materials may cause the hang, but prior to commiting to this goose chase, we would have liked to know :

  • Is this an ongoing issue in the engine ? When scouring through UDN, we came across this issue that shared the same GPU breadcrumbs as ours, but we are unsure if this is relevant to our issue since the causes seems to differ.

  • Is there any way we could extend/configure logs in a such way that infringing objects/materials could be directly identified ? (bear in mind we use the launcher version of the engine, as such we tend to avoid solutions that use engine modifications)

Thanks in advance,

Joaquim

Steps to Reproduce

The crash seems a bit random and no clear reproduce steps have been determined.

We know that some areas of our levels are more or less prevalent to the crash, but even then we have troubles reproducing it 100% of the time.

In case it may help, here is the rendering parameters used in our current packaged builds :

[/Script/Engine.RendererSettings]

r.DBuffer=False

r.Streaming.PoolSize=8000

r.VirtualTextures=True

r.VT.AnisotropicFiltering=True

bEnableVirtualTextureOpacityMask=True

r.Water.FallbackDepth=200000

r.Water.SingleLayerWater.SupportCloudShadow=True

r.DynamicGlobalIlluminationMethod=1

r.ReflectionMethod=1

r.CustomDepth=3

r.CustomDepthTemporalAAJitter=True

r.Nanite.Streaming.StreamingPoolSize=1024

r.Nanite.MaxNodes=8388608

r.Nanite.MaxVisiblePatches=8388608

r.Nanite.MaxVisibleClusters=16777216

r.Nanite.MaxCandidatePatches=8388608

r.Nanite.MaxCandidateClusters=67108864

r.Nanite.Tessellation=1

r.Nanite.AllowTessellation=1

r.Nanite.AllowSplineMeshes=1

r.Lumen.HardwareRayTracing=1

r.Lumen.HardwareRayTracing.LightingMode=0

r.Lumen.Reflections.RadianceCache=1

r.Lumen.Reflections.MaxRoughnessToTraceForFoliage=0.2

r.GenerateMeshDistanceFields=0

r.Lumen.TraceMeshSDFs=0

r.LumenScene.FarField=1

r.LumenScene.FarField.MaxtraceDistance=200000

r.LumenScene.GPUDrivenUpdate=1

r.RayTracing=True

r.RayTracing.Shadows=True

r.RayTracing.Skylight=True

r.RayTracing.UseTextureLod=True

r.RayTracing.Nanite.Mode=0

r.RayTracing.Culling.Radius=100000

r.RayTracing.Shadows.EnableTwoSidedGeometry=0

r.DistanceFields.SupportEvenIfHardwareRayTracingSupported=0

r.DistanceFields.DefaultVoxelDensity=0.200000

r.PathTracing=False

r.AllowStaticLighting=False

r.ForwardShading=False

r.NormalMapsForStaticLighting=False

r.DefaultFeature.AmbientOcclusion=False

r.DefaultFeature.AmbientOcclusionStaticFraction=False

r.DefaultFeature.AutoExposure.ExtendDefaultLuminanceRange=True

r.DefaultFeature.AutoExposure=False

r.DefaultFeature.Bloom=False

r.DefaultFeature.MotionBlur=False

r.DefaultFeature.LightUnits=2

r.VertexFoggingForOpaque=False

r.Mobile.EnableNoPrecomputedLightingCSMShader=0

r.MobileHDR=False

r.Mobile.EnableStaticAndCSMShadowReceivers=False

r.Mobile.AllowDistanceFieldShadows=False

r.Mobile.AllowMovableDirectionalLights=False

r.AntiAliasingMethod=4

r.TSR.AsyncCompute=2

r.ScreenPercentage=61

r.SkinCache.CompileShaders=True

r.SkinCache.DefaultBehavior=1

SkeletalMesh.UseExperimentalChunking=1

r.GPUSkin.Support16BitBoneIndex=True

r.GPUSkin.UnlimitedBoneInfluences=True

r.HairStrands.Strands=False

Hello!

This could be related to the other NodeAndClusterCull crash. In the Editor case it looks like a TDR in the VSM Nanite pass, not the main Nanite VisBuffer. It seemed to occur when there are enough local shadow casting point lights that the thread group worker count is high (over 1024+ seems to be a reliable repro in testing locally) and usually it occurs on level load in Editor when there may be additional shader compilation going on and heavy load rendering the first frame. If you have some idea of about the conditions of your game when the crash repros based on the logs that might help us narrow things down. You could try integrating CL#40753822 (3fb0499) which has seemed to help in the Editor repro case and is an optimization for NaniteInstanceCulling, but the underlying cause is still not known.

The other potential place to look is at your real-time reflection capture settings.

(ID: 0x805693f9) [ Active] Frame 11398 (ID: 0x805693fd) [ Active] GPUSkinCache (ID: 0x80569580) [ Active] FRDGBuilder::Execute (ID: 0x8056943a) [ Active] Scene (ID: 0x80569463) [ Active] CaptureConvolveSkyEnvMap (ID: 0x80569464) [ Active] ConvolutionMip6EtcIt looks like there’s some work in the Async queue breadcrumbs related to that and it has been another source of potential TDRs so in 5.6 we added a warning in CL#39287895 (8b7f4aa) to suggest users lower their sky light resolution to less than 512 or the r.SkyLight.RealTimeReflectionCapture.TimeSlice.SkyCloudCubeFacePerFrame to less than 6. Not sure if this is related but worth looking into if you are using those features.

Hi,

After investigating a bit more, we realized that Nanite Tesselation was the cause of the issue. Disabling it allows us to play our level without the mentionned hangs/tdr crashes.

For a bit of context, we use nanite tessellation on our terrain and ballast materials to simulate small rocks on the ground with displacement.

This results in tessellated geometry taking quite a bit of screenspace, but overall performances seems alright, especially since the introduction of the Fade parameters.

The only issue is with those punctual hangs, which I would guess is caused by Tessellation suddenly and briefly generating too much nanite data for whatever reasons, since lowering r.Nanite.Max* parameters seems to reduce the length of the hangs.

I don’t really have any idea how to further investigate on my own, but I’m currently bundling up a workspace to communicate to our contact at Epic, for investigations on other issues.

I will see if it is also possible to send it here or directly to you so you can use it as a reprocase on this issue if still needed by then.

Thanks for the follow up information. We’ve seen GPU timeouts from NaniteSplit but the breadcrumbs include NodeAndClusterCull so it looks similar. I created this issue which should be public soon Unreal Engine Issues and Bug Tracker (UE\-277458\). We changed the way NaniteSplit works in 5.6 and the repro case we have no longer crashes there, however it’s not clear which CL fixed the underlying issue, though both of these are candidates:

CL#37563978 Converted Nanite patch split to multi-pass.

CL#37869798Fixed issue where tessellation has missing triangles when getting too close.

Hi,

Just a little update, to keep the issue open.

We are trying to test a build of our project on the 5.6 preview, but met a packaging issue that prevent us from launching our repro case in a packaged build.

The error is about an invalid EditorState.

I will keep informed you of the result of our tests once we manage to fix that.

Thanks for the update!