UE5.5.4 Invalid data in Nanite ClusterPageData

Hello Epic Games Support!

We have noticed that we have GPU validation crashes on unaligned memory reads and MMU fault when accessing memory in Nanite compute shaders. Bug is not related to single GPU model or driver version. It happens with different frequency for different PCs but there is nothing specific to their configuration.

Unaligned address crash for one of the shaders in BasePass shown below. The only value that can hold unaligned value is PositionOffset, which means that Cluster data is invalid.

[Image Removed]My question is regarding to the changes in FPackedCluster data structure which we integrated with engine version upgrade to UE5.5 Engine/Source/Runtime/Engine/Public/Rendering/NaniteResources.h (https://github.com/EpicGames/UnrealEngine/commit/44f9aec3a74110380ad66f06c9cf402b4af32acb)

Does this change require all Nanite meshes to be manually re-imported in order to generate updated FPackedCluster / NaniteResourcesPtr.Get().StreamablePages data?

I didn’t found version tracking in FPackedCluster and from my observation NaniteResourcesPtr.Get().StreamablePages updates called only on mesh import. If assets data is handled automatically with version upgrade I will be glad to learn it and if not I would like to get a confirmation from Epic team before doing re-import of all Nanite assets in a big project.

Thank you,

Oleksii

Steps to Reproduce
Cannot reproduce outside of the project.

Hello,

I’m passing this to my colleague who is more familiar with the changes here, but I do want to mention that it would help to have the following

  • A full log with breadcrumbs
  • An Aftermath crash dump
  • Does it repro on NVidia, AMD and Intel GPUs?
  • What is the most recent driver version you have seen the crash on?
  • What changes to the engine or renderer have you made that we should be aware of? For example, this post indicates you may have changes to “Customized Unreal’s graphic resources allocator to significantly reduce hitches caused by resource allocation”

It’s not necessarily related, but right before we created the 5.5 branch we were seeing GPU Crashes and visual issues where Nanite ClusterPageData was getting corrupted because the NVidia driver was skipping UAV barriers at the beginning of command lists. This should be fixed in driver versions 572.16 and higher, but when troubleshooting this error these were the CVars that we used to workaround the issue before we knew the root cause:

  1. Turning off Nanite async streaming r.Nanite.Streaming.AsyncCompute=0
  2. Turning off parallel translation r.RHICmd.ParallelTranslate.Enable=0

As with memory corruption issues, there’s many ways it can occur, but I thought I’d mention this one since it was around the same time frame.

Hi [mention removed]​

Thank you for your answer!

We use more recent drivers than 572.16. One of the crashes were reproduced on 581.42 and there is no indication that the issue is tied to a specific driver version. We will try to apply this workaround CVars anyway.

I can confirm with 100% issue happens on Nvidia GPUs as we have detailed Aftermath reports with every GPU crash.

I also think the issue is relevant for AMD GPUs, as we observe a similar rate of GPU crashes there, but crash reporting setup doesn’t work well enough, so I can’t confirm it. I don’t have a statistics for Intel GPUs.

The repro rate is high when GPU load is high, for example, on low settings or when we use manually simplified materials crash rate is much lower.

We get ~0 GPU crashes with r.Nanite=0. Which is expected asNanite ClusterPageData is not used in this case.

We will try to reproduce the issue on Vanilla UE version to exclude our engine modifications’ impact on the GPU crash rate.

I will provide logs and dumps in my next answer.

In this post I want discuss some details of the crashes data (attachments stored in a private [Content removed]

1) Please carefully review GPUCrashNaniteMem.7z(attached in private ticket) with crashes that happens due to corrupted ClusterPageData.

In crash (1) there is Failed to translate the virtual address (MMU Fault) in one shader and Misaligned Address Error in other shader, suggesting that a memory was corrupted for this objects. In crash (2) there is a shader with Misaligned Address Error and in crash (3) which the only one symbolized there is Misaligned Address Error in GetBoneInfluenceHeader() for a material that we know never use skinning, but has (Cluster.bSkinning==TRUE), which is another indicator of corrupted Nanite ClusterPageData.

Generally MMU or unaligned access errors distributed over source code of a shader and don’t happen in one place. Usually it’s some BitStreamReader_Read_RO reading various attributes in NaniteShading_ClusterPageData.

2) We have a collection of GPU crashes on debug build from yesterday. In GPUCrashUnsorted.7z (attached in private ticket) you can find Nvidia GPU crashes with gpudumps and all collected logs. This represents our general GPU crashes sample.

3) Usually our breadcrumbs contain only FRDGBuilder::Execute and we want to check if it matches logs that you typically observe. Do you think this is alright and typical for UE5 titles or maybe breadcrumbs logs should be more verbose normally and we have incorrect breadcrumbs setup?

[Image Removed]

Regards,

Oleksii

[mention removed]​

This workaround helps us reduce the crash rate from 128 to 5 GPU crashes. Those 5 GPU crashes are not Nanite-related.

  1. Turning off Nanite async streaming r.Nanite.Streaming.AsyncCompute=0
  2. Turning off parallel translation r.RHICmd.ParallelTranslate.Enable=0

Nvidia Driver version for crashing users are 581.42, 581.02, 581.80, 581.57, so we should have mentioned fixes for UAV barriers from Nvidia.

Do you have more ideas so we can narrow down the problem even further?

>Does this change require all Nanite meshes to be manually re-imported in order to generate updated FPackedCluster / NaniteResourcesPtr.Get().StreamablePages data?

A reimport is definitely not required as this is only a change of the DDC data. The DDC automatically gets rebuilt because of the versioning change to DevGuids.NANITE_DERIVEDDATA_VER.

The disabling of async and parallel translate fixing the problem is definitely suspiciously similar to the driver bug with the missing async barrier, but I don’t think we have seen a repro of that since the driver fix from NVIDIA.

Another thing to try could also be disabling reserved resources (r.Nanite.Streaming.ReservedResources 0), which afair is default enabled on PC in that version.

Not really consistent with the Async/ParallelTranslate findings, but just going by where this is crashing, it could also be related to an issue that was fixed here: 37604262

The problem was that if the viewport happened to have an odd size, then because of helper lanes for quad shading mode, a pixel outside the viewport could end up getting fetched and decoded.

As that pixel could just be some uninitialized garbage, it could reference into parts of the VisibleClusters array that had not been written, causing crashes looking very similar to this.

I later fixed some similar issues in depth export and some debug shaders in these two CLs: 45641169, 45662288.

Hi [mention removed]​

TY for telling about DevGuids.NANITE_DERIVEDDATA_VER, this will help us with debugging and validation.

We have tried to run our game on UE5.6, which has mentioned CLs with fixes, but it didn’t help to reduce the high GPU crash rate that we are observing. We have 2 of these CLs cherry picked already, and I will cherry-pick the 3rd one for overall stability improvement.

We will try to set r.Nanite.Streaming.ReservedResources 0.

If it would help, our full cvars dump for one of the crashing builds is attached here in a private ticket cvars.csv [Content removed]

Particularly, we have async compute disabled. r.D3D12.AllowAsyncCompute=0.

Hi [mention removed]​

We have tested r.Nanite.Streaming.ReservedResources 0, and it doesn’t affect the number of GPU crashes. The GPU crashes count was the same as in the control group.

One important note is that unaligned access crashes typically occur more frequently when the Scene Capture component is active. It’s unclear whether a higher GPU load is affecting repro rate or if something randomly happens in Scene Captures.

Are the crashes you are seeing always in BasePass or some other screenspace pass?

A vertex that is being decoded during BasePass should already have been used for rasterization during the visbuffer pass. So if the source of the problem was the actual resident page data, then I would expect that you would already hit that alignment error when the vertex would be decoded during rasterization.

It getting hit in a basepass shader suggests that the problem is likely that it is trying to decode an invalid pixel that was not actually rasterized. This was what we were seeing in those bugs related to odd-sized viewports mentioned above and afair, the actual crashes were very similar looking.

I wonder if there is any pattern in the pixel coordinates of the threads where this happens.

Please refer to the image below. As you can see, most of them are in the BasePass. Some Breadcrumbs that don’t make sense, like CopyBackBuffer are also BasePass, just reporting is messed up, but a small portion (1-2%) happens in VisBuffer.

I completely agree that a completely corrupted ClusterPageData would crash in VisBuffer. From my understanding if the buffer is partially corrupted, we would crash in BasePass where we make much more data accesses, increasing the failure probability. I may be wrong about this. [Image Removed]We can reproduce the issue in the 1920x1080 game client, so while odd-sized viewports could be a part of all GPU crashes we get from live, in our studio environment there are other sources of this issue than odd-sized viewports.

Unfortunetly Nsight crash reports shows only line of code with no extra data, and we haven’t collected any data regarding pixel position for invalid Clusters in any other way.

> completely agree that a completely corrupted ClusterPageData would crash in VisBuffer. From my understanding if the buffer is partially corrupted, we would crash in BasePass where we make much more data accesses, increasing the failure probability. I may be wrong about this.

Yes, shading also fetches vertex attributes for instance, but at least wrt your initial screenshot inside the position decode, those same vertices *should* get decoded the same during rasterization.

One thing that could maybe help narrow it down further would be r.Nanite.ProgrammableRaster 0.

Not something you can ship as it effectively disables WPO and Masked materials. But it also skips a bunch of passes in the middle of the Nanite pipeline. It skips the raster binning phase and makes it so that only the fixed function rasterizers get executed. If that makes the problem go away then it could be a big hint wrt where to look for the root of the issue.

Hi [mention removed]​

Number of GPU crashes that report unaligned memory reads and MMU faults in Nanite Cluster Data has been reduced from dozens to ~5 per night run. Unfortunately, we didn’t catch the exact moment and commit that helped us, but we think it was the issue with an odd viewport size, which could happen for SceneCaptures and we use many SceneCaptures for different game features. Most likely, some check for odd viewport sizes helped us.

We will make sure our SceneCapture viewport sizes are always even and let you know if we still experience this issue.

That makes sense. I think the only crashes I have seen internally in the deferred pass decode logic has been when it somehow ended up decoding invalid pixels that were not written by the rasterizer from the current frame.

In our latest check, we have collected 0 unaligned memory reads and MMU faults over two nights’ run, so we assume the issue is fixed. The project still has GPU crashes, but 99% of them don’t have a shader associated with a crash, and a few that have shaders are random DF code, so it has to be a different issue.

Thank you for the assistance [mention removed]​ [mention removed]​ !

Hello Alex,

I’m a colleague of Oleksii. We have finished porting the game to vanilla UE 5.5 (source build, no custom changes). Along with that, we disabled rendering for most of our custom features, including the terrain system, foliage system, and several post-processes.

However, the crash rate remains the same or even higher. Overnight, we received about 350 GPU crashes.

One notable difference is that I no longer see meaningless breadcrumb scopes showing only FRDGBuilder::Execute with no active entries below it. Around 75% of the reported crash breadcrumbs now point to BuildHZB, while the remaining 25% are scattered (randomly?) across various passes, each with low occurrence. It is unclear whether this shift in reported breadcrumbs is something indicating an actual error there or if it is just because the workload changed significantly without our custom rendering features. Aftermath did not generate any data. I will see if I can reintroduce our improvements to Aftermath support to capture additional data and NaniteClusterPage detections.

Are there any known issues that could occur around BuildHZB? I have attached example breadcrumbs for reference.

(missing attachment)

Hello Alex,

one important detail is that ClusterPageData misaligned access detection triggers when GFSDK_Aftermath_FeatureFlags_EnableShaderErrorReporting is enabled. We have this flag active in UE 5.5, and it already helped surface several issues in the past. It’s now included in UE 5.7 by default, which is great news for anyone tracking GPU crashes.

We did the nightly automated hunt with our game + vanilla UE 5.5 + our aftermath improvements with aftermath shader error reporting. Here are the results:

[Image Removed]GPU crashes with entries in the GPU.Aftermath.Fault column come from Aftermath shader error reporting. The top issue is the distance field crash, which is due to the missing fix in vanilla UE 5.5 (the one we discussed earlier in our other gpu crash thread - https://github.com/EpicGames/UnrealEngine/commit/d058828e5fed327c7154767bebd167beffd33dbc).

The remaining shader error reports are Misaligned Address Errors, what confirms this still occurs in vanilla UE 5.5. We also ran hunt on 5.5 matrix demo with aftermath shader error reporting but did not experience the issue there.

Another crash which is there is the BuildHZB mentioned above but, so far no useful data on that one.

Please note that the aggregation is scattered, we aggregate only based on last active breadcrumb node. Because multiple nodes can be involved, the high-cardinality reports are the most reliable to focus on.

I’m now working on reducing the project to isolate the cluster page issue and produce a reproducible test case we can share.