Render code crashes caused by ShaderMap mismatch in UniformExpressionCache

Hello,

We have quite a high crash rate in the rendering code on production, as well as many GPU crashes. CPU crashes in rendering are all over the place, suggesting a memory stomp. We conducted an extensive test of a build with Address Sanitizer and fixed all of the reported cases. We also set up an automation so that the game can play itself on every available PC. We are currently using DebugGame configuration with ensures and checks compiled in to report every potential issue.

We were able to work around some issues, for example:

1) We’ve set r.pso.EnableAsyncCacheConsolidation=0, as we had often crashes “Consolidate was hit while Get/SetPSO was running” in TSharedPipelineStateCache<>::FScopeVerifyIncrement/Decrement

2) Cherrypicked CL 39289814, which includes a uniform buffer name in the FRHIUniformBufferLayoutInitializer hash.

3) On top of the previous one, we noticed that the Name is always “Material”, effectively making this change worthless for material’s UBs. So we passed the name of the Material asset to FUniformExpressionSet::CreateBufferStruct(), so it can be used for the line:

UniformBufferLayoutInitializer = FRHIUniformBufferLayoutInitializer(LayoutName);

This was done to improve the situation with UniformBuffer-related checks, Fatal logs, and memory stomps (which in result helped only partially)

I’ll put the relevant callstacks in a text file and attach it to the ticket to remain under the character limit.

We hoped those fixes might fix the biggest issue we have now. It’s related to the ShaderMaps in FMaterial and UniformExpressionCache in FMaterialRenderProxy. See Issue #1 in the attachment.

In UE5 code, this is a checkf we changed it to be ensureMsgf+ log, also adding the contents hash, to be able to continue the execution flow and receive more info, since we couldn’t figure out how this could happen.

Supposedly, when UniformExpressionCacheis invalidated and queued to be cached (for example, by a material parameter change). Typically, it’s added to the DeferredUniformExpressionCacheRequests set and will be updated in FMaterialRenderProxy::UpdateDeferredCachedUniformExpressions(), during FScene::Update(), before any other code uses it or compares with RenderingThreadShaderMap

We tried to remove the parallelization factor from the equation by setting r.DeferUniformExpressionCaching=0, r.UniformExpressionCacheAsyncUpdates=0. Unfortunately, it did not help.

Another very similar one comes from the VT destroy callback, triggered much more rarely. See Issue #2 in the attachment.

Those issues have a low reproducibility rate and occur randomly and more frequently on specific machines. It doesn’t correlate with CPU generation, core count, etc., suggesting a data race. Additionally, it only happens to Nanite materials, so our best guess is that RDG tasks within Nanite::FRenderer::FDispatchContext::DispatchHW cause the problem, while UniformExpressionCache is not up-to-date.

Since we are able to continue after this assertion arises, we can see other subsequent issues being reported, for example, Issues #3, 4, 5 in the attachment.

Those assertions always have at least one preceding “UniformExpressionCache should be up to date”, but often more than one. Sometimes the material name reported from FMaterialShader::SetParameters() will not be found in the assert in FMaterialShader::GetShaderBindings(), but all the materials are always used only with Nanite across thousands of occurrences.

Have you ever encountered such issues? Do you have any suggestions on what we can try? Any help is much appreciated.

Kind regards,

Denis

Hi,

thanks for posting the call stacks. I’ve started investigating these issues, which might be threading related.

Can you provide any more information about when these asserts occur? For example:

> does this only happen in cooked builds or also in the editor?

> is there a specific plugin enabled? A similar issue has been reported when the FastGeo Streaming plugin is enabled in [this [Content removed] Changing r.MaterialQualityLevel at runtime with FastGeo enabled would trigger the assert.

[This UDN [Content removed] may also be related to this issue and contains a hotfix in the second post, which prevents the application from crashing by doing a check inside FNaniteMeshProcessor::TryAddMeshBatch(). Can you try this fix and see if it helps?

Another similar looking case with a potential fix consisting of adding a if/else condition around the check at FMaterialRenderProxy::EvaluateUniformExpressions can be found [Content removed]

To determine if the issue is caused by a data race, you can force the engine to run with only one thread by using the -onethread and -forcerhibypass command line arguments (as mentioned here).

Please let me know what you find out and we can continue debugging from there.

Thanks,

Sam

Hi Sam,

Thank you for your answer.

> does this only happen in cooked builds or also in the editor?

  • This has never happened in editor. Editor has slightly different code path and could recreate uniform buffer in case of ShaderMap mismatch, this is the one reason and the other is that we usually have automation running for a few hours in game client (debug/test) to catch this crashes often enough, which we don’t do in editor.

>plugin

We don’t have a FastGeo plugin. We use other 3rd-party plugin for some of our geometry, but materials that have mismatch in shadermap are not used by 3rd-party plugin, only regular materials which are used on nanite meshes or lights materials (3rd party plugin doesn’t produce lights or nanite meshes).

> -onethread and -forcerhibypass

We tried adding this and the issue is still reproducible.

[mention removed] Could you, please, share here​ hot-fix with a check in FNaniteMeshProcessor::TryAddMeshBatch() as unfortunately [Content removed] post is not loading for me.

Work around in FMaterialRenderProxy::EvaluateUniformExpressions is better than nothing but sometimes we catch use after free in Asan on UniformExpressionCache shadermap and I guess this would not save us from use after free on UniformBuffer.

Thank you again for you answer, as soon as we have more info I will post it here

Best regards,

Oleksii

Hi,

thanks for the extra details and trying some of the suggested workarounds. The fact that the issues still occur when forcing single-threaded execution makes a data race less likely.

The hotfix I mentioned to prevent crashes inside FMaterialShader::GetShaderBindings (triggering the "UniformExpressionCache should be up to date" check) consists of doing a check inside FNaniteMeshProcessor::TryAddMeshBatch() with the following code:

if(!MaterialRenderProxy.UniformExpressionCache[FeatureLevel].CachedUniformExpressionShaderMap)
{
    UE_LOG(LogRenderer, Warning, TEXT("Skipped adding MeshBatch for MaterialRenderProxy %s"), *MaterialRenderProxy.GetMaterialName());
 
    return false;
}

While this fix seemed to work in previous engine versions, the FNaniteMeshProcessor::TryAddMeshBatch() method is unfortunately no longer available in UE5.5 (it was removed in this commit), but perhaps it would also work when adding it inside FBasePassMeshProcessor::TryAddMeshBatch(). Hopefully that helps somewhat, but please let me know if you find out more.

Best regards,

Sam

Hi Sam,

Small update. I was able to reproduce one of such issues locally. It was caused by CacheResourceShadersForRendering not called after MaterialQualityLevel update for a material instance loaded by our custom system for shader runtime pre-caching. After completely disabling that system we still have a big amount other crashes with the same callstack, but I don’t have a repro for other issues.

With that particular case issue is GetObjectsOfClass(UMaterialInstance::StaticClass() doesn’t return Material Instance used by our Shaderpre-caching system in a list of material instance and as a result FMaterial that was rendered was for High quality and UniformExpressionCache had ShaderMap from Medium quality FMaterial of this RenderProxy.

Regarding workaround: we will try this one, but before we tried a few similar in other places in code where we would crash instead and it usually just crash in different place where UniformExpressionCache is accessed, like OnVirtualTextureDestroyedCB.

Best regards,

Oleksii

Hi,

thanks for the update. It’s good to hear that you were able to clear up the cause of one of the crashes.

>> It was caused by CacheResourceShadersForRendering not called after MaterialQualityLevel update for a material instance loaded by our custom system for shader runtime pre-caching. After completely disabling that system we still have a big amount other crashes with the same callstack

Is it possible to provide the callstack for the other crashes?

Thanks,

Sam

Hi Sam!

Sorry for the confusion, but I didn’t mean we get a new error call tacks. Please check call stacks in issue#2 and issue#3 in original post above. That code is called even if we skip draw call for invalid UniformExpressionCache. We have noticed that most of this issues happen when Material Quality Level and DetailMode are changed in runtime.

From my understanding changing material quality just helps to find invalid UniformExpressionCache with different pointer address and the issue itself is not connected to Material Quality Level, but to missing UniformExpressionCache updates.

Best regards,

Oleksii

Hi,

thanks for clarifying that. Just to rule one thing out, can you please check in your project settings under Engine > Rendering > Materials if the option “Game Discards Unused Material Quality Levels” is disabled? It should be disabled (the default) to allow material quality level changes at runtime. If enabled, attempting to switch to an unloaded quality level will force a potentially unstable recompilation or cause an error.

Thanks,

Sam

Hi Sam,

Thank you, I have checked that “Game Discards Unused Material Quality Levels” is disabled in our project.

Best regards,

Oleksii

Thanks for confirming. Since it is quite difficult to debug these crashes remotely, would it be possible to provide a minimal project where the crash can be reproduced? Also, do you see a difference in the number of failed assertions related to issue #2 when applying the suggested workaround in [Content removed] (second to last reply)?

Thanks,

Sam

Hi Sam!

Unfortunately, I can’t make such project. First of all this crash has quite low repro and we catch it in auto test running on 50+ PCs for 12 hours.

We don’t catch assertions since disabling graphics quality change in our automated test.

Since we connected graphics quality change we are focusing on finding correct fix, as limiting Material Quality Level and DetailMode change is already a workaround that works for us.

I will try to make a better repro based on my understanding of the nature of this issue and will share it with you if I succeed.

Best Regards,

Oleksii

No problem and thanks for the update. It’s great to hear that you have been able to narrow down the cause of the asserts and found a workaround.

>> I will try to make a better repro based on my understanding of the nature of this issue and will share it with you if I succeed.

Thanks! That would be helpful for Epic to debug the issue and come up with a fix.

Best regards,

Sam