Hello,
We have quite a high crash rate in the rendering code on production, as well as many GPU crashes. CPU crashes in rendering are all over the place, suggesting a memory stomp. We conducted an extensive test of a build with Address Sanitizer and fixed all of the reported cases. We also set up an automation so that the game can play itself on every available PC. We are currently using DebugGame configuration with ensures and checks compiled in to report every potential issue.
We were able to work around some issues, for example:
1) We’ve set r.pso.EnableAsyncCacheConsolidation=0, as we had often crashes “Consolidate was hit while Get/SetPSO was running” in TSharedPipelineStateCache<>::FScopeVerifyIncrement/Decrement
2) Cherrypicked CL 39289814, which includes a uniform buffer name in the FRHIUniformBufferLayoutInitializer hash.
3) On top of the previous one, we noticed that the Name is always “Material”, effectively making this change worthless for material’s UBs. So we passed the name of the Material asset to FUniformExpressionSet::CreateBufferStruct(), so it can be used for the line:
UniformBufferLayoutInitializer = FRHIUniformBufferLayoutInitializer(LayoutName);
This was done to improve the situation with UniformBuffer-related checks, Fatal logs, and memory stomps (which in result helped only partially)
I’ll put the relevant callstacks in a text file and attach it to the ticket to remain under the character limit.
We hoped those fixes might fix the biggest issue we have now. It’s related to the ShaderMaps in FMaterial and UniformExpressionCache in FMaterialRenderProxy. See Issue #1 in the attachment.
In UE5 code, this is a checkf we changed it to be ensureMsgf+ log, also adding the contents hash, to be able to continue the execution flow and receive more info, since we couldn’t figure out how this could happen.
Supposedly, when UniformExpressionCacheis invalidated and queued to be cached (for example, by a material parameter change). Typically, it’s added to the DeferredUniformExpressionCacheRequests set and will be updated in FMaterialRenderProxy::UpdateDeferredCachedUniformExpressions(), during FScene::Update(), before any other code uses it or compares with RenderingThreadShaderMap
We tried to remove the parallelization factor from the equation by setting r.DeferUniformExpressionCaching=0, r.UniformExpressionCacheAsyncUpdates=0. Unfortunately, it did not help.
Another very similar one comes from the VT destroy callback, triggered much more rarely. See Issue #2 in the attachment.
Those issues have a low reproducibility rate and occur randomly and more frequently on specific machines. It doesn’t correlate with CPU generation, core count, etc., suggesting a data race. Additionally, it only happens to Nanite materials, so our best guess is that RDG tasks within Nanite::FRenderer::FDispatchContext::DispatchHW cause the problem, while UniformExpressionCache is not up-to-date.
Since we are able to continue after this assertion arises, we can see other subsequent issues being reported, for example, Issues #3, 4, 5 in the attachment.
Those assertions always have at least one preceding “UniformExpressionCache should be up to date”, but often more than one. Sometimes the material name reported from FMaterialShader::SetParameters() will not be found in the assert in FMaterialShader::GetShaderBindings(), but all the materials are always used only with Nanite across thousands of occurrences.
Have you ever encountered such issues? Do you have any suggestions on what we can try? Any help is much appreciated.
Kind regards,
Denis