CPU Crash in UniformBuffer caused by incorrect UniformExpressionCache.CachedUniformExpressionShaderMap

Hello Unreal Engine Support Team!

We have found a bug that we reproduce in gameplay automation testing when we change quality settings (sg.EffectsQuality with different r.MaterialQualityLevel values). We isolated the problem to r.MaterialQualityLevel changes which normally triggers CachedUniformExpressionShaderMap/UniformBufferExpression update.

For for materials this update is not triggered which is Debug version appear as Assertion failed: UniformExpressionCache.CachedUniformExpressionShaderMap == Material.GetRenderingThreadShaderMap() and in Test/Shipping version has crashes call stack with next calls FD3D12CommandContext::SetResourcesFromTables or

FRHITexture::GetFlags

GetD3D12TextureFromRHITexture

FD3D12CommandContext::RetrieveTexture

FD3D12ResourceBinder::SetTexture

EnumerateUniformBufferResources.

The nature of the issue next: Some materials has 4 UniformBufferExpessions (one per each quality level) and if a cached UniformBufferExpession doesn’t match the UniformBufferExpession for last updated r.MaterialQualityLevel (GCachedScalabilityCVars.MaterialQualityLevel) the game will assert in Debug or crash in Test/Shipping version.

Based on material names in asserts it seems that the issue happens even on with environmental BPs or actors.

I have tried to put ensureAlways in code that covers typical entry points of material initialization, like UMaterialInstance::PostInitProperties() and then checking that we can reach them via GetObjectsOfClass(UMaterialInstance::StaticClass() as this function is used in AllMaterialsCacheResourceShadersForRendering. So far there were no success in fixing this issue.

We get similar crashes call stacks from live, so we want to fix this issue.

Maybe UE team has some advice on how to fix or narrow down this issue.

Thank you.

[Attachment Removed]

Steps to Reproduce
I don’t have a repro project, but I will try to make one.

Issue reproduced in gameplay automated testing with no repro steps except automatically quality settings changes via UGameUserSettings::SetOverallScalabilityLevel. It takes around 1 hour to collect first crashes.

[Attachment Removed]

Hello,

Is it possible this is the same issue that was reported here? [Content removed]

If so, another developer posted a potential fix for it here [Content removed] and we have an issue tracking implementing the fix for it here https://issues.unrealengine.com/issue/UE\-355606 but the internal ticket hasn’t been resolved yet.

I’ve reached out to the assigned dev for an update but you may want to try the proposed workaround in the meantime (copied below)

The issue is because in ScalabilityCVarsSinkCallback(), when it creates the FGlobalComponentRecreateRenderStateContext, the constructor of it calls UpdateAllPrimitiveSceneInfosForScenes(). This launches a task on RenderThread which calls FPrimitiveSceneInfo::CacheMeshDrawCommands. This can then overlap with the assignment of GCachedScalabilityCVars which causes the issue.

We were able to fix this by adding another FlushRenderingCommands() between creating the FGlobalComponentRecreateRenderStateContext and assigning GCachedScalabilityCVars here to wait for the UpdateAllPrimitiveSceneInfosForScenes RT work to finish.

[Attachment Removed]

Hi [mention removed]​

Thank you for pointing out to this potential fix, but unfortunately FlushRenderingCommands() between creating the FGlobalComponentRecreateRenderStateContext and assigning GCachedScalabilityCVars didn’t fix the issue.

I don’t have a repro project to share at the moment.

[Attachment Removed]

Hi,

Apologies for the delay. Looking at the differences here, I see BuildNaniteMaterialBins in your callstack and am wondering if there may be a race condition here due to the way those can be handled on parallel threads. I’m reaching out to a colleague more familiar with what the RHI is doing here to see if there’s a better way to ensure we’re ready to change the scalability.

Can you remind me - does your engine have Nanite customizations that might have an affect how Nanite CPU related work is scheduled?

[Attachment Removed]

Hi [mention removed]​ !

Our engine version is 5.5.4, with many cherry-picked changes from UE5.7 and UE-Main, but none that directly affect the UE scheduling code. We have more active threads than other titles due to the third-party library usage.

[Attachment Removed]

Hello,

Just wanted to provide an update that we don’t have a fix for this yet but have had other reports of crashes related to changing scalability levels including this one:

[Content removed]

There is likely an underlying issue and we hope to have some suggestions soon if we can repro in latest.

[Attachment Removed]

Hi [mention removed]​ !

We decided to remove the Quality Level node from our materials, as it wasn’t needed in most cases to fix this issue and to reduce the shader count. But if you find a fix, we would be interested in cherry-picking so we could use the Quality Level node in the future if needed.

[Attachment Removed]

Hello,

Apologies for the delay, I’m passing this issue to a colleague who is looking into the MaterialQualityLevel crashes in case he has additional thoughts or requests. If we identify a root cause of the issue we’ll update https://issues.unrealengine.com/issue/UE\-355606 with the fix CL.

[Attachment Removed]

Hi Oleksii,

Apologies for the delay, did you manage to get anywhere with creating a repro? We also find this happens non-deterministically but Im poking it atm anyway.

Thanks,

Jon

[Attachment Removed]

Hi [mention removed]​

We didn’t have a repro case for this particular issue, but it was reproducing on our gameplay testing, which runs the game as a player would, but switches quality levels every 5 or 10 seconds. We have a world with a large number of objects and unique material instances, and the majority of them had quality switch node. For some sesssions it didn’t repro in 2-3 hours, for others repro in 15 minutes, so this repro is not the best.

We removed the Quality Level node completely, and now don’t have this issue.

Best regards,

[Attachment Removed]

Hi Oleksii,

Apologies for another delay in response, we have multiple fires on our end that Im tied to. Ok so Im going to test this with FN’s multiple quality switches and see if I can create a repro somehow, bear with me this week please.

Do you skip cooking any of the material quality levels in the config for particular platforms? Any crashing more often than others? Just wondering if somehow it’s attempting to use a quality level that doesn’t exist (and hence the texture MIP is skipped).

Thanks,

Jon

[Attachment Removed]

Hi [mention removed]​ !

No problem. Delay is completely understandable.

We only support Windows PC at the moment, and the fix was manual removal of the Quality Node from materials by our tech artists.

One issue that had the same behavior and I was able to fix was in our material warmup feature, which loaded materials that GC didn’t track, so it didn’t receive a quality change callback and tried to render material with the old uniform buffer.

Later, I tried to check for all other material if GC tracks them, but the check was passing fine, so I think the engine issue is a different problem, not connected to GC tracking.

Best regards,

Oleksii

[Attachment Removed]

Hi [mention removed]​

Yes it’s connected to the first issue and we didn’t try the suggested fix yet. I will try it and make an update. Thank you

[Attachment Removed]