PSO Precaching Hitches

Hi,

We’re seeing some hitches when creating PSOs that the engine thinks should already be precached. This is specifically on NV GPUs and we found that setting r.PSOPrecache.KeepInMemoryUntilUsed solves the issue, however unless we disable the limits set in r.PSOPrecache.KeepInMemoryGraphicsMaxNum and r.PSOPrecache.KeepInMemoryComputeMaxNum, which seems inadvisable given the comments, hitches can still happen since PSOs bumped from InMemoryPSOIndices in TryAddNewState() are marked as precached but are actually not.

We’ve tried changing the code in ProcessDelayedCleanup() to remove the entry in PrecachedPSOInitializerData for the PSO if ShouldKeepPrecachedPSOsInMemory() and EPSOPrecacheStateMask::UsedForRendering is not set so that it will at least try to cache it again later (if you load another map for example). This seems to work after a quick test, but this requires a component to attempt to trigger the precache again which isn’t guaranteed to happen before the material is used.

We were wondering if Epic had any advice here or has been looking into this problem more since adding this ShouldKeepPrecachedPSOsInMemory() path.

Thanks,

Lucas

Steps to Reproduce
Run a packaged build with PSO Precaching enabled on a system with a NVIDIA GPU

Hi Lucas,

Thanks for bringing this up. Do you have more details about how you trigger this particular hitch? What cvars are you using for your packaged build related to PSO bundling and precaching? If you can create a small repro project that demonstrates your issue. Let me know if this is at all possible

Hi Tim,

We are basically using the default settings r.PSOPrecache.Components=1 and r.PSOPrecache.ProxyCreationWhenPSOReady=1 r.PSOPrecache.ProxyCreationDelayStrategy=0 except for r.PSOPrecache.KeepInMemoryUntilUsed. Unfortunately, it’s difficult to provide a self-contained repro since really the issue manifests with our content and the large number of PSOs in a real project.

Yeah, it’s a hitch as reported by CheckAndUpdateHitchCountStat (and you can see it in UnrealInsights) where RHICreateComputePipelineState takes a very long time from FCompilePipelineStateTask, like over 100ms, but this is not a precache, it’s the JIT path via FRHICommandListBase::AddDispatchPrerequisite, so it causes a long RenderThread pause. This shouldn’t happen since the object should not be drawn due to r.PSOPrecache.ProxyCreationDelayStrategy until the PSO is ready and in fact the PSO state is EPSOPrecacheResult::Complete as seen via the marker this code emits to Insights.

I think Epic has already reproduced this since it seems to be the reason for adding the r.PSOPrecache.KeepInMemoryUntilUsed cvar. Epic’s comment on it is this

	TEXT("If enabled and if the underlying GPU vendor is NVIDIA, precached PSOs will be kept in memory instead of being deleted immediately after creation, and will only be deleted once they are actually used for rendering.\n")
	TEXT("This can speed up the re-creation of precached PSOs for NVIDIA drivers and avoid small hitches, at the cost of memory.\n")
	TEXT("It's recommended to set r.PSOPrecache.KeepInMemoryGraphicsMaxNum and r.PSOPrecache.KeepInMemoryComputeMaxNum to a non-zero value to ensure the number of in-memory PSOs is bounded."),

which seems to be the exact situation we are hitting, a PSO is precached via the engine’s PSO precaching system but not used for rendering immediately so the NVIDIA driver does not actually cache it. This violates the assumption the PSO preaching code makes that simply calling CreatePipelineState and then freeing the PSO is enough to warm up the PSO and cause the driver to cache it. This causes a future CreatePipelineState call which the engine expects to be quick since the status is set to EPSOPrecacheResult::Complete, to actually take a long time and cause a bad hitch.

Setting r.PSOPrecache.KeepInMemoryUntilUsed does solve this issue as Epic’s comments indicate, so we are definitely hitting this same driver behavior, but the issue is that unless we store an unlimited amount of not-yet-used PSOs by setting the limits to 0 which the code advises against (and also we end up storing many PSOs in memory forever) eventually not-yet-used PSOs are dropped by the code in TPrecachePipelineCacheBase::TryAddNewState when there is no more room in InMemoryPSOIndices

					// Evict the oldest PSO if we're at maximum capacity.
					if (InMemoryPSOIndices.Num() == MaxInMemoryPSOs)
					{
						uint32 PSOIndex = InMemoryPSOIndices.First();
						InMemoryPSOIndices.PopFirst();
 
						// Enqueue the corresponding PSO for cleanup.
						PrecachedPSOsToCleanup.Add(PrecachedPSOInitializers[PSOIndex]);
					}

However, this causes another issue is that these PSOs states are left at EPSOPrecacheResult::Complete, so the engine will never retry caching them if they are later requested again, which means later the same hitch will occur if the PSO is then later used.

Practically this can occur if for example an object in one map preaches a PSO but that object is never drawn (the object is culled or in an area of the map the player doesn’t visit). Then later that PSO is evicted due to MaxInMemoryPSOs. After that the player then travels to another map which uses the same material. The component calls PrecachePSOs(), however nothing is actually precached since the engine thinks the PSO was precached already since the state inside the preaching system persists for the entire process and the PSO is marked EPSOPrecacheResult::Complete. However, it is not, and a hitch occurs.

One possible fix was to have TPrecachePipelineCacheBase::ProcessDelayedCleanup() do something like this

	if (ShouldKeepPrecachedPSOsInMemory())
				{
					DEC_DWORD_STAT(STAT_InMemoryPrecachedPSOCount);
					//New code
					if (!EnumHasAnyFlags(FindResult->ReadPSOPrecacheState(), EPSOPrecacheStateMask::UsedForRendering))
					{
						//If we are freeing this one, mark it uncached so we will try again later
						PrecachedPSOInitializerData.Remove(InitializerHash);
					}
					//New code
				}

However, then this also requires additional handling in FMaterialPSORequestManager::MarkCompilationComplete to handle FMaterialPSOPrecache request with now stale PSOs in them.

We were wondering if Epic had looked into this since this edge case seems to be a bug in the original code. Primarily we were wondering:

  • Does this fix seem correct? With this change a second component calling PrecachePSOs() will at least restart the PSO precache process for these discarded and uncached PSOs due to the MaxInMemoryPSOs limit since their state is reset to EPSOPrecacheResult::Unknown which prevents the hitch since the new component will wait to create the scene proxy while the PSO precaching system does another precache attempt.
  • How can we handle a transition from EPSOPrecacheResult::Complete back to another state like EPSOPrecacheResult::Unknown. Currently this does not happen in the PSO precaching code, so there is still a path where it hitches since existing components would have already made their scene proxies while the PSO state was EPSOPrecacheResult::Complete. Currently the code treats Complete as a terminal state for the rest of the process lifetime, but unfortunately with the driver behavior it is not and PSOs can become “uncached” if they are not used soon enough.

Let me know if I can provide any more information. I see CL 43263054 improves this slightly by making the MaxInMemoryPSOs handling smarter so the limit is less likely to be reached, however unfortunately the problem can still occur if it is.

Hi,

can you pls provide some traces with the error (with and without the keep PSOs in memory) - does it always happen on compute shaders or also graphic PSOs?

Would make it easier to understand the problem on our side as well.

Kind regards,

Kenzo

Hi Lucas,

There has been some internal discussion, and we are going to investigate some options to try to keep PSOs from all active components at runtime. However, this will take time to explore, and we expect to have some results in the coming months. In the meantime, would it be acceptable for you to configure your r.PSOPrecache.KeepInMemoryGraphicsMaxNum and r.PSOPrecache.KeepInMemoryComputeMaxNum to a usable level as a workaround? I know this is not an ideal solution, but this is currently our best option. Let me know what you think.

We definitely do not recommend lowering the limits for those Cvars. Do you have CL 43263054 integrated into your build, which introduces a new option for (r.PSOPrecache.KeepInMemoryUntilUsed=2)? This is soon going to be the new default for defining the behavior of how we store precached PSOs at runtime. I recommend you use that one as well.

Okay, that sounds good. Then I would say you try those suggestions to see if they help you reduce the number of hitches. If you are still having trouble, feel free to reach out again, and we can take another look. Let me know if anything is still unclear at this point.

So far it seems to be compute PSOs used for Nanite Lumen cards, but I don’t see why the issue would be specific to that, it’s just the repro case we’ve seen. Unfortunately, I’m not sure if I can share the entire trace since it has a lot of info in it, but here’s a screenshot of the hitching part. In both cases the same code runs but keeping all the PSOs in memory the RHICreateComputePipelineState returns quickly so there is no hitch.

You can see from the PSOPrecache: Precached marker that the engine thinks the PSO should be precached

[Image Removed]

We have not, but it definitely seems like a good idea. I agree this path seems better since it avoids unnecessary PSOs consuming the limited slots.

> There has been some internal discussion, and we are going to investigate some options to try to keep PSOs from all active components at runtime

That sounds great and would solve the issues with our local change for resetting the cache state. That way, if there was also a path to mark PSOs which are still unused but evicted (by hitting r.PSOPrecache.KeepInMemory*MaxNum and not being used by a currently loaded component) as uncached again, then I think it would address all the edge cases since it could assume next time a component using them was created the precache would be triggered again.

Thanks,

Lucas