During our ‘Hitches Hunt (Epic tm)’ on our road to shipping, we found out that PSOs creation could be extremely costly on some PC cards. Using Unreal Insight, we would see 100-400ms on some PSO creation, despite already having had their shader compiled and only needing to be re-created.
Epic implemented a PSO cache system to side-step the issue (enabled with ‘r.PSOPrecache.KeepInMemoryUntilUsed’) but this unfortunately wasn’t enough for us. Note: we are one 5.5 but forward integrated some of the 5.6 PSO improvements.
The caching system keeps PSOs alive until they are used once, at which point they are handled just like before. This meant that PSOs could be dropped during traversal and needing slow recreation when heading back, while never used PSOs would still be loaded. We had similar issue when pausing our world rendering during a menu screen. Additionally, it was causing some RHIThread hitches when a whole lot of PSOs were freed in the same frame (we have a lot of PSOs).
My solution to this was to change the PSO caching system to instead keep alive all ‘active’ + ‘X last used’ PSOs. This allows controlling the desired memory overhead to keep them alive, keeping the most relevant PSOs alive and dropping the unused one. This can be achieved by tracking the last time a PSO was access in ‘FPipelineState’ similar to how it’s already done when ‘PSO_TRACK_CACHE_STATS’ is enabled.
I think a vanilla implementation of this would be a good thing, until the PSO creation time is solved at the driver level.
This is indeed a pretty big problem with nVidia drivers right now. In upcoming drivers they have improved it but I can still see 40+ msec hitches but not the 200+ msec anymore - but still to much to have at runtime so keeping the compiled PSOs in memory is still needed and I have a feeling it will be for quite some time until they have completely fixed this problem. On AMD and Intel we don’t run into this problem.
The r.PSOPrecache.KeepInMemoryUntilUsed was indeed a stop gap solution added for Fortnite at some point when this issue got introduced in the driver. What you suggest makes sense, but I guess the current code assumes used PSOs are kept alive by the TSharedPipelineStateCache which keeps PSOs used during the last 60 seconds alive (CVarPSOEvictionTime). Ideally I would like to make some time to keep all required PSOs from all loaded components alive in the PSO precache caches. This would work a lot better and doesn’t require manual tweaking of r.PSOPrecache.KeepInMemoryGraphicsMaxNum and r.PSOPrecache.KeepInMemoryComputeMaxNum as well - and should reduce the memory footprint to what’s actyually needed as well.
Can you easily share a github pull request with the changes to see if we can integrate this in the short term until we have time to implement the proper solution?
I can create a github account and see about sharing it (after I’m given access to UE with it).
The code could be improved a bit, since we work with the constraint to change as little as possible to facilitate integrations, but it’s a good baseline.
Thank you - I don’t know the flow to generate pull requests but have received them a few times already in the past. Perhaps [mention removed] knows more how to do this.
And here we can see how it behaves with the cache set to 8 unused PSO (assigned a low number to test the cleanup in the vehicle sample with few unused PSOs)
sorry for the late reply but Epic was on Winter Break.
I am currently working on refcounting the precached PSOs in the PSOPrecacheCache and keep all the PSOs alive for the active materials. If a material is destroyed then all the precached PSOs for that material get their refcount reduced and potentially released. This is working fine at first sight and gives promising numbers in CitySample and FN but it would of course be good to see how this behaves in other games as well. We will get this code in for 5.8 and ideally can make this the default and then try and remove the current in memory solution.
I am not keeping global graphics and compute PSOs in memory and also don’t keep Slate PSOs in memory. This is because we have a LOT of these and actual usage percentage is pretty low. For compute I see a pretty good in memory vs actual usage percentage (1.2 multiplier for in memory precached PSOs vs actually used compute PSOs which have been precached). For graphics the multiplier is around 2/2.5 but that’s probably because we have a lot of inactive component which can be rendered at any point like particles and such (and also have a lot more possible graphic PSO permutations).
Does this solution make sense or do you still think your changes would work better for your game?
I hope this solution will work out of the box for all/most licensees and fix the runtime driver cache creation hitches.
That would definitely be an improvement, but I do not think this is the best approach for Open world with the cost of creating PSO on NVidia (even one already in driver cache), we’re trying to avoid creation cost on materials that could be destroyed then reloaded.
When we are travelling across biomes and then coming back to a visited area, the PSOs should still be loaded if possible, even if the refcount dropped to 0 at some point. In fact, on system with higher amount of RAM, we increase the PSO pool size by a good amount, to keep most of the game used PSO in memory without ever reloading them a second time. This allows us to (1.) Set a pool size that’s respected (2.) Drop never used PSOs automatically (since they are never accessed, will be the first to be kicked out).
There’s a meeting being planned with you guys next week, we could discuss this further then.
With the code I am working on right now this should work normally. If you go back to a previously visited area then these components will precache again and if those PSOs are not in the PSO precache yet then they will be loaded from the driver cache again async - it’s refcount based so if refcount is 0 then it will async create the PSO again. So ideally all the runtime material based PSOs should be found in the PSO precache cache if everything is setup and precached correctly.
I guess this should work for an open world game right? Unless I am not understanding something correctly and then we can discuss this next week.
That’s a good point, our initial work was before fixing a lot of the missing precaching issues.
Still, having a large amount of PSO creation during streaming will be stealing CPU resources from other async task that could use it, when each PSO can take 20-400ms to create (I think it’s better with latest NVidia drivers, but still far from negligeable). Maybe the system you’re working on could be hybrid? When you are about to drop a PSO with refcount=0, timestamp it and move it to a pool cache instead, and evict the older entries if there isn’t enough room.
hmm - interesting point. Using a keep alive cache next to the refcounted cache could be an option. But the idea is of course that runtime cache keeps all PSOs used in the last 60 seconds alive (default value of cvar). It might be easier to extend this timer and also put a max amount on the runtime cached PSOs. Not sure which is better. The precache PSO cache would then only be used to cache all possibly required PSOs from the active components instead of trying to keep a certain amount of extra PSOs alive as well. I will have a think about this.
On average creating a PSO which is in the driver cache ‘only’ takes around 10msec or less I think- which is still to slow to do on the foreground rendering threads but not super bad for the background tasks. But the problem is that there are PSOs which can take 100+msec due to contention with actually new compiling PSOs inside the driver which can then cause bad hitches (this has been improved in the latest preview driver which they shared with me btw). But average time is still ‘okay’ for warming up the cache. But of course not doing them again when there is enough memory available will always be better!
> It might be easier to extend this timer and also put a max amount on the runtime cached PSOs
The problem with this is that we keep all the never used PSOs alive for nothing (the one we created just in case, but end up not being a needed combination). A pool with timestamp allows to give priority to the one we know are needed. In our case, I timestamp when the PSO is created, then every time it is requested.
I just realized that ‘RefCount + TimeStamped Pool’ in your situation, would keep alive un-needed PSOs too, if you don’t have something in place to detect which one were requested at least once.
The goal is of course is that PSO precaching only precaches PSOs which could potentially be used in one of the upcoming frames depending on what’s visible and rendered. In 5.8 I made a lot of changes and optimizations to reduce the set of unused and unrequired precached PSOs. There are still too many PSOs but that’s mostly coming from shadow passes (because culling is tighter on object size) and particles because those are precached by default at system level (it’s possible to precache them at component level and this reduces the set of PSOs to precache quite a bit). In FN we now need around 2K in memory precached graphics PSOs and of these 1K is actively being used at runtime - for compute we keep around 1.1K in memory and use 800 of these. So the overhead isn’t that bad.