[5.6] FCompilePipelineStateTask leak lead to GPipelinePrecompileTasksInFlight never decrementing

Hi,

We are using this guide: https://dev.epicgames.com/documentation/en\-us/unreal\-engine/manually\-creating\-bundled\-pso\-caches\-in\-unreal\-engine

After upgrading from 5.4 to 5.6, our shader warmup system and screen still worked (because it contained compute PSO only, from Niagara), but became stuck after we started feeding SPC again during the cook (which added graphics PSO).

Upon looking into it, it seems like PipelineStateCache.cpp‎’s InternalCreateGraphicsPipelineState creates a FCompilePipelineStateTask, which increments GPipelinePrecompileTasksInFlight, but the task is never cleaned up, which means the dtor is never called, and GPipelinePrecompileTasksInFlight is never decremented to match the ctor increment, which leads to FShaderPipelineCacheTask::ReadyForPrecompile always returning false after more than 10 of these tasks “leaks”.

I could have missed it, but I don’t think these tasks are cleaned in any way, the CachedState simply stays in GGraphicsPipelineCache and we’ll never clean the completion states that were initialized through FPipelineStateAsync::SetPrecompileTask.

An easy fix, but likely not the best, would be to add a clean step in InternalCreateGraphicsPipelineState’s lambda, and call ThreadPoolTask.Reset().

Thanks,

JB.

Steps to Reproduce

Hi Jean-Baptiste,

That’s concerning to hear. We have not run into an issue like this yet, but verifying if we have a bug here would be good. I know it might be tough to do, but would it be possible for you to create a small repro project or give me some repro steps? That way, it will be much more likely that we can find a fix for you.

Cheers,

Tim

We had the same problem while trying to upgrade to 5.6 and adding ThreadPoolTask.Reset() in the FPSOPrecacheAsyncTask lambdas like suggested fixed it for us. We have a loading screen (using DefaultGameMoviePlayer) at the very start that waits for FShaderPipelineCache::NumPrecompilesRemaining() and it never completed.

Like Jean-Baptiste said, FCompilePipelineStateTask that are wrapped in FPSOPrecacheAsyncTask and launched in the PSO thread pool are never destroyed, so GPipelinePrecompileTasksInFlight is never decremented. FCompilePipelineStateTask that are launched with the task graph are fine because they are destroyed as soon as DoTask() is done. Adding ThreadPoolTask.Reset() at the end of the lambda makes it behave the same as launching with the task graph.

Okay, it sounds like this might be an issue we have not caught yet. We will investigate this in the coming days and look for a fix. Do you have an easy way to reproduce this leaking behavior? I imagine it is easier to trigger it on a larger project, but for debugging purposes, it would be much easier to isolate the issue to a small project.

Hi,

Sorry, I can’t really spare the time for the sample, I fixed it on our side with what I suggested in OP (the explicit Reset call).

As the issue is specifically in InternalCreateGraphicsPipelineState, you need to follow the guide on bundled PSO cache (link in OP), as otherwise you’ll only have compute shaders (from niagara) in the bundled PSO cache.

Probably easily doable with CitySample, log the pso records of a flyby session to get enough usages, generate the SPC from the log, feed the SPC for the next cook, and you’ll likely get a few thousands graphics PSO which should hopefully repro the issue.

It’s likely not hard to repro, we don’t do anything fancy other than following this guide, we do have a “shader warmup screen”, but it’s just calling `FShaderPipelineCache::SetBatchMode(FShaderPipelineCache::BatchMode::Fast);` and waiting that `FShaderPipelineCache::NumPrecompilesRemaining()` reaches 0.

But yeah without that screen, you might not see that GPipelinePrecompileTasksInFlight is stuck and prevents further batch of PSO to be processed, might need to use a debugger to check.

For reference, our bundled cache typically has approx 18k graphics PSO, and 4k compute PSO.

Sorry for not being more helpful here, I wanted to at least report the issue instead of just fixing on our side.

Sure, that makes sense. Thanks for the extra info. I will investigate this some more using CitySample and get back to you with any updates. However, investigating this crash will take some time due to the complexity of reproducing this on a content example such as CitySample.