Runtime generation of PSO leads in some cases to crashes

Hey, we are currently facing some issues with the runtime PSO generation.

  1. We have one case where we run into an OOM-Crash when rendering a 16k image while PSOs are generated simultaniously
  2. And we have one case in which we run into a GameThread timeout while RTPSOs are generated

Once the PSOs are generated in both cases the crashes won’t occur anymore, so the application is running well without the additional overhead of generating them at runtime.

We are still working on precaching them in our build process, but we are not there yet.

Are there settings that could help us mitigate those crashes in the meantime?

Best Regards,

Dominikus

Hi, thanks for reaching out.

>> We have one case where we run into an OOM-Crash when rendering a 16k image while PSOs are generated simultaneously

There is a known issue with Nvidia hardware in UE5.5, where created PSOs are not released, leading to an overallocation of heap memory. This was fixed in UE5.6, but it should be possible to manually apply a patch for UE5.5. This may help prevent an OOM crash.

There are also some CVars available which help reduce memory pressure during PSO generation as described here:

r.PSOPrecache.KeepInMemoryUntilUsed (default set to 0)

r.PSOPrecache.KeepInMemoryGraphicsMaxNum (default set to 2000)

r.PSOPrecache.KeepInMemoryComputeMaxNum (default set to 200)

These UDN cases may also be of interest:

[Content removed]

[Content removed]

>> And we have one case in which we run into a GameThread timeout while RTPSOs are generated

You can try setting r.ShaderPipelineCache.MaxPrecompileTime to a non-zero value to switch PSO precompilation to background processing. If that doesn’t work, would it be possible to provide more details, such as a trace from Unreal Insights or a crash log?

Thanks,

Sam

Hi Sam,

I’m from Dominikus’ team and in order to make make our reproducer actually able to reproduce I need to clear any computed RTPSOs.

Starting with -clearPSOdrivercache only enforces recreation of compute and graphics PSOs. Setting r.ShaderPipelineCache.UserCacheUnusedElementRetainDays 0 didn’t work either.

Are there some files on disk (like the PSO driver cache files) or am I missing a specific CVar to do that?

Being able to quickly clear RTPSOs too would help testing out the CVars you mentioned in certain circumstances.

Cheers,

Simon

Hi,

Using -clearPSOdrivercache should do the job, but you can also manually clear the PSO cache be deleting the DX cache in C:\Users\[YourUsername]\AppData\LocalLow\NVIDIA\PerDriverVersion\DXCache (for NVIDIA drivers). I have checked the documentation, but could not find a specific CVar that would clear the RTPSOs.

Have you had a chance yet to test if the patch I mentioned improves stability (in terms of OOM crashes)?

Thanks,

Sam

Hey Sam,

thanks for your answers!

We don’t build the engine from source, so we are not able to apply the patch.

As Simon wrote we are currently testing the CVars you mentioned.

Once we get some results I’ll let you know.

Thanks,

Dominikus

Thanks for the update, interested to know how it goes.

Best regards,

Sam

Hi Sam,

thanks for the help so far!

I did some tests with a reproducer provoking a GameThread timeout while compiling RTPSOs, setting the CVar r.ShaderPipelineCache.MaxPrecompileTime to 200. I reckon it’s in milliseconds. Sadly I didn’t see any improvement to the issue since the thread still times out. I’m not sure if there are errors in the setup or if there is a different reason why it’s not working as expected in this case.

I browsed the engine source code and expected at least to see a log line informing me about the switch to background generation.

For completeness sake I’ll upload a log and lay out what I tried to achieve and some context for easier understanding:

The scenario I wanted to recreate is a case where there might be instances which do not have cached PSOs and there also aren’t all possible packaged. So I deleted all cache files in directory you mentioned to start with a clean plate.

This is a build without any packaged PSOs.

In this particular case, just prior to the start of the PSO generation, the application loads a level with some lights and a PPV, setting the scene up for a shot using Lumen.

Maybe you can see something of interest in the attached logs.

The tests regarding the memory issues are still ongoing.

Cheers,

Simon

Hi,

thanks for including the error log, that’s definitely helpful. The time set by r.ShaderPipelineCache.MaxPrecompileTime is the time in seconds, before switchign to background compilation. You could try a value of 5 seconds. The recommendation is to use fast compilation while displaying a loading screen and switch to background compilation (which takes more time, but shouldn’t ) in menus.

It’s also worth testing if increasing the timeout interval helps avoid timeouts caused by the renderthread taking too long. This can be done by setting g.TimeoutForBlockOnRenderFence, (this is the number of milliseconds the game thread should wait before failing when waiting on a render thread fence) to 9999999 (the default value is 120000). If that works, you can experiment with lower values.

I noticed from the logs that the application is run in a server environment with Nvidia T4 GPUs. If each instance has the exact same GPU hardware and driver installed, one optimization may be to generate the PSOs just once and copying the PSO cache to other instances (instead of each instance compiling the PSOs from scratch, which is time consuming and prone to timeouts).

Hopefully this is helpful, please let me know how it goes.

Sam

Thanks, looking forward to your results.

We’re in the process of extending our pipeline to precompile and package the PSO which would achieve a similar result, right?

That should work if both the GPU architecture and driver version are identical between the machine used to precompile the PSOs and the machine running the application (if not, it will trigger a recompilation).

Thanks,

Sam

Hi Sam,

I hadn’t found much time to continue the tests. But increasing the timeout with g.TimeoutForBlockOnRenderFence alleviated most of the stability issues we’re facing.

I’m still struggling to understand though why r.ShaderPipelineCache.MaxPrecompileTime didn’t have any effect in preventing the timeout. Are there any CVars directly affecting the MAxPrecompileTime? Maybe I’m missing something very basic.

Cheers,

Simon

Hi,

>> increasing the timeout withg.TimeoutForBlockOnRenderFencealleviated most of the stability issues we’re facing.

Glad to hear that helped.

>> Are there any CVars directly affecting the MaxPrecompileTime?

I dug a bit deeper into this and fund that there are actually two additional CVars that need to be set when r.ShaderPipelineCache.MaxPrecompileTime is greater than 0. In that case, the number of PSOs precompiled per frame will change, and will be affected by the following two flags:

  • r.ShaderPipelineCache.BackgroundBatchSize: The number of PSOs to compile per frame in a single in Background mode. Defaults to a maximum of 1 per frame, due to asynchronous file IO it is less in practice
  • r.ShaderPipelineCache.BackgroundBatchTime (the target time in milliseconds to spend precompiling each frame when in the background): when greater than 0, the engine will determine the number of pre-compiled PSOs based on the single-frame latency after entering Background mode. You can try setting this to 10

You can find a bit more info on these CVars in this blog.

Hope that helps,

Sam

Hi Sam,

I dug around in the engine code some more and found my growing suspicion confirmed that the mentioned CVars do affect the asynchronous on-demand compilation of RTPSOs.

Nonetheless, in the meantime we rolled out the packaging of precompiled PSOs and the increased game thread timeout value which greatly reduced instabilities in our product.

From my side this ticket can be closed. Thanks for your help.

Cheers,

Simon

Hi Sam,

thanks a lot for the comprehensive answer. I’ll test this and let you know what I found.

I noticed from the logs that the application is run in a server environment with Nvidia T4 GPUs. If each instance has the exact same GPU hardware and driver installed, one optimization may be to generate the PSOs just once and copying the PSO cache to other instances (instead of each instance compiling the PSOs from scratch, which is time consuming and prone to timeouts).

We’re in the process of extending our pipeline to precompile and package the PSO which would achieve a similar result, right?

Cheers,

Simon

Hi,

>> we rolled out the packaging of precompiled PSOs and the increased game thread timeout value which greatly reduced instabilities in our product.

Great! In that case I will close this ticket, but feel free to open a new one when you encounter more issues.

Best regards,

Sam