Render Thread Advice

I’m having a really hard time with our render thread. The GPU and main thread are easily able to give us a lower frame time but the render thread is consistently lagging behind. In insights, a lot of the concepts are more abstract and it’s more difficult to cleanly understand what needs to be optimized. I was hoping someone could review our stat profile and give some better insight as to what kind of things are taking a lot of time on our renderthread. From what I can see there’s a lot of task waiting. I have no idea what task is being waited on because there are a lot of tasks that vary from occlusion culling to hism gather dynamic elements. Our draw calls don’t seem unreasonable, a lot of the world is instanced. We aren’t using nanite or VSMs so everything here is just traditional lods and shadow maps.

On our previous title we shipped on 4.27 I was definitely able to get better render thread performance out of similar scene complexity so I’m hoping either I there’s been a regression here or I’m missing something.

[Image Removed]

Thanks,

Brenden

Hello,

Thank you for reaching out.

I’ve been assigned this issue, and we will be looking into your Render Thread performance for you.

Hello,

The Render Thread workload is spread across multiple worker threads. The parallelization of the workload has been significantly improved with UE 5.5 and UE 5.6.

Based on the occlusion cull waits in the trace, can you please let us know what the GPU frame time here was? Occlusion visibility needs to wait on results from the previous frame which usually leads to inflated view visibility times on the frame that waits.

Does you scene composition primarily consist of objects with Static mobility or other mobility types? Draw command caching can help reduce Render Thread times.

For more information, please see this documentation:

https://dev.epicgames.com/documentation/en\-us/unreal\-engine/mesh\-drawing\-pipeline\-in\-unreal\-engine

Please let us know if this helps.

Yes I’m aware improvements were made in 5.5 and 5.6. Perhaps if you look at the trace you could advise on select changes that are worth our time to backport? We have cherry picked quite a few things from 5.5 and 5.6 already, but mostly relating to streaming performance via async creation of physics colliders and navmesh convex rasterization.

I attached the stat profile in my previous sub which contains all the gpu (and occlusion) timings. Were you able to open and look at it? I can tell you that the occlusion timings on the card are extremely minimal and the renderthread is definitely not waiting on the gpu here. It looks like its waiting on foreground tasks.

Almost everything in the scene is static mobility. It seems HISMS always go through the dynamic path though probably because of clustered lods and occlusion tests? I have tried r.MeshDrawCommands.UseCachedCommands 0 and performance is basically identical, so it would seem basically zero draw commands are being cached.

Thanks,

Brenden

Some questions that can help with some context in the trace:

* What streaming system are you using for your world?

* Are you batching your HISMs in some way that results in multiple HISM components for the same mesh?

* Based on the number of GDME calls, have you tried tuning “r.Visibility.DynamicMeshElements.NumMainViewTasks” to better balance the workloads on the threads?

* How many cores does the PC have for the trace you attached? Are you modifying the number of worker threads for the thread graph in anyway?

-world partition

-world partition breaks up hisms by level so theres probably a few hisms per mesh due to multiple levels being loaded around the player

-i have not tried this but i will look into it thanks

-64 core threadripper. i ran this on my 16 core 7950x as well and results were similar.

-world partition

-world partition breaks up hisms by level so theres probably a few hisms per mesh due to multiple levels being loaded around the player

-i have not tried this but i will look into it thanks

-64 core threadripper. i ran this on my 16 core 7950x as well and results were similar.

r.Visibility.DynamicMeshElements.NumMainViewTasks had virtually no effect. It defaults to 4. Setting it to 8 may have dropped things *ever* so slightly but there’s no noticeable change.

One change that did make a difference was if we turned off r.hzbocclusion 0. At first performance got way worse, but when we set “r.AllowSubPrimitiveQueries 0” performance improved to actually be a couple MS better than when HZB was enabled.

We’re still currently renderthread bound though, lower gpu resolution scale makes no difference on framerate and the main thread is still issuing waits.

Given the number of HISM components being considered for render, you might want to consider tuning your World Partition cell sizes for lowering those HISM counts. Based on the trace these mostly look like Instanced Foliage actors.

If not World Partition cell size, consider setting up cull distances on the foliage types and potentially modifying foliage.CullDistanceScale and grass.CullDistanceScale as well.

HISM can schedule sub-primitive queries to do its node-based culling so that is what you are affecting by setting it to 0. However, using the suggestion above can result in lesser queries since the culling efficiency could change.

There are additional culling CVars you can try modifying for your use-case:

  • r.Visibility.OcclusionCull.MaxQueriesPerTask
  • r.Visibility.FrustumCull.UseOctree