Ocasional frame drops on nDisplay render nodes

Erasio · October 10, 2022, 1:37pm

Hey there!

We are running into odd challenges regarding stable performance of our scene. Despite everything being optimized to run at below 20ms per frame while only requiring stable performance of about 40ms per frame we have ocasional frame drops that we can’t explain from the profiling tools. For tests I’ve run both only the inner frustrum with frozen nodes as well as only the environment without inner frustrum.

Sometimes we get exceptionally high game time for no evident reason (there’s no logic running at all). And the moment it drops we see a frame drop.

Other times it happens randomly.

With also no apparent reason. Neither the live profiling tools, nor the GPU analyzer nor the frontend were able to show us what exactly caused these issues.

Lumen is disabled. Both GI and reflections. Environment is part landscape, part nanite. Exponential height fog, minor lighting (sky and directional lights) and some translucent fog cards. Nothing that would obviously cause performance issues.

The most concerning part is the impossibility of debugging. As far as I can see the level is optimized to a degree where rendering it twice should be stable.

I can verify that the network stream for livelink data (camera position) is entirely stable at stable intervals with <0.1ms variation. All connections are wired, all nodes have synchronized graphics cards which are also synchronized with the livelink input data and all wires have been double checked for faults and latency spikes. The issue happens both with network derived data caches as well as local caches.

We have also attempted delaying the external location data to arrive at different moments within a rendered frame to guarantee it won’t possibly arrive in between frames due to Unreal taking a bit more or less time per frame.

So I am confident in saying there’s no hardware defect nor synchronization errors. And am quite confident there’s also no configuration errors.

Anyone got some insights or experience with this? Specifically about how to debug problems like these?

Erasio · October 12, 2022, 2:14pm

For context. Here’s the structure of a frame drop. Instead of reaching our intended 40ms it took 44 ms to render this frame, therefore missing the sync window. I can definitely eliminate performance as a potential problem as all threads spend most of the time waiting.

However, upon closer inspection it seems that the difference in duration is related to background workers quering for tasks… of which there are none.

Here’s a normal frame. The render thread records a duration of ~4ms for the frame.

Here’s the frame from above (44ms) focused on the render thread. It is ~8ms long. Most of that time is spent idling. But, instead of doing basically nothing all 26 worker threads (both foreground and background) take about 4.6ms to querry for jobs… we have basically no jobs.

There’s 71 microseconds of chaos updates (we have no physics active) and 62 microseconds of “mesh draw command pass setup”. As well as about 40 microseconds of audio updates (we have no audio).

Which seems to me as if the worker thread sync is causing these frame drops. After checking the settings and skimming over the ini I didn’t find a way to reduce workers and test this theory.

Any help with reducing workers or whether that’s the right direction would be much appreciated!

Erasio · October 21, 2022, 2:52pm

Early tests with Unreal 5.1 seem much better, though neither switchboard nor nDisplay seem stable and usable in Preview 2 (as can be expected from previews)

For the time being I’ve compiled a custom version of Unreal and reduced the maximum amount of threads to 8 instead of running 26+. This does mean we underutilize the CPU but spending less time on synchronizing threads seems to reduce the size of spikes in render time dramatically.

Just to leave it here. I’ve used the answer in this thread:

Changing this line: https://github.com/EpicGames/UnrealEngine/blob/release/Engine/Source/Runtime/Core/Private/Async/TaskGraph.cpp#L1827

It doesn’t need to be hex. You can punch in a regular number.

When compiling, remember to build the multiuser server. Otherwise switchboard won’t launch multiuser.