NDisplay: low FPS / events

Hi Basile. What is your method to verify that wireshark receives the packets in due time ? I ask in the sense of clock alignment. Also, does the issue always happen with the same node or do different nodes exhibit the same intermittent delay ?

ps. Yes please share the files if possible, the utraces and the pcaps.

[Attachment Removed]

Oh and UDP messaging is not supported for this.

[Attachment Removed]

I upload the files archive splitted in 4 1GB files. Could not find a way to add 4 files into a single post, …

My analysis focused on frame 166 start.

[Attachment Removed]

Basile,

Are all nodes have the same hardware\software setup?

Would that make sense to try variety of combinations by changing primary\secondary nodes?

  • #1 primary + #2 secondary
  • #3 primary + #2 secondary​
  • etc

I would also try having a totally different HW setup (even laptops could work).

[Attachment Removed]

Hey all,

Let me join the party.

Few notes regarding GetObjectsData. This is a synchronization step that allows the secondary (slave) nodes to get corresponding data from the primary (master) node every frame. This actually synchronizes all the custom objects. If you don’t use any of them (IDisplayClusterClusterSyncObject interfaces implementations), then you don’t introduce any additional traffic there. Yes, it’s called 3 times per frame: pre-tick, during-tick and post-tick synchronization steps. From a secondary node perspective, the synchronization process looks like this:

1. A secondary node sends GetObjectsData(pre-tick for example) request

2. The primary receives the request

2.1. If the pre-tick sync data is already available (the primary node has prepared and cached it already), then send the response with data immediately. There will be minimal overall time for this GetObjectsData step.

2.2. IF the pre-tick sync data is not ready yet, then wait until the primary node prepares it. Obviously, the more you wait, the longer the GetObjectsData block looks in the traces.

Usually, these synchronization steps pass really fast.

Few notes about WaitForFrameStart/WaitForFrameEnd. These are barrier based synchronization steps. That means every node sends the sync request, but they get a response ONLY when all other nodes have sent it as well (they all met at the same point).

1. A node (including primary) sends WaitForFrameStart (let’s say it’s FrameStart, FrameEnd work absolutely the same way)

2. The primary’s TCP session receives the request from a node, and calls a barrier synchronization

2.1. If not all nodes have sent such a request yet, then the barrier object blocks the calling thread until the last node arrives

2.2. If all nodes have sent their WaitForFrameStart requests, then release the barrier, send responses to every caller, and therefore unblock them

So you see, there are a lot of sources that could introduce some delays on any of the synchronization steps. Even though it’s possible, in your case it rather looks like an issue because 90+ ms is too much.

The first thing that comes to my head is exactly what Alejandro has mentioned earlier. We have seen such issues in the past. All those cases were caused by cables, routers, other non-nDisplay traffic, quotas, etc. It looks like let’s say step 2.1 for GetObjectsData above, but for some reason it takes too long between Node-Master-Data-Sent and Node-Slave-Data-Received. This is what we could theoretically prove or disprove by investigating the pcap files. Unfortunately, the files you attached were captured for the ‘none’ sync policy, so I can’t find the problem. Yes, I see Pipe_4 (master) runs faster than others, Pipe_5 is a bit slower. Pipe_1 seems to be the slowest, therefore all those relatively small delays are caused by the slowest node. It would be perfect to catch such 90+ ms delay in a pcap, while running an ‘empty’ scene with ‘Nvidia’ sync policy.

There is one thing that really looks weird. Look at this “5ms over 5ms” sine-like pattern. Sometimes such pattern can be caused by other networking traffic, cooling issues, CPU/GPU throttling, etc.

[Can’t attach the image here for some reason]

If I find anything else, I will let you know.

[Attachment Removed]

For some reason, I can’t attach any images. Here is a gdrive link to a jpeg.

[Attachment Removed]

Hi all, welcome aboard Andrey,

With little delay in answering (getting back from headquarter was not a smooth trip because of flight delays and the tempest hitting Normandy, …).

Thanks for the detailed explanations about the synchronisation.

On the last onsite day, I experimented with some (windows) registry settings :

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Multimedia\SystemProfile]

“NetworkThrottlingIndex”=dword:ffffffff

“SystemResponsiveness”=dword:00000000

I have not the details on which one helps the most (or both) but applying them changed visibly the behavior into something that looks more as expected.

It would mean that OS was imposing delay by netword throttling or CPU starvation.

Here is a graph from free run :

[Image Removed]First spike zone happened as I had to touch the KVM. Then we have the giant massive spike when the simple simulation test is started. Then load is quite high but stable.

I also have another run with nVidia synchronization enabled :

[Image Removed]

Besides the frame losses happening after simulation starts, it looks finally like it should !

However, I am still not convinced all is good. If I look in details at the last 50ms spike :

This is the trace from IG1 (master)

[Image Removed]We see the frame presentation are occuring at 1m55.1478, 1m55.1767, 1m55.2146.

It more or less means 2 frames are presented with 33ms delay.

On the IG2, we see this :

[Image Removed]I get from the presentation this frame is taking 50ms…

IG3 follows the pattern from IG2.

[Image Removed]

My understanding from frame lock and swap lock is that the presentation are synchronized by the hardware and presentation / swap is blocking until all IGs are ready to swap. It does not make sense to me that swap are not aligned.

Is there anything I am missing on that front ?

I will start uploading the traces for this run. If you can tell us what it means to you…

Some extra questions now :

What was the goal of testing with other computer ? (identifying hardware issue ?)

Is there any system counters captured by Unreal trace system (overall CPU usage for instance) ?

[Attachment Removed]

[Attachment Removed]

[Attachment Removed]

[Attachment Removed]

Wow, yeah, this looks much better now.

I believe NetworkThrottlingIndex is the key here, since your CPUs have sufficient computational power. This is probably a Windows 11 specific issue (like it’s disabled by default there), and it’s worth pointing out since there isn’t much feedback from Win11 users yet.

Yes, buffers swap is synchronized on the hardware level by NVIDIA sync boards. If any node is not ready to swap, the others have to wait. That’s why all the spikes appear in the same place for every node. One thing I have to mention here. By default, nDisplay uses the old synchronization approach called “NVIDIA Swap Barriers”. There is new approach available. It’s called “NVIDIA Present Barrier”. You can activate it by tuning the following console variable:

nDisplay.sync.nvidia.UsePresentBarrierPolicy=1

This new guy is responsible for the same thing. However, it might work better in Windows 11. We haven’t tested it a lot, it’s still rather an experimental feature.

Regarding those spikes you’ve mentioned. I see they still occurred kind of during test initialization/start. You see, once everything is loaded, it runs smoothly. The last spike was likely caused in the very end of the test start/initialization. I mean it could still be a part of the data loading/deploying/initialization.

https://drive.google.com/file/d/1yk51_emAIQB_MZ08Gbf3Z-OUQ2k0Fh6P/view?usp=sharing

What was the goal of testing with other computer?

It’s a reliable way to isolate any local issues. Either software or hardware.

Is there any system counters captured by Unreal trace system (overall CPU usage for instance)?

I don’t think so. But I’m not 100% sure. For a complete picture, you could capture Windows performance (https://learn.microsoft.com/en-us/windows-hardware/test/wpt/windows-performance-analyzer) alongside with UE traces.

[Attachment Removed]

It is also my guess that network throttling was the cause here as I expect / hope that the CPU in place are largely powerful enough to handle the situation. We still have much to add into the application before it becomes usable.

I am not on site anymore but we will give a try to the present barrier to see if there is any change.

Also, while I agree that the spike I mentioned is directly following the massive spike occuring at simulation starts, it is still happening more than one minute after application start. The synchronisation should be in place at this point.

Looking into the console variable you are mentioning, I noticed these comments :

// Sometimes we see a weird pattern of NvAPI_D3D1x_Present() handling in the utraces. Those patterns

// visualize something that we would not expect to see if synchronization works properly.

// NvAPI_D3D1x_Present() may return in between of the vblanks, or it may return asynchronously to other nodes.

// This cvar enables additional barrier synchronization step right after NvAPI_D3D1x_Present() call.

// This should help keeping RHI threads aligned.

static TAutoConsoleVariable<bool> CVarNvidiaSyncPostPresentAlignment(

TEXT(“nDisplay.sync.nvidia.PostPresentAlignment”),

false,

TEXT(“Sync nodes on a network barrier after frame presentation\n”),

ECVF_ReadOnly | ECVF_RenderThreadSafe

);

Can you elaborate on the variable / comment ?

We definitely have plans for smoothing this huge frame loss and simplify the synchronization task but I would still have expected the frame loss to be identical on all node (1 large 50ms frame or 2 consecutive 33ms, not different behavior). Also, the target system will have 6 channels with blended displays which means it should be scalable and reliable.

Thanks for the WPA pointer. I used that a decade ago and it seems it has not evolved. Your Insight tool or NVIDIA NSight system applications are much better references for human brains…

[Attachment Removed]

Also, while I agree that the spike I mentioned is directly following the massive spike occuring at simulation starts, it is still happening more than one minute after application start. The synchronisation should be in place at this point.

Probably, I don’t get something. In the most recent traces, I see the following:

0:0 - 0:55 - Engine initialization. At approx. 0:55 it starts simulation.

0:55 - 1:25 - It’s kind of idle. The frames finish super fast. Like an empty or trivial scene.

At 1:25, it looks some custom BP timer event occurs, some widgets appear (tiger canvas, volumetric cloud, …)

1:25 - 1:48 - Even with some new things, it still keeps running smoothly. Looks idle from the time budget perspective.

1:48 - 1:55 - It looks like something is being loaded asynchronously. All those spikes appear during this “heavy” period.

1:55 - 4:48 - Runs smoothly till the end.

The “heavy” period takes roughly 7 seconds and processes only 180 frames (~40ms on average, the overbudget is in the spikes). The last spike occurs right after some new water/wave/Niagara actor/effect appeared. It’s kind of expected to have spikes during this heavy period, I would say. It’s not related to the synchronization, it’s rather a budget issue. You just need to optimize the way how data is loaded, and how it appears in the scene, so the frames produced smoothly.

Or I don’t get something right?

Re CVarNvidiaSyncPostPresentAlignment.

It’s one of those diagnostics CVars that we used in the past to investigate some issues. It should not be used as we use non-blocking presentation in DX12.

[Attachment Removed]

Hi Andrey,

Let me try to clarify and let us the frame number as reference as the application time differs from one IG to the other. If I mention a time, then I will be referring to IG1 (master).

From frame 0 to 3023, application is starting then point of view is static of very simplified scene (terrain skin). Besides initialization, this is fine.

Frame 3023 is a massive huge frame. It consists in simulation starting. Simulation is CIGI (network driven) based for our test reference. Entities are created.

  • We shall improve our management here. Acknowledged and understood.

Frame 3034 and beyond, application is updating position for entities, potentially weather and all actors will tick accordingly.

  • Our job to improve performances for our actors / simulation.
  • Synchronization to ensure displays are presented correctly and consistently.

I am coming from a world (openGL) where glSwap is the moment the image is presented on screen and using frame lock is ensuring all IGs are swapping. I am so expecting that all nDsiplay NvAPI_D3D1x_Present are released at the very same time and that all frame duration for all IGs are identical.

Yet, looking into the traces :

IG2 and IG1 are reporting one frame 50ms (2 lost targeting 60fps) while IG3 is exposing 2 consecutive 33ms (1 frame loss, twice)

[Image Removed]

Looking at the RHI threads, IG1 : (2 consecutive miss)

[Image Removed]IG2 ( 2 lost frames )

[Image Removed]IG3 (2 lost frames )

[Image Removed]

There may be things I do not know about DirectX swap chains and presentation behavior but I do not understand why all IGs are not reporting the same traces.

[Attachment Removed]

Basile, make sure you have PSO precaching enabled.

https://dev.epicgames.com/documentation/en\-us/unreal\-engine/pso\-precaching\-for\-unreal\-engine

The only perfect solution for PSO is to load ALL the content once on a target machine to compile all shaders on target GPU and driver version.

Many games doing this on a first launch traversing through content filling up pso cache to prevent stuttering.

[Attachment Removed]

Ah, I see what you mean. I think the following should explain that.

For the first N frames, we use an additional Ethernet barrier synchronization step before presenting them. This allows to call NvAPI_D3D1x_Present() almost simultaneously on every node, and therefore make the low-level synchronization easier and smoother (remember it’s a black box for us with no control at all, so we just make some pleasant environment for it). The reason is exactly the same that you have - to pass smoothly the data loading/initialization phase that usually occurs first couple of seconds after start. Then after N frames we stop pre-synchronization for optimization purpose. And this is likely the explanation why those calls have different duration. They start at different time, but stop simultaneously. If you want to play around, you can make an experiment with the following CVar:

nDisplay.sync.nvidia.PrePresentAlignmentLimit=100000

So the N is big enough to keep pre-sync active when your test starts. I would expect to see those calls of the same duration. But it’s not a fix, it’s just a way to test my theory. I still think the synchronization works properly, even though those calls look so different.

[Attachment Removed]

Hi,

The wireshark reception is based on the reported time stamp.

[Attachment Removed]

[Attachment Removed]

[Attachment Removed]

[Attachment Removed]