NDisplay: low FPS / events

Steps to Reproduce

Hi there, we are currently working with ndisplay and 3 computers ( SRV_CS_0, SRV_CS_1, SRV_CS_2 ) synchronized with nvidia swap barriers, SRV_CS_0 as ndisplay master.

we are experiencing FPS shutdowns and are trying to figure why,

As the insights graph shows for the ndisplay clusters: WaitForFrameStart/WaitForFrameEnd do sometimes take a large amount of time, ex: going up to 93.6 ms (see picture): which does throttle the Fps and the whole cluster time to time, even with an empty/low complexity scene.

  • what would be the logic behind WaitForFrameStart/WaitForFrameEnd ndisplay events ? is this a known behaviour: for a simple graphic scene that they end up with around 3ms ?

moreover, in the second picture the function: GetObjectsData gathers information for the current cluster, we did also experience loss of FPS with attributes ( ex: r.velocity.EnableVertexDeformation, particles, .. ), it seems that could be related to determinism,

  • is there somewhere in the documentation, all the features/events/attributes in ndisplay targetted for sync/determinism between cluster nodes ?

we know about ndisplay replication, this is set to off

last point, when we get under 30 fps, ndisplay will stick at 30 fps, even if it can pursue later at 60fps in the scene, how to disable this behaviour ?

thanks, stephane

[Attachment Removed]

Step to reproduce:

load asynchronous data, navigate the scene, profile with insight

[Attachment Removed]

Hi support,

Can you please advise what is synchronized by ndisplay and how to troubleshoot such latency introduced within the execution ?

Even with little load on the system, we are observing multimilliseconds synchronization reports and none of the system is reporting unexpected activities on those events.

Network is isolated and under no pressure (less than 10% usage). Also, to our knowledege, we are not synchronization extensive data on our own between the nodes.

Can you point us to extra logs / tests / verifications ? This is preventing proper usage in large scale system.

Thanks,

[Attachment Removed]

Hello,

Both NVIDIA and the Ethernet sync policy are forcing VSync to be on. Enabling vsync is required to prevent tearing and align frame presentation with the display refresh rate.

Your screenshot shows the frame timing around 16.6ms, which corresponds to a 60Hz display refresh rate.

Thus everything works as expected and there is no problem or loss.

The first screenshot with GarbageCollection call could be related to the content. What the nature of the content in your scene that GC takes so much time to clean up? Do you have some sort of dynamically created and left object in the world? Could you repro it with a single machine to focus on GC issue here. You may put a breakpoint and look into debugger what sort of objects being cleanup.

Quick and dirty fix could to call GC more often or not at all if you have enough ram between level switches.

I hope that helps!

vitalii

[Attachment Removed]

Hi Vitali, thanks for the feedback !

we are focusing mainly on the WaitForFrameEnd events emitted in the cluster, for testing purposes we are using an empty scene, and are trying to figure out why the slave events ( WaitForFrameEnd ) arent returned earlier.

Our target tech scenario would be working with multiple computers ( X 6 ) sync with ndisplay/nvidia but we gather troubleshoot in performances/return times due to GetObjectsData() et GetEventsData() lazily returned to the master.

would there be a way to bypass the wait of these function knowing they are empty ? moreover, why is there multiple calls to GetObjectsData() in the same frame per slave ?

As i understand multiple GetObjectsData() calls on the same slave, for the same frame: would impact the performance ( network / event / send / receive ) with an overhead unnecessary.

the idea behind is that the more computers we have, the less overhead we would need to accumulate and guarantee quick resolution for the WaitForFrameEnd events.

by these means, could you tell us what is synchronized between clusters with GetObjectsData() and tell us if it would be doable to achieve a full circle WaitForFrameEnd ( master+slaves ) under 0.5 ms in this case ( X 6 ndisplays ) and thus avoid multiple calls to GetObjectsData() ?

provided with pictures, an example with 3 comp/clusters with the WaitForFrameEnd vs GetObjectsData() calls

thx ! [Image Removed]

[Attachment Removed]

Hello Stephane,

why is there multiple calls to GetObjectsData() in the same frame per slave ?

Those are related to the sync points in the cluster to align its steplock execution. These functions are absolutely needed for engine consistency across the nodes and have minimap performance implications(much faster than 0.5ms) unless huge data is passed through or one node waits for another to finish the job (game thread or render thread).

You can see the raw performance by switching to sync policy “none” - that will disable vsync, which is forced for Ethernet\nvidia sync policies.

Let me know if that helps.

Vitalii

[Attachment Removed]

Hi Vitalii,

Stéphane is presently attending I/ITSEC in Orlando and will not be able to answer this before getting back to the system next week. I am pretty confident that we tried disabling the policy “none” to ensure the hardware setup was not the faulty component to the same end result. There are some -undeterrmined- frames which -for no reason we can explain- expose this unexpected and significant delay with the sync method.

In the captures, the traces are showing the same frame on all 3 nodes from our platform. As far as we can see, nothing explains the delay. Are you able to pinpoint what the applications are waiting for ? As Stéphane said, we are not replicating any content so we would expect the call to be nearly invisible in the traces…

Thanks,

Basile

[Attachment Removed]

Hello Basile,

Cheap network switches often cause latency issues as well. Please share with us none sync policy traces.

Thanks!

vitalii

[Attachment Removed]

Hi Vitalii,

The traces were generated using a CISCO C1000-48T-4X-L.

Also, the synchronisation network cards are operating on a dedicated VLAN. It would be quite unfortunate the latency comes from that part.

Stéphane will get his hands back on the system next week to generate traces after modifying the sync policy.

Thanks,

Basile

[Attachment Removed]

Basile, could you please share traces with sync policy none?

Thanks

[Attachment Removed]

Hi [mention removed]​ , here are the traces with sync policy none:

https://we.tl/t\-CcxDtPAw18

best,

stéphane

[Attachment Removed]

These hitches could be a networking issue, traces tend to look like this in such cases. One way to further diagnose is with a network capture (e.g. .pcap with wireshark).

[Attachment Removed]

Hello Stephane,

From the traces, I see that you have very decent performance, around 180-200 fps, and then at some point, hitches begin to occur.

What could be affecting the network around 40 seconds after the start, causing it to take 7 seconds to return to normal?

[Image Removed]

[Attachment Removed]

Hi [mention removed]​ yes, with sync policy to None, the FPS isnt restricted ( no vsync )

for the zone (red) you specify in your image, if we take the highest pic ( frame 7536 ),

we have multiple bottlenecks in this areas, see pictures below, all of computer are calling extensive calls to:

WaitforFrameStart => duration 15ms, GetEventsData => duration 18.5 ms

as behaviour, the scene is static: the camera is at fixing point, doesnt move, and no elements are loading or unloading during that time, the computers have a dedicated router (CISCO C1000-48T-4X-L),

This area (red) is un explanable, since no interaction is happening, what would cause a WaitforFrameStart to take a such amount of time ? moreover what would you recommand as network setup ( hardware )? is this happening of software/code/ndisplay side ? thx vitali

[Image Removed] [Image Removed] [Image Removed]

[Attachment Removed]

Hi Stephane, nothing is happening here; we just wait on the socket for the data.

My recommendation would be to try to switch to a decent unmanaged switch for the test (not cheap, not Cisco high-end grade).

We often see network-related issues due to high-frequency updates throughputs.

[Attachment Removed]

Hi [mention removed]​, we are going to proceed and run a series of tests based on another unmanaged switch,

we will get back to you as soon we have the benchmark results, these will be in a few weeks

[Attachment Removed]

Ethernet cables can also cause issues that look like this in traces.

[Attachment Removed]

Hi all,

First of all, let me wish you all the best for the starting year. I travelled this week to our headquarter to run some analysis on the system. I ended up recompiling the engine with extra trace for the logs. Only, I generated wiresharks captures of all packets on our system.

For reference, this validation platform is using 3 nodes (1 as master, 2 & 3 as slaves).

The conclusion I draw from today’s analysis is that the network synchronization TCP socket is *not* processing the received (as per wireshark) packet.

Some snapshots to highlight my analysis :

Master [Image Removed]Slave 1

[Image Removed]

Slave 2

[Image Removed]I focused my analysis to the WaitForFrameStart delay seen on both master & slave 2. Those being delayed by Slave 1.

The application is capped to 70 fps and we are *not* using nvidia synchronization in this test. Based on that fps limitation, I consider the max tick rate delay expected.

The sequence in the pcap files are as follow:

Master (IP 192.168.50.1)

[Image Removed]

Slave 1 (IP 192.168.50.2)

[Image Removed]

Slave 2 (IP 192.168.50.3)

[Image Removed]

Now the trace for the according ndisplay section in Insight :

[Image Removed]

My understanding of the sequence is as follow :

Both slaves are sending time data request.

Master is sending time data information to BOTH after some time in order to maintain the 70fps then waits.

Slave 1 is managing to receive the packet and send the WaitForFrameStart request then waits.

Slave 2 takes multiple milliseconds to receive and finally send the WaitForFrameStart request then waits.

Master is sending the WaitForFrameStart responses to both slaves then continue.

In that sequence, the delay in WaitForFrameStart is unexpected to me and seems due to (blocking) socket receive call being unexpectedly delayed.

Can you confirm my understanding of the problem ?

Do you have any recommandation for further tests / configuration to solve the issue ? The machines are very powerful and the load for this test is minimal.

All data are available for sharing if interested

Thanks in advance,

Basile

[Attachment Removed]

Is the packet received later than expected in wireshark or just in the socket ? If the latency is in wireshark then I would probably try changing the ethernet cables, I have seen an issue like this before and it turned out to be a degraded ethernet cable. If the delay is between wireshark and the window socket, then this is new territory and the only other case I saw something like this was when there was a plugin incorrectly polling the socket in a busy loop and the lower level OS/driver stack didn’t take it well (this was 5 years ago though) and introduced general networking delays.

[Attachment Removed]

Hi Alejandro,

As I tried to explain, the packet is seen within wireshark in due time. However, the recv call in the winsock socket is not relaxed before another 8 ms or so. Something similar to this :

https://stackoverflow.com/questions/15588961/windows\-tcp\-socket\-recv\-delay (in our case we have a blocking receive call and not polling but still).

For completeness, I already tried different cables before narrowing down to this. Whether it is Ethernet CAT6, 7 or even brand new CAT8, the results were similar. Also we tried using sockperf and other network analysis tools and it reports correct latency.

I indeed expects something badly managed at the TCP layer within driver / OS but since this is, so far, only observed within Unreal, I was hoping for pointers like :

  • OS settings to double check
  • Simplified application for unit tests
  • Lower level monitoring tools references

Also, is there a possibility to use UDP based ndisplay sync instead of TCP. Since the network is isolated and have very little sollicitation, packet loss is not a concern. That could possibly also workaround the problem…

Thanks,

[Attachment Removed]