200ms hitch every few seconds in DisplayClusterClusterSyncClient

HI there,

When running 2 Machines with multi process multi-gpu configurations with any sync mode None, Nvidia etc set in the switchboard every few seconds there is a 200ms hitch in the DIsplayClusterSync, usually in one of these functions.

FDisplayClusterClusterSyncClient::GetObjectsData FDisplayClusterClusterSyncClient::GetEventsData FDisplayClusterClusterSyncClient::WaitForFrameStartOr any function with

TRACE_CPUPROFILER_EVENT_SCOPE(CLN_CS::WaitForFrameStart); Response = SendRecvPacket(Request);When a single Render mahcine is used, using multi-GPU Multi-process it runs fine, adding the second render node seems to cause the problems.

The render machines have 2 networks and 1 GB/s Control and a 10 GB/s Project network.

I have however tried an isolation test with just the 10 GB/s network conencted only, same issue.

I have tried launching with different sync options set in switchbaord of None / nvidia / ethernet etc It does not seem to make a difference.

Are there any thoughts of what could be going wrong or things i should eliminate?

Attached are the 2 render node insights traces _00 and _01 and the editor trace.

I have also attached a copy of the project that reproduces the issues.

And DxDiag for the 2 render nodes.

Some Machine Stats:

Nvidia: Driver Version: 573.24 Studio

GPUS: RTX 6000 Ada

in Multi Process Mode,

Sync Cards: Quadro Sync 2 taking in genlock signal with a cat 7 ethernet cable to connect the 2 snc cards (Have tried changing the ethernet cable also)

Latest Quadro Firmware

Network NIC: Intel X710 10 GBs

Render Node 1:

3 x GPU’s but only 2 are used.

Render Node 2:

1 x GPU and is used

Unreal Version is 5.6

Windows 11

Thread Ripper Pro 7975WX 32 Core

256 GB Ram

ASUS wrx90e - sage SE Motherboard.

Thanks,

Hi Keegan. I’m not sure what’s going on but it does seem that the TCP comms between Node_0 (primary) and Node_1 stall by a little over 200ms every few seconds (sometimes every second). Have you looked at a wireshark capture ? The traces indicate that the client thinks it sent the data, while the server thinks it hasn’t received it during that 200ms of limbo, and it would be good to know which one is actually right. You could also look at the exchange of TCP ACK messages. ps. nDisplay disables the Nagle algorithm in those sockets to minimize delays.

From a bit of digging, you may also want to look into your NIC settings, or test using a different NIC. There are some settings that could cause delays for CPU or power efficiency purposes, such as “Interrupt Moderation”.

Hi,

Good to know about Nagle, I was wondering if it was a cause.

I’ll try a Wireshark capture, and limit it only to using a single NIC on the machines, and I may even put in a really basic switch. I’ll also attempt turning off Interrupt Moderation and report back.

Thanks

Hello Keegan,

I would suggest following:

  1. Try with the default nDisplay template project
  2. no MPMGPU, one pc 1 UE instance
  3. Keep only two nodes for the test
  4. It would be interesting to see the test with a non-embedded network card

And a few additional questions:

  • What kind of switch do you have? We saw similar behavior on cheaper hardware, which is easy to throttle
  • Disable MT in BIOS settings

Thanks

vitalii

Hi [Alejandro [Content removed] and [Vitalii [Content removed]

I should have access to the stage in a couple of days so i can follow up with some test results.

Hi there,

Just replying with my findings thus far.

Tried without MPGPU on a basic level, no difference.

Turned off Interrupt Moderation and power efficient ethernet.

From Wireshark, I managed to determine that\:

Node 0 -> Node 1 - Sends a message

Node 1 -> Node 0 - Ack’s

Node 0 - Never receives Ack

Node0 - Waits 200ms resends tcp packet

Node 1 -> Node 0 - Responds with a duplicate ack

Node 0 - Receives Ack

Swapping to a 1 Gb/s Network with a dumb switch.

I instead had 30ms hitches on the same functions, every 3-4 mins vs the 200ms hitches every second. Also times where it goes 3 or 4 mins without any issue.

Not sure if that is expected, given it needs to sync with the other machine might just be bad timing.

It is a tplink tl-SX1008 router.

I’m next going to investigate the router.

Keegan, I am not entirely sure it is a networking issue. The trace has a long texture transfers.

Could you please gather traces from the following setups:

  • 2 PC nodes and disable MPMGPU
  • 1 PC node and enable MPMGPU

Thanks!

vitalii

Hi there,

Just following up with some success thus far.

Replacing the Ethernet cables on the render nodes has resolved the 200-ms hitch issue, which was caused by missed acknowledgments on TCP due to its 200-ms retransmission time.

So far we have seen no hitching over 20 mins, whereas between it was every 1-2 seconds.

I’m a little surprised that the damage in the cable was enough to cause intermittent packet loss.

Will do the additional tracing as you suggested due to the long texture transfers aswell, as we may have other problems.