Hey all,
Let me join the party.
Few notes regarding GetObjectsData. This is a synchronization step that allows the secondary (slave) nodes to get corresponding data from the primary (master) node every frame. This actually synchronizes all the custom objects. If you don’t use any of them (IDisplayClusterClusterSyncObject interfaces implementations), then you don’t introduce any additional traffic there. Yes, it’s called 3 times per frame: pre-tick, during-tick and post-tick synchronization steps. From a secondary node perspective, the synchronization process looks like this:
1. A secondary node sends GetObjectsData(pre-tick for example) request
2. The primary receives the request
2.1. If the pre-tick sync data is already available (the primary node has prepared and cached it already), then send the response with data immediately. There will be minimal overall time for this GetObjectsData step.
2.2. IF the pre-tick sync data is not ready yet, then wait until the primary node prepares it. Obviously, the more you wait, the longer the GetObjectsData block looks in the traces.
Usually, these synchronization steps pass really fast.
Few notes about WaitForFrameStart/WaitForFrameEnd. These are barrier based synchronization steps. That means every node sends the sync request, but they get a response ONLY when all other nodes have sent it as well (they all met at the same point).
1. A node (including primary) sends WaitForFrameStart (let’s say it’s FrameStart, FrameEnd work absolutely the same way)
2. The primary’s TCP session receives the request from a node, and calls a barrier synchronization
2.1. If not all nodes have sent such a request yet, then the barrier object blocks the calling thread until the last node arrives
2.2. If all nodes have sent their WaitForFrameStart requests, then release the barrier, send responses to every caller, and therefore unblock them
So you see, there are a lot of sources that could introduce some delays on any of the synchronization steps. Even though it’s possible, in your case it rather looks like an issue because 90+ ms is too much.
The first thing that comes to my head is exactly what Alejandro has mentioned earlier. We have seen such issues in the past. All those cases were caused by cables, routers, other non-nDisplay traffic, quotas, etc. It looks like let’s say step 2.1 for GetObjectsData above, but for some reason it takes too long between Node-Master-Data-Sent and Node-Slave-Data-Received. This is what we could theoretically prove or disprove by investigating the pcap files. Unfortunately, the files you attached were captured for the ‘none’ sync policy, so I can’t find the problem. Yes, I see Pipe_4 (master) runs faster than others, Pipe_5 is a bit slower. Pipe_1 seems to be the slowest, therefore all those relatively small delays are caused by the slowest node. It would be perfect to catch such 90+ ms delay in a pcap, while running an ‘empty’ scene with ‘Nvidia’ sync policy.
There is one thing that really looks weird. Look at this “5ms over 5ms” sine-like pattern. Sometimes such pattern can be caused by other networking traffic, cooling issues, CPU/GPU throttling, etc.
[Can’t attach the image here for some reason]
If I find anything else, I will let you know.
[Attachment Removed]