We’re seeing that cooks targeting our new on-prem ZenShared server are flip flopping a lot between deactivating and activating due to performance criteria. Our threshold was initially set to DeactivateAt=10 but I’ve tested this as high as a value of 30. It does seem to improve things, but I feel like an on-prem machine to an on-prem server in a stable network environment should never average over 10ms for any length of time, and the tests I’ve run show that it isn’t. I’ve run a continuous ping test on the side while a build was running and 99% of the pings were <1ms with some outliers jumping as high as 5ms at times, but none were over 10ms and certainly none over 30ms.
I was attempting to look at what the engine thinks is going on, but without diving into the debugger, it seems like the ZenShared isn’t in the stats blurb at the end of the cook. I see ZenRemoteGetHits, ZenRemoteGetTotal, and ZenRemoteGetHitPct listed, but for latency I only see LocalLatency, CloudLatency, and SharedLatency (which was 0.0000 and I assume refers to the legacy FileSystem shared DDC since that wasn’t active in this test). I did take a look through the code and I see where it’s calculating these, but I’m wondering if it’s possible the cook itself might be starving this and generate falsely high values at times (pure speculation on my part).
The log statements in question (that I see 5-10 times through the cook depending on the threshold used) are:
LogDerivedDataCache: Display: ZenShared: Performance does not meet minimum criteria. It will be deactivated until performance measurements improve. If this is consistent, consider disabling this cache store through environment variables or other configuration.
LogDerivedDataCache: Display: ZenShared: Performance has improved and meets minimum performance criteria. It will be reactivated now.
Before I go digging super deep into this, have you all seen anything like this or have any potential changes to how this is calculated? Thanks in advance for any insight you can provide into this.
Yes, I have seen this before. Specifically what I found was happening in the past was that the metrics coming out of curl for latency or “time to first byte” of a response would be skewed when we have a lot of requests in progress and we hit our designated connection limit forcing HTTP1 requests (without multiplexing) to be queued until a connection freed up. What seemed to be happening was that curl would included this queued time as part of the timers we query to determine average latency. This was discussed with the curl developers here:
And in libcurl 8.4.0 we had to introduce a patch to address this issue (stored as curl_starttransfer_time_workaround.patch in the Engine/Source/ThirdParty/libcurl/patches/curl-8.4.0 path). In moving to later libcurl versions, it seemed as if the libcurl developers had addressed this issue, and carrying forward the patch was challenging due to the alterations to the library. However, we have also see evidence of this pattern of connecting & disconnecting to zenserver due to a false perception of latency when connections are queued.
I think we will have to consider either changing the way we compute average latency within zenserver’s DDC client, or we have to restore the patch for timing in newer libcurl drops. [mention removed] for visibility on this topic.
Thanks for following up on this. I don’t think the impact to us currently is very severe, and we’ve gone ahead with rolling out our shared zen server to the team. It seems like it will only disable for a small window before it checks again. While this likely causes some missed cache pulls, it’s not detrimental, so we’ll watch for any updates on this.