ZenShared deactivating and reactivating multiple times due to performance criteria

anonymous-edc · October 9, 2025, 3:38am

We’re seeing that cooks targeting our new on-prem ZenShared server are flip flopping a lot between deactivating and activating due to performance criteria. Our threshold was initially set to DeactivateAt=10 but I’ve tested this as high as a value of 30. It does seem to improve things, but I feel like an on-prem machine to an on-prem server in a stable network environment should never average over 10ms for any length of time, and the tests I’ve run show that it isn’t. I’ve run a continuous ping test on the side while a build was running and 99% of the pings were <1ms with some outliers jumping as high as 5ms at times, but none were over 10ms and certainly none over 30ms.

I was attempting to look at what the engine thinks is going on, but without diving into the debugger, it seems like the ZenShared isn’t in the stats blurb at the end of the cook. I see ZenRemoteGetHits, ZenRemoteGetTotal, and ZenRemoteGetHitPct listed, but for latency I only see LocalLatency, CloudLatency, and SharedLatency (which was 0.0000 and I assume refers to the legacy FileSystem shared DDC since that wasn’t active in this test). I did take a look through the code and I see where it’s calculating these, but I’m wondering if it’s possible the cook itself might be starving this and generate falsely high values at times (pure speculation on my part).

The log statements in question (that I see 5-10 times through the cook depending on the threshold used) are:

LogDerivedDataCache: Display: ZenShared: Performance does not meet minimum criteria. It will be deactivated until performance measurements improve. If this is consistent, consider disabling this cache store through environment variables or other configuration.

LogDerivedDataCache: Display: ZenShared: Performance has improved and meets minimum performance criteria. It will be reactivated now.

Before I go digging super deep into this, have you all seen anything like this or have any potential changes to how this is calculated? Thanks in advance for any insight you can provide into this.

MoreOrLessRandom · October 21, 2025, 6:37am

Yes, I have seen this before. Specifically what I found was happening in the past was that the metrics coming out of curl for latency or “time to first byte” of a response would be skewed when we have a lot of requests in progress and we hit our designated connection limit forcing HTTP1 requests (without multiplexing) to be queued until a connection freed up. What seemed to be happening was that curl would included this queued time as part of the timers we query to determine average latency. This was discussed with the curl developers here:

github.com/curl/curl

Unexpected Timing Data When Limiting Concurrent Connections

opened 04:28PM - 06 Nov 23 UTC

closed 10:04PM - 27 Nov 23 UTC

zousar

not-a-curl-bug libcurl API

### I did this Implemented an average latency metric for a series of asynchro…nous HTTP1_1 requests using the multi API that is produced from: > CURLINFO_STARTTRANSFER_TIME - CURLINFO_CONNECT_TIME I also limited the number of concurrent connections using: > CURLMOPT_MAX_TOTAL_CONNECTIONS = foo ### I expected the following The latency metric would match the known latency as measured by calling the standalone curl executable to perform a basic GET request with a tiny response. What I found instead is that when there were more requests in progress than the connection limit permitted (and without pipelining or HTTP2), the latency measurement would increase in a way that seemed aligned with requests being queued and awaiting a connection. If I remove the connection limit, the latency measure is then accurate. This class of issue was discussed previously and a theoretical solution was proposed but the discussion went stale and no fix was submitted: https://github.com/curl/curl/issues/4776#issuecomment-570890159 Having locally attempted a modification of this nature to curl, I found that it corrected the issue I was seeing. Is there a chance that a fix can be submitted for this issue? ### curl/libcurl version 8.4.0 ### operating system Microsoft Windows [Version 10.0.19042.2965]

And in libcurl 8.4.0 we had to introduce a patch to address this issue (stored as curl_starttransfer_time_workaround.patch in the Engine/Source/ThirdParty/libcurl/patches/curl-8.4.0 path). In moving to later libcurl versions, it seemed as if the libcurl developers had addressed this issue, and carrying forward the patch was challenging due to the alterations to the library. However, we have also see evidence of this pattern of connecting & disconnecting to zenserver due to a false perception of latency when connections are queued.

I think we will have to consider either changing the way we compute average latency within zenserver’s DDC client, or we have to restore the patch for timing in newer libcurl drops. [mention removed] for visibility on this topic.

anonymous-edc · October 21, 2025, 5:22pm

Thanks for following up on this. I don’t think the impact to us currently is very severe, and we’ve gone ahead with rolling out our shared zen server to the team. It seems like it will only disable for a small window before it checks again. While this likely causes some missed cache pulls, it’s not detrimental, so we’ll watch for any updates on this.