UE 5.6 - while cooking, freeze happens with multiple HTTP timeouts and awaits after DDC forever under UNiagaraSystem::WaitForCompilationComplete()

Cooking without Cloud DDC makes it go through without any deadlock.

It seems like pushing new data to Cloud DDC can sometimes trigger HTTP timeout errors, and when they happen, random freezes can happen when UNiagaraSystem::BeginCacheForCookedPlatformData() needs to process new DDC data.

HTTP GET, PUT, POST all have zero size and have timeouts that are always above 3 seconds:

[2025.06.19-18.04.21:837][ 0]LogDerivedDataCache: Display: HTTP: POST https://server.to.cloudddc.com/api/v1/refs/namespace/fshaderjobcacheshaders/078fced047dfa6320b9d110b87df5aed66232eb9/finalize/d22b389797c8f2afc16faf183fe5d1ad6429348b: Operation timed out after 3181 milliseconds with 0 bytes received (3.181 seconds 0.000|0.000|0.000|0.000) Content type ‘*/*’ of size 0

[2025.06.19-18.04.21:849][ 0]LogDerivedDataCache: Display: HTTP: GET https://server.to.cloudddc.com/api/v1/refs/namespace/fshaderjobcacheshaders/2fe8dbf4eb530e1769a5bc33442e290200c13943: Operation timed out after 3181 milliseconds with 0 bytes received (3.181 seconds 0.000|0.000|0.000|0.000) Content type ‘*/*’ of size 0

[2025.06.19-18.04.21:915][ 0]LogDerivedDataCache: Display: HTTP: PUT https://server.to.cloudddc.com/api/v1/compressed\-blobs/namespace/e38d5e21733f081aec661365ce3073d2ff11d1b7: Operation timed out after 3169 milliseconds with 0 bytes received (3.169 seconds 0.000|0.000|0.000|0.000) Content type ‘*/*’ of size 0

[2025.06.19-18.04.21:939][ 0]LogDerivedDataCache: Display: HTTP: GET https://server.to.cloudddc.com/api/v1/refs/namespace/fshaderjobcacheshaders/0f3bbfabd8414ab47122dee1c3b1112e4196a452: Operation timed out after 3190 milliseconds with 0 bytes received (3.191 seconds 0.000|0.000|0.000|0.000) Content type ‘*/*’ of size 0

[2025.06.19-18.04.21:941][ 0]LogDerivedDataCache: Display: HTTP: GET https://server.to.cloudddc.com/api/v1/refs/namespace/fshaderjobcacheshaders/9067562de0b9e308e28d3fa8eca071c4dd2a36a4: Operation timed out after 3190 milliseconds with 0 bytes received (3.190 seconds 0.000|0.000|0.000|0.000) Content type ‘*/*’ of size 0

[2025.06.19-18.04.21:957][ 0]LogDerivedDataCache: Display: HTTP: GET https://server.to.cloudddc.com/api/v1/refs/namespace/fshaderjobcacheshaders/77c2b67e0f8176dcff6cfcd120e4804393c47039: Operation timed out after 3176 milliseconds with 0 bytes received (3.177 seconds 0.000|0.000|0.000|0.000) Content type ‘*/*’ of size 0

[2025.06.19-18.04.21:959][ 0]LogDerivedDataCache: Display: HTTP: GET https://server.to.cloudddc.com/api/v1/refs/namespace/fshaderjobcacheshaders/a40ac9abbaf4db18766267e4af1021c87644f62c: Operation timed out after 3175 milliseconds with 0 bytes received (3.176 seconds 0.000|0.000|0.000|0.000) Content type ‘*/*’ of size 0

[2025.06.19-18.04.21:961][ 0]LogDerivedDataCache: Display: HTTP: PUT https://server.to.cloudddc.com/api/v1/refs/namespace/fshaderjobcacheshaders/f6de3802e4426947d3446c58a999a6b8e3f07434: Operation timed out after 3168 milliseconds with 0 bytes received (3.168 seconds 0.000|0.000|0.000|0.000) Content type ‘*/*’ of size 0

[2025.06.19-18.04.22:108][ 0]LogDerivedDataCache: Display: HTTP: GET https://server.to.cloudddc.com/api/v1/refs/namespace/fshaderjobcacheshaders/364a2da58fe755bedd7643■■■1472152635fff9a: Operation timed out after 3168 milliseconds with 0 bytes received (3.169 seconds 0.000|0.000|0.000|0.000) Content type ‘*/*’ of size 0

[2025.06.19-18.04.22:126][ 0]LogDerivedDataCache: Display: HTTP: POST https://server.to.cloudddc.com/api/v1/refs/namespace/fshaderjobcacheshaders/f4412aaa6fd8a87e14d1743eda5c5ce40f4e6911/finalize/9db41c6543ee2d45d788913883a9f11f52583e7d: Operation timed out after 3163 milliseconds with 0 bytes received (3.164 seconds 0.000|0.000|0.000|0.000) Content type ‘*/*’ of size 0

This did not happen with UE 5.5.

The Cloud DDC is on version 1.2.0 as seen in https://github.com/orgs/EpicGames/packages/container/package/unreal\-cloud\-ddc

When looking at the Parallel Stacks, it seems to be waiting after a single DDC to finish, but nothing else is running on any other thread. That looks like a race condition.

Any help is appreciated.

Thanks,

  • JLP

Steps to Reproduce
* Launch a cook with Zen Local as DDC and Cloud DDC

* While cooking, at one point there are HTTP timeouts like these:

[2025.06.19-18.04.21:837][ 0]LogDerivedDataCache: Display: HTTP: POST https://server.to.cloudddc.com/api/v1/refs/namespace/fshaderjobcacheshaders/078fced047dfa6320b9d110b87df5aed66232eb9/finalize/d22b389797c8f2afc16faf183fe5d1ad6429348b: Operation timed out after 3181 milliseconds with 0 bytes received (3.181 seconds 0.000|0.000|0.000|0.000) Content type ‘*/*’ of size 0

* UNiagaraSystem::BeginCacheForCookedPlatformData() is called, some of these HTTP timeouts appear, and ultimately gets stuck in FNiagaraActiveCompilationDefault::QueryCompileComplete() and FNiagaraAsyncCompileTask::CheckDDCResult()

Hi Jean-Luc, have you seen the timeouts affect other data types too, even if they don’t lead to a hang? The data type is in the URL in the log right after /namespace/. The examples that you gave are for shaders, which could have been requested in association with the UNiagaraSystem.

Timeouts in Cloud DDC are usually due to the limits set in FHttpCacheStore::GetDefaultClientParams().

  1. DnsCacheTimeout and ConnectTimeout impact the initial connection. Based on the log that you shared, the ConnectTimeout is likely what you are hitting. We don’t usually see connection timeouts at Epic. I can direct you to a different engineer to share ideas on how to diagnose connection issues if these are the majority of your timeouts.
  2. LowSpeedLimit and LowSpeedTime impact how we time out while waiting for the response. The default is to time out if the average speed is less than 1024 bytes/s over a period of 10 seconds. This is typically how we time out at Epic. If this is what you’re seeing then, aside from network issues, this normally happens for us when there are many requests in flight at once. We have work scheduled to limit the number of concurrent requests to minimize these timeouts.

> Actually looking more into this stack, it isn’t stuck waiting on the network response.

The stack that you shared yesterday is waiting to get an IHttpRequest from the queue, which is basically a pool of requests. This is a get request, which means the queue size is 128. If there are no available requests in the queue, it means there are 128 incomplete requests. Aside from the small window of time between taking a request from the queue and sending the request to the server, the requests are going to be waiting for libcurl to say they are finished, whether successfully or due to a timeout. You’re correct that this stack is entirely about the local client, but it can only be here because we are waiting for the server. The connection pool thread is the one that fires completion callbacks for the request/response, and it would be good to confirm that thread isn’t blocked on something.

> Is it guaranteed that multiple wide requests won’t have the same FRequestOwnerShared?

The caller can put as many requests into an owner as they want to. Waiting on the owner will wait on each request that is currently active in the owner.

> Well I guess it depends on if the object is really destroyed in End.

The requests are reference-counted. The thread calling FRequestOwner::Wait holds a reference (LocalRequest) and that reference is created under a write lock. With FQueueRequest, when we call Wait in the scenario that you described, it’ll call OnComplete.Wait(). If OnComplete.Notify() has already been called, as in your example, it’ll return immediately.

> The FQueueRequest I don’t believe is ever legitimately waited on.

It is waited on in exactly the scenario that you shared a stack for. All of this can be async but the stack is coming through the pre-5.0 GetSynchronous API which waits on the owner. That’ll sit in the wait loop, with many requests beginning and ending, until the overall cache get operation is complete. If the request queue is saturated at any point in that, it’ll wait on the FQueueRequest exactly like this.

Sam, do you have a reproduction that can be shared? If you don’t have one that can be shared, would you be willing to try out a newer version of curl with your repro to see if the problem still occurs with a newer version than 8.12.1?

So far I’m seeing it with fshaderjobcacheshaders, compressed-blobs (before the namespace), legacyniagarascriptderiveddata, materialshadermap.

When these appear it does not mean it will freeze and hang immediately, it may take a while of these appearing where a freeze happens.

I have turned on Verbose logging of LogDerivedDataCache, I did notice that I have quite high latency:

[2025.06.20-17.44.03:511][ 0]LogDerivedDataCache: Verbose: HTTP: PUT https://server.to.cloudddc.com/api/v1/refs/namespace/legacycard/f7327d22c8d83e84410205dfcc5fc41698d6cd54 -> OK (200) (sent 42 bytes, received 12 bytes, 2.722 seconds 2.655|0.000|0.000|2.722)

[2025.06.20-17.44.03:516][ 0]LogDerivedDataCache: Verbose: HTTP: PUT https://server.to.cloudddc.com/api/v1/refs/namespace/legacycard/a8fbc31aa21f48700b065babdf1b1dfd15173e72 -> OK (200) (sent 42 bytes, received 12 bytes, 1.825 seconds 1.765|0.000|0.000|1.825)

[2025.06.20-17.44.03:516][ 0]LogDerivedDataCache: Verbose: HTTP: GET https://server.to.cloudddc.com/api/v1/refs/namespace/materialshadermap/5daf50871105b8d44199ad1a20268a45d87d351e -> OK (200) (received 965 bytes, 2.694 seconds 2.664|0.000|0.000|2.694)

[2025.06.20-17.44.03:522][ 0]LogDerivedDataCache: Verbose: HTTP: GET https://server.to.cloudddc.com/api/v1/refs/namespace/materialshadermap/5daf50871105b8d44199ad1a20268a45d87d351e -> OK (200) (received 965 bytes, 0.413 seconds 0.383|0.000|0.000|0.413)

[2025.06.20-17.44.03:529][ 0]LogDerivedDataCache: Verbose: HTTP: PUT https://server.to.cloudddc.com/api/v1/compressed\-blobs/namespace/6f5195be46a27a1912ceb520f001e07da905a828 -> OK (200) (sent 449 bytes, received 57 bytes, 2.497 seconds 2.423|0.000|0.000|2.497)

[2025.06.20-17.44.03:535][ 0]LogDerivedDataCache: Verbose: HTTP: PUT https://server.to.cloudddc.com/api/v1/refs/namespace/legacycard/30f56f6138b003afea41c0076a2673753a2e5de0 -> OK (200) (sent 42 bytes, received 12 bytes, 1.156 seconds 1.099|0.000|0.000|1.156)

[2025.06.20-17.44.03:542][ 0]LogDerivedDataCache: Verbose: HTTP: PUT https://server.to.cloudddc.com/api/v1/refs/namespace/legacycard/f63efbf470427d3b4024f574fe52b6260f02b09a -> OK (200) (sent 42 bytes, received 54 bytes, 1.169 seconds 1.092|0.000|0.000|1.169)

[2025.06.20-17.44.03:546][ 0]LogDerivedDataCache: Verbose: HTTP: GET https://server.to.cloudddc.com/api/v1/refs/namespace/materialshadermap/5daf50871105b8d44199ad1a20268a45d87d351e -> OK (200) (received 965 bytes, 1.140 seconds 1.110|0.000|0.000|1.140)

[2025.06.20-17.44.03:548][ 0]LogDerivedDataCache: Verbose: HTTP: PUT https://server.to.cloudddc.com/api/v1/refs/namespace/legacydist/f8cc311e7866bc3e976a0dfae40db95caa1c3c89 -> OK (200) (sent 44 bytes, received 12 bytes, 1.152 seconds 1.093|0.000|0.000|1.152)

Can we raise the timeout to something much higher, like 30 seconds ? I don’t know where to change this.

So the cases that do error are hitting the ConnectTimeout, but neither of these explains a full on deadlock as the timeouts should kick in. Luke was able to hit this and the stuck thread is here:

`UnrealEditor-Core.dll!UE::HAL::Private::FMicrosoftPlatformManualResetEvent::WaitUntil(UE::FMonotonicTimePoint WaitTime) Line 28 C++
UnrealEditor-Core.dll!UE::ParkingLot::Private::WaitUntil(const void * Address, const TFunctionRef<bool __cdecl(void)> & CanWait, const TFunctionRef<void __cdecl(void)> & BeforeWait, UE::FMonotonicTimePoint WaitTime) Line 548 C++
[Inline Frame] UnrealEditor-DerivedDataCache.dll!UE::ParkingLot::WaitUntil(const void *) Line 81 C++
UnrealEditor-DerivedDataCache.dll!UE::FManualResetEvent::WaitUntil(UE::FMonotonicTimePoint WaitTime) Line 71 C++
[Inline Frame] UnrealEditor-DerivedDataCache.dll!UE::FManualResetEvent::Wait() Line 35 C++
UnrealEditor-DerivedDataCache.dll!UE::DerivedData::FHttpRequestQueue::FQueueRequest::Wait() Line 61 C++
UnrealEditor-DerivedDataCache.dll!UE::DerivedData::Private::FRequestOwnerShared::Wait() Line 230 C++

UnrealEditor-DerivedDataCache.dll!UE::DerivedData::Private::FLegacyFetchOrBuildTask::BeginGet() Line 495 C++
[Inline Frame] UnrealEditor-DerivedDataCache.dll!UE::DerivedData::Private::FLegacyFetchOrBuildTask::StartAsync() Line 407 C++
[Inline Frame] UnrealEditor-DerivedDataCache.dll!UE::DerivedData::Private::FLegacyFetchOrBuildTask::ExecuteSync() Line 434 C++
UnrealEditor-DerivedDataCache.dll!UE::DerivedData::Private::FDerivedDataCache::GetSynchronousByKey<TArray<unsigned char,TSizedDefaultAllocator<32>>>(const wchar_t * CacheKey, TArray<unsigned char,TSizedDefaultAllocator<32>> & OutData, TStringView<wchar_t> DebugContext) Line 804 C++
UnrealEditor-DerivedDataCache.dll!UE::DerivedData::Private::FDerivedDataCache::GetSynchronous(const wchar_t * CacheKey, TArray<unsigned char,TSizedDefaultAllocator<32>> & OutData, TStringView<wchar_t> DebugContext) Line 812 C++
UnrealEditor-NiagaraShader.dll!FNiagaraShaderMap::LoadFromDerivedDataCache(const FNiagaraShaderScript * Script, const FNiagaraShaderMapId & ShaderMapId, EShaderPlatform Platform, TRefCountPtr & InOutShaderMap) Line 545 C++`We’ve seen similar behavior against other similar engine tech (of which I’m sure you’re familiar) where these large servers failed to enable PMTU black hole detection which causes the server to start failing to send large payload packets (and infinitely retrying) due to misconfigurations in firewalls on the route, but assuming this LowSpeedLimit/LowSpeedTime check is being made on another thread to signal the stuck thread.

I also see similar params set in FZenCacheStore::Initialize and FZenStoreHttpClient::InitCommon with

ReadinessClientParams.LowSpeedLimit = 1;

ReadinessClientParams.LowSpeedTime = 5; // 5 second idle time limit for the initial readiness check

or

ClientParams.LowSpeedLimit = 1;

ClientParams.LowSpeedTime = 25;

respectively.

but idk what these settings are used (does cloud DDC use zen?). I can’t immediately find where LowSpeedLimit/LowSpeedTime is used. Is it zen or cloud service side?

It could still be a stomp/race in the client, but it does feel like these timeouts aren’t functioning correctly (outside of the black hole case. Idk if a black hole scenario will actually return control to the service to handle these timeouts if they aren’t client-side. behavior is OS and implementation dependent.)

Actually looking more into this stack, it isn’t stuck waiting on the network response. It’s stuck trying to queue up the network response to be sent. So this is entirely a local client issue.

Only suspicious thing I see that stands out is there is a gap between the writes to FRequestOwnerShared::Requests and FRequestOwnerShared::Wait which grabs Requests.Last. Is it guaranteed that multiple wide requests won’t have the same FRequestOwnerShared? Cause how I’m reading this doesn’t guaruntee the caller that made the request waits on the same request afterward, and could grab the request for the short-lived queue request after the OnComplete notify was finished but before it was erased from the Requests list.

This doesn’t seem particularly safe to me. Calls on one thread to FHttpRequestQueue::CreateRequestAsync will quickly add and remove an entry from FRequestOwnerShared::Requests. if that can run while FRequestOwnerShared::Wait runs on another thread for the same shared owner, that’s bad news. Specifically within CreateRequestsAsync doesn’t keep the owner write lock for the entire period between FRequestOwnerShared::Begin and FRequestOwnerShared::End, and this isn’t safe with a wide .Last call.

I added a thread access detector to validate this case assumption and we were able to hit this race, so I think it’s fairly likely that it is this. (Edit: Well I guess it depends on if the object is really destroyed in End. I thought you can get an Begin (thread 1)->.Last (thread 2, passing the null check)->End (Thread 1)->OnComplete notify (thread 1)->.Wait (thread 2 before GC or dangling, whichever). which is a very tight race. In theory that pointer can still point to garbage even if it is cleaned up immediately)

We will likely move ahead with making an accessor for that RW lock and covering the write lock for the entire CreateRequestsAsync

Indeed this part of the stack should be impossible if this wasn’t happening. The FQueueRequest i don’t believe is ever legitimately waited on.

UnrealEditor-DerivedDataCache.dll!UE::DerivedData::FHttpRequestQueue::FQueueRequest::Wait() Line 61 C++ UnrealEditor-DerivedDataCache.dll!UE::DerivedData::Private::FRequestOwnerShared::Wait() Line 230 C++

I see your case where TryCreateRequest fails to find an instance in the RequestPool where can legitimately wait on this and you’re right the ref count should keep the instance alive in my scenario such that it wouldn’t be dangling, but the waiting thread could bypass that State atomic check in the FManualResetEvent::WaitUntil before OnComplete.Notify() fires. Atomic wouldn’t save that check. It can get well into ParkingLot::WaitUntil before Notify is called. ParkingLot::WaitUntil and ParkingLot::WakeAll being called at nearly the same time still guaruntees that the WaitUntil will pass? I also notice we don’t have the change for WaitUntil to park threads based on how the ParkingLot was invoked in parkinglot code but I assume that won’t affect this case since it’s WaitUntil being called initially.

I’m also struggling to find where TryCreateRequest/TryGiveRequestToQueue is ever re-attempted except when new FHttpRequestQueue::CreateRequestAsyncs come in. If the last one comes in when the queue is full, and then the queue processes through it’s http requests but a new request never comes in (say if further requests are waiting on this one to complete), how does that last FQueueRequest ever complete? edit: nvm I see this is done on the OnDestroyRequest callback.

Well I hit this again but I don’t think it is either case above. RequestPoolCount is 0 in the curl client. But even more suspiciously, curl_multi_info_read in the FCurlHttpConnectionPool::ThreadLoop pump is returning that there are no inflight messages. So the http requests that were in the queue supposedly finished but never got to CompleteRequest->…->FCurlHttpClient::DeleteRequest. I wonder if an http errors or times out, if it actually returns CURLMSG_DONE, since that is the only case that calls CompleteRequest

Another way is if it hit FHttpOperationReceiver::ShouldRetry and got a 429 too many requests response, which will bypass the reset in OnComplete but only after curl_multi_remove_handle is called. Need to see where that retry actually happens (Edit: I see it is handled for async requests in FAsyncHttpOperationReceiver::OnComplete. Sync requests look like they never retry and just languish, but doesn’t look like we ever do a non-async request except against localhost and security token)

So the HttpConnectionPool thread is actually stuck in curl_multi_perform (specifically in a for loop in curl_splay way down the curl stack) so it is never pumping - I was previously looking at a different connection pool pumping. It appears to be a thread safety issue with the curl tree as this is looping through the curl tree infinitely.

Appears to be the same as https://github.com/curl/curl/issues/1360. Does look like a threading stomp and seems to always happen after a timeout case. Only thing special about that is you would be re-adding the same request curl using curl_multi_add_handle, but in theory all the ops shouldn’t be in contention with the connection pool since they are being called by it. I’ve confirmed that it doesn’t deadlock on the very next pump of curl_multi_perform. It ends up pumping that a couple dozen times after the timeout case before it deadlocks. I have seen some other threads where this was a problem in older versions of curl just from bugs. I do notice libcurl was updated up 6 versions in this release and it’s a fairly new release from this year, so also possible they introduced it. I’ve managed to get the timeout and queue size just right to reproduce this consistently. I’ll see if I can get it against the old 8.4.0 version.

[mention removed]​ for vis. I note this version of curl in 8.12.1 is quite recent and already had some hot fixes. Given how new it is and the latest hotfix within the last few months, I’m now more suspicious of this being an issue in the library. Do we know if there are any other dependencies in 5.6 on this latest version of curl? I assume the engine should work fine against 8.4.0 still.

Edit: well my 100% repro case I hit has gone away with 8.4.0. If no other dependencies exist, we may roll back to 8.4.0 for now to verify it across the team as several people have gotten this at fairly frequent intervals. Possible some relation to this similar sounding cause https://github.com/curl/curl/issues/17139 also found in 8.12.1 though the behavior is different or a prior regression found here https://github.com/curl/curl/issues/15639 (though this one was supposedly fixed before 8.12.0). Could also be downstream of what was fixed in multi: start the loop over when handles are removed by bagder · Pull Request #16588 · curl/curl · GitHub which looks like it could access bogus mem.

Thanks for the updates, Sam! It is good to know that the issue doesn’t repro with 8.4.0. I don’t know if we have a dependency on anything in the newer versions. We’ll review the Cloud DDC client for possible threading issues too, though it already has a dedicated thread for curl calls and has otherwise been very stable.

Hi, a request for testing if you’re still holding off on taking 8.12.1 because of this problem.

We recently updated to 8.15.0 and encountered a similar hang, and we confirmed by debugging a repro case on the farm that it was caused by corrupt pointers in curl’s Curl_multi->timetree data structure. Since the curl developers report that this is only expected due to unsychronized multithreaded access, we have written some instrumentation into our copy of the curl library to detect multithreaded access and give a warning and callstacks when it occurs. We haven’t hit a repro case yet in local testing, and we are about to deploy it to our farm for more testing over the next few weeks.

If you can still reproduce the problem and you’re willing to try out the instrumentation for us, I can send you the github changelist of our instrumentation in 8.12.1 later this week, and you can cherrypick that into your build and try it out in your repro case.