We have had issues in the past were a build is waiting for one or two agents to finish and continue only when the UBA agent Windows service is restarted.
This seems to have become more frequent on UE5.7.
When it happens I seem to be always finding the following in the lease:
“UbaAgent> : UbaProxyClient (af554772-549a-40fa-a330-f4869ff6af45) - RecvSocket - WSAPoll returned timeout for socket 1984 and connection 31925ce4-8a60-44ee-96ae-c667e8829d60 after 10.0s (Connecting)”
How does the .uba trace look? does that helper have a few actions running on it?
We used to see these issues before but I haven’t see it in a long time since I changed how casdb garbage collecting work… before the garbage collecting could stall things for a long time (deleting files on windows is stupid-expensive).. and my theory is that there must be a timeout issue of some sort that could cause this scenario where processes hangs for a long time. Since we never see it anymore it is hard for me to figure it out.
Latest version of uba on main should be backward compatible and has a bunch of improvements.. I would really appreciate if you tested those binaries to see if the issue goes away before I try to dive deeper into this issue.
Can you tell me a little bit more about your setup. number of machines/zones etc, latency, network stability. what kind of machines are the builds running on, etc.
hmmm, based on your trace it seems like the compile process itself is stuck on something. I can see that the helpers with stuck processes are pinging the server so it is not a network issue I think. Most valuable here would be to attach to the cl.exe process and look at the callstack(s) to see what it is stuck on… if it looks like it is stuck on a message, then next process to attach to is the UbaAgent.exe process.. and if that one has a thread that is stuck waiting for a message, then next is to attach to the dotnet process on the build machine (with native debugging)
I don’t think it is related to the proxy.. but you can easily test that by either removing the zone environment var on the machines.. or make sure the builder machine has the same zone. Did you enable zones since you will have people using the helpers from outside the LAN? (there is no point using zone if all machines are always on the same LAN)
hmm, it doesn’t look like that in the trace above. it is stuck while using 14.9gb out of 197 committable memory (and 55gb ram). Are the vm doing anything else at the same time?
The code is using GlobalMemoryStatusEx to figure out how much memory is used (and decide when it is close to oom).. maybe it does not work well with vm?
Also, measuring memory usage from outside a vm can be a bit weird since windows use up all the unsued ram it has access to for disk cache when possible.
attached is a zip with a .uba trace with the same issue and the lease jsons.
This is the behaviour of the CPU on an agent during the ‘freeze’ and when I restarted the Horde Agent service. Strangely CPU seems to remain at 50% all the time until it is restarted.
[Image Removed]
Could ‘RecvSocket - WSAPoll returned timeout for socket’ be an issue with our network or is this part of the something you used to see?
Also, I’ll see if we can switch 5.7 UBA with the latest from main.
The build process is under Jenkins as a Windows service. I guess this might be important seeing that UbaVisualizer cannot find the session on channel ‘Default’.
The build process at the moment does not set a zone, but the agents seem to assign and use a proxy.
I tried replacing Engine\Binaries\Win64\UnrealBuildAccelerator\x64 with the files taken from your main branch on perforce and tried a build and a couple of agents got stuck on the PS5 compilation.
I tried aborting the build and the agents returned to an Idle state.
Compile Module.MetasoundEngine.cpp [ Wall Time 91.19s / CPU Time 67.80s / Mem 1.84 GB ]
Trace written to file D:/WS/sandbox5_main/Engine/Programs/AutomationTool/Saved/Logs/Trace.uba with size 14.0mb
Total time in Unreal Build Accelerator executor: 59773.79 seconds
Result: Failed (OtherCompilationError)
Total execution time: 59797.18 seconds
Took 59,798.01s to run dotnet.exe, ExitCode=6
UnrealBuildTool failed. See log for more details. (D:\WS\sandbox5_main\Engine\Programs\AutomationTool\Saved\Logs\UBA-sandbox5-PS5-Test.txt)
AutomationTool executed for 16h 36m 42s
AutomationTool exiting with ExitCode=6 (6)