A few days ago we set the UBA_ZONE environment variable on our dedicated horde agents (Windows on-premises VMs) as suggested here [Content removed]
Since then we have had a couple of agents that were ‘frozen’ and did not finish the lease. For the first we interrupted after the job under Jenkins after two days (weekend) and the other I restarted the HordeAgent service after a couple of hours and the job then completed.
I’m not sure if this is connected with the zone usage or not.
Has this issue been seen? Is there any method to set a timeout to avoid an agent being stuck for hours or even days?
I have attached the lease text file from Horde server of the agent that was stuck a couple of hours.
Hmmm, we have not seen any hangs like this afaik… the entire job was hung right, not only the agent? If that was the case we would definitely have seen this on our farm.
Unfortunately there are no timeouts for something taking too long time.. main reason is because if something like this happens, we want to make sure we catch it… and on our farm some actions (pgo/ltcg linking) takes over two hours. I guess no remote actions should probably not take more than 30-40 minutes so we could add an optional timeout.
If you get repro on this, do you think it would be possible to attach a debugger and see what is going on? Check both the client and the host.
You can also try latest binaries from github.. they should be backward compatible and I fixed a couple timeout-related things just a week ago.. even though they probably won’t fix your specific problem since we have not seen problems like that in a long time.
Hmnm, unfortunately I think I will need callstacks of when it is hung.. ideally both on agent that is being the proxy and the host but especially the proxy.. I have no idea what this could be and I have not seen this at Epic and we have hundreds of helpers being proxies all the time. Probably rotating 10s of millions of actions every day.
yes the whole job was waiting and when I remotely restarted the HordeAgent service the job finished shortly after.
The agents are Windows Server 2022 Standard 21H2 VMs under Proxmox. I do not have details on the latency and bandwidth, but they are in the same server room. I have graphs of ethernet traffic and the max I have seen is about 600Mbits/s Before adding UBA_ZONE we had a range from 0 to 200 Mbits/s in the graphs and it never reached 200.
At the moment the zone is only defined on the dedicated agents. We are trying 4 workstation agents that do not have zone defined and the machines that are requesting the compilation do not have anything defined in the BuildConfiguration.xml
Regarding latest binaries from github, is that UBA binaries on ue5-main? I do not see any binaries, but I suppose I could find them on Perforce repository. Of the two cases we had, one was UE5.7 preview @46459148 and the other was UE5.6.1 .