Hello,
My team and I are experiencing an issue with Horde where certain build jobs seem to hang with steps never reaching a completed state. Even when manually cancelling the build/step we get the ‘Cancelled’ tag but the build will always be in a ‘running’ state according to Horde.
[Image Removed]
[Image Removed]
I’ve done some digging into the cause of the issue and it seems to coincide with the Horde server restarting to apply Windows updates. Roughly at the same time as the above steps were started the server logs suggest that the server shutdown and restarted:
[23:03:47 inf] Shutdown/SIGTERM signal received [23:03:47 inf] Application is shutting down... ... [23:04:10 inf] Server version: 5.5.0-37571337 [23:04:10 inf] App directory: C:\Program Files\Epic Games\Horde\Server [23:04:10 inf] Data directory: D:\Horde\Server [23:04:10 inf] Server config: D:\Horde\Server\server.json ... [23:04:11 inf] [4952] 14 May 23:04:11.526 * Ready to accept connections [23:04:11 inf] Win32 service starting...
It seems that once the server shuts down and the agents lose connection, the agent seems to try and cancel their current lease due to the lost connection. This is an excerpt from the lease log:
Exception on log tailing task (682510ec6e68dc646d7277c1): Status(StatusCode="Unavailable", Detail="Error starting gRPC call. HttpRequestException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (horde.1010games.com:443) SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.", DebugException="System.Net.Http.HttpRequestException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (horde.1010games.com:443)") [00:12:13] Failed to run process: The operation was canceled. [00:12:13] Lease was cancelled (session terminating)
Because of this, it seems that the agent thinks that it has cancelled the lease and so it can be witnessed doing leases for other build jobs, but I’m guessing that the server (as it is shutdown/restarting) never receives the message cancelling the lease and so it thinks that that step is still running?
To help this from happening in the future we have designated a specific time through the night where the agents/server will trigger updates where we will not trigger any scheduled builds, so that if any updates need doing the agents/server can restart freely without affecting any builds.
The issue we have now though is that we have quite a few of these builds that are stuck in a ‘running’ state. This is particularly an issue where we have builds that are limited to run one at a time as we are having to edit the schedules to account for these builds.
My initial thoughts to remove these ‘running’ builds would be to delete their entries from the mongodb, along with the associated lease entries, although I’m not sure whether this would actually work in effectively getting rid of these problem builds?
If anyone has any further ideas on how we can navigate this problem, then the help is greatly appreciated!
Cheers,
Ben