HORDE - Builds left perpetually running even after cancelling

Hello,

My team and I are experiencing an issue with Horde where certain build jobs seem to hang with steps never reaching a completed state. Even when manually cancelling the build/step we get the ‘Cancelled’ tag but the build will always be in a ‘running’ state according to Horde.

[Image Removed]

[Image Removed]

I’ve done some digging into the cause of the issue and it seems to coincide with the Horde server restarting to apply Windows updates. Roughly at the same time as the above steps were started the server logs suggest that the server shutdown and restarted:

[23:03:47 inf] Shutdown/SIGTERM signal received [23:03:47 inf] Application is shutting down... ... [23:04:10 inf] Server version: 5.5.0-37571337 [23:04:10 inf] App directory: C:\Program Files\Epic Games\Horde\Server [23:04:10 inf] Data directory: D:\Horde\Server [23:04:10 inf] Server config: D:\Horde\Server\server.json ... [23:04:11 inf] [4952] 14 May 23:04:11.526 * Ready to accept connections [23:04:11 inf] Win32 service starting...

It seems that once the server shuts down and the agents lose connection, the agent seems to try and cancel their current lease due to the lost connection. This is an excerpt from the lease log:

Exception on log tailing task (682510ec6e68dc646d7277c1): Status(StatusCode="Unavailable", Detail="Error starting gRPC call. HttpRequestException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (horde.1010games.com:443) SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.", DebugException="System.Net.Http.HttpRequestException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (horde.1010games.com:443)") [00:12:13] Failed to run process: The operation was canceled. [00:12:13] Lease was cancelled (session terminating)Because of this, it seems that the agent thinks that it has cancelled the lease and so it can be witnessed doing leases for other build jobs, but I’m guessing that the server (as it is shutdown/restarting) never receives the message cancelling the lease and so it thinks that that step is still running?

To help this from happening in the future we have designated a specific time through the night where the agents/server will trigger updates where we will not trigger any scheduled builds, so that if any updates need doing the agents/server can restart freely without affecting any builds.

The issue we have now though is that we have quite a few of these builds that are stuck in a ‘running’ state. This is particularly an issue where we have builds that are limited to run one at a time as we are having to edit the schedules to account for these builds.

My initial thoughts to remove these ‘running’ builds would be to delete their entries from the mongodb, along with the associated lease entries, although I’m not sure whether this would actually work in effectively getting rid of these problem builds?

If anyone has any further ideas on how we can navigate this problem, then the help is greatly appreciated!

Cheers,

Ben

Steps to Reproduce

Hey there Ben,

This is a great question, and can occur in other circumstances as well. Here is a [public [Content removed] that is quite similar, and I’d give the same advice thus far. In short:

  • /api/v1/debug/repair-job/{jobId}
  • /api/v1/jobs/{jobId}/batches/{batchId}/steps/{stepId} seems promising if you forcibly set the step to be completed
  • /api/v1/jobs/{jobId} - should be able to give back pertinent information on job state

Now, repair-job will require you to EnableDebugEndpoints.

I will see if there’s an appetite within the team to create some frontend Admin capability to better support automation in this space.

Let me know if the above helps.

Kind regards,

Julian

Hey Ben,

Sounds good - please let me know how you fare. And yes the batches job most likely requires permissions for the logged in user.

Kind regards,

Julian

Hey Ben,

That sounds reasonable here. One quick note about the EnabledDebugEndpoints - it will require a server reset. Regarding why you’re seeing a 403, it could be from that, but I’d be curious why an admin account failed. If you want to debug that, I always find it best to attach a debugger to the server, and step through the controller auth calls. Also relevant DebugEndpointAttribute, which is fundamentally what controls the top-level auth for this particular controller.

Kind regards,

Julian

Hey Julian,

Thank you for the quick response!

I haven’t had the chance to look any further into this issue today, but will give you an update on any progress I make when I’m back on Monday!

Thanks again,

Ben

Hey Julian,

Just a quick update on this, I’ve been trying a few of your suggestions above and have had varying degrees of success so far.

First off, I tried the following API to set the step to be complete: /api/v1/jobs/{jobId}/batches/{batchId}/steps/{stepId}, however did not get very far with this as I was experiencing a 403 error (it’s possible I don’t have the correct permissions for this?).

I have found some success using a lease API: /api/v1/leases/{leaseId} and posting with the following payload.

{ "aborted": true }This seems to completely cancel the lease that is still running and by extension the whole build job is set to a finished state.

I am intrigued by this API call /api/v1/debug/repair-job/{jobId}, but I am yet to give it a go. I’ve still go a few builds leftover in a ‘running’ state so will give this call a try in quieter office hours when I have chance to restart the server!

Thanks,

Ben

Hey Julian,

Yesterday I gave the /api/v1/debug/repair-job/{jobId} call a go and despite having EnableDebugEndpoints enabled we were experiencing 403 errors with this also (even from admin accounts).

I think our plan from here is to move forward with just aborting the troubled leases with the method that I mentioned above; I’m currently in the process of writing an automation script that will handle this.

Again, thank you for your help and quick responses,

Ben