Build steps on Horde seem to randomly get cancelled

Hello,

We’re encountering an issue with a few of our build jobs on Horde, where build steps seem to be randomly cancelled by Horde. This issue occurs intermittently on various builds/build steps and cannot be reproduced reliably (so unfortunately no solid repro steps). For reference we are using modified versions of the BuildAndTestProject.xml and BuildEditorAndTools.xml from the engine but the general flow remains the same.

When opening the info about the cancelled step the build log always seems to be empty, but the following message can be observed at the top of the page:

Summary: Job created by scheduler. This step was canceled by Horde.

We’re really unsure as to why this seems to be happening as there doesn’t seem to be any logs associated with each build that display any sort of issue/clue as to why the step was cancelled, so any help on this matter would be greatly appreciated.

Thanks,

Ben

Steps to Reproduce

Hey there Ben,

Just checked internally and I can see that we have encountered some similar issues. They however seem to be related to an infrastructure LivenessProbe check.

Do you have any additional details around your infrastructure setup? I know you’ve mentioned the logs on the agent being empty, but anything of note on the server logs at that time? Assuming Windows, anything of interest in the event viewer?

Julian

Hey Ben,

Good stuff on tracking that down. I’ll keep this ticket open until you’ve confirmed on your end. Best of luck!

Kind regards,

Julian

Hi Julian, thank you for your quick response!

We have done more digging into this issue this morning and we think we may have found the cause of the problem. We have noticed that one of our in-office build agents is silently crashing at various points through the day. I’ve gone through our past builds and found that all cancelled steps seem to be coming from this one agent.

The server logs also back up this theory, where the agent must crash and lose connection so the step therefore gets cancelled by Horde.

[22:55:08 inf] Attempting to create session for agent WAR-HAGENT01 [22:55:08 inf] Terminating session 67ec27cc8a572fd7bcd0c705 for WAR-HAGENT01 [22:55:08 inf] Terminated session 67ec27cc8a572fd7bcd0c705 [22:55:08 inf] Removing lease 67ec5dab8a572fd7bcd0f320 (type.googleapis.com/ExecuteJobTask) [22:55:08 inf] Lease 67ec5dab8a572fd7bcd0f320 complete, outcome Cancelled [22:55:08 inf] Failing batch 67ec461e8a572fd7bcd0db88:ba9c with error Cancelled [22:55:08 inf] Failed lease 67ec5dab8a572fd7bcd0f320, job 67ec461e8a572fd7bcd0db88, batch ba9c with error Cancelled [22:55:08 inf] Session 67ec60bc8a572fd7bcd0fbcb started [22:55:08 inf] Updating step reference ccae for job 67ec461e8a572fd7bcd0db88, batch ba9c, with outcome Failure [22:55:13 inf] Assigning job to waiter [22:55:13 inf] Assigned lease to agent [22:55:13 inf] Lease 67ec60c18a572fd7bcd0fbd5 started (Type=Job, JobId=67ec461e8a572fd7bcd0db88, BatchId=d8ba, LogId=67ec60c18a572fd7bcd0fbd6) [22:55:15 err] Exception in call to /Horde.LogRpc/UpdateLogTail System.IO.IOException: The request stream was aborted. ---> Microsoft.AspNetCore.Connections.ConnectionAbortedException: The HTTP/2 connection faulted. ---> Microsoft.AspNetCore.Connections.ConnectionResetException: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException (10054): An existing connection was forcibly closed by the remote host.

We’re currently working on resetting the agent in question, and once we have done that we can see whether the issue rights itself.

Thanks,

Ben

Hey Julian,

Just a quick update on this, we have had builds running on a new agent (replacement for the other) for all of today and we haven’t experienced any cancellations/crashes so far.

Leaving it over the weekend will be a better indicator, but its looking like the issue may be sorted.

Thank you again for your help!

Ben