Horde - No job notifications on lease execution errors

Hi,

We recently had a build issue that has highlighted a slight problem with Horde’s job notifications. One of the tests run during would use up all of the memory available on the worker running the test. This sometimes caused the agent to crash, after which Horde would mark the lease as failed with either an execution error or cancelled and the job would be terminated. When this happens, Horde doesn’t send the usual job completion notifications, which can delay noticing and fixing the problem due to the Slack notifications channel appearing healthy. This can be especially problematic since these kinds of job failures tend to only happen when there are serious issues. Is there any way for us to ensure that job notifications get sent regardless of how the job is completed? Or add in another layer of notification for when something outside of the job goes wrong that terminates the job?

[Image Removed]

Some misc observations:

  • We do get job notifications when someone cancels a job.
  • When jobs fail in this way, the job summary will not indicate that there was any issue with the job. For example: “Job created Friday, November 14th at 10:18 AM PST by scheduler and completed Friday, November 14th at 11:02 AM PST.”
  • From the 10 or so jobs that failed in this way, I noticed that one of them did manage to send an issue notification in the triage channel for the job, but even that was for a warning rather than anything about the lease failing to execute correctly.
  • In the past I had posted about a [‘lease incomplete’ [Content removed] where our server was unable to save log blobs, the most severe version of that (3 retries into a cancelled lease) would also fail to send any job notifications.
  • Not sure if actually related, but a somewhat similar scenario is when Horde automatically cancels a preflight job due to a newer one being started against the same shelf CL. Whatever the solution to this problem is might need to distinguish between a couple of ways the lease can end up in the cancelled state.

Thanks for the help!

Steps to Reproduce

  1. Start a job that normally has some notifications set up. For example, our incremental build sends messages in a Slack channel when they complete.
  2. While the job is running, stop the agent. This should eventually lead to the lease being cancelled and the job failing.
  3. Notifications do not get sent for the job.

Hey Jeremy,

Thanks for the report. I definitely see how this can happen in code. I don’t have a quick workaround to offer you at the moment. But we need make sure the same code path for updating a job is triggered for when leases are cancelled due to timed out agent sessions.

To track this work, I’ve created an internal issue UE-355536