UnrealBuildTool race condition in ImmediateActionQueue

We found some intermittent failures in UBT where it would appear to just stop suddenly after completing an action and end with exit code 6.

We tracked it down to what seems to be a race condition in ImmediateActionQueue.TryStartOneAction(). Between the first part of the method where the action is attempted to be started, and the end where it checks for a stall, the number of active actions may decrease if an action completes. This can falsely indicate a stall, if we are waiting on actions to unblock the next action, but they complete during execution of this method and the number of active actions ends up at 0 by the time we do the stall check.

We have a repro case here with some extra debugging to make the sequence of events clearer. The actual behavioral changes are to add a sleep to TryStartOneAction(), and to force using the ParallelExecutor as it creates a single ImmediateActionQueue runner of Automated type. https://github.com/EpicGames/UnrealEngine/pull/13918

The race condition itself can occur even with other executors, but in cases like UGA where there’s a manual runner as well, the stall check will always pass on the manual runner, so it doesnt really matter. For us, in our FASTBuild executor we end up with some actions we pass off to a ParallelExecutor, which is why we saw this to begin with.

Ideally for us, we could get a fix in for the bug itself (even if it doesn’t manifest in any default Engine usage), so we can mitigate it on our end without incurring more divergence, until we’re ready to fully swap over to UBA. I have a proposed fix here, which is to just pull the stall check up into action lock scope https://github.com/EpicGames/UnrealEngine/pull/13917

Steps to Reproduce

https://github.com/EpicGames/UnrealEngine/pull/13918

I’ll take a look this week, at a quick cursory glance it seems fine to me. We can’t modify the UE5.7 branch anymore though, and I don’t think I’ll be able to get this into the UE5.7 release either since we’re past the commit cutoff so it’ll most likely be committed to ue5-main for the next release.

I’ll most likely take your change as is, if not I’ll let you know either here or in the comments on the PR.

Hey Joe. Thanks for taking a look!

No worries about not making it into 5.7. I realized when I said “without more divergence” I should have said non-temporary divergence. If we believe something approximately like this is coming later we’re happy to apply it internally now.