UAT AutomationExceptions and Horde Log Event Matchers

When UAT encounters an error, and the output doesn’t match an existing domain-specific pattern (such as a cook error), then Horde will not identify any error log event to surface in the Horde UI.

For example, we encountered an error during the stage step of a BuildCookRun task. It threw this exception: https://github.com/EpicGames/UnrealEngine/blob/release/Engine/Source/Programs/AutomationTool/Scripts/CopyBuildToStagingDirectory.Automation.cs#L2104

Horde showed no errors, because none of the existing implementations of `ILogEventMatcher` matched the specific message of that thrown exception. It doesn’t appear that there’s any prefix or other indicator wrapping the message that would allow a matcher to clearly identify it as the error message.

It appears that before 5.3, this would have been picked up by `Log.WriteException()` using the previous log methods, which would have prepended the log level of `error:`, resulting in it being matched by `GenericEventManager`.

https://github.com/EpicGames/UnrealEngine/commit/7e26c34d2be354342655426997a786d00bc9ac06#diff-2a08bf9fd58e3a327773c010a985ba8ed224df1da6aaa7f74d60860ebd7352cbL482

edit: this direct link to the diff line appears not to work reliably, so I’ll upload of the screenshot of the specific lines I’m referring to.

What is the intended way for generic UAT errors to get matched and surfaced in Horde? I’m happy to try something out, test it, and throw up a PR. However, there’s a lot of different ways to go about this, so I’m looking for some opinionated guidance.

Steps to Reproduce
Run a build graph through Horde that uses the `Command` task, and have that task in UAT throw an exception.

Hey there,

Thanks for bringing this up to us. This does indeed seem like an issue we need to get a fix for. Let me chat with the team to see what we can do here to reintroduce the proper matching for this, or see if there’s something going on here configuration wise.

Kind regards,

Julian

Hey again,

Just digging into the code a bit more, something does seem a tad off, as I’d expect the following code to be invoked at the top of the UAT process:

Automation.cs ln:190

`catch (AutomationException Ex)
{
// Output the message in the desired format
if (Ex.OutputFormat == AutomationExceptionOutputFormat.Silent)
{
Logger.LogDebug(“{Arg0}”, ExceptionUtils.FormatExceptionDetails(Ex));
}
else if (Ex.OutputFormat == AutomationExceptionOutputFormat.Minimal)
{
Logger.LogInformation(Ex, “{Message}”, Ex.ToString().Replace(“\n”, “\n “));
Logger.LogDebug(Ex, “{Details}”, ExceptionUtils.FormatExceptionDetails(Ex));
}
else if (Ex.OutputFormat == AutomationExceptionOutputFormat.MinimalError)
{
Logger.LogError(Ex, “{Message}”, Ex.ToString().Replace(”\n”, "\n "));
Logger.LogDebug(Ex, “{Details}”, ExceptionUtils.FormatExceptionDetails(Ex));
}
else
{
Log.WriteException(Ex, LogUtils.FinalLogFileName);
}

// Take the exit code from the exception
return Ex.ErrorCode;
}
catch (Exception Ex)
{
// Use a default exit code
Log.WriteException(Ex, LogUtils.FinalLogFileName);
return ExitCode.Error_Unknown;
}`With the call site referenced (CopyBuildToStagingDirectory.Automation.cs, AutomationException.cs):

`// From CopyBuildToStagingDirectory.Automation.cs
throw new AutomationException(Message.ToString());

// From AutomationException.cs
public class AutomationException : System.Exception
{
// …
public ExitCode ErrorCode = ExitCode.Error_Unknown;
public AutomationExceptionOutputFormat OutputFormat = AutomationExceptionOutputFormat.Error;
// …
}`So the Automation class should still be trampolining to the WriteException API, which then will have a structured log event via the LogError within the WriteException API.

Do you have the command line handy? I’m curious if somehow this is being executed outside of the expected context where we catch and bubble up to WriteException?

Kind regards,

Julian

The UAT command line (via the `Command` task in the graph) was

-p4 -NoCompile BuildCookRun -Project=Projects/Blah/Blah.uproject -Cook -FastCook -Stage -NoSubmit -UTF8Output -Package -Archive -Pak -Compressed -NoCodeSign -ClientConfig=Test -CrashReporter -Platform=Win64 -ArchiveDirectory=T:\HordeData\StandardCi\Sync/LocalBuilds/Blah/Targets/Win64-Standalone -Target=BlahGame

I do think it was going through the default path you pasted. The issue seemed to be in the format of the error log. In `WriteException()`, the `Logger.LogError()` line gave us something like this:

The following files are set to be staged, but contain restricted folder names ("XboxOne"): Blah/Plugins/UIX/Content/ButtonPrompts/XboxOne/T_UI_ButtonPrompt_XB1_A.uasset Blah/Plugins/UIX/Content/ButtonPrompts/XboxOne/T_UI_ButtonPrompt_XB1_A.uexp ...

It’s just the exception message itself (which happens to be multi-line in this case). There’s no error level or class name prefixed, so there’s nothing really to identify that line as either an error, or as the message of an exception.

Pre-5.3, it looks like `WriteException()` would use `TraceError()` which would prefix `ERROR: ` on that line, but the structured log code path doesnt have that.

Hey there,

Yeah makes sense - I guess I’m just surprised we aren’t seeing excessive failures with this internally (rather - no output matching for this entire class of problem - UAT command invocations).

I’d expect Logger.LogError to be implicitly caught in matching - there are other APIs that are widely used throughout the codebase (~1000 occurrences). Our structured logs are consumed via the JobDriver’s LogParser.

Let me ask around internally to see if we are doing anything special.

Julian

Hey there,

Out of curiosity, do you have UE_LOG_JSON_TO_STDOUT set on your Horde Server & execution environment? You should be able to see this in the Agent log.

For posterity, the call structure is a bit complicated here. In short:

  • The HordeAgent will initiate JobDriver (sub process via dotnet.exe)
  • JobDriver will invoke the corresponding steps that are a part of a batch
    • As an example:

Executing C:\WINDOWS\system32\cmd.exe /C ""C:\HordeAgent_main\Demo-Inc\Sync\Engine\Build\BatchFiles\RunUAT.bat" BuildGraph -Script=Engine/Build/Graph/Examples/BuildEditorAndTools.xml -Target="Stage for UGS" -SingleNode="Compile Tools Win64" -set:UProjectPath=Sandbox/DevTools/AnalyticsSandbox/AnalyticsSandbox.uproject -set:EditorTarget=AnalyticsSandboxEditor -set:PreferredAgent=IncrementalWin64;CompileWin64;Win64 -set:EditorCompileArgs="-UBA -ActionStallReportTime=300 -UBALinkRemote -UBAPrintSummary -noXGE -UBAHorde=https://horde.devtools.epicgames.com" -set:PreflightChange=1234567890-NoCompile"+ When an emitted event goes to the LogParser.WriteData, this message will be forwarded all the way to JobStepLogger, then ServerLogger
- ServerLogger will then attempt to write this out via RpcCreateLogEventRequest && WriteEventsAsync
- This is how we get a LogError emitted to the Horde server

I have tested both a Logger.LogError & Log.WriteException and have verified that our Horde server will properly receive the events.

Kind regards,

Julian

Ah, that was it. It seems that we had that variable unset due to some historical problem with some divergence that was added, and never unwound. I can confirm that with it set, we see the errors bubble up correctly. I’ll look into unwinding the divergence that led to it getting unset in the first place.

Thanks for looking into it Julian, and apologies for the wild goose chase.

Ahh that will do it! And worries at all - it helps me dust off some of the knowledge on my lesser visited parts of the engine.

Don’t hesitate to reach out should you need anymore help.

Kind regards,

Julian