Intermittent UBA CAS errors

When a Horde agent is assigned a build task and then reaches out to another Horde agent for UBA, sometimes I get errors about receiving files from the UBA worker. Example:

UbaStorageServer - Expecting to be able to decompress to 8387636 bytes but got 0 (2d0f1f58aaa9b37deed12ebc61f7f11a83e3d701 -> C:\HordeAgent\Sandbox\Demo-Inc-Full\Sync\Engine\Intermediate\Build\Win64\x64\CrashReportClient\Shipping\CoreUObject\Module.CoreUObject.6.cpp.obj) UbaSessionServer - Failed to copy cas from 2d0f1f58aaa9b37deed12ebc61f7f11a83e3d701 to C:\HordeAgent\Sandbox\Demo-Inc-Full\Sync\Engine\Intermediate\Build\Win64\x64\CrashReportClient\Shipping\CoreUObject\Module.CoreUObject.6.cpp.obj (Module.CoreUObject.6.cpp (Compile [x64]))That’s what I see in the Horde step log.

If I go to the agent running the task, I can get a log with a little more information:

UbaStorageServer - Expecting to be able to decompress to 8387636 bytes but got 0 (2d0f1f58aaa9b37deed12ebc61f7f11a83e3d701 -> C:\HordeAgent\Sandbox\Demo-Inc-Full\Sync\Engine\Intermediate\Build\Win64\x64\CrashReportClient\Shipping\CoreUObject\Module.CoreUObject.6.cpp.obj) UbaSessionServer - Failed to copy cas from 2d0f1f58aaa9b37deed12ebc61f7f11a83e3d701 to C:\HordeAgent\Sandbox\Demo-Inc-Full\Sync\Engine\Intermediate\Build\Win64\x64\CrashReportClient\Shipping\CoreUObject\Module.CoreUObject.6.cpp.obj (Module.CoreUObject.6.cpp (Compile [x64])) UbaSessionServer - Client REDACTED11 returned process 1399 to queue (Failed to send output files to host) ** For CrashReportClientEditor-Win64-Shipping ** [1075/1171] (Wall: 45.70s CPU: 34.61s) Compile [x64] Module.CoreUObject.13.cpp [RemoteExecutor: REDACTED11] [Worker0] UbaSessionClient - Server failed to receive file E:\HordeAgent\Sandbox\Saved\Uba\sessions\250519_115009\output\850a643fa5f119eedec779c097a17508 (C:\HordeAgent\Sandbox\Demo-Inc-Full\Sync\Engine\Intermediate\Build\Win64\x64\CrashReportClient\Shipping\CoreUObject\Module.CoreUObject.6.cpp.obj) [Worker0] UbaSessionClient - Failed to send output files to host

I confirmed that Worker0 is, in fact, REDACTED11

I also looked for logs on REDACTED11, but found nothing useful in

E:\HordeAgent\Sandbox\Saved\Uba\sessions\250519_115009\log

The cas ID was not present anywhere in E:\HordeAgent\Sandbox\Saved\Uba

The obj file does eventually show up at C:\HordeAgent\Sandbox\Demo-Inc-Full\Sync\Engine\Intermediate\Build\Win64\x64\CrashReportClient\Shipping\CoreUObject\Module.CoreUObject.6.cpp.obj

but I think that’s because the local machine compiles it right at the end:

[1169/1171] (Wall: 14.53s CPU: 14.50s) Compile [x64] Module.CoreUObject.6.cpp

The error can appear with many different cpp files across many build steps (Linux server, tools, editor, etc…)

It seems like the task does finish by virtue of retries, but it still gets marked as an error and fails the whole graph.

My setup:

  • Local dev machine running the Horde solution under the debugger
  • REDACTED08 as the only machine in the pool that can be assigned tasks
  • REDACTED11, REDACTED12, and REDACTED13 are pure compute nodes with the Horde Agent installed, but not configured to be in a pool that will be assigned Horde tasks

Two questions:

  • Any hints what could be causing this? This seems like it would required a pretty deep and challenging dive without any direction.
  • Why is this causing the whole graph to fail? If we retry the compute task, should it really mark the build step as having errored? Is there a way to have this error not be fatal, since the step does actually complete?

Steps to Reproduce

Hey there,

We have seen this on our farm sporadically, with long file paths being a suspected issue (it doesn’t seem to be the case with the above…).

Can you first confirm what version you’re on (5.5.X?). Also, it may be helpful to rebuild with debug + increase verbosity in case we catch anything else - and getting full logs is always ideal. It would also be helpful to get the uba trace as an attachment, which may give some insights.

Regarding the log failures, this is indeed an issue we are aware of and are hoping to bring back the IgnorePatterns.txt to help with this, but I can check to see if there’s anything here as it is benign with retry.

Kind regards,

Julian

Hey there,

Yes I would absolutely suggest grabbing the latest binaries and running tests (you should be able to update these ahead of the rest of the engine without too much friction). There have been a significant amount of bug fixes that have gone in since 5.5.0 (some of which specifically relate to this issue, albeit for somewhat different suspected root causes).

Kind regards,

Julian

Thanks for the debugging resource. I’ll poke around. We are using 5.5.0. We are actually in the middle of upgrading to 5.5.4, so maybe I get lucky and that solves it…?