Staging step "Failed sending oplog request to Zen ..." when using BuildCookRun -cook -stage together, have a workaround, but is this a bug or a misconfiguration?

Starting on 5.7, our Staging step intermittently gets this error:

Reading oplog from Zen...
Failed sending oplog request to Zen at [::1]:8558 for oplog ABC.1a23b45c.Windows: Error while copying content to a stream..
(see D:\Horde\ABC\Sync\Unreal\Engine\Programs\AutomationTool\Saved\Logs\BuildCookRun\Log.txt for full exception trace)
 
AutomationException: Failed sending oplog request to Zen at [::1]:8558 for oplog ABC.1a23b45c.Windows: Error while copying content to a stream..
   at AutomationScripts.Project.ReadZenCookedFilesFromZenServer(ProjectParams Params, DeploymentContext SC, Boolean bAutoLaunch, String PackageStoreFileArgName, String PackageStoreFileArgValue) in D:\Horde\ABC\Sync\Unreal\Engine\Source\Programs\AutomationTool\Scripts\CopyBuildToStagingDirectory.Automation.cs:line 1001
   ...

Occurs maybe once every 10 runs, on different machines and target platforms.

Our Staging step is combined into a single BuildCookRun command `-cook -stage -package` and looks like this:

<Property Name="PROJECT" Value="ABC"/>
<Property Name="Platform" Value="Win64"/>
<Property Name="Config" Value="Test"/>
<Property Name="ArchiveDirectory" Value="$(RootDir)\$(PROJECT)\Publish\Package"/>
<!-- Configuration: Root=(Type=Hierarchical, Inner=<Local>, Inner=<Remote>, Inner=<Cloud>) -->
<Property Name="DDCGraph" Value="DDCGraph_Horde"/>
 
<Command Name="BuildCookRun" Arguments="-project=$(PROJECT) -target=$(PROJECT) -platform=$(Platform) -configuration=$(Config) -archivedirectory=$(ArchiveDirectory) -NoCodeSign -skipbuild -cook -CookIncremental -ddc=$(DDCGraph) -stage -pak -iostore -package -archive -crashreporter -prereqs"/>

Debugging showed 3 scenarios for that Staging step “Reading oplog from Zen…” `ReadZenCookedFilesFromZenServer()`:

  1. `IsZenServerRunning(SocketHostNameAndPort)` = true, and the Zen GET request succeeds.
  2. `IsZenServerRunning(SocketHostNameAndPort)` fails the GET `health/ready` with “No connection could be made because the target machine actively refused it.”, resulting in calling `RunUnrealPak(…, “ZenAutoLaunch”)`, and then continues successfully.
    1. This is the most common scenario for successful runs.
  3. `IsZenServerRunning(SocketHostNameAndPort)` = true, and the Zen GET request fails “Error while copying content to a stream..”.

Looking in the Zen logs for all the scenarios, they all start during the Cook step and look similar going in the Staging step regardless of outcome.

The temporary fix that works for us so far is to split out the Cook into its own Node/step `<Cook …/>`.

Is this a bug or a misconfiguration on our part? I split it out the steps initially thinking Zen may need some time after Cooking, but the most common scenario is still above #2 where the health GET fails.

[Attachment Removed]

This seems like a bug. To help me understand how it might be happening, can you tell me if you are running with LimitProcessLifetime config value (described here) set to true or false?

[Attachment Removed]

Ok, could you please test with LimitProcessLifetime set to true false in your INIs? I want to know if this alleviates the problem for you or not. It should not be required, but I want to know if it is necessary to repro this bug.

Edit: I had initially inverted my intent. The thing to try here is setting LimitProcessLifetime=False and report back if that resolves the issue.

[Attachment Removed]

Hi,

I’m sorry, I wrote the inverse of what I meant in my response. The thing I’d like to establish is if this problem goes away if people set LimitProcessLifetime=False in their INIs. Apologies for the misdirect.

[Attachment Removed]

Ok. Thanks for reporting back. I apologize for writing the inverse of my intent the first time, but I’m glad this has been confirmed as a solution. I’ll see if there are options to make LimitProcessLifetime=True more reliable.

[Attachment Removed]

We have this in our Project/Config/DefaultEditor.ini Zen.AutoLaunch section, we haven’t specified LimitProcessLifetime:

[Zen.AutoLaunch]
AllowRemoteNetworkService=true
ExtraArgs=--http asio --gc-cache-duration-seconds 1209600 --gc-interval-seconds 21600 --gc-low-diskspace-threshold 2147483648 --quiet

[Attachment Removed]

Hello,

We are also experiencing this issue on 5.7.2. It is also intermittent in our case; we ran into it twice yesterday with both Test and Shipping build configurations on one specific target platform. The CI pipeline where this issue occurred had multi-process cook enabled with 16 cores (it is not globally enabled, we enable it only on this CI pipeline via -CoreProcessCount=N). We have only seen this issue occur on this pipeline so far, and the issue reproduced on different machines.

We build, cook, stage, and package (i.e. “-build -cook -stage -package”) the build from BuildCookRun in UnrealAutomationTool. We are not using Build Graph. In our configuration files, LimitProcessLifetime is set to true.

Hopefully this helps, feel free to reply back if you need more information. Thanks!

[Attachment Removed]

Hey,

We have just updated our project to 5.7.4, and are also now seeing these same intermittent build failures:

We’ve not configured ZenAutoLaunch in our DefaultEngine.ini, so we’re using the default settings from BaseEngine.ini from 5.7.4

We are also running BuildCookRun with -build -cook -stage -package

We are not using multi process cook

[Attachment Removed]

Hey,

We’ve been running with LimitProcessLifetime=False for a week now, and have not experienced any build failures related to this since!

[Attachment Removed]