Queueing a job at certain changelists takes >10 mins since upgrading to Horde 5.5

Hi all,

We’re currently looking at upgrading to Horde 5.5 and are seeing some strange behavior, with some jobs taking up to 14 minutes to queue from the Horde web UI:

[Image Removed]

The issue seems to be related to Perforce calls, and we don’t have the same issue on our Horde 5.4 system.

I’ve also noticed that queuing a build at more recent changelists on Horde 5.5 doesn’t have this issue (it’s instant). For example, a build at CL 975679 will take ~14 minutes to queue on 5.5, whereas at CL 1000172 it’s instant.

For reference, CL 975679 is 30 days old and 1000172 is 6 hours old.

From some tracing, it seems that Horde does many p4 changes calls, such as `p4 changes -m20 -ssubmitted //zzt-horde-p4bridge-p4server-horde-server-worker-deployment-7fc44b674f-74ppv-HordeUser-++Project+stream/…[Content removed] and only on Horde 5.5.

Have there been some changes around this area in Horde 5.5 which could cause this behaviour?

Thanks

Sean

Hi Sean,

That’s not good seeing such delays.

1) How many changes are there in this stream between CL 975679 and 1000172?

2) Is this reproducible if running the same job again?

3) How quickly did you try to run the job after upgrading?

4) With the tracing you mentioned, do you see which trace spans this is executing inside? Would help pinpoint things on our side.

As you noticed, Horde is querying these older CLs in the stream, but those should get indexed and saved. It’s possible the background indexing of CLs in the Perforce stream is lagging behind after an upgrade. But once it’s queried, it should be more or less instant for any follow-up builds.

Also, Epic is closing for a two-week summer break, starting next week. Apologies for any delays in response.

Hi Carl,

Thank you for the quick response!

  1. There’s 14,000 changes between those two CLs.
  2. It’s reproducible yes. It happens every time when I queue at that CL, even if it’s been successfully queued previously.
  3. The server has been running for over two weeks now, with some restarts inbetween for debugging. So some time after that. I saw when I restarted it was running “PerforceServiceCache.UpdateClusterInternalAsync”. I restarted it yesterday and it looks like that has been running for 18 hours and is still running now it seems, according to open telemetry.

[Image Removed]4. For the tracing when a job is being queued, we can see most of the time going to “PerforceService.CommitSource.FindAsync” -> “PerforceConnection.GetChangesAsync”.

[Image Removed]Is this helpful? We can probably provide the full trace as a JSON file if not, but almost all of the time seems to go to those two methods.

Thanks

Hi Carl, I can provide some more information on this now which might help?

Checking Mongo, I saw that our 5.5 dev server had 1.3k items in CommitsV3 vs 49.5k on our prod server.

Also that the MinChange for the dev server was CL 1008487, which is from the day the server was started. Max change was up to date at the latest CL (1010330) Is this the expected behavior or should some previous changes be cached too?

The server seems to run the PerforceServiceCache.UpdateClusterInternalAsync call for about 18 hours or so and then stop. Then later on (not sure when) it will start again and repeat.

If we queue a build between the min and max change, the queuing is instant.

If we queue a build before the MinChange, and the UpdateClusterInternalAsync is still running, we see the queuing take ~14 mins as it does multiple p4 changes calls, each taking ~20secs.

If we queue a build before the MinChange, and UpdateClusterInternalAsync is finished, we get a broken pipe error, interestingly, similar to this but not always from this function:

Failed: write: Broken pipe: Broken pipe (Generic=Fault) at EpicGames.Perforce.PerforceResponse.EnsureSuccess() in /app/Source/Programs/Shared/EpicGames.Perforce/PerforceResponse.cs:line 75 at HordeServer.Perforce.PerforceService.CheckShelfAsync(StreamConfig streamConfig, Int32 changeNumber, CancellationToken cancellationToken) in /app/Source/Programs/Horde/Plugins/Build/HordeServer.Build/Perforce/PerforceService.cs:line 822 at HordeServer.Jobs.JobsController.CreateJobAsync(CreateJobRequest create, CancellationToken cancellationToken) in /app/Source/Programs/Horde/Plugins/Build/HordeServer.Build/Jobs/JobsController.cs:line 138 at lambda_method878(Closure, Object) at Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.AwaitableObjectResultExecutor.Execute(ActionContext actionContext, IActionResultTypeMapper mapper, ObjectMethodExecutor executor, Object controller, Object[] arguments) at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeActionMethodAsync>g__Logged|12_1(ControllerActionInvoker invoker) at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeNextActionFilterAsync>g__Awaited|10_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)Tried to track down the p4 logs for this call from our p4 server but couldn’t actually find the call taking place in order to track it down, so perhaps it didn’t get that far.

Hopefully this helps narrow down the problem though?

Hi Sean,

I’m just checking over p4 history now and can see that there were changes to the UpdateClusterCommitsAsync function which UpdateClusterInternalAsync calls into between the 5.4 and 5.5 release.

So it’s entirely possible that an issue or quirk has been introduced here.

I’ll have someone on the team diagnose this further and see if the code change has had a knock on effect.

I’ll let you know as soon as we have more information.

Hi Sean,

I suspect this issue has been fixed after the initial 5.5 release.

There are two changes that went in post 5.5 to “PerforceServiceCache.cs => FindAsync” that sound very related to the issue you are seeing.

https://github.com/EpicGames/UnrealEngine/commit/03a15ce47e4d2d05926fadf5329dad529438db72

"Horde: Fix cached commit source decrementing number of remaining results to return even if commits are filtered out of the resulting list by commit tags.

Prevents issues with new jobs not being able to identify code changelist to run at."

https://github.com/EpicGames/UnrealEngine/commit/8a15611c7b2ccdfe0d461f2e21729b7384ad89ed

“Horde: Decrement the number of remaining results to return when falling back to slow path for querying commits.”

Can you confirm how you installed the latest version of Horde?

Was it through the 5.5 docker image (https://github.com/orgs/EpicGames/packages/container/package/horde\-server) as that was last updated 9 months ago.

The “latest” version which was generated 6 months ago will include the fixes I mention above and could be worth trying out.

Cheers,