Horde - How to improve job sync/setup time?

Hi,

When starting a new job on Horde, it consistently takes the agent 5-10 minutes to sync and get ready to run the job. This is a significant portion of our overall job runtime, and is especially egregious when we’re running simpler scripts that take less than a minute to complete.

Looking at the logs, the agent is going through these steps:

  1. Update the have table
  2. Clean the workspace (what exactly is this clean doing?)
  3. Remove files for the sync
  4. Add files for the sync

It appears that these steps always take a minimum baseline amount of time to complete, even if the worker just ran a job on the exact same changelist. I tried changing our workspaces to useHaveTable=false, and while that did reduce the sync time somewhat, it also flipped the workspaces to readonly which we can’t have because some of our jobs need to submit to Perforce.

My current understanding is that this is because Horde is designed to work with ephemeral workers, so the managed workspace needs to first determine the actual workspace state before any syncing can occur. However, our setup uses permanent VM agents with dedicated workspaces, so updating the have table and cleaning the workspace before each job should be entirely unnecessary. Are there any config changes we can make to save this time and essentially just sync P4 normally, or would we essentially need to rewrite the PerforceExecutor to not use a managed workspace?

Thanks,

Jeremy

Steps to Reproduce

  1. Start a job in Horde
  2. Observe how long it takes the worker to get ready to run the job

Hi Jeremy,

Is this for an agent building the same stream, or is it switching streams?

Horde has a custom way of syncing Perforce streams called ManagedWorkspaceMaterializer, this tries to optimize for a stream switching scenario by caching files (moving them to a cache dir) when syncing/switching. This carries some overhead, but 5-10 minutes seems a bit too much. Could you explain or even post anonymized bits of the lease log for the agent? Where it logs these timings.

There’s also a new PerforceMaterializer which just uses vanilla Perforce syncing. The code is available and we’re evaluating this internally for our incremental workspaces (no stream switching, just building head). If you build Horde from source, it’s possible to try it today, but still very experimental (no support for Horde-defined view filtering, must use stream view).

Example:

"workspaceTypes": {
   "Incremental": {
      "base": "Win64Base",
      "identifier": "FooMainInc",
      "stream": "//Foo/Main-Incremental",
      "incremental": true,
      "method": "name=perforce"
   },
   ...
}

The PerforceMaterializer requires you to get the latest version of Horde for Perforce/GitHub and build it yourself. It’s still quite experimental and we’re evaluating it internally, so perhaps too “hot”.

Streams can be forced to share directory they sync to if the “identifier” field in workspace types config is identical. Opening an agent in the web UI you can check the identifier of each workspace that agent is handling.

1) Are you using a single Perforce server?

2) How does your workspace types config look like?

3) Have you experimented with the “incremental” flag to the workspace type? That stops the clean up of untracked files

2) Can you compare this sync log to a previous job on that same agent? In particular, look at the client names “//Horde+BUILD-WORKER31+depot-main”.

3) Just to confirm, the same agent that built a previous job synced the same stream/workspace but only had a different CL?

Yes, PerforceMaterializer will overwrite and not re-use what ManagedWorkspaceMaterializer does. But you should be able to separate them with a unique “identifer” in the workspace type config. It does store files twice then though. We also have seen some issues with PerforceMaterializer and conforms internally that we will be addressing (hence very experimental).

Normal Perforce syncing will always be faster, so once PerforceMaterializer matures that’s likely going to be your best bet.

But given that’s still under development, it’s still puzzling why some steps take so long with ManagedWorkspace option.

  1. What type of disks do agents use for syncing? (SSD, NVMe, network-attached/cloud-based, and what specs?)
  2. How many files do you have in your stream? What’s the total size in gigabytes?
  3. Do you have any anti-virus, Windows Defender or corporate monitoring software on these agents?
  4. Can you check what filter drivers are active? (Run “fltmc” from command-line as admin)

We’ve seen certain filters drivers installed by PS4/5 SDKs to significantly slow down file operations.

Oh, you were able to test it, great. I just submitted a fix on Monday for the unshelving to mimic what ManagedWorkspace is doing (p4 print) to avoid this exclusive/locking issue. We also had an issue with conforms (the resetting of workspaces on the agent) that was addressed. So we run the PerforceMaterializer on a few streams internally now.

But it still is puzzling to see your disk I/O be so slow. Are you able to exclude the workspace directories from Windows Defender? Or disable it completely for testing purposes?

Hearing even PerforceMaterializer takes 18 hours suggests something is going on outside Horde’s domain. Possibly at the machine/OS, network, P4 server level.

Have you had P4 syncs performing as expected on this OS/hardware combo prior to using it from Horde?

Using the PerforceMaterializer, we had a full sync of one virtual stream take around 1h 50m for 6.8 million files, 1.4 TB of data. This is on quite mediocre disks (AWS EBS gp3 disk, 16000 IOPS, 1000 MB/s throughput), so I suspect we might be limited on that.

Sorry, can you clarify what you mean by switching streams? Is it just when an agent has multiple workspaces synced pointing at different P4 streams, and switches which one is actively being used based on the needs of the current job? Or if a single Horde stream config defines a workspace, but then changes the definition of that workspace to point at a different P4 stream without changing the workspace identifier? Or are multiple Horde stream configs each defining workspaces using the same identifier? Something else I haven’t thought of?

For the most part, we are working in a single Horde stream, and our workers typically have two workspaces assigned: the main P4 workspace, and a setup workspace that is the same depot but a much more restricted view (we’re using this to run automation tool so the initial Horde job Setup Build step can run without needing an entire workspace). We do have a few uncommonly used secondary Horde streams, but are using unique workspaces separate from the main ones, and don’t run jobs very often.

Here’s a sync lease log, this is on an agent running our main incremental build. Aside from some UBA compute work, the previous job on the same worker was the same incremental build job, with about 240 CLs in between the two runs.

Syncing to //depot/main at CL ######
Reverting changes...
Reverting changes... (0.1s)
Updating have table...
Finding files to clean...
Hang detected (Perforce connection (2)): sync -k -q //Horde+BUILD-WORKER31+depot-main/...@###### (60s)
Updating have table... (66.2s)
Finding files to clean... (193.3s)
Fetching metadata...
Hang detected (Perforce connection (2)): fstat -Ol -Op -Os -Rh -T code,depotFile,clientFile,headType,haveRev,fileSize,digest //Horde+BUILD-WORKER31+depot-main/...@###### (60s)
Fetching metadata... (112.9s)
Saving metadata to D:\Horde\depot-main\Contents.dat...
Saving metadata to D:\Horde\depot-main\Contents.dat... (4.9s)
Gathering files to remove...
Gathering files to remove... (23.2s)
Moving 10176 files to cache...
Moving 10176 files to cache... (16.9s)
Deleting 2298 directories...
Deleting 2298 directories... (2.0s)
Gathering files to add...
Gathering files to add... (50.5s)
Syncing 12634 files using 4 threads...
Syncing 12634 files using 4 threads... 14.5% (952.6mb/6,556.5mb)
Syncing 12634 files using 4 threads... 30.4% (1,991.2mb/6,556.5mb)
Syncing 12634 files using 4 threads... 67.8% (4,446.2mb/6,556.5mb)
Syncing 12634 files using 4 threads... 87.0% (5,706.2mb/6,556.5mb)
Syncing 12634 files using 4 threads... 99.0% (6,490.6mb/6,556.5mb)
Syncing 12634 files using 4 threads... 100.0% (6,556.5mb/6,556.5mb)
Completed in 452.1s

I don’t see the PerforceMaterializer in our 5.6.1 depot, what CL did it first get added to the Epic P4 depot? I’ll have to try bringing that in to our own depot to test it out, it does sound like exactly what I was asking for.

Sorry numbering for these responses got a bit messed up

  1. I think we have a single base server with a couple of proxy servers for splitting the load. Each worker is configured to always connect to the same proxy.
  2. These are our two main workspace configurations. We have some similarly configured workspaces for some other streams like milestone releases or development streams, but they are used much less frequently and none of the workspaces share an identifier.
"Main":
{
	"identifier": "depot-main",
	"incremental": true,
	"method": "name=managedWorkspace&preferNativeClient=true",
	"stream": "//depot/main_build"
},
"HordeSetup":
{
	"identifier": "depot-main-horde-setup",
	"incremental": true,
	"method": "name=managedWorkspace&preferNativeClient=true",
	"stream": "//depot/main_horde_setup"
}
  1. We’re already using incremental workspaces
  2. Can confirm that the client name is Horde+BUILD-WORKER31+depot-main for each sync
  3. Yes, same agent, same stream/workspace, same job, just a different CL. Checked again and there was a similar setup on the same worker today. This time two instances of the same job ran directly back to back, with no UBA compute work in between. It was two runs of the same job, on the same stream with the same workspace, the jobs were run about an hour apart with less than 100 changelists in between the syncs. Both jobs took a bit over 7 minutes to sync, so about the same amount of time as the log I posted in my previous response. The logs look pretty much identical to what I posted earlier, just with slightly different timings and numbers of files affected.

I’m trying to bring in the PerforceMaterializer from Epic’s P4, and I noticed in my testing that the agent is completely wiping the existing workspace and re-syncing the entire depot, do you know if there’s a good way to prevent that? I had thought it might be because it was using a new workspace prefixed with Horde+PM instead of just Horde, but the full re-sync is happening even after I changed that to match the existing workspace name. Unfortunately due to how long the full re-sync takes I haven’t yet been able to test if the PerforceMaterializer incremental syncing process goes faster than the managed workspace, but we will definitely need to avoid doing the full re-sync on all of our workers if the PerforceMaterializer is faster.

Thanks for the info, I’ll definitely continue testing the PerforceMaterializer. What are the conform issues that have been happening?

Got some help from our IT to answer the agent setup questions:

  1. The disks on our workers are running on Pure Storage using 32Gb fibre channel
  2. The stream is roughly 4.5million files and close to 1TB (about 900GB). Could you provide a frame of reference for how large Epic’s streams are and how quickly agents are able to sync on those?
  3. The agents are running Windows Defender by default
  4. fltmc results:
C:\Windows\System32>fltmc
 
Filter Name                     Num Instances    Altitude    Frame
------------------------------  -------------  ------------  -----
bindflt                                 1       409800         0
UCPD                                    9       385250.5       0
SIE-FileMon                             0       380852.5       0
WdFilter                                9       328010         0
storqosflt                              0       244000         0
wcifs                                   0       189900         0
gameflt                                 3       189850         0
CldFlt                                  0       180451         0
bfs                                    11       150000         0
FileCrypt                               0       141100         0
luafv                                   1       135000         0
UnionFS                                 0       130850         0
npsvctrig                               1        46000         0
Wof                                     7        40700         0
FileInfo                                9        40500         0

I was also finally able to test out the PerforceMaterializer a bit, and when running test jobs against that I am seeing sync times of 1-2 minutes, so definitely a promising improvement. However, the first job on the PerforceMaterializer took over 18 hours to sync for the first time, so we would need to figure out a way to avoid that cost on the rest of our workers.

Just another test note for the PerforceMaterializer, it’s unable to unshelve files when the depot has them set to exclusive checkout. Previously the managed workspace would just write the shelf contents to disk without actually unshelving them, but I guess since this is working with a more basic P4 workspace, it is actually unshelving files.