Horde: slow transfer speed of large artifact files

We are seeing a very large transfer speed discrepancy for agents to fetch build artifacts between a small set of large files, and a much larger set of small files:

Reading block "Sparkle Compile Editor Win64":"Sparkle Compile Editor Win64 Binaries" from temp storage (artifact: 680fd1ee161c65c96924f017 'sparkle-compile-editor-win64' (step-output), ns: horde-artifacts, ref: step-output/sparkle-main/65075/sparkle-compile-editor-win64/680fd1ee161c65c96924f017, local: D:\h\sparkle\Sync\Engine\Saved\BuildGraph\Sparkle Compile Editor Win64\Manifest-Sparkle Compile Editor Win64 Binaries.xml, blockdir: block-Sparkle Compile Editor Win64 Binaries) Using 16 read tasks, 16 decode tasks, 16 write tasks Written 263 files (204.5mb, 45.6mb/s) Written 307 files (280.4mb, 11.9mb/s) Written 348 files (359.4mb, 15.8mb/s) Written 588 files (686.9mb, 75.9mb/s) Written 1479 files (1324.2mb, 145.6mb/s) Elapsed: 24s, bundle.packet_cache.hits: 2,258, bundle.packet_cache.misses: 47, bundle.bundle_cache.hits: 47, bundle.bundle_cache.misses: 1, backend.http.wall_time_secs: 0, backend.http.num_bytes: 0, bundle.packet_reader.num_bytes_read: 0, bundle.packet_reader.num_encoded_bytes_read: 0, bundle.packet_reader.num_decoded_bytes_read: 0 Download took 24.6s

Reading block "Sparkle Compile Win64":"Compiled Binaries" from temp storage (artifact: 680fd1ac161c65c96924edfe 'sparkle-compile-win64' (step-output), ns: horde-artifacts, ref: step-output/sparkle-main/65075/sparkle-compile-win64/680fd1ac161c65c96924edfe, local: D:\h\sparkle\Sync\Engine\Saved\BuildGraph\Sparkle Compile Win64\Manifest-Compiled Binaries.xml, blockdir: block-Compiled Binaries) Using 16 read tasks, 16 decode tasks, 16 write tasks Written 2 files (7.6mb, 0.7mb/s) Written 2 files (12.2mb, 0.9mb/s) Written 2 files (17.0mb, 1.0mb/s) Written 2 files (22.1mb, 1.0mb/s) Written 2 files (26.0mb, 0.8mb/s) Written 2 files (32.1mb, 1.3mb/s) Written 2 files (37.4mb, 1.0mb/s) Written 2 files (41.5mb, 0.7mb/s) Written 2 files (47.3mb, 1.1mb/s) Written 2 files (52.6mb, 1.0mb/s) Written 2 files (58.4mb, 1.2mb/s) Written 2 files (64.0mb, 1.1mb/s) Written 2 files (68.5mb, 0.9mb/s) Written 3 files (90.2mb, 2.3mb/s) Written 3 files (94.6mb, 0.9mb/s) Written 3 files (100.9mb, 1.4mb/s) Written 3 files (105.5mb, 1.0mb/s) [... snip ...] Written 3 files (541.1mb, 12.4mb/s) Written 3 files (878.2mb, 80.3mb/s) Written 3 files (1263.4mb, 81.9mb/s) Written 3 files (1540.1mb, 50.7mb/s) Written 3 files (1754.0mb, 44.0mb/s) Written 4 files (1998.7mb, 40.1mb/s) Written 4 files (2368.3mb, 79.5mb/s) Written 4 files (2520.4mb, 27.8mb/s) Written 15 files (2746.8mb, 49.6mb/s) Written 22 files (2890.6mb, 56.8mb/s) Elapsed: 347s, bundle.packet_cache.hits: 436, bundle.packet_cache.misses: 71, bundle.bundle_cache.hits: 71, bundle.bundle_cache.misses: 2, backend.http.wall_time_secs: 0, backend.http.num_bytes: 0, bundle.packet_reader.num_bytes_read: 0, bundle.packet_reader.num_encoded_bytes_read: 0, bundle.packet_reader.num_decoded_bytes_read: 0

I’m currently trying to pinpoint the underlying cause for this, as it is a relatively major bottleneck.

AFAIK there hasn’t been further changes to DirectoryNode storage implementation to address this ue5-main.

Steps to Reproduce

Hey Yang

Thanks for reporting this. Can you quickly confirm your:

  • Horde server version
  • Virus scanning software (or the exclusions therein applied to Horde server)

Kind regards,

Julian

Hey there Yang, thanks for confirming the details.

Let us know how it goes. I spoke with one of our devs internally and this was some of their feedback:

  • We had to play with the read, decode and write tasks within the DirectoryNode.Extract.cs quite a bit to fine tune this (see lines 250 to 252)
  • Looking at the log output they are using the default setup:
    • “Using 16 read tasks, 16 decode tasks, 16 write tasks”
  • This logic in particular to define the number of write tasks may be of interest since it adapts based on the number of files involved:
    • int numWriteTasks = options.NumWriteTasks ?? Math.Min(1 + (int)(directoryNode.Length / (16 * 1024 * 1024)), 16);
    • But looking at the logs it’s still falling back to 16 currently which is the default.
    • We would advise to consider tweaking with these three values, maybe scaling more aggressively depending on the number of files involved, although we did hit OOM issues so they will need to be careful how far they push things.

Julian

Took more finagling than expected, but seeing initial 3x to 10x speed up after refactoring the way extraction target files are opened for writing.

Will let this stew for a couple of days to collect more data before I polish it up into a PR.

Reading block "Sparkle Compile Win64":"Compiled Binaries" from temp storage (artifact: 6811319e3617c52a5b91acac 'sparkle-compile-win64' (step-output), ns: horde-artifacts, ref: step-output/sparkle-main/65160/sparkle-compile-win64/6811319e3617c52a5b91acac, local: D:\h\sparkle\Sync\Engine\Saved\BuildGraph\Sparkle Compile Win64\Manifest-Compiled Binaries.xml, blockdir: block-Compiled Binaries) Using 16 read tasks, 16 decode tasks, 16 write tasks Written 3 files (150.6mb, 39.8mb/s) Written 4 files (352.2mb, 0.5mb/s) Written 4 files (801.5mb, 111.1mb/s) Written 4 files (1583.5mb, 147.5mb/s) Written 20 files (2391.9mb, 165.6mb/s) Written 20 files (2876.8mb, 98.4mb/s) Written 22 files (2891.5mb, 98.2mb/s) Elapsed: 30s, bundle.packet_cache.hits: 424, bundle.packet_cache.misses: 69, bundle.bundle_cache.hits: 69, bundle.bundle_cache.misses: 2, backend.http.wall_time_secs: 0, backend.http.num_bytes: 0, bundle.packet_reader.num_bytes_read: 0, bundle.packet_reader.num_encoded_bytes_read: 0, bundle.packet_reader.num_decoded_bytes_read: 0

PR: https://github.com/EpicGames/UnrealEngine/pull/13190

We’re seeing x10 to x50 improvement in throughput.

We are running a lightly modified Horde server and agents based on 5.5.4-CL40574608 with this patch: https://github.com/EpicGames/UnrealEngine/pull/13089

The server host is not running any scanning software. IIRC the agents are running Sentinel One, but I don’t see the same slowdown when other apps are performing similar large IO writes.

I think I have it narrowed down to BatchOutputWriter having high internal contentions when the same file is split across many chunks sets from many different blobs. Will report back if I come back with any promising optimization results.