Horde keepCount behavior is broken?

anonymous-edc · August 13, 2025, 10:53pm

I am evaluating Horde on a machine with a relatively small 1TB NVMe that has been enough for Team City for a few years now. Due to how much intermediate artifacts that Horde generates and swaps with the server, Horde puts a lot more stress on the storage space available on the machine. I think this would be ok if things were functioning as expected, but despite the fact that I have all the artifacts set to “keepCount”: 2 or 3, it doesn’t appear to be culling its data in accordance with those rules. I am not using keepDays at all, because I wouldn’t want particularly busy days to be able to exceed the storage, so keepCount seems more appropriate to try and keep a more rigid leash on the storage requirements.

Here is where the storage is at right now. I’ve expanded one entry from each size category to see what’s inside, but clearly there are far, far more than 2 being “kept” in this situation. Am I missing something, or is this a bug?

This is sorted by name, so the latest changelist is up top.
This is also only less than half the number of folders here
In total there are 118 folders in here. Most of them are 2.5GB
Some of the newer ones are in the 3GB range
The newest is 22.4GB
Weirdly a bunch at the bottom of the list, the oldest from a changelist perspective, are still huge

[Image Removed]

Given that keepCount is a property of the artifact types, I guess my expectation is that there would only be the top 2 entries under step_output/development, so in this case, we’d just be keeping the last 2 builds, such as

[Image Removed]

Instead we have over 100 more huge data chunks more than expected eating up space.

I am on Horde built from 5.6. How can I fix this apparently completely busted keepCount behavior?

JulianGamble · August 14, 2025, 3:32pm

Hey there,

Just as a quick response to this - we have some practical debugging documentation regarding storage & retention here. There can be some tricky aspects to this, and it’s certainly something we are trying to keep an eye on.

A couple of key considerations/suggestions:

Make sure your garbage collection is on
Capture the server log snippet for when GC is running - we want to see if it’s attempting to do anything here

Example of GC config:

"namespaces": [ { { "id": "horde-artifacts", "gcFrequencyHrs": 0.1, "gcDelayHrs": 6, }, //... }We should also be able to track down some important snippets of the Expiration process by looking for the following within the log:

“Checking for expired artifacts…”
- This will give us the top of the expiration pass (which occurs hourly, within ExpireArtifactsAsync) - I’m curious to see what’s going on here
“Expiring artifacts from orphaned stream…”
- This will indicate that we are actually attempting to expire artifacts from the particular stream

Once we have some of the log context around this, we can dig in a bit more.

Kind regards,

Julian

JulianGamble · August 18, 2025, 4:19pm

Hey there,

Let me have a look internally as I could swear there was a known issue for this.

Julian

JulianGamble · August 19, 2025, 8:50pm

Hey there,

Just circling back regarding this:

Unreal Engine Issues and Bug Tracker (UE\-278395)
This is the public issue we are tracking.

While we get a proper fix in for this (where presumably we would properly remove the MongoDB entry):

MemoryMappedFileCache.cs:

`public void Delete(FileReference file)
{
lock (_lockObject)
{
MappedFile? mappedFile;
if (_pathToMappedFile.TryGetValue(file, out mappedFile))
{
mappedFile.DeleteOnDispose();

_pathToMappedFile.Remove(file);
_mappedFiles.Remove(mappedFile.ListNode);

mappedFile.Release();
}
else
{
// [JGAMBLE_DIVERGENCE-START] - swallow exception
try
{
FileReference.Delete(file);
}catch (Exception)
{
// Swallow exception for the time being
}
// [JGAMBLE_DIVERGENCE-END] - swallow exception
}
}
}`This should at the very least get passed the issue of the GC loop bailing, and being able to continue on.

Let me know if this helps in the interim!

Kind regards,

Julian

anonymous-edc · August 14, 2025, 4:26pm

Thanks. I am using default settings. gcFrequencyHrs nor gcDelayHrs exist in the configs out of the box.

Looking at the logs(my bad I should have sooner), I can see that there there are exceptions happening where it is bombing out.

For example

[00:05:08 inf] Running garbage collection for namespace horde-artifacts... [00:05:08 inf] Garbage collection queue for namespace horde-artifacts (storage:horde-artifacts:check) has 50482 entries [00:05:08 dbg] Deleting horde-artifacts blob 685d8e33776f929b00db7eb9, key: step-output/development/20215/game-compile-editor-win64/685d8da29a201d7801a40bf9/fa5a3d7662534dbbbbd90008391640f8_2.blob (0 imports) [00:05:08 err] Exception while running garbage collection: Could not find a part of the path 'D:\Epic\Horde\Server\Storage\artifacts\step-output\development\20215\game-compile-editor-win64\685d8da29a201d7801a40bf9\fa5a3d7662534dbbbbd90008391640f8_2.blob'. System.IO.DirectoryNotFoundException: Could not find a part of the path 'D:\Epic\Horde\Server\Storage\artifacts\step-output\development\20215\game-compile-editor-win64\685d8da29a201d7801a40bf9\fa5a3d7662534dbbbbd90008391640f8_2.blob'. at EpicGames.Core.MemoryMappedFileCache.Delete(FileReference file) in D:\p4\7thcurse\dev\Engine\Source\Programs\Shared\EpicGames.Core\MemoryMappedFileCache.cs:line 235 at EpicGames.Horde.Storage.ObjectStores.PrefixedObjectStore.DeleteAsync(ObjectKey locator, CancellationToken cancellationToken) in D:\p4\7thcurse\dev\Engine\Source\Programs\Shared\EpicGames.Horde\Storage\ObjectStores\PrefixedObjectStore.cs:line 38 at HordeServer.Storage.StorageService.CheckReachabilityAsync(NamespaceInfo namespaceInfo, SortedSetEntry1 entry, ObjectId lastImportBlobInfoId, GcSweepState state, StorageConfig storageConfig, AsyncEvent queueChangeEvent, CancellationToken cancellationToken) in D:\p4\7thcurse\dev\Engine\Source\Programs\Horde\Plugins\Storage\HordeServer.Storage\Storage\StorageService.cs:line 1695
at EpicGames.Core.AsyncPipelineExtensions.ProcessItemsAsync[T](ChannelReader1 reader, Func3 taskFunc, CancellationToken cancellationToken) in D:\p4\7thcurse\dev\Engine\Source\Programs\Shared\EpicGames.Core\AsyncPipeline.cs:line 198
at EpicGames.Core.AsyncPipeline.RunGuardedAsync(Func2 taskFunc) in D:\p4\7thcurse\dev\Engine\Source\Programs\Shared\EpicGames.Core\AsyncPipeline.cs:line 96 at EpicGames.Core.AsyncPipeline.WaitForCompletionAsync() in D:\p4\7thcurse\dev\Engine\Source\Programs\Shared\EpicGames.Core\AsyncPipeline.cs:line 120 at HordeServer.Storage.StorageService.TickGcForNamespaceAsync(NamespaceInfo namespaceInfo, ObjectId lastImportBlobInfoId, DateTime utcNow, CancellationToken cancellationToken) in D:\p4\7thcurse\dev\Engine\Source\Programs\Horde\Plugins\Storage\HordeServer.Storage\Storage\StorageService.cs:line 1603 at HordeServer.Storage.StorageService.TickGcForNamespaceAsync(NamespaceInfo namespaceInfo, ObjectId lastImportBlobInfoId, DateTime utcNow, CancellationToken cancellationToken) in D:\p4\7thcurse\dev\Engine\Source\Programs\Horde\Plugins\Storage\HordeServer.Storage\Storage\StorageService.cs:line 1607 at HordeServer.Storage.StorageService.TickGcAsync(CancellationToken cancellationToken) in D:\p4\7thcurse\dev\Engine\Source\Programs\Horde\Plugins\Storage\HordeServer.Storage\Storage\StorageService.cs:line 1520

The reason for this is probably that when horde initially filled up the drive, I had to manually nuke some of these artifacts in order to even run horde again.

My guess is that this caused this exception that horde is not able to get past. The files no longer matched what the database expected should exist. But code that deletes files should be able to handle if the files are already gone and not get stuck in a fail loop of broken garbage collection. The failure of garbage collecting one blob is clearly blocking the attempts to try others that it could successfully clean up.

Any suggestions for how to repair this state? I’d be fine with a “wipe the artifact slate clean” solution.