Horde GC doesn't keeps too many artefacts

Hi,

We are having problems with that horde doesn’t seem to expire artefacts as intended. We have hade our Horde server running on docker in linux now for about a year now, and we have recently run out of disk-space multiple times the last few weeks. We have had problems with it before, but then when we cleaned up some diskspace, it seemed to start working again.

I noticed that it’s the directory data/Storage/artifacts/step-output/ that grows more than I should. I made a script that checked how much of our diskspace was used by files that’s older that 3 days in there, and it stated 1.2TB of our 1.4TB drive.

I thought it was this issue: [Content removed] and I have applied the fix in there. But it still grows.

I have setup grafan to monitor the space left, and I notice that it’s sinking regularly, so it seems like it’s not completely broken.

I searched in the logs for exceptions, and what I found was entries like these:

[11:19:24 err] An unhandled exception has occurred while executing the request.
MongoDB.Driver.MongoCommandException: Command findAndModify failed: Plan executor error during findAndModify :: caused by :: E11000 duplicate key error collection: Horde.Storage.Blobs index: ns_1_blob_1 dup key: { ns: "horde-artifacts", blob: "step-output/ride-release-1.2/16771/compile-editor/69b93833b818c8b595bc28d1/7d2bbd6509e343b5ad2e4f2a52263a9b_1" }.
  at MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1.ProcessResponse(ConnectionId connectionId, CommandMessage responseMessage)
  at MongoDB.Driver.Core.WireProtocol.CommandUsingCommandMessageWireProtocol`1.ExecuteAsync(IConnection connection, CancellationToken cancellationToken)
  at MongoDB.Driver.Core.Servers.Server.ServerChannel.ExecuteProtocolAsync[TResult](IWireProtocol`1 protocol, ICoreSession session, CancellationToken cancellationToken)
  at MongoDB.Driver.Core.Operations.CommandOperationBase`1.ExecuteProtocolAsync(IChannelSource channelSource, ICoreSessionHandle session, ReadPreference readPreference, CancellationToken cancellationToken)
  at MongoDB.Driver.Core.Operations.WriteCommandOperation`1.ExecuteAsync(IWriteBinding binding, CancellationToken cancellationToken)
  at MongoDB.Driver.Core.Operations.FindAndModifyOperationBase`1.ExecuteAttemptAsync(RetryableWriteContext context, Int32 attempt, Nullable`1 transactionNumber, CancellationToken cancellationToken)
  at MongoDB.Driver.Core.Operations.RetryableWriteOperationExecutor.ExecuteAsync[TResult](IRetryableWriteOperation`1 operation, RetryableWriteContext context, CancellationToken cancellationToken)
...
[11:19:28 dbg] HTTP PUT /api/v1/storage/horde-artifacts/blobs/step-output/ride-release-1.2/16771/compile-editor/69b93833b818c8b595bc28d1/7d2bbd6509e343b5ad2e4f2a52263a9b_1 responded 500 in 4681.9101 ms

Searching for “Garbage collection queue for namespace horde-artifacts” had these entries as our latest entries.

12:01:45 inf] Garbage collection queue for namespace horde-artifacts (storage:horde-artifacts:check) has 0 entries
  Line 14885: [12:11:45 inf] Garbage collection queue for namespace horde-artifacts (storage:horde-artifacts:check) has 0 entries
  Line 17544: [12:21:45 inf] Garbage collection queue for namespace horde-artifacts (storage:horde-artifacts:check) has 8 entries
  Line 22293: [12:31:45 inf] Garbage collection queue for namespace horde-artifacts (storage:horde-artifacts:check) has 8 entries
  Line 26945: [12:41:45 inf] Garbage collection queue for namespace horde-artifacts (storage:horde-artifacts:check) has 32 entries

It feels like when I have cleaned up storage to the growing data/Storage/artifacts/step-output/ folder I must have managed to desynced the DB someway. Any help trying to get the database back in order would be appreciated (yes, I have tried turning it off and on again ;))

Cheers,

Markus

[Attachment Removed]

Hi Markus,

I think there are probably two things going on here.

1. Retention possibly isn’t configured specifically for step-output

Horde creates default artifact type entries for step-output, step-saved, etc., but those defaults don’t have KeepCount or KeepDays set. When both are null, the expiration ticker just skips that type entirely, nothing gets expired, refs never get deleted, and blobs never make it into the GC queue. That lines up with your GC queue only showing 0–32 entries.

You can confirm by searching your logs for “No expiration policy set for”, if that shows up for step-output, that’s the issue.

Fix: add explicit retention to your stream config:

"ArtifactTypes": [
  {
    "Type": "step-output",
    "KeepCount": 5,
    "KeepDays": 7
  }
]
 

Also double-check that RetentionOptions.EnableArtifactExpiration isn’t set to false on your streams.

2. Manually deleting files desynced the DB

When you cleaned up files from data/Storage/artifacts/step-output/, the MongoDB records (Storage.Blobs, Storage.Refs) were left behind. The GC still sees those blobs as referenced, so it won’t touch them. And when new jobs try to upload to the same blob paths, they hit the existing DB records and get the duplicate key 500s you’re seeing.

If you haven’t manually deleted any files recently and the disk just filled up on its own, fixing #1 should be sufficient, you can skip the MongoDB cleanup. The 500 errors in that case would be a separate race condition that’s already handled in newer builds.

To fix this you’d need to clean up the orphaned records in MongoDB, remove the BlobInfo entries in Storage.Blobs (where ns = “horde-artifacts”) for paths that no longer exist on disk, along with any corresponding Storage.Refs entries. Then set Reset: true on the gc-state singleton doc so the ingestion scanner reprocesses everything.

Restarting alone won’t help since all the GC state and orphaned records persist in MongoDB.

I would also recommend investing in more storage… 1.4TB isn’t much of a buffer, and while not the root cause, giving yourself a bit more ceiling would be good.

Please let me know if this helps, - Josh

[Attachment Removed]

Hi Josh,

Thanks for looking into this:

We have

"artifactTypes": [
         {
             "type": "ugs-pcb",
             "keepCount": 10
         },
         {
             "type": "step-output",
             "keepDays": 1
         }

In all our streams to try to combat the storage shortage.

And searching for “No exipration policy set for” gave me:

Line 32345: [10:52:01 inf] No expiration policy set for pw-main-horde step-saved; keeping all artifacts.
	Line 32346: [10:52:01 inf] No expiration policy set for pw-main-horde step-trace; keeping all artifacts.
	Line 32347: [10:52:01 inf] No expiration policy set for pw-main-horde step-testdata; keeping all artifacts.
	Line 32392: [10:52:01 inf] No expiration policy set for ride-main step-saved; keeping all artifacts.
	Line 32393: [10:52:01 inf] No expiration policy set for ride-main step-trace; keeping all artifacts.
	Line 32394: [10:52:01 inf] No expiration policy set for ride-main step-testdata; keeping all artifacts.

And that seems kind of expected.

I also checked for EnableArtifactExpiration, and I didn’t find anthing in our settings files.

2: How can I actually cleanup the DB? As it feels like I need to create quite a complex script, where I query the DB for { “ns”: “horde-artifacts” } and then I would need to check if the file exists on file, and if not, delete it in the db? Or could I just nuke the entire content of “horde-artifacts” and nuke everything in { “ns”: “horde-artifacts” } in Storage.Blobs and Storage.Refs.

3: You mention that the 500 issues are already fixed in a newever build. How does horde handle using a newever version of horde from the engine version? That is, is it safe for me to update horde without updating our engine?

Cheers,

Markus

[Attachment Removed]

After I wiped everything but the last 24h, it seems to keep itself in check and don’t grow anymore. I checked all folders in Storage/artifacts/step-output and they seem to be less than 24h old now.

I did your debugging steps, and noticed that 2 and 3 seems to be ok. What’s interesting is that 1 returns no results, but I see entries like this: Deleting horde-artifacts blob 69b5bf229680314c5ab1385a, key: step-output/ride-main/16738/age-win64-symstore/69b5bf22b818c8b595b5344e/ffe77100a2354f22bcb88d849af98f4a_1.blob (0 imports)

And when searching for Finished expiring artifacts in, I noticed that most said 0s, some said 1s and 3s (we are a small team, so we don’t produce that much).

So it seems like the server has started to behave itself now (it seems to do this about a 2-4 times a year). I noticed that it’s cleaning up the blobs, but not all files. As there is still some lingering old directories that is 68k-1.4mb. So might it be that it was these files/folders that accumulated over time that caused a leak of disk space?

There was one question you seemed to miss, and that is if it’s safe to use newer versions of horde with older engine versions with the regard of all features? Or is that a: “It depends on the engine version, please check release notes”. I was thinking that upgrading might break telemetrics or UBA as they are closer tied with the engine.

[Attachment Removed]

Markus,

The keepDays: 1 config for step-output is correct and will be applied. The “No expiration policy” messages for step-saved/step-trace/step-testdata are expected since you haven’t configured those.

To find where the pipeline is stuck, check these logs:

1. Search for “Expiring” + “step-output” to confirm artifacts are being expired

2. Search for “Finished expiring artifacts in” to see how long each hourly cycle takes (a huge backlog = long runtime)

3. Search for “Deleting horde-artifacts blob” to confirm blobs are actually being removed from disk

If #1 and #2 look fine but #3 is absent, the GC ingestion cursor is likely behind. Force a full re-scan by running this in MongoDB:

db.SingletonsV2.updateOne(
  { _id: "gc-state" },
  { $set: { Reset: true } }
)

This auto-clears on the next tick, no need to unset it manually. You can confirm it worked by searching logs for “Resetting scan for new blobs”.

DB cleanup from manual file deletion: You probably don’t need to manually clean MongoDB. The GC handles missing files gracefully, it removes the DB record and moves on. The gc-state reset above will accelerate this.

The 500 duplicate key errors: These are a race condition during concurrent uploads, should not be a root cause. Horde builds from 5.5+ handle this better. Though, I am assuming you are on this? This might be due to bad state.

Please let me know if this helps, -Josh

[Attachment Removed]

Ok great, and your instincts are correct, technically it is possible, though in practice not recommended due to hard to track down corner cases… you are also correct in identifying UBA coordination as a source of issues in this setup.

[Attachment Removed]