Hello,
We’ve recently noticed that our Cloud DDC instance is failing to receive incoming data (PUT requests) and giving error code 500 to users, looking at the backend I see we are getting a lot of the following (see attached callstack)
And true enough the attached NVMe drive is out-of-space, I was under the impression this was a cache that would self-clean itself. We are using a i4i.xlarge so it has 850GBs of space which should be plenty.
Any ideas as to why the Cloud DDC is not cleaning the cache and as a result is giving back failed requests?
Hey William
Its correct that its supposed to be self cleaning, the cleaning doesn’t run all the time so if you have a very large amount of data coming in that may cause Cloud DDC to run out of space before the cleanup triggers. Practically this shouldn’t really be an issue but its worth checking diskspace over time to see if it seems like it resolves itself or ends up being permanently out of space.
If its the later then you are likely running into a issue we fixed and was hoping to release in 1.4.0 in the coming few days, our temp storage is not being cleaned up and if the pods are being restart while a request is being buffered this temp file is around for ever causing temp storage to eventually fill up the entire machine. This can easily be resolved manually by remoting into the machine and deleting everything in /tmp . 1.4.0 adds a seperate gc for temp files were any file older then a few days is deleted.
Hi Joakim,
It doesn’t seem to be the tmp folder (deleted to no effect), the entirety of the usage is in the nvme ‘Blobs’ folder, this issue has been reported over the last week or so and does not seem to be resolving itself. Would this mean that either the cleanup is not occurring, or 850GBs is not enough for our use case? Is there any alternative action we can take to resolve this?
Looking at the worker pod, it does seem to be doing some cleanup based off the logs (but not sure if this cache cleanup or S3 cleanup)
{“Timestamp”:“2025-11-25T09:55:39.6730481+00:00”,“Level”:“Information”,“MessageTemplate”:“Ran blob cleanup {BlobCleanup}. Deleted {CountBlobRecords}”,“Properties”:{“BlobCleanup”:“Jupiter.Implementation.OrphanBlobCleanupRefs”,“CountBlobRecords”:6318,“SourceContext”:“Jupiter.Implementation.BlobCleanupService”}}
The OrpahBlobCleanupRefs is the S3 cleanup not the local cleanup. You would be looking for FilesystemStore as the {BlobCleanup} value in that message for the local cleanup.
You could also search for “Filesystem cleanup not running. Disksize used: ‘{UsedDiskSize}’. Namespace: ‘{Namespace}’. Trigger size was {TriggerSize}” That triggers when it attempts to do a cleanup but decides not to and explains why, in case its not running.
In general the values that matter here are
MaxSizeBytes
TriggerThresholdPercentage = 0.95
TargetThresholdPercentage = 0.85
MaxSizeBytes controls how large the filesystem is allowed to be, and the trigger and target controls at which point a GC triggers (95% by default) and how much it cleans to (85% by default).
You should also verify that you are setting GC.RunFilesystemCleanup = true , we should be setting that in the helm charts if you are using them if the chart is configured to use local storage.
It looks like the Filesystem cleanup never occurs, deleting the worker pod shows that it starts and completes the ‘OrphanBlobCleanupRefs’ process, but the filesystem one never starts, I hadn’t explicitly added ‘RunFilesystemCleanup’ in the helm values, I’ve added it now to the worker’s config
worker:
enabled: true
config:
GC:
CleanOldRefRecords: true
CleanOldBlobs: true
RunFilesystemCleanup: true
Although pushing this update and restarting the worker deployment isn’t reflecting this change
You do not want that in the worker section as the worker typically doesn’t have a filesystem attached to it and even if it has its not all the filesystems with the cache. Instead you want to pass that to the normal api pods to make sure they clean their own filesystem.
Yeah, that seemed to be it, I’ve moved it to the global config section and now the normal pods are draining the cache. Thanks for the help!