Zenserver on linux can run out of memory when viewing oplogs in dashboard.

We are hosting a zenserver instance on linux. The binaries were compiled at version 5.6.2 from the zen server github. We have recently begun populating cooked snapshots there and found that if someone browses to the oplogs list for a project with many oplogs it will begin soaking up memory when populating the list. That list has no paging available that I can see so it tries to show all available oplogs. In our case with around 320 oplogs (I imagine it varies depending on the oplog sizes) the process would use up all 64 GB of memory on our VM and destabilize the VM to the point of requiring a restart. It should also be noted that navigating away from the page after it was fully populated does not seem to free the memory (perhaps it will at a GC?).

I have not attempted the same steps for the Windows server yet so I am unsure if it has the same behavior.

Steps to Reproduce

That’s interesting and seems like a problem we should address. I’ve asked [mention removed]​ to own this. I expect the first few things he will ask (and I’m also curious about) are:

  1. Is this when viewing the list of projects, or the list of oplogs?
  2. What is the storage strategy that you’ve got in place that results in this many distinct oplogs on one zenserver?
  3. Does this issue still happen with newer zenserver releases? (eg: we are up to 5.6.17 release available on zenserver github at this point)

1) This is happening when viewing the list of oplogs. We only have 2-3 projects so I suppose it may also happen there but I just haven’t added enough projects to evaluate that.

2) We have a very aggressive strategy for how often we cook and store snapshots for cooked content. We trigger a cook every 30 minutes if there are any changes. We cook across 3 platforms currently so it works out to somewhere around 140 total snapshots per day. We previously were storing these in a file based object store of our own making with a retention policy of 7 days and are looking to move them into zen storage and potentially cloud ddc server so we can have some more regional flexibility.

3) I am on 5.6.2 currently but can update to .17 and give it a try and report back.

Will look into this. Just a heads up that we don’t use zenserver to store that oplogs stored to the same instance in our setup so we haven’t spent much time optimizing for this scenario.

I’m trying to understand you set here - are you creating a new oplog for each cook and using that as the “general storage” for oplog?

Or are you uploading oplog to a zenserver used as generic storage?

If you just keep adding oplogs to a zenserver without restart it will currently just eat more and more memory, the web interface is just forcing it to happen faster.

Once an oplog has been loaded into memory due to a request (either by UE or the web interface) it will stay in memory until the zenserver is restarted or the entire oplog is GC:d which might be days.

Even though using the zenserver as a general storage option can be done it as you notice not implemented to handle this many oplogs in one instance currently.

How are you using the oplogs stored?

The oplog is coming from a local zen storage instance on the workstation using the normal process of cooking with zen storage enabled. We then push the snapshot/oplog to a shared server using the ZenExportSnapshot BuildGraph task. We also store the json for that oplog on a file share.

That oplog is then used by build agents to pull down cooked output (usually by way of the ZenImportOplog BuildGraph task) for packaging and other tasks, it is also used by developers wanting to pull down something precooked to stream to a console or run locally or do any other things a developer might do with cooked output locally.

Unless I am misinterpreting things we are more or less doing what is mapped out in the documentation here: https://dev.epicgames.com/documentation/en\-us/unreal\-engine/cooked\-data\-snapshots\-with\-zen\-storage\-server\-for\-unreal\-engine.

Thanks for the info, I was actually not aware that it was documented as a proper option with no caveats.

Unfortunately we can’t provide a timely fix for this problem but it is on the list of priority to look into. There is no quick fix for this, sadly.

In the meantime do you know if a cloud-ddc server instance would be a more appropriate location for storing snapshots longer term in this way?

Also, am I correct in thinking that a cleanup process with a pretty short retention time of the snapshots on the server could also help with this, maybe some combination of shorter retention and increased memory on the server VM might work around it ok for us for ths short term.

I have made some changes that should help with the memory consumption when listing oplogs.

Inspecting an oplog would still draw quite a bit of memory and remain in memory for a while.

This is expected to be included in the next zenserver release (5.7.1) while should be available next week or so.

We just released zenserver 5.7.3 which includes several fixes to reduce memory usage when storing many oplogs, could you please give that a try?

https://github.com/EpicGames/zen/releases/tag/v5.7.3

More memory and a shorter retention time would help yes, but it is just a temporary solution and not very scalable.

The other proper option is to use Cloud DDC, which I’m personally not familiar with setting up for third-parties.

Finally it is possible to export to the file system (or file share) but the documentation and helper tools for that is limited as it is there mostly for testing purposes.

Thank you, I’ll get our server updated to 5.7.3 in the next day or two. I updated it to 5.7.1 shortly after that was releases and it has been much improved on the memory usage.