[Bug report] UBA agent can crash when there is no free memory

anonymous-edc · April 2, 2025, 10:40pm

Unfortunately I can’t turn swap on in my environment (GKE Autopilot), so I can’t test whether it works there. I’m almost certain it does though.

I think the root cause of this is the way that MSVC relies on having a pagefile - even on Windows with a large pagefile, you’ll still get MSVC pagefile errors when compiling large C++ files, and this issue is an artifact of MSVC just relying on a pagefile being present. Notably, using Clang as the compiler for Win64 (under Wine/Linux with no swapfile) does not exhibit the same out-of-memory issues, likely because that compiler is written to also run on Linux where no swapfile means you have a hard OOM killer in effect at all times.

My next strategies here for making things work are:

As a quick hack, modifying the memory-shim to make all mmap’d regions backed by a file on disk. I’ve no idea if the Linux kernel will do the right thing and page out regions to the file on disk when under memory pressure (effectively making the file-backed mmap’d region act like a swap file just for that region). I’ll likely need to statically link mimalloc into memory-shim as well so that malloc/free can be routed to a file-backed mmap region as well. I do worry about the performance implications though - especially since with this model the Linux kernel will think persisting the data to file is important, when in reality we just want it there as a last resort of storage if the memory limit is being hit.
Using the userfaultfd API of the Linux kernel. This is available in constrained environments like GKE Autopilot, and lets user space handle faulting in pages of anonymous mmap’d regions. The API is not that well documented, but I think I’ve wrapped my head around what it needs. It does allow passing the userfaultfd file descriptor back to a central process for faulting in pages (i.e. a central process can track the overall resident memory for all downstream child processes). The downside of this API is that it requires the target process to cooperate in releasing memory from the anonymous mmap - the only way to evict a page with userfaultfd is for the process that has the mmap’d region to use madvise(DONTNEED) - you can’t do it from the central process. So there’s a bit of awkward IPC needed when the central process identifies that a page needs to be evicted to disk to allow another page to be brought into memory.

Of course, I think the long term solution here would be for UBA to use the mimalloc API to back memory allocations with mmap’d files on disk whenever a compiler process goes above it’s “share” of the memory. This would probably eliminate the “please increase your pagefile” errors on Windows as well.