Lumen & Nanite on MacOS

philipturner · September 9, 2022, 6:19pm

I’m striving to minimize how much I post on this thread, but an update is overdue. I got Nanite “running” on macOS, whatever you define running to be. In my case, running sometimes means freezing my entire Mac and requiring that I reboot it. The iGPU encountered some kind of infinite loop.

Also, the crash. This is new territory - something that nobody has described a workaround for. I’m currently investigating it, although help from someone who knows the UE5 code base would make this happen much faster.

I replicated the source code that @gladhu had made public. The shaders once had a hack that enabled Nanite through 32-bit texture atomics. @gladhu made a hack around the hack, because Metal only supports 32-bit atomics through buffers. UE5NanitePort replaced each atomic modify with a regular read + write. This is inherently thread-unsafe, and may explain the graphical glitches surrounding incorrect depth/occlusion.

Since then, Epic removed the 32-bit texture atomic workaround, so that Nanite only runs on DX12/Vulkan devices with 64-bit atomics. I just thought of an entirely different way to run Nanite without needing 64-bit atomics or texture atomics. It’s thread-safe by nature, unlike the previous lock-based workaround. It runs not only on macOS (Apple + Intel), but also DX11. I pitched the idea to @SupportiveEntity in a PM because it’s excruciatingly long.

Nanite through 32-bit atomics

I’m planning to implement and explain it in the AtomicsWorkaround directory, so it might be worth checking that periodically. In short, you have to think theoretically regarding information transfer. 64-bit atomics are required because depth must be synchronized with color. In rasterization pipelines, this is called z-buffering. Nanite performs rasterization through a compute shader.

However, the depth data is only 24 bits. You’re transferring 56 bits of information when doing a 64-bit atomic max. So what if you separated the 24 bits of depth, then broke the remaining 32 bits of color into 8 bit chunks? Then, rearranged them like so:

24 bits of depth + 8 bits color data = 32-bit word
24 bits of depth + 8 bits color data = 32-bit word
24 bits of depth + 8 bits color data = 32-bit word
24 bits of depth + 8 bits color data = 32-bit word

The depth is duplicated 4 times, but it works! Four 32-bit memory chunks that can be atomically modified. The actual implementation is more complex, and it splits color into two chunks of 16 bits. It uses locks, but differently than the previous lock-based workaround. I’m not describing that here for brevity.