This is mostly a necro of this post
[Content removed]
We see that Nanite passes have a High Register use, and as such occupancy is at best 50% on mid tier GPUs. e.g RTX4060-8gb.
We see this even in theVisbuffer*: /EngineMaterials/WorldGridMaterial.WorldGridMaterial* dispatch*,* which as I understand it is the fast-path, where the material has no WPO/pixelprogrammable/mask.
From what Nsights tells me, and my understanding: Nanite is a complex hierarchy of function calls, lots’ of transforms and flags to keep in memory as it has to resolve where every cluser, then triangle is. Each step down the stack takes more memory. and so the Gpu can launch less warps. Couple that with the need to read and write UAV and Textures, we can’t hide latency well.
We try to keep ontop of our material complexity, We’ve hunted Masked/WPO With Nanite to Extinction, but we believe that if even the WorldGridMaterial bin is overflowing, we don’t stand much chance to gain anything back.
We could start to reverse engineer some DXIL assembly and hope to find anything in there that we could potentially not have, but that’s last resort, assumes we can do a better job of what Epic has already done, and any modifications we do diverge us from mainline and make merging updates harder.
other hotspot areas for register use for us are:
- Nanite::EmitDepthTargets ->‘Emit Scene Depth/Resolve/Velocity’
- Nanite Basepass - ShadeGbuffer - LandscapeMaterials.
- we do have a complex landscape material, we do turn off tessellation for low-end, but still.
- Nanite Basepass - ShadeGbuffer - Animated Opaque Skinned Nanite foliage (With Voxels)
- VSM - Nanite - [Fixed Function (Voxel | CastShadow | Skinned)]
- This is big cost for us, occupancy is less than 20%.
Flamegraph for VSM - [Fixed Function (Voxel | CastShadow | Skinned)]. Show time/samples, not register use
Questions:
-
Any Cheatcodes? Compile flags, cvars, etc… to play with here?
-
Is there anything in the pipeline on this with future updates to Unreal?
-
Does Epic Feel there is any excess left in the say NaniteVertexFactory.ush etc that could be pruned for low-end. Trade some quality to free up some registers?
-
Regarding Nanite landscape, is there anything to an idea of reducing the maxpixelsperedge, or basically the non-tessellated nanite dicing rate for nanite, on just landscape for low-end, to get bigger, rougher triangles, and then have to spend less time and VGPRs doing FetchTransformedNaniteVerts
-
Skinned & voxel Nanite foliage is exciting, it’s unlocked a lot of potential for us, but the performance, especially with VSM is worrisome, The main view pass -- not so much. But a moving directional sun light sees a lot of foliage in it’s top down view of a forest. So specifically the pass VSM-Fixed Function (Voxel | CastShadow | Skinned) Give us our worst occupancy, and is perhaps is the single largest and longest call in our frame. We’ve pushed VSM Lodbias as far as it can go. So my question: How Done is nanite foliage? can we expect any dx12 shader optimisations to come down here at this level?

