Nanite SW Rasterize - High Register use, DX12

This is mostly a necro of this post

[Content removed]

We see that Nanite passes have a High Register use, and as such occupancy is at best 50% on mid tier GPUs. e.g RTX4060-8gb.

We see this even in theVisbuffer*: /EngineMaterials/WorldGridMaterial.WorldGridMaterial* dispatch*,* which as I understand it is the fast-path, where the material has no WPO/pixelprogrammable/mask.

From what Nsights tells me, and my understanding: Nanite is a complex hierarchy of function calls, lots’ of transforms and flags to keep in memory as it has to resolve where every cluser, then triangle is. Each step down the stack takes more memory. and so the Gpu can launch less warps. Couple that with the need to read and write UAV and Textures, we can’t hide latency well.

We try to keep ontop of our material complexity, We’ve hunted Masked/WPO With Nanite to Extinction, but we believe that if even the WorldGridMaterial bin is overflowing, we don’t stand much chance to gain anything back.

We could start to reverse engineer some DXIL assembly and hope to find anything in there that we could potentially not have, but that’s last resort, assumes we can do a better job of what Epic has already done, and any modifications we do diverge us from mainline and make merging updates harder.

other hotspot areas for register use for us are:

  • Nanite::EmitDepthTargets ->‘Emit Scene Depth/Resolve/Velocity’
  • Nanite Basepass - ShadeGbuffer - LandscapeMaterials.
    • we do have a complex landscape material, we do turn off tessellation for low-end, but still.
  • Nanite Basepass - ShadeGbuffer - Animated Opaque Skinned Nanite foliage (With Voxels)
  • VSM - Nanite - [Fixed Function (Voxel | CastShadow | Skinned)]
    • This is big cost for us, occupancy is less than 20%.

Flamegraph for VSM - [Fixed Function (Voxel | CastShadow | Skinned)]. Show time/samples, not register use

Questions:

  • Any Cheatcodes? Compile flags, cvars, etc… to play with here?

  • Is there anything in the pipeline on this with future updates to Unreal?

  • Does Epic Feel there is any excess left in the say NaniteVertexFactory.ush etc that could be pruned for low-end. Trade some quality to free up some registers?

  • Regarding Nanite landscape, is there anything to an idea of reducing the maxpixelsperedge, or basically the non-tessellated nanite dicing rate for nanite, on just landscape for low-end, to get bigger, rougher triangles, and then have to spend less time and VGPRs doing FetchTransformedNaniteVerts

  • Skinned & voxel Nanite foliage is exciting, it’s unlocked a lot of potential for us, but the performance, especially with VSM is worrisome, The main view pass -- not so much. But a moving directional sun light sees a lot of foliage in it’s top down view of a forest. So specifically the pass VSM-Fixed Function (Voxel | CastShadow | Skinned) Give us our worst occupancy, and is perhaps is the single largest and longest call in our frame. We’ve pushed VSM Lodbias as far as it can go. So my question: How Done is nanite foliage? can we expect any dx12 shader optimisations to come down here at this level?

Hi,

>We see that Nanite passes have a High Register use, and as such occupancy is at best 50% on mid tier GPUs. e.g RTX4060-8gb.

In isolation, higher occupancy is usually better as it allows for better latency hiding, but beyond a certain point there are usually diminishing returns and can start conflicting with other performance goals like minimizing recomputation, minimizing memory traffic, etc. I think it usually better to focus on the throughput counters as they tell you how well the hardware is being utilized and how far from theoretical peak performance a given shader is wrt the various bottlenecks.

From the utilization graph seems the SM Throughput metric (purple) is fairly healthy and that would put a hard limit on how much more could in the best case be gained from higher occupancy. If you have concrete suggestions for how to improve performance on this HW, they are obviously very welcome :slight_smile:

>Any Cheatcodes? Compile flags, cvars, etc… to play with here?

On the Nanite side, the single most impactful thing for performance is the target triangle size (r.Nanite.MaxPixelsPerEdge). By default it targets 1px triangle edges. In many cases 2 or even higher is perfectly acceptable and that can significantly cut down on culling and rasterization cost.

>Regarding Nanite landscape, is there anything to an idea of reducing the maxpixelsperedge, or basically the non-tessellated nanite dicing rate for nanite, on justlandscape for low-end, to get bigger, rougher triangles, and then have to spend less time and VGPRs doing FetchTransformedNaniteVerts

I don’t think we have any current plans to do something like that, but that seems like a modification that could be fairly managable. The instance could be tagged with a flag and the culling code could use that to switch between landscape and non-landscape pixels per edge.

>Skinned & voxel Nanite foliage is exciting, it’s unlocked a lot of potential for us, but the performance, especially with VSM is worrisome, The main view pass -- not so much. But a moving directional sun light sees a lot of foliage in it’s top down view of a forest. So specifically the pass VSM-Fixed Function (Voxel | CastShadow | Skinned) Give us our worst occupancy, and is perhaps is the single largest and longest call in our frame. We’ve pushed VSM Lodbias as far as it can go. So my question: How Doneis nanite foliage? can we expect any dx12 shader optimisations to come down here at this level?

Ah, one thing to watch out for here is that pushing VSM lodbias too far can actually have negative performance consequences for voxels. The problem is that each cluster is handles by a single thread group. For large triangles, we can fall back to HW rasterization, but we don’t have a similar fallback for voxels. So as voxel clusters get larger in screen space, we can start getting into bad load balancing scenarios. As the voxel/triangle split is baked into the geometry, we can’t easily fall back to triangles either. This is something we would want to address.

>Does Epic Feel there is any excess left in the say NaniteVertexFactory.ush etc that could be pruned for low-end. Trade some quality to free up some registers?

I think it is generally hard to keep complexity under control, but we try. We have recently started looking into performance on some lower-end NVIDIA hardware, so perhaps something will come out of that.

Priorities are less clear than usual atm, so I won’t speculate about what might happen when.