High cost of GPUSceneUpdate due to Niagara meshes

Hey!

Hey, since we upgraded from UE 5.0.3 to UE 5.3.2, we noticed some high cost of GPUSceneUpdate on the GPU, going up to multiple milliseconds on our recommended spec setup with an RTX 2070 Super. We saw this correlating with heavier loads of Niagara GPU simulation.

After I added some GPU scopes to see where the cost actually goes, I was able to see that the “GPU Writer Delegates” part (“ImmediateWrites” on the trace screenshot) blows up with individual dispatches per emitter.

I was able to greatly reduce the cost of GPUSceneUpdate by changing NIAGARA_ENABLE_GPU_SCENE_MESHES define to 0 on top of NiagaraMeshVertexFactory.h (and also made a whitespace change in NiagaraMeshVertexFactory.ush as the comment suggests).

Is this behaviour for individual dispatches expected? Is there something we could do on our side to make this behave better? Are there any future improvements I could cherry-pick for this? What are the drawbacks of disabling NIAGARA_ENABLE_GPU_SCENE_MESHES?

Cheers,

Gábor

Hi,

Can you take a capture in PIX to see if the dispatches are overlapping or not? I’ve reported an issue like that to an IHV in the past (I don’t recall if it was AMD or NV), as we found that the dispatches were running in serial each with small amounts of work where they should be overlapping. I would also disable the individual markers to avoid them causing a false positive.

The downsides to disabling GPU scene for Niagara is that various rendering features will not work, raytracing / virtual shadow maps / etc. If you don’t need them then that’s fine.

Thanks,

Stu

Hey,

Are there plans to improve this behavior within UE in the future? By batching many small dispatch calls into fewer larger ones, or some other way? Or further discuss with IHVs to resolve this issue?

Thanks,

Gábor

Hi,

Sorry I missed you response, the question got into the pending state and your recent response kick it back to the queue for me.

On the Niagara simulation side of things we currently don’t have plans to batch draw calls, it’s complicated to do so given the permutations of options, perhaps there’s a path forward when no data interfaces are used but that is extremely limiting. We are somewhat reliant on the driver / GPU doing the correct thing and overlapping draws calls like we asked it to. We could be bumping up against how many in flight dispatches are possible given GPR usage, etc.

This might also be true with the GPU scene uploads on those GPUs, i.e. hitting the max in flight dispatch limit.

For me first stop would be checking in with IHVs to see where the bottleneck is, if you have a project which reproduces the issue I could ask them about it again.

Thanks,

Stu

Hello,

Gábor asked me to create a sample project in 5.6 where the high GPU Scene Update cost can be reproduced, and at the same time a high Compute GPU Sim cost. I created a pseudo setup to roughly use similar features and options in Niagara that we use in our game.

In our project, we often use emitters that have many low poly meshes and we stack multiple mesh renderers into one emitter. Using the visibility parameter, we randomly choose between them per particle. We also use multiple meshes in one mesh renderer module and the mesh ID parameter, but unfortunately this cannot be reproduced in Unreal 5.6 because the engine crashes when trying to add another field to the static mesh array in the mesh renderer module.

On a machine with an RTX 2070, I performed a Trace Capture and a Pix Capture in the editor, and I was able to reproduce a significant GPU Scene Update cost. [Image Removed][Image Removed] I also noticed the more Mesh Renders are active the higher cost of GPU Scene Update. I placed 60 Niagara System actors using the same system that contains 5 emitters, only one has Mesh Renderers but I added 6 of them in one emitter. In total, 300 active GPU emitters at the same time.

As for Niagara GPU Simulation, I specifically used only GPU Emitters in this test and added specific modules in the emitters to force a high number of Data Interfaces. Unreal shows that I’m using 7 DIs.

In this test, I was also able to get a high Compute GPU Sim cost, although I noticed it’s relatively low based on my experience. 300 emitters took about 1.5 ms.

Interestingly, this value appears when looking at the Trace in Insight, but in the Pix Capture, it’s different. In Pix it’s much less [Image Removed]

To investigate further, I did a test build in our game where I recreated exactly the same setup and got an average of 6 ms for GPU Sim.

To make sure, I repeated the test on the vanilla version 5.3.2 available on EGL. Using exactly the same test case, I was able to reproduce the increased Compute GPU Sim cost. I also noticed the significant difference in cost of GPU Scene between 5.3.2 and 5.6

To summarize, I have a sample projects where you can observe a high GPU Scene Update cost by using many Mesh Renderers in one emitter. Attached bellow

The Compute GPU Sim cost and GPU Scene Update differs between versions 5.3.2 and 5.6, and this difference is very noticeable on Nvidia 2000 series cards[Image Removed][Image Removed]

Hi,

We did some further investigation, and it seems the profile markers generated from RDG_EVENT_NAME prevent the small dispatches to run parallel.

We observed significant perf boost, and nicely overlapping dispatch calls in PIX when disabling markers with `ToggleDrawEvents` console cmd. Both for GPUSceneUpdate and NiagaraGPUSimulation.

Seems like we were chasing a false positive!

Cheers,

Gábor

I’m back from vacation, thanks for the update.

That matches something I came across previously when looking into a similar issue on a different IHVs GPU, it was extremely sensitive to any additional work in between the dispatches.

I think we do agree internally that it would still be beneficial to batch these small dispatches so the test project will still be helpful.

Thanks,

Stu

Thanks, we’ve found breadcrumbs to be pretty useful for narrowing down crashes at least internally. I’ve not found them being useful at the dispatch level in the wild however.

I just want to verify that your on the latest drivers and seeing this issue?

We have been talking about having a verbosity level, that way if you need to dig in you can enable via a setting (ini/cvar).

Thanks,

Stu

I got some more information that WBI will act as a barrier sometimes no matter what driver version unfortunately.

The ‘solution’ is to prevent breadcrumbs / events, and we have RDG_EVENT_SCOPE_FINAL macro which will do this by default. When chasing a hang we can change r.RDG.Events to include the lower down events.

If you have time, I would love to know if this works as expected? I.e. using RDG_EVENT_SCOPE_FINAL(GraphBuilder, “GPU Writer Delegates”) should hopefully fix it.

Thanks,

Stu

Thanks, if you do manage to test please let me know how it works out.

Thanks,

Stu

Hey,

Here is how it looks like on a PIX capture analyzed on RTX 2070. Haven’t digged into captures on AMD, or consoles. We do regularly look at Insights traces on consoles however, and we have never seen this scope blow up like this on PS5 or XSX.

This behavior seem to be similar to NiagaraGPUSimulation, which also consists of many small dispatches. We noticed that work also being disproportionately more expensive especially on older generation NVIDIA cards (RTX 2000 and earlier in our experience, but ofc our testing is not exhaustive).

Cheers,

Gábor

Hey Stu,

In the meantime I noticed that this was not the full story, as RDG events are actually enabled on all desktop builds (at least on UE 5.3), and are used for GPU breadcrumbs even in Shipping builds.

As a workaround I replaced the RDG_EVENT_NAME macros with {} in FNiagaraGpuComputeDispatch::DispatchStage, and removed the SCOPED_DRAW_EVENTF from “Update Free IDs” part. This seem to have drastically reduced the Niagara GPU sim cost in our playtest traces (Test config builds where `WITH_PROFILEGPU` is 0).

Hope this helps, cheers!

Gábor

Hey!

We did the testing on the RTX 2070 on driver 577.00. The latest now is 580.88, but it was released after our tests.

Adding an opt-in CVar to enable these verbose events sounds like a good solution.

Cheers,

Gábor

Hey,

RDG_EVENT_SCOPE_FINAL indeed looks like what we need here. Sorry, I don’t have the capacity to verify it in the next couple days.

Cheers,

Gábor