Editor crash in FNiagaraSystemGpuComputeProxy::RemoveFromRenderThread when unloading a world

Hi, we’re seeing an editor crash inside the lambda in FNiagaraSystemGpuComputeProxy::RemoveFromRenderThread when unloading certain maps in the editor.

I’ve been able to narrow down the cause quite a bit and have a reliable reproduction on my side, but haven’t quite isolated it well enough yet to reproduce it in a clean sample project.

Let’s start with the symptom:

The crash occurs on the render thread inside the lambda in FNiagaraSystemGpuComputeProxy::RemoveFromRenderThread. The FNiagaraGpuComputeDispatchInterface is already destroyed and the vTable has been cleared, so the virtual function call to RemoveGpuComputeProxy results in a pure virtual function call.

How does it get into this state? I managed to confirm that during world destruction, it first calls FNiagaraGpuComputeDispatch::OnDestroy and AFTER that it calls FNiagaraSystemGpuComputeProxy::RemoveFromRenderThread again with the already destroyed FNiagaraGpuComputeDispatchInterface and queues a render command with it. I’ve attached a screenshot of this callstack in the .zip file.

So there are 2 callstacks:

  1. where the FNiagaraGpuComputeDispatchInterface gets destroyed
  2. where a NiagaraSystem is stopped AFTER the destruction of FNiagaraGpuComputeDispatchInterface and it queues a render command with the destroyed FNiagaraGpuComputeDispatchInterface

I’ve also added a screenshot of the function UEditorEngine::EditorDestroyWorld where these 2 callstacks converge and the relative order of operations is easier to see.

One final important piece of the puzzle: UNiagaraCullProxyComponent seems to be involved. UNiagaraCullProxyComponent seems to be needed to make Niagara effects get stopped only after the destruction of FNiagaraGpuComputeDispatchInterface. I can reproduce the crash with a small test level by placing a specific Niagara system in it, then moving the camera very far away from the Niagara system and then unloading the world. Without moving the camera far away first it does not crash. I didn’t look too deep into where the UNiagaraCullProxyComponents get created and how this mechanism works, but it seems to be involved in this late cleanup of Niagara effects.

I hope this is enough information to figure it out! Thank you!

Hi Christoph,

Thanks for the report, I just checked our crash report and I see a very low number of these (~11 for a large time period) so it’s likely gone under the radar due to it’s low repro count.

Since you have a test project / way to repro the crash could you add something into FNiagaraSystemGpuComputeProxy::RemoveFromRenderThread where we skip the render command if ComputeDispatchInterface->IsPendingKill() is true? If that works I’ll implement that here, I can also try and setup a test case when I have a few spare cycles, or if you have a project that you could share that would be great also.

Thanks,

Stu

Thanks for the confirmation, much appreciated.

Thanks,

Stu

Hi Daniel,

I’m back from vacation, thanks for the detailed write up, I don’t imagine you have a repro I could work with at all?

I semi wondered about the timing of this after the fact, some of the clean up order of worlds isn’t exactly straight forward.

One of the things I would likely test is to not use the SystemInstanceAllocation->GetComputeDispatchInterface() but instead ask the UWorld for the FNiagaraGpuComputeDispatchInterface, that way if the FScene has been removed it would return null and we can skip the removal.

Thanks,

Stu

Thanks so much, I will find time this week to investigate.

Thanks,

Stu

Wanted to let you know that this project was great and I got a repro no problem.

I have two solutions that both fix the problem, one is making the references weak (so to speak) and the other is making a TSharedPtr.

I’m currently leaning towards the TSharedPtr approach vs a custom weak reference, I’ve been running various bits of testing and it all comes up good. I’m moved onto cooked testing inside Fortnite next if that comes up good I’ll likely get it submitted this week and can pass on a CL. I think it should be fairly easy to back port into 5.6 also.

Thanks,

Stu

My change is in UE5 Main now CL 45116304.

It certainly fixes this case, and from what I can tell so far hasn’t introduced anything else :slight_smile:

Thanks,

Stu

Hi!

Colleague of Christoph here :slight_smile:

I just tried this out, and it seems to fix the issue nicely.

I slapped this right above the ENQUEUE_RENDER_COMMAND in FNiagaraSystemGpuComputeProxy::RemoveFromRenderThread:

	if (ComputeDispatchInterface->IsPendingKill())
		return;

Thank you for your assistance!

Ciao, Daniel!

Sorry, I’m afraid I was overly enthusiastic about the results.

While this did indeed fix the issues we were having with our editor crashing, we’ve noticed the same crash also appears in our shipping builds (and only shipping - not test, or any other config).

I’m currently trying to see if I can figure out how it can still get into that state, or if I can come up with at least a workaround.

Ciao, Daniel!

I spent some more time debugging this.

  • The crash happens when we are changing levels.
  • UEngine::LoadMap -> UWorld::CleanupWorld -> UWorld::CleanupWorldInternal -> FFXSystemInterface::Destroy
    • This marks the FXSystemInterface as pending kill, and queues the render command to destroy it
  • Shortly thereafter, the render thread picks up the command, and actually destroys the FXSystemInterface.
  • The main thread then reaches TrimMemory() inside of UEngine::LoadMap
  • While collecting garbage, during TrimMemory, a NiagaraComponent is found and destroyed, which goes into FNiagaraSystemInstanceController::Deactivate -> FNiagaraSystemInstance::Complete -> FNiagaraSystemGpuComputeProxy::RemoveFromRenderThread
    • Checking the ComputeDispatchInterface->IsPendingKill() doesn’t help anymore at this point, because this is a shipping build, and the contents of the interface above have been overwritten with random stuff, which at this point usually means that IsPendingKill() returns false.
  • The render command is now queued, and picked up almost immediately by the render thread, and then crashes on the destroyed interface.

I played around a bit and just tried something simple/stupid:

  • In FXSystemInterface::Destroy, instead of immediately queueing a render command, I just add the FXSystem* to a static global array.
  • I added an FFXSystemInterface::NukeSystems() function that iterates over the array and queues all the systems to be destroyed on the render thread.
  • At the end of UEngine::TrimMemory, I explicitly call FFXSystemInterface::NukeSystems() to queue the destroys.

It’s ugly, but allows us to travel between maps just fine for now.

I had to remove the IsPendingKill() check from FNiagaraSystemGpuComputeProxy::RemoveFromRenderThread for this to work, otherwise ComputeDispatchInterface->RemoveGpuComputeProxy(this) would not get called on the interface that is already pending kill, and then later while destroying the interface it would try to access the stale compute proxy and die there.

I don’t know what a proper fix would look like - splitting the destruction into two parts so far apart seems wrong. Maybe reference counting the compute interface so it only really gets destroyed once the last compute proxy has given it up?

Ciao, Daniel!

Hi, sorry for the late reply. The last couple weeks have been very busy for us.

This morning I took the time make a sample project. I’ve simplified the blueprint so it’s basically just the Niagara effect. And I’ve exported it to a clean project. I’ve tested it with a vanilla UE 5.6.1 installation… it seems to be reproducible still!

Reproduction steps are:

  • open the project in UE 5.6.1
  • open the “CrashDemo” level
  • maybe move the camera a bit. Flying from one side of the terrain to the other seems to be enough.
  • open the “CrashDemo” level a second time (so it unloads the first world)

With these steps I’ve always been able to reproduce it.

Hope it helps! Thanks for looking into it!

Thanks a lot! I’m happy to hear reproduction was a success! :blush:

Our test suite caught one issue over the weekend, this is fixed in 45154086.

Thanks,

Stu

Thanks a lot!