We’ve determined the reason for the crash and implemented a fix locally, but since it relates to this core Niagara system we wanted to let other people know about it. We can reproduce it pretty easily locally, but I’m unable to provide a sample project that has a repro case.
When the ProcessSignificance method iterates through the components and creates CullProxies for them, that can result in components unregistering themselves from the scalability manager. This effectively means that the ManagedComponents and State arrays will be modified during iteration, and the indices used during the loop are no longer valid.
I’ve attached file that shows a more complete callstack of the exact moment when the arrays are modified during iteration. This seems to happen when components are completed on the same frame that they are culled (it’s not clear whether it is the same component being culled or a different one, there’s some indication it can be a different one). Everything about this process seems normal, so it doesn’t seem like it’s necessarily a mistake that the components are being completed. However, because the loop doesn’t account for this possibility, it will cause a crash. It seems like this is easier to reproduce for short-lived systems with aggressive culling that are created rapidly. In our case this happens with bullet impact effects.
Our local fix involves several steps:
- A copy of the ManagedComponents array is created before iterating, and that copy is used during the loop instead of the original.
- Before processing each component, we check if the component pointer is still valid and the component is still registered with the scalability manager. These components don’t need to be processed, so they are skipped.
- After checking whether the component is valid, we reacquire the current index for the component in case the index from the SignificanceIndices array is out-of-date. This is done by taking the component pointer from the copy of the ManagedComponents array and finding the index of that element in the real ManagedComponents array, via the IndexOfByKey method. This corrected index is then used for the rest of the behavior in the loop.
This local fix is slightly more expensive due to copying the array and finding the index again, but there doesn’t seem to be another way to handle cases where the internal arrays are modified during iteration.
Hi Justin,
Sorry about the delay. I’m looking into this now and should get back to you soon.
Best regards,
Vitor
Hi Justin,
Sorry about the extended delay. I’ve been trying to reproduce the issue you described, but so far without success. For this to be actionable by Epic devs, it would be best to find how to repro the problem consistently.
I have attached the current version of my test project. Would it be possible for you to take a look at it and see if there are any steps I missed when trying to repro? I am spawning one system per tick at 120 fps. Each system spawns a burst of particles that last about 0.25 seconds, so I expect to have about 30 live systems at any given time. The scalability settings in my Effect Type are limiting the number of instances to 30 and using age for relevance, so that systems are dying at about the same time when the scalability manager would try to cull them. I’ve also tried several different variations on those numbers.
When inspecting the code and testing with breakpoints, I could also not find an obvious situation where FNiagaraScalabilityManager::UnregisterAt() would be called while iterating on the Context.SignificanceIndices array inside function FNiagaraScalabilityManager::ProcessSignificance(). Since you seem to be comfortable making changes to the engine source code, maybe you could set a flag while that loop is iterating, break when a component is unregistered while that flag is set, and send over the callstack at that time?
Best regards,
Vitor
Hi Justin,
I almost forgot, I’ve got the public bug tracker link for you: UE-294564. The engine devs are going to take a look at this and they may chime in here asking for more information. In the meantime, if you have any more information about this issue that could help them repro the crash (so they can properly test if their solution attempts actually work), you’re welcome to post it here (or in a new thread if this one is closed by that time).
Best regards,
Vitor
Hi Vitor, thanks for testing this. It may take me some time to look through the example project you provided, I’m currently on vacation, but I’ll try to see if anything jumps out at me as soon as I’m able.
Regarding your last paragraph, that’s exactly what I did. The file attached to this ticket has the callstack that shows a component getting unregistered while iterating through the indices. More specifically:
- During ProcessSignificance, a component is detected that needs to be culled, so CreateCullProxy is called.
- When the NiagaraCullProxyComponent is registered, it is specifically told to unpause via SetPaused
- This unpause call gets propagated up a few layers through the UNiagaraComponent, FNiagaraSystemInstanceController, FNiagaraSystemInstance, and eventually to FNiagaraSystemSimulation.
- As part of FNiagaraSystemSimulation::UnpauseInstance, there is a call to WaitForInstancesTickComplete. This blocks the game thread until all concurrent tick operations finish.
- One of those tick operations results in a component completing, because that tick was apparently the last tick of its lifetime. This means it immediately invokes the FNiagaraSystemInstance::HandleCompletion method and begins cleaning itself up.
- As part of that completion process, the UNiagaraComponent associated with the FNiagaraSystemInstance is reclaimed by the pooling system. This means the component is deactivated, so it unregisters itself from the scalability manager. This is the moment that the array is modified during iteration.
Out of all these steps, 2, 4, and 5 seem to be the crux of the problem here. They lead to a cascade of behavior here that can cause a lot of side-effects, and one of those side effects can modify the array. There are a couple of potential changes that could be made to avoid these side-effects:
- Maybe the newly-created NiagaraCullProxyComponent doesn’t need to be explicitly unpaused.
- Maybe the FNiagaraSystemSimulation doesn’t need to wait for all concurrent ticks to finish when something is unpaused.
- Maybe the completion callbacks don’t need to be invoked immediately when waiting for all ticks to finish and can instead be scheduled for later.
Those are all possible ways to resolve this issue permanently, but they all probably require some deeper changes to the system. For now, we’ve resolved this with the simplest solution we could find, which was to handle this at the topmost level by making the iteration stable even when the array is changed during iteration.
Hi Justin,
Thank you for the thorough report and discussion of the issue. And sorry that I missed that other callstack, it does indeed show the exact moment a niagara component is unregistered from the scalability manager while that iteration is being performed.
Unfortunately, though, I am still not able to repro the crash on my end here. I have prepared an internal bug report with the information you provided for the engine devs to take a look, but without a repro their ability to tackle this and test possible solutions might be limited. When you are back to office, if you have some time to check my test project or create a repro of your own, that could prove very helpful.
By the way, are you using vanilla UE 5.4? And are you able to tell if this crash also happens in the more recent engine versions?
Best regards,
Vitor
PS: Here’s a 2nd version of my test project