UE5.5.4 - Crash in FNiagaraRibbonGpuBuffer::Allocate and FWindowsCriticalSection::Lock

We are experiencing multiple crashes and I start to believe that we are getting memory corruption but I am out of ideas to find the root cause of our issues. All ideas are welcome.

Those crashes started when we upgraded from 5.4.4 to 5.5.4. It started with this callstack in background workers :

Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000000000000028

[ 00 ] FD3D12DynamicRHI::RHILockBuffer (D3D12Buffer.cpp:686)

[ 01 ] FRHICommandListBase::LockBuffer (RHICommandList.h:729)

[ 02 ] FNiagaraRendererRibbons::InitializeVertexBuffersResources (NiagaraRendererRibbons.cpp:2514)

[ 03 ] FNiagaraRendererRibbons::GetDynamicMeshElements (NiagaraRendererRibbons.cpp:963)

[ 04 ] FNiagaraSystemRenderData::GetDynamicMeshElements (NiagaraSystemRenderData.cpp:235)

[ 05 ] FNiagaraSceneProxy::GetDynamicMeshElements (NiagaraComponent.cpp)

[ 06 ] FProjectedShadowInfo::GatherDynamicMeshElementsArray

[ 07 ] FProjectedShadowInfo::GatherDynamicMeshElements

[ 08 ] UE::Trace::FChannel::operator|

[ 09 ] TaskTrace::FTaskTimingEventScope::{ctor}

[ 10 ] UE::Tasks::Private::FTaskBase::TryExecuteTask

[…]

What I found out is that before locking the buffers, when calling VertexBuffers.InitializeOrUpdateBuffers, some of the buffers being allocated would be null. I did add naive custom code after RibbonLookupTableBuffer.Allocate(…) to add some logging and I only got null buffers when bIsUsingGPUInit is false.

In FNiagaraRibbonVertexBuffers::InitializeOrUpdateBuffers, RibbonLookupTableBuffer.Buffer == nullptr will now output me some logs instead of crashing.

I initially thought that we were simply running out of memory but I keep an eye on memory usage from the Windows Task Manager/performance tab in case it’s untracked memory. It looked like I always still had more than 20Gb of available RAM.

I did a current ugly fix by just not locking RHICmdList.LockBuffer / Memcpy / UnlockBuffer when it’s null but I was very much aware that it’s just ignoring the symptoms of something else.

Anyway, after adding the nullptr check, those crashes went away.

That being said, one of the thoughts is that there’s some memory corruption / stomping in a buffer shared with the GPU.

After that “fix”, I started to notice a new callstack as our top crash being reported:

Unhandled Exception: EXCEPTION_ACCESS_VIOLATION writing address 0x0000000000000024

[ 00 ] RtlpWaitOnCriticalSection ( ntdll.dll )

[ 01 ] RtlpEnterCriticalSectionContended ( ntdll.dll )

[ 02 ] RtlEnterCriticalSection ( ntdll.dll )

[ 03 ] EnterCriticalSection(Windows::CRITICAL_SECTION *) ( MinimalWindowsApi.h:238 )

[ 04 ] FWindowsCriticalSection::Lock() ( WindowsCriticalSection.h:44 )

[ 05 ] FScopeLock::{ctor}(FWindowsCriticalSection *) ( ScopeLock.h:39 )

[ 06 ] FD3D12BaseShaderResource::RemoveRenameListener(FD3D12ShaderResourceRenameListener *) ( D3D12Resources.h:893 )

[ 07 ] FD3D12View::~FD3D12View() ( D3D12View.cpp:242 )

[ 08 ] FD3D12ShaderResourceView_RHI::`scalar deleting destructor’(unsigned int) ( MyGame-Win64-Test.exe )

[ 09 ] FRHIResource::DeleteResources(TArray<FRHIResource *,TSizedDefaultAllocator<32> > const &) ( RHIResources.cpp:71 )

[ 10 ] FRHICommandListExecutor::FSubmitState::Submit(FRHICommandListExecutor::FSubmitState::FSubmitArgs const &) ( RHICommandList.cpp:1056 )

[…]

Reading about RtlEnterCriticalSection, I don’t see how it could crash unless it’s not properly initialized but FScopeLock will make sure it is initialized so that triggered my assumption that we do get memory corruption.

The crashes are happenning randomly, it happens on all kinds of GPU and CPU from different vendors. And we couldn’t come to the conclusion that it crashes more on low end or high end hardware.

The repro rate is around 1 crash every 100 games. With no known repro steps.

I do have a few questions :

1 - In FNiagaraRibbonVertexBuffers::InitializeOrUpdateBuffers there’s no check for nullptr, is it expected that it should throw an OOM exception instead? And if that’s the case, what am I missing with my naive check for the buffer being null?

2 - I did run with -stompmalloc but it is so slow that I can barely test anything and since we don’t have clear repro steps, it’s just impossible to play and crash. Is there a way to activate the stomp malloc only for this specific allocator?

3 - As a rule of thumb, when allocating the buffers, does a MaxAllocationCount == 9586980 seems ridiculously high? I wasn’t sure if maybe we have some very bad VFx setup.

4 - Should I just monitor stat RHI (and stat RHICMDLIST) or will there be untracked memory that I’d be missing to find out if we really have too much memory pressure? Are there max size vars for that allocator (RHI memory) that I could try to tune?

We did implement one of the suggested fix in this thread: [Content removed] and it really helped on AMD GPU but we still got the FScopeLock crash so I suspect there’s something else.

Based on [Content removed] I will try to disable parallel GDME for both Niagara & Cascade.

Those two crashes might be absolutely unrelated to each other, it’s really just an assumption that it is memory corruption since I can’t find other reasons.

Hi,

The Niagara callstack looks exactly like the issue I fixed with 40656300 “Fix race with Post Init Views particle systems” that was posted in the question you linked.

1 - We should never get nullptr back. OOM is considered fatal inside the RHI.

2 - You might want to open a separate question for someone of the core team to answer there are a few options but personally I use ASAN when stompmalloc will not work.

3 - There are two different counts the max that can fit in a buffer, the estimated max particle count. We should be using the max estimate particle count rather than the max buffer count, I need to go verify the code.

4 - Can you open a separate question for the RHI team to answer? stat RHI I would assume has some level of data.

Thanks,

Stu

Thanks Stuart!

Merging 40656300 “Fix race with Post Init Views particle systems” in our next nightly build.

I’m out for a few days but will definitely open separate questions for the RHI team.

Great hopefully that fixes it, the accumulation stride issue was also fixed in 44042760 yesterday.

I just looked over the code, the max count is only used to ensure we don’t blow the buffer limit. So nothing wrong there, it’s just letting you know the max count is 9 million particles.

Cheers,

Stu

Sadly, got back from vacations and QA reported me that we still have a lot of crashes with the same callstack even after merging 40656300 “Fix race with Post Init Views particle systems”

One more CL sprung to mind 40656300, the PostInitViews call to the FX system was incorrectly moved to overlap with some task work.

Thanks,

Stu

Ahh sorry, vacation brain.

The fact it’s from the shadow views makes me think something must be overlapping there, perhaps incorrectly, although I only really remember making that fix. Let me dig through some CLs.

I completely missed CL 38395120 before, certainly worth checking especially as shadow views can be running multiple projections at once.

Thanks,

Stu

Are you able to run with ASAN on your target platform at all? Just wondering if running a soak / replay, if you have that available, could perhaps get the issue to show up.

Thanks,

Stu

My next step would be to disable parallel GDME for the shadow pass only (could likely do this only for Niagara) to see if that fixes the issue.

I spent some time digging around because I’m pretty sure we had issues with the shadow pass overlapping some parts incorrectly, but I can’t narrow down a different CL or if I’m miss-remembering also.

I don’t feel it should be bad data related, more likely a race / bad overlapping.

Thanks,

Stu

Yes, that’s the one we merged and we still have the issue.

I am under the impression that we have another race condition somewhere else but I’m unable to narrow it down.

And to be clear, it’s for this callstack :

Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000000000000028

[ 00 ] FD3D12DynamicRHI::RHILockBuffer (D3D12Buffer.cpp:686)

[ 01 ] FRHICommandListBase::LockBuffer (RHICommandList.h:729)

[ 02 ] FNiagaraRendererRibbons::InitializeVertexBuffersResources (NiagaraRendererRibbons.cpp:2514)

[ 03 ] FNiagaraRendererRibbons::GetDynamicMeshElements (NiagaraRendererRibbons.cpp:963)

[ 04 ] FNiagaraSystemRenderData::GetDynamicMeshElements (NiagaraSystemRenderData.cpp:235)

[ 05 ] FNiagaraSceneProxy::GetDynamicMeshElements (NiagaraComponent.cpp)

[ 06 ] FProjectedShadowInfo::GatherDynamicMeshElementsArray

[ 07 ] FProjectedShadowInfo::GatherDynamicMeshElements

[ 08 ] UE::Trace::FChannel::operator|

[ 09 ] TaskTrace::FTaskTimingEventScope::{ctor}

[ 10 ] UE::Tasks::Private::FTaskBase::TryExecuteTask

I’ll make another thread for the other one.

Thanks! We already merged 38395120 a few weeks ago.

I also handpicked CL 44313032 but still the same crash.

I will definitely try. I believe people had issues running with ASAN on 5.5 but I’ll cherrypick fixes from 5.6 if needed.

Any idea of settings to turn on and off to debug is welcome.

At this point I’m working on building a gym that I’ll let run to repro it more easily.

One other question, I know we had some issues with assets resaving when we upgraded from 5.2 to 5.3 to 5.4 and lately to 5.5 and some assets might not have been touched in a while.

So I’m wondering if there’s something I could check to make sure it’s not an asset that has bad data.