We are experiencing multiple crashes and I start to believe that we are getting memory corruption but I am out of ideas to find the root cause of our issues. All ideas are welcome.
Those crashes started when we upgraded from 5.4.4 to 5.5.4. It started with this callstack in background workers :
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION reading address 0x0000000000000028
[ 00 ] FD3D12DynamicRHI::RHILockBuffer (D3D12Buffer.cpp:686)
[ 01 ] FRHICommandListBase::LockBuffer (RHICommandList.h:729)
[ 02 ] FNiagaraRendererRibbons::InitializeVertexBuffersResources (NiagaraRendererRibbons.cpp:2514)
[ 03 ] FNiagaraRendererRibbons::GetDynamicMeshElements (NiagaraRendererRibbons.cpp:963)
[ 04 ] FNiagaraSystemRenderData::GetDynamicMeshElements (NiagaraSystemRenderData.cpp:235)
[ 05 ] FNiagaraSceneProxy::GetDynamicMeshElements (NiagaraComponent.cpp)
[ 06 ] FProjectedShadowInfo::GatherDynamicMeshElementsArray
[ 07 ] FProjectedShadowInfo::GatherDynamicMeshElements
[ 08 ] UE::Trace::FChannel::operator|
[ 09 ] TaskTrace::FTaskTimingEventScope::{ctor}
[ 10 ] UE::Tasks::Private::FTaskBase::TryExecuteTask
[…]
What I found out is that before locking the buffers, when calling VertexBuffers.InitializeOrUpdateBuffers, some of the buffers being allocated would be null. I did add naive custom code after RibbonLookupTableBuffer.Allocate(…) to add some logging and I only got null buffers when bIsUsingGPUInit is false.
In FNiagaraRibbonVertexBuffers::InitializeOrUpdateBuffers, RibbonLookupTableBuffer.Buffer == nullptr will now output me some logs instead of crashing.
I initially thought that we were simply running out of memory but I keep an eye on memory usage from the Windows Task Manager/performance tab in case it’s untracked memory. It looked like I always still had more than 20Gb of available RAM.
I did a current ugly fix by just not locking RHICmdList.LockBuffer / Memcpy / UnlockBuffer when it’s null but I was very much aware that it’s just ignoring the symptoms of something else.
Anyway, after adding the nullptr check, those crashes went away.
That being said, one of the thoughts is that there’s some memory corruption / stomping in a buffer shared with the GPU.
After that “fix”, I started to notice a new callstack as our top crash being reported:
Unhandled Exception: EXCEPTION_ACCESS_VIOLATION writing address 0x0000000000000024
[ 00 ] RtlpWaitOnCriticalSection ( ntdll.dll )
[ 01 ] RtlpEnterCriticalSectionContended ( ntdll.dll )
[ 02 ] RtlEnterCriticalSection ( ntdll.dll )
[ 03 ] EnterCriticalSection(Windows::CRITICAL_SECTION *) ( MinimalWindowsApi.h:238 )
[ 04 ] FWindowsCriticalSection::Lock() ( WindowsCriticalSection.h:44 )
[ 05 ] FScopeLock::{ctor}(FWindowsCriticalSection *) ( ScopeLock.h:39 )
[ 06 ] FD3D12BaseShaderResource::RemoveRenameListener(FD3D12ShaderResourceRenameListener *) ( D3D12Resources.h:893 )
[ 07 ] FD3D12View::~FD3D12View() ( D3D12View.cpp:242 )
[ 08 ] FD3D12ShaderResourceView_RHI::`scalar deleting destructor’(unsigned int) ( MyGame-Win64-Test.exe )
[ 09 ] FRHIResource::DeleteResources(TArray<FRHIResource *,TSizedDefaultAllocator<32> > const &) ( RHIResources.cpp:71 )
[ 10 ] FRHICommandListExecutor::FSubmitState::Submit(FRHICommandListExecutor::FSubmitState::FSubmitArgs const &) ( RHICommandList.cpp:1056 )
[…]
Reading about RtlEnterCriticalSection, I don’t see how it could crash unless it’s not properly initialized but FScopeLock will make sure it is initialized so that triggered my assumption that we do get memory corruption.