I’m backporting runtime race detector from UE5 main branch to our UE5.2. There were several issues during that backport as it is not fully compatible and required a chain of dependencies, so I went with workarounds to make it finally work and to not grab too many (risky for us) updates from main branch. Could you please look on these workarounds and guess if instrumentation could work as it is intended?
We have no per module inline functionality and I didn’t want to integrate it since we didn’t want to have more risk with memory management due to current production stage. It worth to mention that PerModuleInline.inl from Instrumentation is still included through EntryPoints.cpp all the time to make race detector work.
For race detector purposes I made own very simple recursive mutex based on std library instead of getting FRecursiveMutex from main branch as it contains many other dependencies which we don’t want to have for now.
LowLevelTasks::Private::FOversubscriptionAllowedScope AllowOversubscription would require me to backport many dependencies so I commented these lines instead.
In Runtime/Core/Public/Atomic.h there was an infinite loop in T Load(const volatile T* Element). For clang I added clang specific call taking as an example LoadRelaxed above (__atomic_load call with __ATOMIC_SEQ_CST instead). I wonder if it would break instrumentation functionality somehow? Could that infinite loop be related to the case that function was not inlined somehow?
For the 4th point here is an example of what forced me to make mentioned change. At the very beginning of the game it went to infinite loop:
[Inline Frame]FWindowsPlatformAtomics::AtomicRead(const volatile int * Src) [Inline Frame]UE::Core::Private::Atomic::Load(const volatile int * Element) __Thunk__Instrument_AtomicLoad_int32(int * Atomic, UE::Instrumentation::FAtomicMemoryOrder MemoryOrder) FWindowsPlatformStackWalk::InitStackWalkingInternal(void * Process, bool bForceReinitOnProcessMismatch) FWindowsPlatformStackWalk::InitStackWalking() PreInit()It is probably because FWindowsPlatformAtomics::AtomicRead called std::atomic which in turns contained (with means of custom clang) additional instructions to redirect a call to instrumentation
As long as all memory is instrumented, that you have no cycles in the atomics that create infinite loops, and that you have no cycles in your recursive mutex, everything should work.
But the cycles are very easy to introduce if things don’t get inlined properly. Those cycles will either stack overflow like you already saw, or deadlock in other case (i.e. mutex). That’s why part of the code use custom spinlocks with special prefix to functions to disable instrumentation.
The PerModuleInline stuff is mostly used for the editor where we have DLLs and we compile in non-monolithic. If you are just trying to test your game runtime in monolithic, you can probably make everything work without it.
I’m happy that this feature is interesting enough for you to go through the effort of backporting it. Let me know if its useful to you.
Also, I fixed a metric ton of race conditions since 5.2, so you might want to look at all my changes if you come across a UE race, it might already be fixed in a subsequent release.
It’s definitely not a bug we’re seeing on UE5/Main atm.
The only way I see where 2 different Pipe Launch could interact on the same task is if a piped tasks was launched with another piped tasks as a prerequisite… In your case, I would expect the Piped task to never be touched until the launch has finished… the fact that something touches its memory before its even published is very fishy. It might be that you’re missing Pipe fixes? I fixed a lot of issues with Pipe in the past. In particular 32857964.
Small notes… the high verbosity of your callstacks related to std::atomic seems to indicate you’re running your build in debug? Proper inlining is crucial to avoid instrumenting things that are part of the race detector logic. Did you try with a fully optimized builds, either development or test builds? Those should work much better.
Also note that there is probably one critical fix in ParallelFor I did that removed a bunch of race 25068129. I don’t think it applies here tho but worth mentioning.
It’s unclear to me exactly what property or variable those are racing on as you might have local changes on your side. There was probably tons of races in 5.2 that have been fixed since then. But even in Main, we still have plenty of benign race in rendering code where either globals or stat counters are written without atomics or other kind of synchronization. Most of the time the worst that can happen is stats being off a little bit.
But it looks at least that you got everything working now and that the race you’re seeing are real ones.
May I ask why you want to stick on 5.2 instead of jumping to a much more performant build that is 5.6 as we’ve done a lot of optimizations under the hood?
Great to hear. Completely understandable of course.
I’m going for vacation for the next 3 weeks so since I’m pretty much the only one familiar with race detector. It might be a while before you hear back from me if you have further questions.
This is a tool of a year for me, as we are clearly experiencing many memory issues, and many of them are clearly from race conditions. Finding them analytically is not productive and very time consuming. With this tool it becomes way more easier to do, so I really appreciate for your efforts to make it work, especially on Windows builds.
With regarding of already fixed races, it is the first thing we did - went over yours commits to find what can be fixed. Although there are still some questions, e.g. we detect races in some places, and the race looks as going to be quite wide spread and be detected on your side, but it is not fixed in mainstream. Therefor it’s unclear - is it only our side problem, or may be we don’t understand something? Anyway, for questions with callstack I will create other threads to not mess this thread.
Below is an example of such report. From my understanding when we construct an atomic, it makes non-atomic write to internal storage, so to make it visible to other threads we need to release store explicitly after initialization. Having relaxed read write in AddRef() is not enough as it doesn’t guarantee ordering, only atomicity. I didn’t find any fix in UE Main, which made me think that either I didn’t understand something or it is known but benign. At the same time the thing is not clear for me is why it complains only in FThreadStats context? Could it be related to how Pipe gets an access to the task pointer?
WARNING: RaceDetector: data race detected AtomicReadWrite of size 4 at 00000250F4051E28 made at clock 765565 by thread GameThread (68648) which is now at clock 765565: Engine\Source\Runtime\Core\Public\Tasks\TaskPrivate.h (109): [Inline Frame] UE::Tasks::Private::FTaskBase::AddRef() 0x00007FF689A4B169 Engine\Source\Runtime\Core\Private\Tasks\Pipe.cpp (27): UE::Tasks::FPipe::PushIntoPipe() 0x00007FF689A4B169 Engine\Source\Runtime\Core\Private\Tasks\TaskPrivate.cpp (273): UE::Tasks::Private::FTaskBase::TryPushIntoPipe() 0x00007FF689A4DA02 Engine\Source\Runtime\Core\Public\Tasks\TaskPrivate.h (526): UE::Tasks::Private::FTaskBase::TryUnlock() 0x00007FF67466F472 Engine\Source\Runtime\Core\Public\Tasks\TaskPrivate.h (303): [Inline Frame] UE::Tasks::Private::FTaskBase::TryLaunch() 0x00007FF68956228D Engine\Source\Runtime\Core\Public\Tasks\Pipe.h (75): [Inline Frame] UE::Tasks::FPipe::Launch() 0x00007FF68956228D Engine\Source\Runtime\Core\Private\Stats\Stats2.cpp (1404): FThreadStats::WaitForStats() 0x00007FF68956228D Engine\Source\Runtime\Core\Private\Stats\Stats2.cpp (239): FStats::AdvanceFrame() 0x00007FF689560112 Engine\Source\Runtime\Launch\Private\LaunchEngineLoop.cpp (5913): FEngineLoop::Tick() 0x00007FF67450AEFA Engine\Source\Runtime\Launch\Private\Launch.cpp (83): [Inline Frame] EngineTick() 0x00007FF67451E498 Engine\Source\Runtime\Launch\Private\Launch.cpp (271): GuardedMain() 0x00007FF67451E498 Engine\Source\Runtime\Launch\Private\Windows\LaunchWindows.cpp (108): GuardedMainWrapper() 0x00007FF67454ACE9 Engine\Source\Runtime\Launch\Private\Windows\LaunchWindows.cpp (270): LaunchWindowsStartup() 0x00007FF67454B490 Engine\Source\Runtime\Launch\Private\Windows\LaunchWindows.cpp (337): WinMain() 0x00007FF67454B632 Previous Write of size 4 at 00000250F4051E28 at clock 2524241 by thread Foreground Worker #1 (63516 alive) which is now at clock 2524334: VC\Tools\MSVC\14.41.34120\INCLUDE\atomic (964): [Inline Frame] std::_Atomic_storage<unsigned int,4>::_Atomic_storage() 0x00007FF689567AE3 VC\Tools\MSVC\14.41.34120\INCLUDE\atomic (1482): [Inline Frame] std::_Atomic_integral<unsigned int,4>::_Atomic_storage() 0x00007FF689567AE3 VC\Tools\MSVC\14.41.34120\INCLUDE\atomic (1673): [Inline Frame] std::_Atomic_integral_facade<unsigned int>::_Atomic_storage() 0x00007FF689567AE3 VC\Tools\MSVC\14.41.34120\INCLUDE\atomic (2166): [Inline Frame] std::atomic<unsigned int>::_Atomic_storage() 0x00007FF689567AE3 Engine\Source\Runtime\Core\Public\Tasks\TaskPrivate.h (135): [Inline Frame] UE::Tasks::Private::FTaskBase::FTaskBase() 0x00007FF689567AE3 Engine\Source\Runtime\Core\Public\Tasks\TaskPrivate.h (730): [Inline Frame] UE::Tasks::Private::TExecutableTaskBase<lambda at .\Runtime\Core\Private\Stats\Stats2.cpp:1120:41’,void,void>::TExecutableTaskBase() 0x00007FF689567AE3
Engine\Source\Runtime\Core\Public\Tasks\TaskPrivate.h (750): [Inline Frame] UE::Tasks::Private::TExecutableTask<lambda at .\Runtime\Core\Private\Stats\Stats2.cpp:1120:41'>::TExecutableTask() 0x00007FF689567AE3 Engine\Source\Runtime\Core\Public\Tasks\TaskPrivate.h (756): [Inline Frame] UE::Tasks::Private::TExecutableTask<lambda at .\Runtime\Core\Private\Stats\Stats2.cpp:1120:41’>::Create() 0x00007FF689567AE3
Engine\Source\Runtime\Core\Public\Tasks\Pipe.h (73): [Inline Frame] UE::Tasks::FPipe::Launch() 0x00007FF689567AE3
Engine\Source\Runtime\Core\Private\Stats\Stats2.cpp (1120): FThreadStats::SendMessage_Async() 0x00007FF689567AE3
Engine\Source\Runtime\Core\Private\Stats\Stats2.cpp (1188): FThreadStats::FlushRegularStats() 0x00007FF689567362
Engine\Source\Runtime\Core\Private\Stats\Stats2.cpp (1112): FThreadStats::Flush() 0x00007FF689565938
Engine\Source\Runtime\Core\Public\Stats\Stats2.h (1492): [Inline Frame] FThreadStats::AddMessage() 0x00007FF68957B899
Engine\Source\Runtime\Core\Private\Stats\Stats2.cpp (443): FStatGroupEnableManager::UpdateMemoryUsage() 0x00007FF68957B899
Engine\Source\Runtime\Core\Private\Stats\Stats2.cpp (969): FStatsThread::Process() 0x00007FF6895B115A
Engine\Source\Runtime\Core\Private\Stats\Stats2.cpp (934): FStatsThread::StatMessage() 0x00007FF6895B07D3
Engine\Source\Runtime\Core\Private\Stats\Stats2.cpp (1120): [Inline Frame] FThreadStats::SendMessage_Async::<lambda_65>::operator()() 0x00007FF6895B6EF5
Engine\Source\Runtime\Core\Public\Templates\Invoke.h (47): [Inline Frame] Invoke() 0x00007FF6895B6EF5
Engine\Source\Runtime\Core\Public\Tasks\TaskPrivate.h (717): [Inline Frame] UE::Tasks::Private::TExecutableTaskBase<lambda at .\Runtime\Core\Private\Stats\Stats2.cpp:1120:41',void,void>::TryExecuteTaskVirtual::<lambda_1>::operator()() 0x00007FF6895B6EF5 ... Engine\Source\Runtime\Core\Private\Sanitizer\RaceDetectorWindows.cpp (155): InstrumentedStartThread() 0x00007FF689515DFF Last known sync clock between both threads is 2523534: Engine\Source\Runtime\Core\Private\ProfilingDebugging\MetadataTrace.cpp (16): FMetadataTrace::SaveStack() 0x00007FF689499683
I looked over 32857964 and it seems it’s contains many different things which we have no in our UE5.2 version. Our version of Pipe.cpp is likely same as 23850955 (//UE/Release-5.2). It seems to backport it we would need to backport some other dependent changes.
I’m thinking if it is possible to find easiest cure for that issue. For instance, looking on CL you mentioned there is a different synchronization code for LastTask_Local (in FPipe::PushIntoPipe) and Task_Local (FPipe::ClearTask), which adds acq_rel semantic in line 18, and changes barriers acq_rel/_acquire at line 36. Although it seems it causes some sort of live lock on main thread in WaitForStats(), when I tried to apply it.
With regard of callstacks verbosity. I’m actually building a Development build, not a debug. This is the only one report I seen so far which went inside std library code. Other reports end up in UE code.
There are 2 other data races which looks as obvious fixes on the first glance, but I didn’t find fixes for them in Perforce. I guess it is mostly because the fix was actually to exclude threads contention during PSO compilation and may be how render thread works with RHI thread.
P.S.: I had to exclude some lines from callstack as UDN restricts long messages
First:
WARNING: RaceDetector: data race detected Read of size 8 at 000001C65E8C2DD8 made at clock 2462619 by thread RenderThread (70652) which is now at clock 2462619: \Engine\Source\Runtime\RHI\Private\PipelineFileCache.cpp (3964): FPipelineFileCacheManager::NumPSOsLogged() 0x00007FF691D6C2CD \Engine\Source\Runtime\RenderCore\Private\ShaderPipelineCache.cpp (1570): FShaderPipelineCache::ReadyForAutoSave() 0x00007FF692189605 \Engine\Source\Runtime\RenderCore\Private\ShaderPipelineCache.cpp (1779): FShaderPipelineCache::Tick() 0x00007FF69218DFDB \Engine\Source\Runtime\RenderCore\Private\RenderingThread.cpp (285): TickRenderingTickables() 0x00007FF69202D583 \Engine\Source\Runtime\RenderCore\Private\RenderingThread.cpp (633): [Inline Frame] FRenderingThreadTickHeartbeat::Run::<lambda_1>::operator()() 0x00007FF6920616DE \Engine\Source\Runtime\RenderCore\Private\RenderingThread.cpp (417): RenderingThreadMain() 0x00007FF69202E118 \Engine\Source\Runtime\RenderCore\Private\RenderingThread.cpp (542): FRenderingThread::Run() 0x00007FF6920571F4 \Engine\Source\Runtime\Core\Private\Windows\WindowsRunnableThread.cpp (150): FRunnableThreadWin::Run() 0x00007FF68BA0CEFD \Engine\Source\Runtime\Core\Private\Windows\WindowsRunnableThread.cpp (81): FRunnableThreadWin::GuardedRun() 0x00007FF68BA0CD0C \Engine\Source\Runtime\Core\Private\Windows\WindowsRunnableThread.h (38): FRunnableThreadWin::_ThreadProc() 0x00007FF68BA46B04 \Engine\Source\Runtime\Core\Private\Sanitizer\RaceDetectorWindows.cpp (155): InstrumentedStartThread() 0x00007FF68B53731F Previous AtomicReadWrite of size 8 at 000001C65E8C2DD8 at clock 628104 by thread RHIThread (75164 alive) which is now at clock 629437: \Engine\Source\Runtime\Core\Public\Windows\WindowsPlatformAtomics.h (52): [Inline Frame] FWindowsPlatformAtomics::InterlockedIncrement() 0x00007FF691D80A3A \Engine\Source\Runtime\RHI\Private\PipelineFileCache.cpp (230): [Inline Frame] FPipelineStateStats::UpdateStats() 0x00007FF691D80A3A \Engine\Source\Runtime\RHI\Private\PipelineStateCache.cpp (459): [Inline Frame] FPipelineState::AddUse() 0x00007FF691D80A3A \Engine\Source\Runtime\RHI\Private\PipelineStateCache.cpp (2866): ExecuteSetGraphicsPipelineState() 0x00007FF691D80A3A \Engine\Source\Runtime\RHI\Public\RHICommandListCommandExecutes.inl (277): FRHICommandSetGraphicsPipelineState::Execute() 0x00007FF691E459E0 \Engine\Source\Runtime\RHI\Public\RHICommandList.h (1147): FRHICommand<FRHICommandSetGraphicsPipelineState,FRHICommandSetGraphicsPipelineStateString1593>::ExecuteAndDestruct() 0x00007FF691DA5DAB \Engine\Source\Runtime\RHI\Private\RHICommandList.cpp (605): FRHICommandListBase::Execute() 0x00007FF691E5341C \Engine\Source\Runtime\RHI\Private\RHICommandList.cpp (917): [Inline Frame] FRHICommandListImmediate::ExecuteAndReset::<lambda_16>::operator()() 0x00007FF691E87693 \Engine\Source\Runtime\Core\Private\Windows\WindowsRunnableThread.cpp (150): FRunnableThreadWin::Run() 0x00007FF68BA0CEFD \Engine\Source\Runtime\Core\Private\Windows\WindowsRunnableThread.cpp (81): FRunnableThreadWin::GuardedRun() 0x00007FF68BA0CD0C \Engine\Source\Runtime\Core\Private\Windows\WindowsRunnableThread.h (38): FRunnableThreadWin::_ThreadProc() 0x00007FF68BA46B04 \Engine\Source\Runtime\Core\Private\Sanitizer\RaceDetectorWindows.cpp (155): InstrumentedStartThread() 0x00007FF68B53731F Last known sync clock between both threads is 627002: \Engine\Source\Runtime\Core\Public\Containers\LockFreeList.h (198): [Inline Frame] FIndexedPointer::AtomicRead() 0x00007FF68AEA9741 \Engine\Source\Runtime\Core\Public\Containers\LockFreeList.h (704): FStallingTaskQueue<FBaseGraphTask,64,2>::Push() 0x00007FF68AEA9741
Second:
WARNING: RaceDetector: data race detected Write of size 8 at 000001C6C1F7CC20 made at clock 62187 by thread Background Worker #5 (55424) which is now at clock 62187: \Engine\Source\Runtime\Core\Public\Templates\RefCounting.h (237): TRefCountPtr<FRHIGraphicsPipelineState>::operator=() 0x00007FF691DAB33E \Engine\Source\Runtime\RHI\Private\PipelineStateCache.cpp (1822): FCompilePipelineStateTask::CompilePSO() 0x00007FF691DAEB0D \Engine\Source\Runtime\RHI\Private\PipelineStateCache.cpp (1758): [Inline Frame] FCompilePipelineStateTask::DoTask() 0x00007FF691DC9F99 \Engine\Source\Runtime\Core\Public\Async\TaskGraphInterfaces.h (1409): TGraphTask<FCompilePipelineStateTask>::ExecuteTask() 0x00007FF691DC9F99 \Engine\Source\Runtime\Core\Public\Async\TaskGraphInterfaces.h (952): [Inline Frame] FBaseGraphTask::Execute() 0x00007FF68AEAB13D \Engine\Source\Runtime\Core\Private\Async\TaskGraph.cpp (2210): [Inline Frame] FTaskGraphCompatibilityImplementation::QueueTask::<lambda_1>::operator()() 0x00007FF68AEAB13D \Engine\Source\Runtime\Core\Private\Async\Fundamental\Scheduler.cpp (377): LowLevelTasks::FScheduler::WorkerMain() 0x00007FF68AE4CBA7 \Engine\Source\Runtime\Core\Private\Async\Fundamental\Scheduler.cpp (70): [Inline Frame] LowLevelTasks::FScheduler::CreateWorker::<lambda_1>::operator()() 0x00007FF68AED3A21 \Engine\Source\Runtime\Core\Public\Templates\Invoke.h (47): [Inline Frame] Invoke() 0x00007FF68AED3A21 \Engine\Source\Runtime\Core\Private\HAL\Thread.cpp (74): FThreadImpl::Run() 0x00007FF68B143F6C \Engine\Source\Runtime\Core\Private\Windows\WindowsRunnableThread.cpp (150): FRunnableThreadWin::Run() 0x00007FF68BA0CEFD \Engine\Source\Runtime\Core\Private\Windows\WindowsRunnableThread.cpp (81): FRunnableThreadWin::GuardedRun() 0x00007FF68BA0CD0C \Engine\Source\Runtime\Core\Private\Windows\WindowsRunnableThread.h (38): FRunnableThreadWin::_ThreadProc() 0x00007FF68BA46B04 \Engine\Source\Runtime\Core\Private\Sanitizer\RaceDetectorWindows.cpp (155): InstrumentedStartThread() 0x00007FF68B53731F Previous Read of size 8 at 000001C6C1F7CC20 at clock 49029 by thread RenderThread (70652 alive) which is now at clock 281493: \Engine\Source\Runtime\Core\Public\Templates\RefCounting.h (268): [Inline Frame] TRefCountPtr<FRHIGraphicsPipelineState>::operator FRHIGraphicsPipelineState *() 0x00007FF691D70ACD \Engine\Source\Runtime\RHI\Private\PipelineStateCache.cpp (678): SetGraphicsPipelineState() 0x00007FF691D70ACD \Engine\Source\Runtime\RenderCore\Private\ShaderPipelineCache.cpp (1066): FShaderPipelineCacheTask::Precompile() 0x00007FF692181041 \Engine\Source\Runtime\RenderCore\Private\ShaderPipelineCache.cpp (1449): FShaderPipelineCacheTask::PrecompilePipelineBatch() 0x00007FF6921870F1 \Engine\Source\Runtime\RenderCore\Private\ShaderPipelineCache.cpp (1836): FShaderPipelineCacheTask::Tick() 0x00007FF69218F554 \Engine\Source\Runtime\RenderCore\Private\ShaderPipelineCache.cpp (1806): FShaderPipelineCache::Tick() 0x00007FF69218E2B6 \Engine\Source\Runtime\RenderCore\Private\RenderingThread.cpp (285): TickRenderingTickables() 0x00007FF69202D583 \Engine\Source\Runtime\Core\Public\Async\TaskGraphInterfaces.h (952): [Inline Frame] FBaseGraphTask::Execute() 0x00007FF68AEA7576 \Engine\Source\Runtime\Core\Private\Async\TaskGraph.cpp (784): FNamedTaskThread::ProcessTasksNamedThread() 0x00007FF68AEA7576 \Engine\Source\Runtime\Core\Private\Async\TaskGraph.cpp (670): FNamedTaskThread::ProcessTasksUntilQuit() 0x00007FF68AEA465D \Engine\Source\Runtime\Core\Private\Async\TaskGraph.cpp (2349): FTaskGraphCompatibilityImplementation::ProcessThreadUntilRequestReturn() 0x00007FF68AEA18D2 \Engine\Source\Runtime\RenderCore\Private\RenderingThread.cpp (417): RenderingThreadMain() 0x00007FF69202E118 \Engine\Source\Runtime\RenderCore\Private\RenderingThread.cpp (542): FRenderingThread::Run() 0x00007FF6920571F4 \Engine\Source\Runtime\Core\Private\Windows\WindowsRunnableThread.cpp (150): FRunnableThreadWin::Run() 0x00007FF68BA0CEFD \Engine\Source\Runtime\Core\Private\Windows\WindowsRunnableThread.cpp (81): FRunnableThreadWin::GuardedRun() 0x00007FF68BA0CD0C \Engine\Source\Runtime\Core\Private\Windows\WindowsRunnableThread.h (38): FRunnableThreadWin::_ThreadProc() 0x00007FF68BA46B04 \Engine\Source\Runtime\Core\Private\Sanitizer\RaceDetectorWindows.cpp (155): InstrumentedStartThread() 0x00007FF68B53731F Last known sync clock between both threads is 49023: \Engine\Source\Runtime\Core\Private\Windows\WindowsPlatformProcess.cpp (1554): FEventWin::Wait() 0x00007FF68BA033CE
The reason is following - our project is huge, it is in live now, and we have many local changes, various own plugins and engine systems, so upgrading by itself is painful and slow process. We are actually upgrading it to 5.6 in parallel, but it is expected to be merged not early than 6 months since now.