We’ve been seeing lock-ups on PS4 which we’ve traced to TaskGraph, perhaps always during a garbage collect. The main thread calls BroadcastSlow_OnlyUseForSpecialPurposes to wake up the task threads, but sometimes one of them will remain at the pthread_cond_wait in GenericPlatformProcess.cpp. Often this result in the main thread getting stuck waiting for a thread which has not received a request to wake up, or possibly didn’t receive a job to keep it awake.
We’ve a local hack to get around the problem, replacing the end of FTaskGraphInterface::BroadcastSlow_OnlyUseForSpecialPurposes with;
double StartTime = FPlatformTime::Seconds();
// NT_START NT_CP_HACK NT_CP_THREADLOCK
double SnoozeTime = StartTime;
TaskGraphImplementationSingleton->StartAllTaskThreads( bDoBackgroundThreads ); // Because we quite likely didn't actually do this
// NT_END
while (StallForTaskThread.GetValue())
{
FPlatformProcess::SleepNoStats(.0001f);
// this is probably not needed, but task are not generally tested to never miss a start
// NT_START NT_CP_THREADLOCK
if (FPlatformTime::Seconds() - SnoozeTime > 0.5)
{
SnoozeTime = FPlatformTime::Seconds();
// NT_END
TaskGraphImplementationSingleton->StartAllTaskThreads(bDoBackgroundThreads);
}
else if (FPlatformTime::Seconds() - StartTime > 3.0)
{
StartTime = FPlatformTime::Seconds();
UE_LOG(LogTaskGraph, Error, TEXT("Broadcast failed after three seconds"));
}
}
which seems to avoid the problem in the repro we have. This implies that even TaskGraphImplementationSingleton->StartAllTaskThreads doesn’t wake all threads. I note that the initial half second delay is a problem, because it’s often the case that one of the task-graph threads will be woken more than one, being in the list StalledUnnamedThreads more than once; as such, it seems we should send the message to all threads before waiting.
We may have also seen a case where the main thread only waited for three of the four task-graph threads before leaving BroadcastSlow_OnlyUseForSpecialPurposes, and thread-locked at a later point in the garbage collect, so it’s not clear that the hack we’ve got will fix our problems.