Thread lock on PS4 from TaskGraph

We’ve been seeing lock-ups on PS4 which we’ve traced to TaskGraph, perhaps always during a garbage collect. The main thread calls BroadcastSlow_OnlyUseForSpecialPurposes to wake up the task threads, but sometimes one of them will remain at the pthread_cond_wait in GenericPlatformProcess.cpp. Often this result in the main thread getting stuck waiting for a thread which has not received a request to wake up, or possibly didn’t receive a job to keep it awake.

We’ve a local hack to get around the problem, replacing the end of FTaskGraphInterface::BroadcastSlow_OnlyUseForSpecialPurposes with;

double StartTime = FPlatformTime::Seconds();
// NT_START   NT_CP_HACK   NT_CP_THREADLOCK
double SnoozeTime = StartTime;
TaskGraphImplementationSingleton->StartAllTaskThreads( bDoBackgroundThreads ); // Because we quite likely didn't actually do this
// NT_END
while (StallForTaskThread.GetValue())
{
	FPlatformProcess::SleepNoStats(.0001f);
	// this is probably not needed, but task are not generally tested to never miss a start
	// NT_START NT_CP_THREADLOCK
	if (FPlatformTime::Seconds() - SnoozeTime > 0.5)
	{
		SnoozeTime = FPlatformTime::Seconds();
	// NT_END
		TaskGraphImplementationSingleton->StartAllTaskThreads(bDoBackgroundThreads);
	}
	else if (FPlatformTime::Seconds() - StartTime > 3.0)
	{
		StartTime = FPlatformTime::Seconds();
		UE_LOG(LogTaskGraph, Error, TEXT("Broadcast failed after three seconds"));
	}
}

which seems to avoid the problem in the repro we have. This implies that even TaskGraphImplementationSingleton->StartAllTaskThreads doesn’t wake all threads. I note that the initial half second delay is a problem, because it’s often the case that one of the task-graph threads will be woken more than one, being in the list StalledUnnamedThreads more than once; as such, it seems we should send the message to all threads before waiting.

We may have also seen a case where the main thread only waited for three of the four task-graph threads before leaving BroadcastSlow_OnlyUseForSpecialPurposes, and thread-locked at a later point in the garbage collect, so it’s not clear that the hack we’ve got will fix our problems.