Engine tick dead lock

The only specifity I can think of in our use case is that some of our tasks are dependant from tick. For example :

Our task will do things, then wait for an event within game thread and restart.

I would be tempted to conclude that having one background thread still executing a task might prevent other threads to pick up pending tasks…

As far as we know, using std::thread instead of Unreal Engine tasks did not expose the issue.

Stacks :

[Image Removed]

Forground workers :

[Image Removed]

All workers :

[Image Removed]

but one (ours) which waits for being unblock next tick from the engine:

[Image Removed]

I would expect the background worker to pick tasks and unless we do create the starvation by blocking all workers, the application should keep running and ticking…

Please advise.

[Attachment Removed]

Steps to Reproduce
Hi,

We have in place a conversion mechanism from data into Unreal native asset (which are then used as levels). During the conversion process, we are starting the editor in command without any RHI to minimize footprint and favor efficiency. However, we are occasionally (non deterministic and occurence varies according to platform) seeing dead lock in the process.

As fas as we understand the problem, the tick cannot complete and wait for a task which never gets executed. This is completely unexpected as the background (and foreground) threads have nothing to do…

Call stacks will follow.

[Attachment Removed]

For completeness, the game thread is waiting for the physic solver task FPhysicsSolverProcessPushDataTask created each frame. As far as I could see, the task is usually processed in a foreground worker and occasionally in a background worker.

[Attachment Removed]

Hey Basile,

could you provide us an image exported from the parallel stacks view in Visual Studio?

That would include all the threads and make it easier to see if there’s anything obvious going on.

In general, why are you waiting in your task for the game thread?

Such waits are usually an anti-pattern.

If you need to schedule further work on the game thread you can do that as a follow up task or just set it up through task dependencies/prerequisites from the start.

If you task is a long running one that does work every frame that it gets from the game thread then the task system is likely the wrong tool. You could schedule a task each frame instead of having a blocking task that wait.

In these cases you’ve seen, is there now task executing the FPhysicsSolverProcessPushDataTask at all?

Kind Regards,

Sebastian

[Attachment Removed]

Hi Sebastian,

Following your request, I ran the process tonight and generated the view:

[Image Removed]I also attached it to the post in case it is not readable on the forum, …

I never used that view. Quite handy indeed. On the very left, we can see the pattern I tried to described. There are 30 workers and 29 seems idle. The last one being one of our own tasks which is waiting for the next upcoming tick (which requires the task completion from the physics).

Now, trying to answer your questions :

In general, why are you waiting in your task for the game thread?

Such waits are usually an anti-pattern.

--> We have the ability to load data from an external format and stream it directly at runtime. For this, the loading is broken in small pieces and sections are executed synchronously (in game thread) and other in asynchronous tasks. The history behind is that in the beginining everything occured in game thread but we are offloading to tasks so we have better and smoother frame rate. When a step requires the game thread, the async task will wait for it to pick it up before continuing. We are using semaphore to avoid data races.

If you need to schedule further work on the game thread you can do that as a follow up task or just set it up through task dependencies/prerequisites from the start.

--> I guess we might but that would require more refactoring and possibly lead to code being less readable (at least for us).

If you task is a long running one that does work every frame that it gets from the game thread then the task system is likely the wrong tool. You could schedule a task each frame instead of having a blocking task that wait.

--> This is not what we do. We have a lengthy task that may take multiple /many frames to occur which, at some point, requires game thread action.

In these cases you’ve seen, is there now task executing the FPhysicsSolverProcessPushDataTask at all?

--> As I said, I looked at all the workers and I cannot see that task being executed (or pending). It feels like the scheduler just dropped it which would obviously be an issue. I am now running a version of the engine that I built locally. If there is any more inspection I can do, let me know.

Last note, this is happening in editor. When this lock has been seen, we were creating assets (loading from other standard and saving as Unreal native assets). So far, this has never been observed in runtime / game mode.

Thanks,

[Attachment Removed]

Thanks for the info!

> I guess we might but that would require more refactoring and possibly lead to code being less readable (at least for us).

The task graph has a lot of complexity to make sure it works correctly in all scenarios. That includes oversubscription when all cores are busy, task stealing on waits, and a lot more details. These only work correctly if you use our primitives and the task graph properly knows about any dependencies.

By using your custom wait based on a semaphore from the standard library, you are essentially bypassing these protections, so it would not be surprising if your wait is causing the deadlock.

If your task really is a loop that waits on individual work from the game thread, you should be able to replace the semaphore call with a dispatch of a task with minimal effort. If you need follow-up work on the game thread, you can schedule it at the same time as a second task, with the first as a dependency.

> I looked at all the workers and I cannot see that task being executed (or pending)

The task would not be pending unless its prerequisites are satisfied.

This likely means that the FPhysicsSolverProcessPushDataTask has additional prerequisites that are not yet satisfied.

That task depends on other physics tasks and any uncompleted physics work from the last frame.

Could you look into the prerequisites for the task that ProcessUntilTasksComplete waits on (which should be the physics push task) and see if there are any that still need to be processed?

If yes you might need to go recursively until you end up at the last prereq in the chain and find out what it is waiting on to get to the root cause.

Could you show how you are scheduling your task currently, are you specifying any additional dependencies yourself?

Kind Regards,

Sebastian

[Attachment Removed]

Hi,

Our task is quite simple :

_workerTask.Launch (

UE_SOURCE_LOCATION,

[this] ()

{

while (!isFinalized () \&\& !needToStopConstruction ())

{

  \_GameThreadSemaphore.acquire (); // wait for the semaphore

  \_GameThreadSemaphore.release (); // release it directly to be able to acquire it again at next loop



  if (\_CDBBaseTileConstructionState)

  {

    updateConstructionFSM ();

  }

}

},

LowLevelTasks::ETaskPriority::BackgroundHigh);

When it gets stuck it fails on the aquire call.

Using std::thread “equivalent”, the dead lock was *never* observed:

auto asyncConstruction =

\[this] () \-\> void

{

while (!isFinalized () \&\& !needToStopConstruction ())

{

    \_GameThreadSemaphore.acquire (); // wait for the semaphore

    \_GameThreadSemaphore.release (); // release it directly to be able to acquire it again at next loop



    if (\_CDBBaseTileConstructionState)

    {

        updateConstructionFSM ();

    }

}

};

std::thread launch (asyncConstruction);

launch.detach ();

The physic task is not ours. It is fully create here :

  • FGraphEventRef FPhysicsSolverBase::AdvanceAndDispatch_External(FReal InDt)

I am not sure exactly what its purpose here is especially since we are not in game mode. The FPhysicsSolverProcessPushDataTask is the last in the dependency chain (and so the first one that should be executed) I could spot investigating the game thread stack and local variables.

I hope this helps,

Basile

[Attachment Removed]

Hi Sebastian,

Is there anything I can check when the dead lock is observed ?

I understand we may not be using all the capability from the scheduler but it should not be leading to the hard dead lock as it is.

From my perspective, if there is such a dead lock, it either means that :

  • The physic task has *not* been submitted and in that case, the game thread should not be waiting after it which would be a defect in the physic solving area.
  • The physic task was submitted but never process despite the many available threads which would be a defect in the scheduler.

Having our tasks using the scheduler appears, again to me, a trigger but not a cause.

Also, while reviewing other data, I fell on such patterns occasionally :

The parallelfor is reporting oversubscription but there is a thread that is idle:

[Image Removed]

[Image Removed]Worker 27 is idle and even looking for a task during the “oversubscription” reported time.

I can imagine / hope that I am misunderstanding the oversubscription report but I do guess there is something off.

Thanks,

Basile

[Attachment Removed]

Hi Basile,

Based on the information you shared, I have a feeling that the code in updateConstructionFSM is doing operations that are not thread safe and\or at the wrong moment. The fact that using an std::thread instead of a task seems to be an interesting lead. I’m guessing that what you shared was pseudo-code but here is what I feel is happening.

  • Task version: The task is executed on a thread with the “TPri_BelowNormal” priority.
  • std::thread version: The thread inherit the priority of the launching thread. I’m assuming the game thread so that would mean “TPri_Normal”

What I think is happening is that the thread version results in the task version to wake up earlier in the frame and avoid a race condition.

Can you share some details regarding the work happening in updateConstructionFSM ? Could some of the work interfere with Chaos?

Can you share the trace? We might be able to better understand what is happening. Have you instrumented your code? It would be important to see when your task is running. You can use TRACE_CPUPROFILER_EVENT_SCOPE(Name) to add events.

Regards,

Martin

[Attachment Removed]

Hi Martin,

Thanks for the answer. You are actually right that using std::thread adds another difference which is the priority. If I am following, modifying the source code adding FPlatformProcess::SetThreadPriority (EThreadPriority::TPri_Lowest); at the very beginning of our lambda when using std::thread, it may increase the issue occurence by reducing even further the priority. And using below normal should lead to similar results as using background tasks…

Now that you say this, consulting the team, the issue occurs more often for large tiles. That could be consistent with a task that did not complete.

The FSM stands for “state machine”. There are multiple actions that could be triggered by that method. One on them is indeed creating the bodies and physics components. It must somehow interact with chaos even if I do not see any direct link with the task scheduling.

I added a few traces to have a better understanding. However, I may be asking stupid question but how do you gather the trace in such a situation :

  • We are starting the editor commandlet, not a dev application.
  • Since the application is dead locking, I suppose the tracefile argument is not applicable.

Am I missing something ?

I guess I could modify the acquire with a try_acquire with unreasonable time out coupled with a graceful exit.

I am mostly dedicated on some performance improvement at the moment but I would like to make progress on that front as well to ensure the database production is reliable.

Thanks,

Basile

[Attachment Removed]

Hi,

Commandlets can be traced. You can do it by adding the following argument. -trace=default

You can add -tracefile if you want the trace to write under Saved\Profilng. It will be written in the trace store otherwise.

The trace should be valid despite the deadlock. It will contain infinite events on some of threads but should be readable and would show what leads to the lock. You should amke sure to instrument updateConstructionFSM so that we can see when it runs.

Regards,

Martin

[Attachment Removed]

Noted. I was under the impression that the trace file was getting written at exit but your comment is a good news. I will try that when I have a chance (I am on site this week and the next, still I may try a night build).

Also, I tried using a lowest priority thread and I saw a different sort of dead lock. Timing seems a reasonable lead.

Keeping you posted.

[Attachment Removed]

Hi,

I tried a few executions last week generating traces without seeing the dead lock. Either the problem got somehow fixed by some corrections on our code base which evolved or traces are altering the timings enough to hide the problem.

I will keeping trying capturing traces when a failure occurs but if I do not comment back here, it means we cannot reproduce anymore.

The only thing I saw in the traces are oversubscription but it should not be enough to generate a dead lock !

Thanks,

[Attachment Removed]

Hi Martin,

You will find attached traces for a dead lock that I observed in the morning.

In the zip file, there are 2 traces from two consecutives tiles from our database pipeline. The first one is executing correctly to the end. The second one is running until I kill the process…

I also placed the log file from the execution that fails.

Is there anything you can tell from it ?

Thanks,

Basile

[Attachment Removed]

The Insights traces are not really helpful. The deadlock is likely in a method that doesn’t emit an event under FActorComponentTickFunction::ExecuteTick. You could try to add some instrumentation and also capture the Task channel. This channel captures the dependencies between the tasks so you might have a better idea on what task is waited on.

Martin

[Attachment Removed]

Hi Martin,

Just to keep you posted, I applied the extra trace channel in my script but I did not reproduce the issue since.

If this pops again I will let you know but I hardly can focus on the problem if it is not seen more often… :confused:

Best,

[Attachment Removed]