Crash in dtNavMesh::init

We have had an intermittent crash in our project that occurs when the recast nav mesh tries to serialize. When we debugged this it looked like the recast nav mesh actor was being destroyed on the main thread while simultaneously serializing on the worker thread. Essentially the null pointer check for RecastNavMeshImpl in SerializeRecastNavMesh passes, but once it gets inside that function something on the main thread calls Destroy on the recast nav mesh actor. In ANavigationData::Destroyed it calls UnregisterAndCleanUp, which calls ARecastNavMesh::CleanUp, which calls DestroyRecastPImpl and deletes RecastNavMeshImpl out from under the async serialization process.

We experimented with two different fixes for this:

Fix #1: In ARecastNavMesh::Serialize, before the else if (RecastNavMeshSizeBytes > 4) check, we added this block of code

else if (IsActorBeingDestroyed())
{
	UE_LOG(LogNavigation, Warning, TEXT("%s: ARecastNavMesh: Navmesh destroyed prior to load. Skipping serialization. \n"), *GetFullName());
	CleanUpBadVersion();
}

However, I get the feeling that rather than fix the problem this simply shrinks the timing window considerably. So the crash theoretically could still happen, but in actual practice maybe not.

Fix #2: Remove the call to DestroyRecastPImpl in ARecastNavMesh::CleanUp. The ARecastNavMesh destructor will destroy it for us anyway, so we let it handle it which should be a threadsafe area. This also fixed the crash for us, and between the two fixes this feels like the more comprehensive fix.

Have you guys seen anything like this? Are there other fixes you would recommend?

Thanks!

I do not recall seeing anything for this previously. Do you have the callstack for the main thread as well so we can see what is happening in both locations? It sounds like there may be a race happening but narrowing down the code path could help with tracing what events can lead to this.

-James

Have you run into this again? I have checked our crash reporter tool and the crashes coming out of dtNavMesh::init that have been reported are around out of memory errors. I also think your fixes may have cleared the issue.

Were both changes needed for stopping the issue? Changing some of the cleanup code for navmesh requires a lot of testing internally for FNBR and other projects to ensure we do not have any regressions in memory or perf.

-James

Thank you for the update! We are very aware of how busy things get. I’ll talk it over with the team about this and what approach they may want to take. I am hopeful you don’t experience it again. And thanks for bringing it to our attention!

-James

At the moment I cannot reproduce the crash, being a timing issue and all. I do remember going up the callstack to ARecastNavMesh::SerializeRecastNavMesh and seeing that RecastNavMeshImpl was explicitly null, not just junk, which indicates that DestroyRecastPImpl was indeed called. If I do manage to repro it again I will include the main thread callstack

Sorry for not responding sooner. I have not run into this issue again

I did not need both fixes to resolve the crash. Only one or the other was necessary. But I also don’t know at this point if it they both truly fixed it or the issue just happened to not repro anymore.

No problem thanks!