UE-182925 - OnDestroyPhysicsState crash while cooking 5.7

5.7.0-release with unrelated engine modifications for our title (nothing around the cook / build process)

Cook small map with very heavily componentized blueprint actor instance in-game. 100’s of components + child actor components.

Crash due to invalid solver pointer when components go through GC to destroy physics state.

Note: some line numbers may not be exact to 5.7.0-release due to small engine mods we’ve made in these files. Function callstack / path is unchanged though.

[Attachment Removed]

Steps to Reproduce
Hello,

following up on Unreal Engine Issues and Bug Tracker (UE-182925) which has EPS post [After upgrading to 5.4 the cooker crashes at the end of the [Content removed]

We’re able to get this 100% of the time while cooking in 5.7.0-release but its basically near the start of a cook, and in particular seems to occur for us on a blueprint actor that has a lot of components (200+) including child actor components inside a very small “lobby” level.

So far what I’ve seen is:

* Level gets loaded and rapidly goes through world creation, serialization, and world destruction. (Level is small enough and after all DDC content is generated, takes under 3s to load / save / destroy map).

* Phys solver is destroyed upon world destruction (effectively pulling the rug out from under the physics bodies are still lingering in primitive components), I set data breakpoints to catch this occurring and where the physics reference was being updated in the final solver sync during destruction.

* Very shortly later, GC clean-up occurs and goes through destroying physics state. Crashes due to invalid pointer trying to resolve the Solver to destroy the underlying physics state objects.

* I’d expect that world destruction should be tearing down / clearing up the components but seems like its not the case (concerning)

The callstack is different. What I haven’t been able to verify yet is if there is widespread in our cook or related to the specific content (content issue potentially vs system issue as a short term fix). I have started to look at intercepting the initial body creation work that can occur in the finalization steps that occur as part of solver teardown and actually nulling out the reference, but not sure yet if that’ll produce any meaningful results.

Note: if we actually remove new creation when we’re shutting down, we’ll get the original callstack at the end of the cook. Specifically in

  1. FPBDRigidsSolver::ProcessSinglePushedData_Internal, ProcessProxyPT lambda, if I modify const bIsNew to not create new elements when shutting down
  2. PrimitiveComponentPhysics.cpp - GetPhysicsObjectById - Add null check ( BodyInstance.ActorHandle->GetPhysicsThreadAPI() == nullptr) to catch on the game side (ugly)

This will get us past the immediate blow-up during GC and we’ll get the original issue as reported. Please let me know if I can provide any more information. I’m highly doubtful I can package up the content project but can try to assist with any other debugging. Parallel callstack didn’t provide much at the time of crash.

Anyway, this is 100% repo for us. In particular the incorrect lifetime management of the physics teardown with the world it belongs too seems like the culprit. I feel like waiting for GC will always leave a high risk that the world itself is gone and so is the solver / scene associated with it.

[Image Removed]

PBDRigidSolver.cpp - const bool bIsNew = !IsShuttingDown() && !Proxy->IsInitialized();

PrimitiveComponentPhysics.cpp 

Chaos::FPhysicsObject* UPrimitiveComponent::GetPhysicsObjectById(Chaos::FPhysicsObjectId Id) const
{
	if (!BodyInstance.IsValidBodyInstance() || BodyInstance.ActorHandle->GetPhysicsThreadAPI() == nullptr)
	{
		return nullptr;
	}


	return BodyInstance.GetPhysicsActor()->GetPhysicsObject();
}

[Attachment Removed]

Worth noting, we’ve never experienced the original issue or this issue before starting to upgrade to 5.7.0. I’ll be merging 5.7.1 today and I can test there, but the release notes don’t look like this has been resolved / touched in the process. We’ve upgraded recently from 5.5.4 to 5.6.1 and now 5.7.0

Update: 5.7.1 no change

[Attachment Removed]

Thinking a bit about it more, it almost feels like the actors in question are not playing by the rules, because if they were, the actors would be destroyed by the world and the game-thread side shutdown with trigger, and get us cleaned up before the world destroyed? Perhaps tracking these actors to see why they go out of band in this process is the right approach?

[Attachment Removed]

Hi Bryan,

Are you able to replicate this in a vanilla version of UE in a test project you can send across.

We are aware of some build issues, but we’ve not been able to get our hands on a solid repro in order to verify some information.

Best

Geoff Stacey

Developer Relations

EPIC Games

[Attachment Removed]

Unfortunately, our team does not have bandwidth to do that right now. I’m guessing it’s going to be very difficult without our actual asset setup, and is probably stemming from complexity. The asset in question has an enormous amount of attached gameplay in native / blueprints too.

Is there anything else I can do to help validate? TBH I’m pretty suspicious of child actors in general due to the editor reinstancing / respawn flows and that our blueprint probably has well over 100 of them.

Is there anything more you can share with me (re: build issues) to help diagnose? As mentioned, the fact that aspects of actor system lifetime are outliving the world seems highly irregular for the normal teardown process. I’d half expect the issue to manifest just doing level transitions unless level transition is being more “complete” about teardown than the cook process.

Thanks, hope we can keep moving this forward.

[Attachment Removed]

Hi Bryan,

Have you managed to find anything further on this?

Would you mind trying a speculative fix for this - if so, there is a discontinuity between the ChaosScene CTOR and DTOR, can you null the SceneSolver->PhysSceneHack pointer in the ChaosScene DTOR and see if that then solves the issue?

Best

Geoff

[Attachment Removed]

Hi Geoff,

Sorry, I have not had the bandwidth to jump further into this. This sounds simple enough to try, I will report back sometime this week!

[Attachment Removed]

Thanks Bryan,

We think part of this may be solved by that change - but it looks like it may write over that memory anyway at times which means the ‘null’ won’t stay null

[Attachment Removed]