Chaos Crash - TransformedAAB Seg Fault inside Update Overlaps

This is a very commonly used feature, and we get one or two of these crashes per month. So it’s pretty rare, but crashing a whole server is pretty disruptive.

We’ve been trying to reduce physics geometry as a mitigation, but because it’s player controlled, and we allow them to place over a thousand items (many of which never get physics representation on the server though), it’s hard to prevent certain worst case scenarios from players.

[Attachment Removed]

Steps to Reproduce
Repro is not currently known, but the callstack indicates that the client has told the dedicated server via RPC to move an item in a specific space for the player to place and move around items.

[Attachment Removed]

Simplified callstack here, because it doesn’t look like it wants to show…

  • Crashed in non-app -> Server-Linux-Shipping +0x61605c2
    • TransformedAABB
  • Chaos::TransformedAABBHelperISPC (AABB.cpp:415)
  • Chaos::TAABB<T>::TransformedAABB (AABB.cpp:513)
  • Chaos::FImplicitObject::CalculateTransformedBounds (ImplicitObject.h:266)
  • TSQVisitor<T>::Visit<T> (SQVisitor.h:206)
  • Chaos::TSpatialVisitor<T>::VisitOverlap (ISpatialAcceleration.h:449)
  • Chaos::TAABBTreeLeafArray<T>::OverlapFast<T> (AABBTree.h:362)
  • Chaos::TAABBTree<T>::QueryImp<T> (AABBTree.h:3063)
  • Chaos::TAABBTree<T>::QueryImp<T> (AABBTree.h:2865)
  • Chaos::TAABBTree<T>::OverlapFast<T> (AABBTree.h:1011)
  • Chaos::TSpatialAccelerationCollectionHelper<T>::OverlapFast<T> (SpatialAccelerationCollection.h:265)
  • Chaos::TSpatialAccelerationCollection<T>::Overlap<T> (SpatialAccelerationCollection.h:530)
  • Chaos::TSpatialAccelerationCollection<T>::Overlap (SpatialAccelerationCollection.h:524)
  • OverlapHelper<T> (SQVisitor.h)
  • const::lambda::operator()<T> (SceneQueryLowLevel.cpp:106)
  • Chaos::Utilities::CastHelper<T> (CastingUtilities.h)
  • (anonymous namespace)::FGenericChaosSQAccelerator<T>::Overlap<T> (SceneQueryLowLevel.cpp:104)
  • LowLevelOverlap<T> (SceneQueryLowLevel.cpp:340)
  • GeomOverlapMultiImp<T>::lambda::operator() (SceneQuery.cpp:972)
  • UE::Core::Private::Function::TFunctionRefBase<T>::operator() (Function.h:470)
  • FPhysInterface_Chaos::ExecuteRead (PhysInterface_Chaos.cpp:549)
  • GeomOverlapMultiImp<T> (SceneQuery.cpp:960)
  • GeomOverlapMultiHelper<T> (SceneQuery.cpp:1079)
  • FGenericPhysicsInterface::GeomOverlapMulti<T> (SceneQuery.cpp:1143)
  • const::lambda::operator() (BodyInstance.cpp:4142)
  • Invoke<T> (Invoke.h:47)
  • UE::Core::Private::Function::TFunctionRefCaller<T>::Call (Function.h:315)
  • UE::Core::Private::Function::TFunctionRefBase<T>::operator() (Function.h:470)
  • FPhysInterface_Chaos::ExecuteRead (PhysInterface_Chaos.cpp:512)
  • FBodyInstance::OverlapMulti (BodyInstance.cpp:4109)
  • UPrimitiveComponent::ComponentOverlapMultiImpl (PrimitiveComponent.cpp:4215)
  • UPrimitiveComponent::ComponentOverlapMulti (PrimitiveComponent.h:3131)
  • UPrimitiveComponent::UpdateOverlapsImpl (PrimitiveComponent.cpp:4027)
  • USceneComponent::UpdateOverlaps (SceneComponent.cpp:988)
  • USceneComponent::UpdateOverlapsImpl (SceneComponent.cpp:3068)
  • USceneComponent::UpdateOverlaps (SceneComponent.cpp:988)
  • USceneComponent::UpdateOverlapsImpl (SceneComponent.cpp:3068)
  • USceneComponent::UpdateOverlaps (SceneComponent.cpp:988)
  • USceneComponent::UpdateOverlapsImpl (SceneComponent.cpp:3068)
  • USceneComponent::UpdateOverlaps (SceneComponent.cpp:988)
  • USceneComponent::MoveComponentImpl (SceneComponent.cpp:3171)
  • USceneComponent::MoveComponent (SceneComponent.h:1639)
  • USceneComponent::SetRelativeLocationAndRotation (SceneComponent.cpp:1478)
  • USceneComponent::SetWorldLocationAndRotation (SceneComponent.cpp)
  • AActor::TeleportTo (Actor.cpp:746)
  • [OurComp]::MoveItem ([OurComp].cpp:5717)
  • [OurComp]::MoveItem ([OurComp].cpp:5761)

[Attachment Removed]

Unfortunately, there’s not too much we can do with the available info. The best guess is that this is a use after free. If you could run the program with ASAN that should help track down the issue

[Attachment Removed]

Hi again, just wanted to check in and see if there was any more we can do or if it’s ok to close this. Thanks

[Attachment Removed]

Hm…that does sound unusual, but points to something else being the culprit and possibly corrupting data. Without a repro/crash dump I’m not sure if we can do more though

[Attachment Removed]

We encounter a very similar crash. Mostly the same callstack in terms of Engine code. Ours originates from a Chaos Vehicle Sim, where the suspension trace then ends up producing the rest of the callstack. We don’t have a repro either, but this at least hints at it not being project-specific. I’m sadly not sure what other information I could share. There is a lot going on when the server crashes, with lots of players and multiple vehicles, so I don’t even know, at the time of writing, which vehicle could have caused it.

Edit: I think OP has a Linux server, based on the “seg fault” mention. Ours is also a Linux build. Not sure I can even reproduce this on a Windows server.

[Attachment Removed]

Yeah, unfortunately that still isn’t enough to really investigate further. That function only really breaks if the input data is corrupt, so the issue likely happened somewhere else. I’m not sure a crash dump will even help. Our best suggestion is to run TSAN if possible to get more info.

I’ll keep this in the back of my mind and see if we get other crash dumps I can correlate.

[Attachment Removed]

Sounds good. In the meantime, I’ll close this and you can re-open if you get more info. I get pinged every day that this is left unanswered otherwise :sweat_smile:

[Attachment Removed]

OOM crash or segfault crash? OOM sounds different than the original issue. I think to catch this you’d have to add some other checks/asserts somewhere else. I’m not sure where would be good, as this looks like an implicit that was freed but is still in the aabb tree, so there’s no easy “check this pointer is valid” kind of thing. You wouldn’t happen to be able to see what other threads were doing would you? I wonder if it’s a use after free vs. a threading issue. We had a case recently with something else in physics not taking a write lock that we were able to see because another thread was often doing stuff. This should be unrelated to your case I think though as it was related to spatial readiness which I think is only in lego fortnite.

[Attachment Removed]

No idea how likely it is. I think it’s one of those “if you see something another thread then that points to something but if you don’t then still in the dark”.

I think use after free is by far the most likely which unfortunately could be just about anything somewhere else…

[Attachment Removed]

This sounds reasonable.

We have noticed that this function used mostly nullptr checks and not validity checks, and it’s based on a RPC server function that manipulates the item in a similar way to how other players can manipulate the item.

Unfortunately, all of our repros are theoretical, as we have not created the crash in a dev environment…

[Attachment Removed]

Oh, I did discover something new… There was a Procedural Mesh Component on one commonly used blueprint that was being initialized with some basic information in construction script, but wasn’t actually being used and should have had no collision. It and one other unused mesh component were at depths 4 and 5 of that, which would match the Update Overlaps callstack depth of the crash. I made a change in main a few weeks ago to delete that component, and I haven’t seen any crashes in our local test cases (but we never had any crashes anyway).

I saw some messages around the internet that there can be similar symptoms from just having too many physics bodies. So maybe this would mitigate by having less… maybe it won’t make a difference… Or maybe the procedural mesh component in itself was a problem.

[Attachment Removed]

Yes, we have a Linux Server; good call out.

And for extra clarity, it’s a Dedicated Server.

[Attachment Removed]

Note: that local change finally made it to live players last week, but we are experiencing this crash, so the Procedural Mesh Component seems to have been a red herring… Unless someone added another one to our game without my knowledge (doubtful)

[Attachment Removed]

I appreciate that, and if we ever identify a local repro with literally any amount of consistency, I’ll update here.

[Attachment Removed]

Sounds fair. As a note, our next hot spot is a bunch of actors that had a common case of 30 static mesh instances cycling on the server upon several different instances. These are common, but shouldn’t be possible to acquire more 30, and that would be a high end player.

One thing interesting about it… I found that we actually had some triggers for a clear and rebuild… And so in editor I ran a test of forcing that to clear and rebuild 100 times every second for a full set (so ~100000 physics bodies cleared & built per second if my math worked out) and let that sit in a Play In Editor session… Obviously not a linux server, but I just wanted to see what would happen. It handled it ok, until maybe 10 minutes later (didn’t time it) I did get an OOM Crash.

We actually are trying to slowly push this to players today, so we’ll see how much it helps soon maybe…

[Attachment Removed]

An update: that change seems to have minimal effect unfortunately… This weekend’s occurrence rate is very similar to last’s.

Back to investigating.

[Attachment Removed]

The server is still segfault crashing with pretty much no real callstack coming in over Sentry. (And we still haven’t gotten a single instance in our Dev Servers)

The OOM was a semi-related thing I found while trying to pressure a potential culprit system in order to generate a repro case… In a local set up… So yeah, the fact that fixing that scenario had seemingly no impact makes me believe I can investigate deeper on that particular case to find something for the segfault…

The Sentry Report does have some small amount of other thread info… I forgot about that. Checking now, it’s all HTTP/Network layer or just a bunch of Waiting worker threads. I think that implies a use after free… What’s the likelihood of checking through the reports, being able to find something that is still in the freeing thread?

[Attachment Removed]