Chaos Flush External Acceleration Queue Timestamp

I’ve been profiling and reading through the chaos physics tick, mainly around StartPhysics and EndPhysics which occur on the game thread and noticed a piece of code that I would expect to cause redundant work when we’re ending a physics frame. Namely after we copy our solver acceleration structure we flush our external acceleration queue, which makes sense, but we do it for all pending spatial data after the acceleration structure’s timestamp or with the same timestamp of the acceleration structure. My understanding is that by definition any pending spatial data with the same time stamp as the acceleration structure should already have been included during the call to FPBDRigidsEvolutionBase::FlushAsyncAccelerationQueue in the physics thread. If I understand correctly, that would mean we’re reupdating the acceleration structure for work that occurred prior to the physics frame beginning when we’re only meant to be including work that occurred since the frame started. Is this a bug or is there some reason that I’m missing that we would need to evaluate operations with the same timestamp as the acceleration structure? The time savings are non-trivial for this change depending on the number of bodies being moved by the game thread (over 0.5ms for us in certain areas) so it would be nice to confirm this is a bug.

void FPBDRigidsEvolutionBase::FlushExternalAccelerationQueue(FAccelerationStructure& Acceleration, FPendingSpatialDataQueue& ExternalQueue)
{
    //update structure with any pending operations. Note that we must keep those operations around in case next structure still hasn't consumed them (async mode)
    const int32 SyncTimestamp = Acceleration.GetSyncTimestamp();
    for (int32 Idx = ExternalQueue.PendingData.Num() - 1; Idx >=0; --Idx)
    {
       const FPendingSpatialData& SpatialData = ExternalQueue.PendingData[Idx];
 
       // UDN Note: Why is this a >= instead of >?
       if(SpatialData.SyncTimestamp >= SyncTimestamp)
       {
           // ...
       }
       else
       {
          //operation was already considered by sim, so remove it
          //going in reverse order so PendingData will stay valid
          ExternalQueue.Remove(SpatialData.UniqueIdx());
       }
    }
}

Hello Robert,

So I did some testing and it seems this does need to be >= vs. >.

This is updating the external structure which is used for game thread. During the physics thread, we update the sim structure but not the external structure. It’s pretty hard to follow, but when we copy the structures from physics to game thread, we actually don’t copy everything, namely the dynamic structure isn’t copied. If we don’t do this we’ll miss data.

I verified this by running some tests and saw cases where the == check does update the tree with an aabb with different values.

Let me know if you have more questions.

I may have been conflating external and async. Async does a very odd thing where it double swaps the dynamic tree around, but that should be unrelated to this.

I did a bunch of investigating and testing today and have some info but nothing too conclusive yet.

First, I was able to observe a simple scenario (a box dropping under gravity) where the aabb that was updated in the == case was not contained by the old aabb. So there are indeed some cases where the aabb is stale if that isn’t called. However, I did some further testing and was unable to notice a discrepancy between the internal/external, likely because of the deep copying. I did notice that sometimes it just shrinks, so I think there’s a bug somewhere in there with the aabb tree expansion factors, but didn’t investigate further. I think the issue here is the update vs. insert case and if your object moves a lot it removes then re-inserts.

Now all of this is in the case of synchronous physics, if it’s async physics, we interpolate the data back to the game thread using this queue, and in this case it would be incorrect to skip the == case.

From my understanding of what’s supposed to be happening, the async queue is used to update the GT structure with any pending changes that may have been missed. I believe there are cases where GT can start up physics, then make a change, and then when that’s publish back to the GT the timestamps would be the same.

I’ll try to do some more digging and asking around, but from my understanding it is needed, I’ve just been unable to make a case that shows the discrepancy

I believe it’s technically correct that it does get updated with old data, at least sometimes. If both GT and PT update an object, we try to ensure that GT wins. Now if GT isn’t doing any changes that sounds wrong to me. When you noted the old update, was it still getting clobbered later? Oh wait, you said it was clobbered. Also is this all dynamic you’re observing?

Just to make sure, are you running async physics?

That is an interesting observation about the different aabbs between GT/PT. I can potentially understand why given that PT might be the only one who knows the true shape (without a lock).

So I spent a little more time digging deep and have some data, but I do need to talk with someone else to help confirm a few things.

At a bare minimum, the timestamp needs to be == otherwise GT changes that happen in-between StartFrame and EndFrame would be lost.

As far as PT data, I believe in an average case, the data is duplicated, but I think there are chances data could be missed. We update with current data from integration in ComputeIntermediateSpatialAcceleration, but we queue pull data in CompleteSceneSimulation which happens at the end of the time step. I think there’s a large chance changes could happen to transforms (e.g. callbacks, position correction, etc…) in between.

I graphed out some of the calls and timestamps to help understand this a bit.From my understanding, whenever ComputeIntermediateSpatialAcceleration has a completed task, the flushes will always ensure that the structures are fully up-to-date with the current frame. That finished structure is queued there and consumed at the end frame later.

I do believe there’s room for optimizations here, but I don’t think there’s an easy answer that won’t lose data. Likely the correct solution is to reconfigure how OnSyncBodies interacts with the queue so we don’t double up the changes.

Oh yeah, my bad. Any changes that happen later get picked up in OnSyncBodies which have the later timestamp (assuming no where else posts changes that I missed). There may be some world where if you could detect the only change was integration that you could skip posting those which I would assume causes the majority of the queue updates

When you say we don’t copy the dynamic acceleration structure, I assume you’re referring to the fact that in the async acceleration structure task (FPBDRigidsEvolutionBase::FChaosAccelerationStructureTask::UpdateStructure(…)) we don’t copy substructures that are dynamic. However, after the acceleration structures are rebuilt we call FPBDRigidsEvolutionBase::CopyUnBuiltDynamicAccelerationStructures(…) which seems to do what it says on the tin.

I also noticed that in FlushAsyncAccelerationQueue we don’t update dynamic trees. However, right after that in ComputeIntermediateSpatialAcceleration we call FlushInternalAccelerationQueue() which does update our dynamic trees. Then we swap our internal acceleration structures and use the async internal structure (which should now have the up to date dynamic data) to update our external one in CopyUnBuiltDynamicAccelerationStructures. It’s very possible I got something wrong in that series of events though so if you’re referring to some other mechanism that would prevent the dynamic accel structure copy let me know.

As far as updates actually changing aabb’s in the external structure - I have seen that there are instances where acceleration structure bounds are changing as a result of operations with the same timestamp as the acceleration structure, which I’m also trying to track down, but I’m having trouble seeing the connection to the dynamic acceleration structure. When I debug draw both the internal and external acceleration structures there is actually a consistent mismatch in bounding box size for dynamic objects which leads me to think these changes are due to that rather than actual updates. In fact, if I debug draw the acceleration structure before and after the flush, we see a consistent growing of bounding box size, not some movement dependent change as I would expect to see.

Here’s a view of both the internal and external acceleration structures. Notice how each body has two mismatched bounding boxes, the smaller one being the internal (physics thread) acceleration structure.[Image Removed]

Here’s a view debug drawing the external acceleration structure right before and after the flush. It looks quite similar to the previous picture, but notice how all it does is grow the bounding boxes, not actually move them based on body translation as we would expect if we were in some way updating out of date data.

[Image Removed]

For reference, the code I added to debug draw pre and post flush looks something like this.

void FPBDRigidsEvolutionBase::UpdateExternalAccelerationStructure_External(
       ISpatialAccelerationCollection<FAccelerationStructureHandle, FReal, 3>*& StructToUpdate, FPendingSpatialDataQueue& PendingExternal)
    {
       // ...
 
       if (ensure(StructToUpdate) && NewStruct != nullptr)
       {
#if CHAOS_DEBUG_DRAW
          IConsoleVariable* DrawExternalSTructureCVar = IConsoleManager::Get().FindConsoleVariable(TEXT("p.Chaos.Solver.DebugDrawExternalSpatialAccelerationStructure"));
          if (DrawExternalSTructureCVar && DrawExternalSTructureCVar->GetInt() == 1)
          {
             DebugDraw::DrawSpatialAccelerationStructure(*StructToUpdate, nullptr);
          }
#endif
          
          FlushExternalAccelerationQueue(*StructToUpdate, PendingExternal);
 
#if CHAOS_DEBUG_DRAW
          if (DrawExternalSTructureCVar && DrawExternalSTructureCVar->GetInt() == 1)
          {
             DebugDraw::DrawSpatialAccelerationStructure(*StructToUpdate, nullptr);
          }
#endif
       }
    }

I am still in active investigation as well. There are definitely times when the external flush does result in us adding elements that weren’t previously in acceleration structure, but I have yet to figure out the cause. That being said, it’s really the Update Ops I care about and leaving the Add ops at >= and changing Update ops to be > would still be really nice.

With more investigation I have actually found the == case will update the acceleration structure with old data. Namely when we flush updates on an object simulating physics, the == update actually ends up giving us out of date bounds (from the previous frame). So this flush is actually doing the opposite of what it’s supposed to do in this case.

This is a cube in free fall. The red bounds are the ones in the acceleration structure copied from the physics thread. The green bounds are the one after the external acceleration queue flush.

[Image Removed]

This does end up getting fixed up because we sync bodies and pull data after we copy our acceleration structure which gives us the correct bounds. Still it is interesting to note.

I also wanted to note that implicit shapes with rotation is part of the cause for the difference in bounding box size in the internal and external acceleration structure. If an implicit shape is rotated then it seems the bounds as calculated on the physics thread is the aabb containing the rotated shape while on the game thread it’s calculated as the aabb containing the rotated aabb of the shape, giving enlarged bounds based on the shape’s orientation.

Given a rotating capsule, the external/GT bounds will grow and shrink (green) while the internal/PT bounds (blue) will stay consistent.

[Image Removed]

This does not appear to be the case for non-implicit shapes (e.g. convex hulls). For these shapes, aabb will be orientation dependent for both the GT and PT.

None of this is directly related to the question at hand necessarily. I just wanted to record my findings on all the reasons why the external acceleration queue flush is causing the bounds to change that aren’t actually meaningful updates so they can be ruled out.

Yep, for our project bTickPhysicsAsync is disabled. I haven’t looked much into async physics and its logic so I do have a bit of a blind spot there.

Yep all of these objects are dynamic.

Also for the falling box updating with old aabb example it should just be the PT updating the position since it’s just a static mesh with physics simulation enabled. I believe the order of events that causes this is:

  1. On GT we pull state in end physics (Timestamp 1). This causes us to update our external acceleration structure AND add the update to our pending external acceleration queue associated with timestamp 1
  2. On the subsequent frame we start our physics update with timestamp 1 THEN update our GT timestamp to 2
  3. Our async acceleration structure rebuild finishes so on the PT we do all the necessary swapping/flushing. The PT pending update flush causes an update to the current timestamp for the rebuilt acceleration structure (which is 1 for the PT).
  4. In GT end physics we copy the rebuilt acceleration structure and flush our pending updates. Since the rebuilt acceleration structure has timestamp 1 and the update from when we pulled last frame also has timestamp 1 we end up updating the acceleration structure with the bounds we pulled from last frame.
  5. Then we pull for this frame and overwrite the out of date bounds and the cycle repeats.

Again, this isn’t necessarily a huge deal since we end up overwriting the out of date bounds from the pending queue flush with the correct bounds from pull immediately, but it is odd behavior.

Yeah I agree that there doesn’t seem to be a solution that is as easy as I thought and it may not be reasonable to find a satisfactory solution at the moment. I think due to the fact that kinematic bodies may not be pulled and instead directly set the body position on the GT could definitely cause discrepancies that need to be accounted for with the pending queue flush. So if we want a fully accurate acceleration structure on the GT we have to perform the update in the == case even though in the majority of cases kinematic bodies will match on GT and PT.

However, I don’t think we’ll miss any changes between StartFrame and EndFrame without the ==. If I followed the timestamp correct, then any changes after StartFrame should have a timestamp higher than the one associated with the queued up rebuilt external acceleration structure since we increment the external timestamp during StartFrame after we’ve already copied it internally for the PT. Here’s the flow I have recorded for that (execution flow generally being top to bottom, left to right):

[Image Removed]