To catch this race condition, I added an `MTAccessDetector` to `FSkeletonRemapping` to mark up all its reads and writes, then did a cold DDC cook to force DDC data to be regenerated.
This caught the following scenario:
The Game Thread was:
- Serializing a Skeletal Mesh
- Rebuilding Ref Skeleton from our SK_CharacterTemplate_Template_Skeleton
- Calling UE::Anim::FSkeletonRemappingRegistry::Get().RefreshMappings(Skeleton)
- ExistingMapping = SK_CharacterTemplate_Template_Skeleton -> SK_Mannequin
- Calling ExistingMapping->RegenerateMapping()
- !! Write: ExistingMapping.RetargetingTable.Reset()
- !! Write: ExistingMapping.GenerateMapping()
While a task thread was:
- Calling UAnimSequence::GetAnimationPose() on SK_Mannequin
- Calling UE::Anim::Decompression::DecompressPose()
- DecompContext:
- SourceSkeleton = SK_CharacterTemplate_Template_Skeleton
- SkeletonRemapping = SK_CharacterTemplate_Template_Skeleton -> SK_Mannequin
- (same object being used on game thread)
- !! Read: SkeletonRemapping.GetTargetSkeletonBoneIndex()
- !! Read: SkeletonRemapping.RetargetBoneRotationToTargetSkeleton()
Nothing prevents the game thread from regenerating the skeleton remapping while the task thread is trying to read from it.
Hi, thanks for reporting this. I took a look in the 5.6 codebase, and I see that there should already be locks around the access of the skeleton remapping data from RefreshMappings. Does the implementation of the function in your version of the engine look like this?
void FSkeletonRemappingRegistry::RefreshMappings(const USkeleton* InSkeleton)
{
TArray<TSharedPtr<FSkeletonRemapping>> ExistingMappings;
{
UE::TReadScopeLock ReadLock(MappingsLock);
PerSkeletonMappings.MultiFind(InSkeleton, ExistingMappings);
}
{
UE::TWriteScopeLock WriteLock(MappingsLock);
for(const TSharedPtr<FSkeletonRemapping>& ExistingMapping : ExistingMappings)
{
ExistingMapping->RegenerateMapping();
}
}
}
I think that this should probably be covered entirely by a single write lock rather than read/write, but it doesn’t sound like that’s the same issue you found.
Great, thanks for letting me know. I’m going to go ahead and get these changes integrated. They’ll go into the 5.8 release. I’m going to close out this thread, but if we happen to run into any problems with the change once it goes into Fortnite, I’ll reopen it and let you know.
Hi, just reopening this since I did find a potential problem with the code that I sent over previously. Turns out that we need to null check the Skeleton as the cloth tools can call RebuildRefSkeleton with a null skeleton. So the code change I sent you previously should actually look like this:
if (Skeleton && (this == &Skeleton->GetReferenceSkeleton()))
{
// Full rebuild of all compatible with this and with ones we are compatible with.
UE::Anim::FSkeletonRemappingRegistry::Get().RefreshMappings(Skeleton);
}
No problem, thanks again for flagging it!
Yes it looks like that. The race is not in constructing or handing out a FSkeletonRemapping, it’s that _after_ that nothing is stopping two competing threads from writing to the same mapping while another reads from it.
We have a temporary fix for this on our end, but it’s pretty extensive. Everywhere that calls `UE::Anim::FSkeletonRemappingRegistry::Get().GetRemapping()` now has to take a scope lock depending on whether they plan on reading or writing.
Examples:
const FSkeletonRemapping& SkeletonRemapping = UE::Anim::FSkeletonRemappingRegistry::Get().GetRemapping(BoneIndexSkeleton, LastUpdateSkeleton);
UE::TReadScopeLock SkeletonRemappingReadLock = SkeletonRemapping.LockForReading();
UE::TWriteScopeLock WriteLock(MappingsLock);
for(const TSharedPtr<FSkeletonRemapping>& ExistingMapping : ExistingMappings)
{
UE::TWriteScopeLock ExistingMappingWriteLock = ExistingMapping->LockForWriting();
ExistingMapping->RegenerateMapping();
}
Thanks, yeah I see what you mean. The option that you mentioned of locking whenever we get the remapping makes sense. As you say though, it means quite a few changes since that functionality is used extensively throughout the animation code wherever we get a bone index.
I think there is another option that we can look at. You mentioned originally that this happened when rebuilding the ref skeleton while the mesh was being loaded. But the actual skeleton (rather than the mesh’s ref skeleton) shouldn’t be affected by that. (If the actual skeleton was affected, we would have many more issues decompressing the animation against a skeleton that had changed.)
Looking at the code, it seems that FReferenceSkeleton::RebuildRefSkeleton is being overzealous about when it regenerates the skeleton remapping. We shouldn’t need to update it when just the ref skeleton on the mesh has been updated. So we could make the following change to FReferenceSkeleton::RebuildRefSkeleton:
RequiredVirtualBones.Add(NewBoneIndex);
UsedVirtualBoneData.Add(FVirtualBoneRefData(NewBoneIndex, SourceIndex, TargetIndex));
}
}
}
// new branch to only refresh mapping if the skeleton asset has changed
if (this == &Skeleton->GetReferenceSkeleton())
{
// Full rebuild of all compatible with this and with ones we are compatible with.
UE::Anim::FSkeletonRemappingRegistry::Get().RefreshMappings(Skeleton);
}
}
That would ensure that we only refresh the mappings when it’s the skeleton that is being regenerated, not the mesh ref skeleton. And we shouldn’t ever end up in a situation where the decompression code is being run at the same time as the skeleton is regenerated.
Would you be able to test that change against your repro?
Excellent, this makes much more sense. I’ll give it a try.
So far so good, our overnight ColdDDC cook passed successfully. We won’t really know until we get a few solid days of success in a row, but this is looking like it worked.
Thanks for following up, sounds promising. I’ll keep an eye out for any updates next week.
We’re looking really good, haven’t run into an issue since.
I see that this was committed to //UE5/Main in CL# 48060137. Thank you!