Hi! While investigating the slow PIE startup times, I’ve identified the inefficiencies in UAnimationSequencerDataModel::InitializeFKControlRig to be a major contributor. In the attached screenshot, you can see that loading ~1100 animation sequence assets takes approximately 9 seconds in total on the main thread. I’ve done some digging, and with some relatively straightforward changes reduced that to ~2.5 seconds; that’s still way too much - I believe many more optimizations are possible. The summary of my findings and optimization attempts are in the comment below (due to character limit for the post).
Unfortunately, these are all engine changes (and I believe further optimizations would require the changes to be quite significant), and to simplify our merges in the future I’d rather reduce the divergence of our codebase from the upstream as much as possible. Is there any interest from Epic in taking any of this? I’m happy to work with your maintainers of the relevant code to address any of your concerns.
The biggest inefficiency is the way URigHierarchy::GetSafeNewDisplayName is called:
UFKControlRig::CreateRigElements creates multiple FRigXXXElement structures, to represent bones etc
First it calls URigHierarchyController::ImportBones, which calls URigHierarchyController::AddBone for each bone (446k total calls in my example)
Then it calls CreateControlForBoneElement lambda, which calls URigHierarchyController::AddControl for a subset of bones (154k calls in my example)
URigHierarchyController::AddControl has a logic to ensure that Settings.DisplayName is unique among siblings by calling URigHierarchy::GetSafeNewDisplayName, which calls URigHierarchy::GetChildren to get the list of children elements of the parent
The list of children per parent is cached, so URigHierarchy::GetChildren calls EnsureCachedChildrenAreCurrent, which in turn calls UpdateCachedChildren if the cache is invalid to rebuild the cache by iterating over all elements and querying their parents
Every time the new element is added, the cache is invalidated (URigHierarchyController::AddElement calling URigHierarchy::IncrementTopologyVersion)
As a result, when calling AddControl in a loop, the cache is rebuilt and invalidated on every single iteration, which is of course very expensive
I believe this deduplication should not be required at all during import - the Control’s display name is currently set to be equal to the bone’s name (which guarantees a collision, because control is added as bone’s sibling); if instead we change the logic to be similar to what block below does for curve controls (ie adds a suffix to the display name), we would guarantee uniqueness, because bone names are already globally unique (this is guaranteed by ImportBones using GetSafeNewName). Then we could add a way to tell AddControl to skip the display name uniqueness check.
Note that actually there are two places where GetSafeNewDisplayName is called when adding a new control: the first one is described above, but then there’s another that happens when AddControl calls AddElement, which in turn calls URigHierarchyController::AddParent. So simply adding a way to skip the first one does almost nothing to the performance - the cost of cache rebuild moves to AddParent. The proper fix ensures both functions know to skip the uniqueness check.
The next inefficiency is URigHierarchyController::AddParent checking for parent-child cycles when adding new elements.
URigHierarchy::IsParentedTo / IsDependentOn does an expensive recursive exploration of the hierarchy, using TMap to store intermediate results
In fact, when adding new elements, this is completely unnecessary - new elements by construction have no children yet, meaning there’s no way they could ever participate in a cycle. I believe that AddElement should pass a flag to AddParent to skip the cycle check (note that we can’t just check whether passed element has no children - getting that would require obtaining element’s children, which would require rebuilding a cache, which would be invalidated later - see above).
The next inefficiency is URigHierarchy::GetSafeNewName
This function calls SanitizeName (which is itself quite expensive, discussed below), then calls IsNameAvailable in a loop (at least once, more if name is non unique), which does a whole bunch of redundant checks, including at least one extra SanitizeName call.
All the checks can be moved to GetSafeNewName to be done once, and then it can simply loop over checking whether name is duplicate (ie the loop condition would be `GetIndex(Key) != INDEX_NONE`).
The amount of string copies could also be reduced (eg building the base name and ensuring it’s short enough); especially considering that FName structure already provides the ability to increment number suffix without touching the string (at least with default compile options).
The next inefficiency is storage allocation in URigHierarchy::AllocateDefaultElementStorage
Every time URigHierarchyController::AddElement is called, it calls URigHierarchy::AllocateDefaultElementStorage to allocate matrices and dirty flags for elements with transform.
FRigReusableElementStorage::Allocate, assuming there’s nothing in the free list (which is the case during load), calls TArray::Reserve, passing the exact number of elements to add. The Reserve call does not use growth heuristic, as a result the storage reallocation happens for every single element we add. I believe it should use one of the Add* functions instead.
URigHierarchy::AllocateDefaultElementStorage, if bUpdateAllElements flag is set, iterates over all elements to update the storage pointer, even if reallocation did not actually happen. Instead, FRigReusableElementStorage::Allocate could return a flag (did reallocation happen), and if not - URigHierarchy::AllocateDefaultElementStorage can ignore the argument and update storage pointers only for the newly created element.
Finally, even with these changes reallocation was visible on the profiler - this can be avoided completely by precalculating the required storage size in UFKControlRig::CreateRigElements, and adding a ‘reserve storage’ API to the URigHierarchy and FRigReusableElementStorage.
A relatively minor optimization can be done in URigHierarchyController::ImportBones
If bReplaceExistingBones flag is not set, there’s no need to fill and look up BoneNameMap; the AddedBones array is guaranteed to have elements in the same order as input bones, so we can look up parent there by BoneInfo.ParentIndex and use it’s final (adjusted for sanitization & deduplication) name immediately.
A relatively minor but very trivial optimization can be done in UControlRig::SetBoneInitialTransformsFromRefSkeleton
The input skeleton is copied into the lambda, which costs like 100ms in my tests. Should be captured by reference instead.
And some more things that I’ve noticed but have not yet fully investigated:
For a subset of objects, InitializeFKControlRig is called twice (first from AnimSequence_PostLoad and then from AnimationSequencerDataModel_PostLoad), which is very expensive.
The reset after second call shows up on a profiler.
The URigHierarchy::SanitizeName function in general is quite inefficient, both in implementation and in the way it’s used.
In the scenario I’ve profiled, it was passed FRigName with FName, so it would always do the string conversion.
It’s called redundantly in a few places (eg URigHierarchyController::AddElement calls it on InDesiredName; when the key is built, the same input was already sanitized as part of GetSafeNewName - instead, the caller could explicitly sanitize the name first, then pass it both as desired name and into GetSafeNewName - and then the callees could expect the input to be already pre-sanitized).
In fact, it’s unclear why is it even needed in the first place (what’s the harm in allowing anything in the name?)
The FRigName structure design keeps FString <-> FName conversions opaque and makes it hard to reason when these relatively expensive operations happen.
There seems to be some redundant work involved in setting transforms while new elements are created.
URigHierarchyController::AddBone calls AddElement with bMaintainGlobalTransform always set to true.
The code traverses the parent chain to recalculate transform.
The transform is overwritten later anyway.
URigHierarchyController::SetSelection is called twice, not sure why it is even needed.
UpdateCachedChildren is still called twice after my optimizations, investigation tbd.
Thanks for the deep dive and vast amount of details! We’d be happy to work through the changes and get them integrated into mainline. There are some future facing plans to do a rework of the RigHierarchy in general but at the moment there is no firm timeline for this - so any improvements we can realize in the short-term would be a win. Do you have a stream within our depot to pass over a shelf, otherwise we could work with a diff or Github pull-request.
There is another inefficiency in the way DDC keys for animation sequences are calculated; this is for the same asset types, but it’s not really related to the issue described above. I’ve opened another PR to optimize this: https://github.com/EpicGames/UnrealEngine/pull/13362. On the same dataset, it reduces time spent on DDC key calculation from ~3.1s to ~0.7s (see attached screenshot).