World Partition streaming performance

anonymous-edc · April 25, 2025, 3:21pm

We took a trace and have identified a few things that are causing world partition streaming performance issues. First one. Large ISMS instantiate all of their collision at once. Epic carefully manages incremental registration of actors but not instances which still have a costly physics registration on a per-instance basis. We’re going to add optional timeslicing support. Epic may want to consider the same. Another high level issue is that components are budgeted based on s.LevelStreamingComponentsRegistrationGranularity which doesn’t consider the time taken to register an actor component. Not all are created equal. It would be better to have time budget here. Perhaps if the granularity is 16, we could apply a weighting system. Then a landscape component has a score of 16 causing the system to stop that frame and split the landscape components over multiple frames. Similarly instancedstaticmeshes could use their instance count to increase their score 1 being only a few instances and 16 (or whatever our max is) being 500 instances or more. This causes large isms with many instances to be streamed 1 per frame.

It’s worth mentioning that an attempt is made to time limit this stuff with s.LevelStreamingActorsUpdateTimeLimit however it only checks per increment and if an increment runs overboard, then the timelimit is not able to be respected. You can see below that 10 components get registered and then it realizes its overtime and ends that streaming tick. The timelimit is 5ms so it’s way overbudget.

[Image Removed]

The next issue is harder to diagnose, the main thread and renderthread seem to be waiting for something that is unclear to us. I checked all the background and foreground workers and all I could find is “Foreground Task”:

[Image Removed]

[Image Removed]StatNamedEvents seems to give us a few more details, it seems to be some chaos related build ops. Perhaps my initial idea of time slicing could help resolve this too.

[Image Removed] [Image Removed] Destroying physics state also extremely costly since it’s done per ism and not timesliced per instance. I’m out of allowed images but after this runs you get another big hitch from PhysicsTick->ChaosPushData and PhysicsTick->DestroyPendingProxies:

[Image Removed]Lastly, the renderthread seems to frequently hitch on CreateCommittedResource. In fact it seems to enlargen some kind of pool. We’re less familiar with this part of the code base so do you have any recommendations to mitigate this?

[Image Removed]

Svegn2 · April 29, 2025, 1:01pm

Hi Brenden,

Can you share the trace?

Martin

Svegn2 · April 30, 2025, 1:51pm

Hi Brenden,

The link doesn’t work. I’m getting an error saying the link is incorrect from drop box.

Martin

Svegn2 · April 30, 2025, 1:58pm

It works now! I’ll review it and come back with advices.

Are you planning on moving to newer releases? Version 5.6 will contain a lots of “streaming” related improvements. We have a list of the original CLs for some of those improvements for customers that are staying on 5.5. It should be possible to integrate some of the changes in 5.4 but it would require a bit more effort.

Svegn2 · April 30, 2025, 9:28pm

Version 5.5 introduces a new experimental feature that adds\removes the physic bodies from a worker thread. The feature was introduced with CL#35964445 and CL#36085421. There are likely other relevant CLs with tweaks\fixes to the system. It’s still not on by default in 5.6.

That being said, I’m not sure that it would really help with the amount of instances we are talking about. The deferred nature of the new code would like cause issues if the bodies are not added to the scene fast enough.

The worse case of RemoveFromWorld (4:50.2660) is removing 61 components for a total of 47,574 bodies. That is a lot of instances. The 2 biggest once have 11k+ instances. Do they really need collisions? Maybe there are better way to manage the same level of density like using runtime PCG. Can you share more details on the type of environment you are building? I could discuss the use case with our technical artists to see if they have suggestions.

I’m waiting from feedback from a few colleagues regarding the Chaos events and the creation of rendering resources.

Svegn2 · May 1, 2025, 1:56pm

I do agree that more\better time-slicing could be part of the solution for bodies. The current time-slicing code was tuned using the samples that where available at the time so it’s not totally surprising that you found some cases where they break. That can be a solution to get you to ship. Our current work is to move off lots of that work from the game thread and we are also working on aggressively reducing the number of Actors\Components. That should be ready when you need it in your next project. I would also recommend that you consider using runtime PCG for scattering smaller objects.

I did find some information on the CreateCommittedResource hitch. We have a private case that discusses lock contention when using Aftermath. Could it be your problem? The problematic CVar is r.GPUCrashDebugging.Aftermath.ResourceTracking. It is turned off by default so this might need further investigation.

The rendering team will want to know the type and size of the texture(s) that is taking long. I would suggest that you add some code to time the duration of CreateCommittedResource and put a breakpoint when it takes more than 10ms. Please open a new case with so it can be assigned to the proper people here.

anonymous-edc · April 29, 2025, 1:22pm

https://www.dropbox.com/scl/fi/3j727ndznlrbz64sgcdig/20250425\_225114\.utrace?rlkey\=k2i4mt25i3no16iwsqj8kulbs\&dl\=0

anonymous-edc · April 30, 2025, 1:54pm

Sorry about that. I edited the above link with the correct one.

anonymous-edc · April 30, 2025, 2:38pm

Right now its up for debate since engine transitions always result in a lot of bugs. We generally prefer to back-port where possible. Have you made improvements to chaos? Because chaos in general appears to be a huge part of the streaming bottlneck here. Frankly the timeslicing issues are easily managed on our end with some source code changes. I have no idea what to do about the create committed resource hitch though.

anonymous-edc · April 30, 2025, 9:37pm

We have small objects that can be picked up by the player. Sticks, rocks, crystals etc. The collision is just a simple sphere collider. That’s probably the biggest offender for so many bodies, otherwise its just normal foliage in the area, which would still likely contribute to a lot of bodies. It’s worth noting that we’ve built this kind of game before and used this kind of approach, but we were using physx and it was 4.21.

PhysX was much more performant when it came to creating physics bodies vs chaos, however for xbox we had to optimize it further. Our solution on clients was to intercept instanced static mesh physics body creation, store it in an octree and only create what is needed in an area around the player. We then timesliced the physics body creation to prevent too many being generated each frame.

Since we aren’t shipping on an old console like the original xbox one, I was hoping to skip the octree part. I’ll see if the background thread CL’s help. I think we can live with collision not being instantly created so long as the client can read that state and correct possible bugs (ie: teleporting to a new spot in the map and falling through the floor).

I think we can also mitigate a lot of this by simply timeslicing more. The streaming system should not be allowed to clean up 61 components with 47k bodies in a single frame. It should be breaking that up, but I have noticed when looking through streaming code that there are inner loops that don’t respect timelimits. For example UWorld::AddToWorld has a time limit on it’s loop which breaks out if the time limit is exceeded, but a single call within the loop can lead to several loops within AActor::IncrementalRegisterComponents which doesn’t pay attention to time limits at all.

Any thoughts on the create committed resource hitch?

anonymous-edc · May 1, 2025, 2:04pm

The current time slicing doesn’t really work unless you are dealing with single actors with limited components. Perhaps that was what it was originally tuned on, because if the code within the loop does several iterations without respecting the outer loop’s time limit, you end up with code that well overruns the time limit, completely defeating the purpose of its existence.

I don’t think runtime PCG is viable for us. The game is multiplayer so we’d have to sync all of this dynamically created content and there’d still be a lot of colliders generated, they’d just be limited to an area around the player. I would argue we’d get similar results by forcing those pickups to use a physics body creation octree, perhaps opting to make spatialized body creation an opt-in only feature and turning it on for these special cases.

Thanks for your advice, I’ll open another thread once I have more info on the create committed resource hitch. I can confirm that we’re not using that cvar.

anonymous-edc · May 17, 2025, 7:53pm

I back ported the changelist and it worked surprisingly great right out of the box with no issues. Performance is massively improved for streaming, but as soon as all those physics bodies are added, work associated with creating the acceleration structure appears to block the main thread significantly.

[Image Removed]

related question -- is there a engine-native way to track down number of bodies per mesh in a scene?

Svegn2 · May 20, 2025, 12:47pm

Hi Brenden,

There is a CVar that is used to force a full build when there is too much work. You might want to grow that number to keep time slicing going on. Check for AccelerationStructureTimeSlicingMaxQueueSizeBeforeForce in the Chaos code. The default is 1000 which is likely easily hit in your case. This will defer the creation of the internal structures so might cause other problems.

Martin

anonymous-edc · May 20, 2025, 1:55pm

Thanks is there a way to tune how many items in the queue are processed per tick? We cranked the number up to 100k but we’re still getting big blocks during streaming that are several seconds in length.

Is that time slicing supposed to affect these foreground tasks as well?

[Image Removed]

Also -- it doesn’t seem the change we integrated causes landscape collision to be created on a thread. Significant main thread time is spent creating landscape collision still:

[Image Removed]

Svegn2 · May 22, 2025, 9:44pm

Have you checked how many items are being processed when you get those multi-seconds calls to ComputeIntermediateSpatialAcceleration? You could try to disable the code that uses AccelerationStructureTimeSlicingMaxQueueSizeBeforeForce to prevent those “flushes”.

Regarding the Landscape heightfield, the support for async injection was added in CL 38662612

anonymous-edc · May 23, 2025, 2:52am

Hi Martin, I think I might need one more cl because PrimitiveComponent doesn’t implement AllowsAsyncPhysicsStateCreation() which LandscapeHeighfieldCollisionComponent wants to override.

Svegn2 · May 23, 2025, 12:50pm

I think it came with CL#37264207. We might be pulling on a very long series of changes. I’m starting to think that you should consider 5.6. There is the new FastGeo plugin that converts static Actors that are unreferenced into render\collision proxy which can dramatically reduce the number of UObjects (GC benefits).

anonymous-edc · May 23, 2025, 1:02pm

Thanks martin. An engine upgrade is non-trivial for us since we’ve made a number of source changes in other unrelated systems. It can be done but we’re close to a playtesting milestone and are a bit weary of the number of regressions we might face due to the many changes found in every engine upgrade. With that said we’ll certainly consider it. As for the fastgeo plugin, we kind of already have something like that, at cook time we pack static mesh components into isms if they are the same. This dramatically reduces component creation when streaming (though that doesn’t solve our physics problems, which has largely been the topic of this discussion).

I ran a simple test based on your suggestion like so:

[Image Removed]Unfortunately I’m still seeing lots of hitches that look like this:

[Image Removed] Are part of these tasks not currently timesliced at all? Or is there another case where flushes can occur that I’m not aware of?

Also we’re seeing some strange behaviour where sometimes world partition goes way over budget. We don’t have “block on slow streaming” checked and yet if we travel quickly through the world, it looks like world partition starts adding levels at a much greater rate regardless of being well over time:

[Image Removed]

Svegn2 · May 23, 2025, 4:55pm

For the Chaos part, look for FAABBTimeSliceCVars. This struct agregates 4 cvarsthat are defined and initialized at the end of AABBTree.cpp.

For the World Partition part, we compiled a list of cvars related to the 5.6 improvements. It does contain a CL related to AddToWorld which might help. I’m attaching a PDF containing those changes.

anonymous-edc · May 23, 2025, 7:20pm

Thank you this pdf is incredibly helpful. I have integrated a few changes already and will integrate a couple more and check back. As for FAABBTimeSliceCVars, I didn’t see anything related to blocking if overbudget. Those vars you called out are only used in HealdessChaosTestBroadphase.cpp so unless I’m looking at the wrong thing, I don’t think this accounts for why acceleration structure related hitches are still occurring.

[Image Removed]