Slow FLumenSceneData::ResetAndConsolidate() due to reallocations

Hello,

While profiling CPU performance on Windows, I found that sometimes `FLumenSceneData::ResetAndConsolidate()` takes a considerable amount of time (up to 24 ms).

The problem lies within `Cards.Consolidate()`, which shrinks the `Cards` array by a small amount (e.g. from 182969 elements to 182963). This involves reallocating a big chunk of memory (67 MB in this case) and memcpy()'ing it to a new location. This operation is very slow on Windows (~20 ms) because new pages are committed on-demand, and the subsequent VirtualFree() call also takes ~4 ms blocking all other threads which might want to call VirtualFree() at the same time.

Is there any specific reason for repeated reallocations of that array?

I also found that `FLumenCard` structure layout is not optimal, consuming 22 extra bytes for padding, which takes extra RAM and slows down memcpy().

Hi there!

Thanks for the feedback! We did some optimizations in this area just recently that will likely affect these numbers:

CL#42910041 (6ce04d2) [Lumen scene update] CPU perf optimizations for UpdateLumenScenePrimitives

- Primitive groups that are not associated with ray tracing groups are now removed immediately instead of being deferred and storing their indices into a TSparseUniqueList. The cost of adding elements to the list in a tight loop becomes quite prohibitive when there is a lot to remove due to reallocating large containers.

- TSparseSpanArray now doesn’t shrink unless doing so reduces allocated size by more than half. Reallocating the Elements array from a large one into another fairly large one is very costly. This reduces some 10+ ms spikes (on a task thread) to less than 1/3 of the original cost in a test project.

CL#43003298 (736f4e2) [Lumen scene update] CPU perf optimizations for UpdateLumenScenePrimitives

- Adds a chunked sparse array implementation TChunkedSparseArray and use it for FLumenSceneData::PrimitiveGroups. Compared to TSparseSpanArray, it has better performance overall but doesn’t support span allocation/free which is not needed for PrimitiveGroups.

- Compacts FLumenPrimitiveGroup from 72 to 64 bytes. This helps in the situation where a level has a lot of instances (primitive groups). It reduces reallocation cost and cuts down memory usage.

- GetCustomId is now called only once for all the instances of a primitive.

- Reserves memory upfront for several transient arrays used in UpdateSurfaceCachePrimitives to avoid reallocations.

- Changs the sort for MeshCardsAdds to a partial sort since we only care about the first MeshCardsToAddPerFrame elements.

- In a test project, reduces the cost of the worst spike caused by PrimitiveGroups reallocation from ~10 ms to ~3 ms and there is reduction to UpdateLumenScenePrimitives spikes generally. ~30 us reduction to the average cost of BeginUpdateLumenSceneTasks scope.

The FLumenCard members and layout have also changed but I’ve passed along your feedback to the team regarding optimizing the structure.

Also, as a reminder, Epic is on holiday break from 6/30 - 7/11, returning on 7/14. Confidential issues will be unanswered during that time and responses to non-confidential issues may be slow.