Getting better performance out of UStaticMeshComponent (or: rendering lots of things)

I hope I’m posting this in the right subforum. Apologies if it needs to be moved! If it should be posted on AnswerHub instead, I can move it to there. But it seems like more of an open-ended question, so I’m first trying here.

I would also like to apologize for the length of this post. I’m going to give as much background as possible, so that maybe someone can see where I’m coming from, and perhaps offer me some suggestions or advice.

A project I’m working on right now requires many small static mesh objects to have their transforms and a per-object material parameter to be animated. The count of individually animated objects is somewhere around 1000 or 2000, depending on the scene. They do not all have unique meshes, but they are not all the same. There is no use of physics or overlap events for these objects. They are used entirely for visual effect.

I’m using UStaticMeshComponents to do this. However, I’m limited by CPU performance. My project needs to work well with HMD/Oculus VR, so it’s important that the game thread finishes as soon as possible, to give plenty of time for the GPU to render.

The transforms and material parameters for the meshes are calculated procedurally using a 3D motion graphics framework I have embedded into Blueprint. The pure Blueprint nodes emit a datastructure which is consumed by a simple graph optimizer, and turned into a flat set of vectorized operations over transforms and related data. The transform data itself is not actually passed through arrays in Blueprint. And because physics and game logic are not really being used, unlike most applications of UE4, the bottleneck is not occuring in physics or collision detection.

My initial limitation was with updating the transforms of all of the components. I found that MoveComponent (via SetRelativeTransform or related transformation update methods) was slow for this many objects.

My first thought was to switch to using UInstancedStaticMeshComponent. This would have been very nice, because then I would have been able to leverage hardware instancing, and also not worry about creating so many UObjects, which have a construction cost and must be exposed to the GC system. Objects with shared meshes could be grouped into being child instances on one of several UInstancedStaticMeshComponents. And each object typically only has 1 material parameter that needs to be animated, so the existing per-instance random value on instances could be leveraged to pass this data through to the material. Unfortunately, I quickly ran into a problem when trying to do this.

UInstancedStaticMesh instances do not properly render into the velocity buffer. When calculating the mesh instance’s transform of the previous frame for rendering into the velocity buffer, only the previous transform of the parent UInstancedStaticMeshComponent is considered. The previous frame’s transform of the individual instances is not stored anywhere. You can see this in LocalVertexFactory.usf (code is from 4.6, though these lines appear to be the same in the current promoted branch):



float4 VertexFactoryGetPreviousWorldPosition(FVertexFactoryInput Input, FVertexFactoryIntermediates Intermediates)
{
#if USE_INSTANCING
	float4x4 InstanceTransform = transpose(GetInstanceTransform(Input));
	return mul(mul(Input.Position, InstanceTransform), PreviousLocalToWorld);
...


This means that the animated instances would not have motion blur when HMD is disabled. More importantly, they would not have correct temporal AA in either HMD or traditional displays. The artifacting is quite noticeable.

I started to add this feature myself, but stopped when I realized I had no idea how FVertexFactoryInput and FPositionOnlyVertexFactoryInput actually mapped their data to the shader via RHI, and anything I touched would probably just break, and I had no idea how to debug it.

I also realized that nearly every person who uses instanced static meshes in UE4 will be using it for foliage and other things which will never have their transforms animated. Nearly doubling the memory requirement on instances for a feature that most people would not use would make it unlikely to have any patch I wrote accepeted into the engine, so I would either need to write an entirely separate code and shader path for animatable instances, or maintain my own fork of the engine.

Writing a separate path is beyond my ability and current understanding of UE4. And, after looking at the commits on the newer branches on Github, I saw that this code was in a state of change. The system was being reworked to support even higher performance for foliage, and anything I tried to modify myself seemed like it would soon conflict with upstream changes.

I considered my options, and decided that trying to get better performance out of UStaticMeshComponent was a better idea. However, I first needed to improve the performance of actually moving the components around.

To remedy this, I first restructured my code that calculates the positions for these objects to always work in world space. (Actually, it is a bit more nuanced than that, but let’s assume it is this way for the purposes of explanation.) I then worked with the assumption that the UStaticMeshComponents that are being updated would always have absolute translation and rotation, so I made my code set those flags to true. For my special case, I do not need to worry about these components having child transforms that need to be updated. I also don’t need to worry about occlussion culling, as occlusion is mostly managed by hand in this project, especially for these groups of high-count dynamic objects. (The entire group of objects is hidden and updating logic is deactivated through scripted events.) After profiling and eliminating procedures that didn’t seem to need to be performed in my special case, I ended up with code that was simplified into something like:



...
for (int32 i = 0; ...)
{
	Components*->ComponentToWorld = CalculatedTransforms*;
	Components*->Bounds = ParentBounds;
	Components*->MarkRenderTransformDirty();
}
...


Which seems to work fine, though perhaps I would lose performance and quality if shadow casting on these meshes were enabled (due to not calculating proper bounds on each mesh individually), but shadows are currently not being used for these types of objects in this project. Updating the transforms in this way is faster than through SetRelative/WorldTransform, which invokes MoveComponent and takes a significant amount of time when multiplied by the number of objects that are being moved.

By the way, I would like to point out that UE4’s CPU profiling system (via the scoped cycle counter macros) is excellent, among the best I’ve ever used.

The main bottleneck was no longer with moving the components or anything related to transforms. After tweaking loops and optimizing a few more things, I reached the point where UMaterialInstanceDynamic::SetScalarParameterValue was occupying about 30% to 40% of the game thread tick time. This method is being called once per tick on each animated object. Unfortunately, I can’t seem to find a way to speed this up or bypass unnecessary operations, and I am still over my CPU performance budget for anything but very high-end consumer CPUs.

(Additionally, UpdatePrimitiveTransform (GT) in Post Tick Component Update also uses up a bit time, though it’s significantly less than the SetScalarParameterValue).

In reality, it takes only a few milliseconds and has no problems on a mid-range CPU when used with a traditional display, but the tight latency requirements of Oculus/HMD mean that there is not enough breathing room being given for the frame to finish. The GPU often ends up being underutilized or fails to hit 75hz.

I need to further improve the performance of getting the animated material parameter into the materials, but I’m not sure what to do. One idea would be to pass the scalar value to the objects’ material via a texture that is modified on the game thread. However, this requires the artists to set up the materials properly with a texture input and material function that fetches the right value from it. It’s doable, but suboptimal.

In theory, because each static mesh component is already being done in a separate draw call, I should just be able to get the parameter into the material without all of this overhead. But a lot of time is being spent on UMaterialInstanceDynamic::SetScalarParameterValue, and I’m not sure how I can get around it.

I also have problems with the amount of time it takes to construct new UStaticMeshComponents. It takes a lot longer than I thought it would. It’s normally not a problem, but it can cause CPU spikes and possible hitches if a lot need to be created all at once. I could use pooling to avoid this, but it would be nice if meshes could somehow be rendered properly without this overhead.

Here’s a summary of my current bottlenecks:

  1. UMaterialInstanceDynamic::SetScalarParameterValue needs to be called many times to set the animated parameter on many meshes. Each mesh has its own unique animated value, so using a material parameter collection is not an option. More than half of the game thread time is being spent on these calls.
  2. Constructing UStatichMeshComponents is a bit slow.
  3. I’m losing about 1ms on UpdatePrimitiveTransform (GT), but I don’t know if it’s possible to avoid that.

Does anyone have any suggestions or advice? I’m willing to try anything. If it involves changing the engine or adding new code, I would be more than happy to submit any code changes or additions upstream to UE4, if I’m offered a little guidance from anyone at Epic.

Thanks for reading!

Hi, I’m having the same problem. I’m wondering if you found a way to make this better.

the year is 2077 and I still have questions whether you ever found a solution to this?