Why Skeletal Mesh extremely expensive?

VladimirDBVY · January 7, 2022, 4:40pm

In the old games like Age Of Mythology or The Lord of the Ring strategy games there no any harm to performance when we see in one screen more than 300-1000 skeletal models. Its work just fine at old PCs, but now in unreal if we put just 100 skeletal meshes - we just have lots of performance leak to that stuff. Why is it?

MostHost_LA · January 7, 2022, 11:34pm

It’s not only the skeletal mesh, it’s the BP you use for the skeletal mesh.

You have to use tick group aggregation and animation sharing to get anything decent performance wise.

And even then, the skeletal mesh component may need to be partially rewritten for optimization.

Most projects with billions of things (like fish, or armies) end up writing custom mesh code and animating via shader.

Humanasset · January 7, 2022, 11:45pm

I agree on the previous post. I notice a lot of it depends on how dense the triangles on the mesh are. If your character has 250k tris its gonna be expensive. I imported one of the high poly metahumans into a BLANK project and because it had hair strands my pc could barely render the one character. I turned his hair off and made him bald and it helped tremendously, but it just looked stupid.

I was able to put 100 characters on screen that had roughly 25k tris and it dipped the performance marginally, maybe 20%. Those were instances though, so I was only getting the draw calls from the single character blueprint, the 100 characters barely had any cpu draw whatsoever, but your graphics card still gets bound when trying to render all the image planes.

MostHost_LA · January 8, 2022, 2:06am

The gfx hits bounds when calculating vertex positions for other the models, not when rendering even 10 materials times 100 characters as separate drawcalls.

Sure it’s not in anyway performance smart, but 1,000 drawcalls is peanuts for any rendering engine…

Humanasset · January 8, 2022, 2:42am

So you’re saying that is the vertex positioning that creating the bind on the GPU? I did some performance tests on my experiment above using the 25k 100 characters using stat RHI and it was showing the majority of the burden was on the texture rendering albeit still very low. It’s pretty generalized and doesn’t quite say what its rendering using the stat window it shows triangle count on screen, but I don’t recall if it was giving any performance data on that specifically. I haven’t cranked it up past 5,000,000 triangles on screen, but that doesn’t even hardly make the engine budge with a decent gpu.

MostHost_LA · January 8, 2022, 3:12am

There’s some hard limits dependent on GPU there.
The engine will do things internally - like just not render a whole mesh - if the allowance is exceeded - and yes, it is tied to vertex count because of the operation load required to render it. More vertex = more math.

To be clear on this.
No, it’s not “JUST” that. It’s a combination of events.
You are rendering a bunch of separate (or instanced) objects. The draw-call total is 1 concern.
Each mesh you render has it’s own tick event - that’s 1 concern.
Each rendered mesh has it’s own mesh deformation (animation or what not) - that’s another concern.

Math wise on the GFX, the events “add up” to eventually provide low performance.
HOWEVER
Most of those issues are handled and affect the CPU, so you would more likely be CPU bound than GFX bound.

It’s important to look at “how” things are being done.

If you just add “character” class instances around, you are more likely to be CPU bound than GPU bound.

Humanasset · January 8, 2022, 3:42am

Thanks for the info. It’s always been an experiment to see what is actually going on by adding things in and seeing what is happening under the hood gpu or cpu wise. It seemed like certain things that I instanced provided a lower draw call, but my gpu usage would sky rocket if there were many of them. Its been hard to pinpoint such things when you throw shadows or volumetric fog in there and you’re working in the editor. I’ve seen things and ran performance checks while running in standalone thinking I was about on par with my target fps, but package and be delighted that its actually performing even better than I anticipated. Which is good, but still leaving me wondering how much the editor was actually eating in the background.

MostHost_LA · January 8, 2022, 5:27am

A lot.
And most of the times as you may be working you potentially have tabs upon tabs and windows upon windows open, which all add up to the rendering pipeline in PIE.

That said. This engine is absolutely 100% the worse performing one on the market.
Sure, it can “look pretty” but it comes at an average cost of 60fps vs all the competitors for 4k native.

This far, the engine’s official response for this has been “suck it”, and “but we implemented an internal method so you never have to render at 4k native”.
Nothing more than excuses.

And I say this from testing Also on a 3090… it’s just bad. Period.

jwatte · January 8, 2022, 5:58am

I have written a 3D engine that runs consistently at 240 Hz at 4k resolution on a GTX 1080.

It also: looks like ■■■■; the art path is laughable; and nobody but me knows how to use it.

Win some, lose some!

MostHost_LA · January 8, 2022, 6:04am

It is what it is.
But if you make a project out of it, that runs at that speed with decent PBR like realistic feel, you’d probably make a big pile of bucks (you also have no legal constraints of any type on it, so I’m all for that too).
It’s true however, making the art pipeline less of a hassle is demanding on the render side.

I think whatever you made is still less laughable than the last 2 years of Epic’s development…

and to be perfectly clear - this stems from wanting to do EVERYTHING - and not being able to the most basic things, like finish the features they started working on 4 years ago…

GeorgeVD · January 8, 2022, 2:13pm

I would hope - and ask - for more teaching information on “best practice” for using the unreal engine. As a not generic teached IT-professional and non generic english speaking user too, it is very hard for me to find out what impact my randomly found “estimated as good” concepts will have on performance.

Konflict · January 8, 2022, 4:14pm

@VladimirDBVY

Unreal’s anim system is well optimized for supporting lots of advanced animation features those two games were never had. Obviously more robust systems would run slower, and optimizations aren’t possible without “cutting” on these nice features unreal anim system has got. You shouldn’t compare eggs to oranges, tho both are spherical shape but they are very different things indeed.

If you wish to animate thousands of models in a single frame, you must make significant reduce on the unrel animation system, or even better just implement your very own simple/custom animation system which will have little feature set but render your armies far much faster. Studio games often do the later, and replace a few key systems in game engines with their own custom designed solutions in order to produce their amazing games. That being said, unreal is likely capable of rendering your large armies fine, you just have to know the systems being involved very well - so you will know what to optimize on it for your custom uses.

Konflict · January 8, 2022, 4:21pm

@GeorgeVD You can learn about engine systems by browsing and reading the unreal enigne source code. If you have visual studio opened just go and travel down the rabbit hole by examining the stack, you can figure out which code drives what portion of engine, would eventually give you ideas which approach of a design step would perform better than the other. Also, you can output tons of metrics on screen (using stat commands) where torture tests can give you hints of which limitatons you may run into. This always takes time, after all there’s a lot to learn about.

VladimirDBVY · January 9, 2022, 4:00pm

Can you please share more about that? So, basically if we dont gonna use anim bp - thats be high performant? I tested that stuff and if i dont use anim bp - that is not the same as use static mesh as example.

How exactly we can “cut” some features to reduce costs? As example if i use only simple stuff in anim bp (changing position of bones) how can i cut all other stuff? I need to create inherited c++ anim bp class and override some functionality? Or i need to create own c++ skeletal mesh and anim bp from scratch?

jwatte · January 9, 2022, 7:37pm

“high performance” is not well defined.

You have some particular goals for animation – 1000 animated characters on screen at 120 Hz, or 10 characters where each strand of hair is physically simulated at 30 Hz, or 10000 fragments of rock flying in the air when exploding a building. The only way to know what “high performance” is for you, is to build the thing (or a thing that is close enough,) and measure it. Measure it in release mode, on your target hardware. Use real low-level profiling tools. Anything from NSight to VTune can help.

Games generally deal with “many” of things, which means that “batch” is the way to go where at all possible. This means that things like “data structure cache coherency” is more important than, say, “flexible object decomposition.” If your previous experience is with “how to write code that’s easy to add features to,” or “how to write code that serves a database web request with acceptable performance,” then many of the best practices you’ve used before, may end up going against the game development needs.

Note that this is not unreal specific – this is a property of what it means to “develop games” as opposed to, say, “develop car window control software” or “develop music playing web pages” or whatever.

VladimirDBVY · January 9, 2022, 9:26pm

The goal a.e. 1000 character in one screen with different skeletal animation (non shader driven) in real time. Doesnt matter the spec of pc (it be laggy on any pc).

Enough if i put 10 skeletal meshes (a.e. default mannequin character) - i see that in profiler, if i put 10 static meshes - no. So that the point, why it so expensive and how to make it less expensive if anim requirements is very simple (just move one bone as example or use only one anim)?

MostHost_LA · January 10, 2022, 2:45am

I already told you precisely why the cost is elevated.

Do research on how to do tick aggregation and animation sharing.

There is literally no way around it.

GeorgeVD · January 11, 2022, 4:18pm

@ VladimirDBVY Should I ask someone who knows already to teach me driving a car? Or should I take lessons in thermodynamics, applied physics and law to find out from the bits and pieces how it propably works?

smaraux · January 13, 2022, 3:16pm

Note that Unreal could be more perfomant on the render / skinning part by batching it on a per skeleton/mesh pair basis. Correct me if I am wrong, but for now, there are some calls on the GPU for each mesh of a character to render, it could be put in a buffer and “batched” per skeleton / geometry mesh, saving a lot of calls for high number of characters / meshes per characters. The skeleton positions can be batch-buffered to render efficiently several characters at once. That’s what we do for our crowd plugin under another DCC where we had to write our own display. We have our own animation node to provide animation for our crowd, and we either fill it to SkeletalMesh actors or StaticMesh actors (way more efficient if the meshes are rigid as they are batched in that case). I the other case, such a batched implementation would require to rewrite a lot of stuff around SkeletalRenderGPUSkin and GPUSkinVertexFactory to handle such batches and automatically build them

MostHost_LA · January 14, 2022, 4:49am

Not sure you have that right -
I wouldn’t expect more than 1 drawcall per bone.
And even at that, I would immagine it being batched to some extent.
Otherwise 10 animated meshes with 256 bones each would already compromise the rendering pipeline.

I think that’s actually above the maximum allowed bone count, it may just be 128.
Same difference though.

In either case that’s where LODs come into play for performance reduction.
At a distance you’ll never need to render/animate the fingers of a character, or the facial setup. So you can reduce the load on the system quite a lot that way too…

Best way to test it is to throw the default skeletal mesh which has 2 bones into an empty level, and Stat to see what drawcall count changes by when you duplicate it…