C++ Performance Optimization (of likely negligible importance)

MostHost_LA · September 3, 2021, 9:02am

Here’s another question I’ll likely never see answered.

In Unreal, what is better?

Storing a socket reference and querying the transform property 3 times (one per vector dimension) each tick,

Or storing and working off the vector, so its one call, update the variable, the access it direct via x y z?

Or store x y z as separate floats?

It’s the silliest example I can come up with for what the real issue is…
Which is the fact that ue4 CPP has absolutely no best practices and varies wildly between template samples - or even engine code.

Basically this is almost entierly dependent on what the engine does under the hood.

In my experience, storing the float and accessing x/y/z once per tick should be faster than queirng a socket reference for the location 3 times.
However, that may in fact not be true for evey engine build.

Another silly example and perhaps a common mistake made all over would be.
Instantiating a variable once every tick.

The engine seems to be OK with it, no red flags to be seen at least. Garbage collection seems ok too…
But is it possible that it is in fact overloading or will cause issues down the line.
And yea, this goes more to Dev work vs Proper code cleanup / following a style, which is why it isn’t my main example.
Nor an issue really.

Curious to see if any other devs have factual stats on MS execution of functions…
And How to pull those out from the engine side would also be helpful… though I doubt this exists…

anonymous_user_f27400311 · September 3, 2021, 12:27pm

Well I don’t do C++ but I’d say storing the vector as one variable is better. Don’t take my word for it, I’ve been doing this for maybe 6 weeks and only done blueprints.

Grim_1 · September 4, 2021, 12:27am

Maybe your questions are not being answered because you are not formulating them very precisely.

A transform consist of more than just one FVector, I assume you are talking about the translation.

I guess you are asking if it’s better to save a temporary vector and access X, Y, Z from that instead of doing something like Ref->GetTransform().GetLocation().X, Ref->GetTransform().GetLocation().Y…
You are not specifying which exact function you are refering to so again I can only guess but if the GetTransform function you are using is just an accessor it will probably be inlined and there should be little difference between the two. If it returns a reference to the transform you can even save the copy of the vector and there will probably be no difference at all. If GetTransform does more than that (like taking in a name and searching for a specific socket) the first option will most likely be more performant. In general less instructions is better in many cases. If you really want to be sure you have to measure or at least take a look at the generated assembly. However, as you have pointed out yourself it is highly unlikely that this will give you any tangible performance benefit.

These are general C++ questions and the generated instructions at the end depend on which compiler you are using, it is not specific to Unreal. Many different engineers work on the source code, all have their own style, some may be more performance-conscious than others.

In some cases, but since the source code is public you can easily take a look under the hood when you need to.

When it comes to optimization you should only ever care about the shipping build.

What are you talking about? A variable of what type? Allocated how? Everything you wrote after this made very little sense.

MostHost_LA · September 4, 2021, 1:15am

You are correct in that I wasn’t very clear. On purpose.

This doesn’t extend to a specific problem i need solved, but to the general engine performance.

And while, yes, you can look at the source, the source won’t tell you ■■■■.
It’s about how long a stack takes to execute - which as far as I have seen has no way to debug in engine where it matters.

For instance.
GetSocket can take .002 to return, while GetBone could take .001 making it much better to dump sockets and just use bones.
Stupid example but it’s something that if the engine was able to return (or if I can fine where the timing for the stack is returned since it’s not in the stack view) would help dial down things.

I guess I could build the engine from source and pull an accurate debug stack with timing by using break points, but those should really just be made avaliable within the debugging tools as it’s something pretty common.

Re the last bit.
It doesn’t matter.
You can do whatever you want on tick, even create a whole new game mode and spawn everything in it.

Obviously any variable you declare within tick is local to the function as with anything else.

But because it’s not really a stack situation like it would be for GetSocket, there isn’t any way to see how long setting that particular variable took. Short of debugging with breakpoints before and after I guess.
Painful to say the least.
This is rgardless of the type of variable or its malloc situation. Obviously the nastier the memory required the longer allocating would take causing the function to “drag”.

Btw, regarding the rest. When this acrually compiled and running the process, gardless of what you compile as. The timings for functions are often different.

And usually inline vs writing out an IF has no bearing on timings (other than the ability to measure with breakpoints).

Meaning

Variable = Somebool? UseTis : UseThat;

Vs

If(Somebool)
Var = usethis
Else
Var = use that

Which is also something else that being able to determine engine run time on execution would be dreadfully helpful for…

Since it’s mostly a convenience factor of writing less - which may actually affect performance positively or adversely… or not at all if it’s like regular cpp projects.

Hopefully that explains a bit better of what I’m looking for/at.

And things like these should definitely be listed in some master BestPractices document, but aren’t.
Despite their importance being completely negligible in 99.9% of cases…

Grim_1 · September 4, 2021, 12:24pm

Of course it will tell you something. If it’s a massive or deeply nested function it will almost certainly take longer than a short flat one.

You cannot expect the engine to profile every function by default, development and debug builds would slow down to a crawl, but Unreal has good profiling tools that you can use to measure any code you write. E.g. to compare GetSocket and GetBone you could just use the SCOPE_CYCLE_COUNTER macro and compare the times for both functions. Look at the profiling documentation.

Also very easy to profile with the tools Unreal gives you, but that is something you could even assess by just looking at the generated assembly. Unless you are using the ternary op to directly initialize the variable upon creation it will probably compile down to the same instructions. Same instructions = same performance.

Unreal Engine is a regular cpp project, any difference in the performance of equivalent code comes down to the compiler you are using. Unreal does not change how C++ works, it doesn’t even have its own compiler and even if it did it would adhere to the same ISO C++ standard.

MostHost_LA · September 4, 2021, 6:31pm

Will need to look further into the SCOPE_CYCLE_COUNTER thing.

However my point is that GENERALLY, when you produce an engine that compiles stuff on it’s own end, regardless of how it does it, you also publish function timing on “best possible” build/machine as a guidance.
There’s just nothing in the way of that for Unreal that I have ever seen - and to be frank, this kind of stuff probably matter way more on games than in nasty C# sites where people do this a lot.

I never dove into the toolchain yet.
From what you are saying, maybe I should just replace .net with GCC here for my specific test purposes.

And yes, I guess I could pull down the assembly to have a look.

One thing though. Just because functions are nested doesn’t necessarily mean they take longer to return. That’s kind of why having “best possible” timings listed for each core function that the libraries provide would give you a better idea on how to approach some General coding…
Particularly since now Unreal is an SDK?

Grim_1 · September 5, 2021, 1:35pm

I have never used or even heard of a library that does this.

You can’t replace something you are not using, the .net compiler is for C#, game code in Unreal is written in C++. GCC also doesn’t support C#.

Unless all function calls are inlined by the compiler they do take longer. Every function call has an overhead (however small it may be).

How are you imagining that would work, “best possible” timings? The time to execute depends on the hardware. You want them to run all functions with all possible inputs on the “best” computer and give you times? We have big O notation for this precise reason, because citing execution time in time units is impractical across different hardware. If you need to optimize performance you have to measure on your target hardware yourself in most cases. There are best practices which are known to be more performant on most modern hardware, but these apply to C++ in general (and specific compilers) and that is not something that you should expect from the Unreal Engine documentation.

eblade · September 9, 2021, 4:53am

I think you do not at all understand what “inlining” a function means to a compiler.

… ? the engine isn’t compiling anything

well, since the transform might change each tick, you probably would get wildly different results if you attempted to store the vector and use it the next tick. You could theoretically create an optimization by ensuring that all code that uses that vector receives a reference, thereby never having to do anything but cache the reference to it, but that’s the sorts of optimizations you’d really only ever see done by either mad-persons or kernel programmers (which may have a very aligned lookin Venn diagram )

So, assuming that you know that the socket reference will never change, the sane path is to cache the socket, and ask it for the value when you need it.

what do you mean ‘instantiating’ a ‘variable’? At the most basic level, and also something I haven’t studied optimizations on in decades, so this may be a quite simplistic explanation, when you enter a function (such as Tick), the program will allocate enough memory off the stack, to account for all of the local variables that function will need. Creating local function variables is basically free beyond the first one, as the allocator for the stack just has to move a pointer around no matter how many variables you have to fill.

well, yeah. what issues would there by? why would there be anything to garbage collect?

… a call stack? Check out the profiling documentation, particularly Unreal Insights | Unreal Engine Documentation

Usually “stack” on it’s own (at least in most programming circles) refers to the memory area known as the “stack”, which is where function local variables are allocated from.

If you’re concerned about microseconds, you may be targetting the wrong hardware. But if that’s something that actually does concern you for some reason, yes, feel free to lose convenience features (sockets) for microseconds saved not doing a lookup into the sockets table. And yes, reading the engine source can tell you a whole lot about what is going on inside each function, which you can use to determine how quickly a function will run in comparison to some other function.

If you’re concerned about the speed of malloc, you’re probably in way over your head with whatever you’re trying to accomplish. If you’re doing a lot of things that cause mallocs, you probably want to avoid doing a lot of those during runtime. Malloc itself is usually quite quick, and with decades of development behind various malloc implementations, it’s probably the least of your problems … but you shouldn’t be in a situation where you’re in need of allocating a lot of heap memory at any point after startup.

Of course it’s going to be quite different. You can run the same compiled code dozens, hundreds, or thousands of times, on exactly the same piece of hardware, and get different timings. Modern CPUs are always doing all kinds of crazy optimizations at runtime.

I would challenge you to find a situation where a simple bool ternary and an if else on the same ternary compile down to different code.

Of course it’s “like regular cpp projects”. It’s all C++.

…? Unreal is in C++, not C#. Some of the tools that go along with it are in C#.

Actually, that’s exactly the case. If you turn a 10 line function into 10 individual function calls of 1 line, it’s going to take longer, unless the compiler optimizes all that crap out.

What library provides “best possible timings” ? I’ve been doing this for 40 years, and the closest thing I can think of is the occasional warning in documentation that a function is O(n^2) or some such (which I know I’ve seen in some Unreal docs recently)

I feel like you’re trying to approach this from some sort of comp-sci theoretical direction, without having much knowledge of comp-sci. This is not an attempt to insult, merely pointing out that your questions aren’t making sense.

MostHost_LA · September 9, 2021, 6:12am

Yea but what is the acrual MS count performance of pulling out the location 10 times, vs storing the location once as a vector and calling it a day (in the same tick).

Imagine if on each tick you create a variable containing pi to the 1000th decimal (so a string?) Whatever you can imagine being the most improbable and impossible variable.

before the next tick, you are now forced to discard the variable (or your cpu/ram whatever even flat file) will not be able to store another one of possibly the same size.

Cpp does this for us.
You don’t mess with malloc - but it’s still collecting garbage somewhere, releasing that memory after the tick function concludes.
This has to be happening pretty much no matter what, even if in different sauces. It is part of garbage collection…

Yea but also no.
At that point I would just edit, add a start and end clock to it and manually profile the cost of the function myself…

True. And I’m not.
However the question still applies when trying to come up with a “best possible” guideline to follow.

Take it an extreme like spawningn in a billion skeletal meshes (good luck on any system).
If you know what the cost is Per Mesh, you can limit and adjust - perhaps even create a custom pool to manage it…

This could probably be very much the same for what you should or shouldn’t do when coding stuff.
Particularly with regards to On Tick.

GCC is C and C++.
Guess using Clang could do too since ue4 ships for it.
The engine does compile with GCC however.

Sort of, but
Not really. And I have been in this muck for 20 plus years. Not as respectable as your 40, perhaps.

You are both correct in that I haven’t seen of recent a single library with best timings listed.

They used to be commonplace things. Particularly when dealing with small footprint systems running on very limited resources.
Even VB used to have some of those recorded at one point…

Oh. And regardless.

I would still expect the engine to be able to give me the timings of the functions without having to go and compile from source after adding in start/stop timer calls on all the functions I want profiled.

It’s not like it doesn’t know the times, being it compiles a nice little ms cost list of the cpu plus realtime graph.
It’s just that the individual times aren’t there unless you add timers into kismet functions.

MostHost_LA · September 9, 2021, 6:23pm

That’s a decent talk for a change.

You can say that twice. But I did warn people by labeling this properly as being of “negligible import”.

If it continues the way it is going it’ll only be worse. Really.

The engineers seem only interested in Best Possible setups.

This has kind of always been the case with UE4? I mean ARK used to rip a new 1080ti apart when the 1080ti was the pinnacle of top tier.

Heck to work in the editor without burning the hardware I had to go out of my way and buy an EVGA water cooling system to manually install (that was fun btw).

I wouldn’t be surprised if this explained why Chaos performance is still sub par to PhysX in testing.
If you work on a thread ripper or i9 10900k, it’s kind of no wonder that you don’t need to optimize anything…
Try running the same something on a Pentium2

Anyway, let’s not drift the topic out of topic.
I think I’ll post some code on how to read out some proper timings into custom classes.
At least if anyone is interested in knowing they’ll be able to find out.

Just wish one didn’t have to go out of their way to find these things out.

Take the Tick video for instance.
If unreal just made and distributed an aggregation system to begin with and warned people about tick misuse properly…
Around 90% of the projects wouldn’t be dreadful

eblade · September 9, 2021, 6:32pm

Entirely dependent on the hardware the device is running on, and due to hardware optimizations, you might well never get the same result twice.

C++ doesn’t do garbage collection. If you’re mallocing something, it’s up to you to deal with it. If you’re just creating local function variables, the stack pointer gets moved at the start of the function to accommodate the local variables sizes, and then gets moved back at the end of the function. Simply creating local variables has a fixed instruction cost in their allocation and removal, and that is the number of instructions it takes to move the stack pointer. What you do with those variables in the meantime, is what matters.

Which will tell you how long that function took to run that time, but doesn’t tell you anything about the complexity of the function, and how long it will always take to run, or anything about how it will run in any different software or hardware circumstance.

That is exactly what you do. You find out the cost given your specific hardware and software requirements (ie, take a worst-case scenario mesh / actor, and put 10 or 100 or 1000 of them into the world, and measure the impact), and then you don’t exceed your limits.

In the project that I’m working on right now, you absolutely never spawn anything at runtime. That, IMO, is the way anything should be structured. Spawning is absolutely the most costly thing you will do, so it should all be done before the user can be impacted by it. Pretty much all other concerns are rendering. Code wise you have to really do extremely inefficient things to really make an effect on your game’s framerate.

That would be identical to adding profiling to all functions. There’s no need for that, and it just mucks up the profiler’s view of things, when you’ve got hundreds of negligible things.

Engine has a pretty wide range of built in profile points, which you can examine with Insights. You can add your own anywhere you please, but the more you add, the harder it’s going to be to actually drill down, because it’ll slow things down significantly and give you a much larger amount of data to go over.

I think what you’re looking for is something like when the engine had UnrealScript, and you could do a profiling dump on everything that happened in script. That’s not useful anymore. You have to make more educated decisions on what to profile, because you can’t profile everything.

yeah, super early start. Back when if you wanted a computer to do something, you had to write it yourself, because what little software existed you couldn’t afford!

MostHost_LA · September 9, 2021, 6:51pm

Olivetti 386 was our first household PC for me, but I’m really only counting since after high-school being that prior to that I wasn’t paid to do stuff…
Still, to date the only game that’s held up in time is peobably Monkey Island…

Nostalgia aside, thanks for the tips.
Will have to review stuff further.

BrUnO_XaVIeR · September 9, 2021, 10:10pm

Nah… you can always download more RAM!

I remember when I discovered that Maya was used to build Final Fantasy 10.

I wanted to learn the software, but all I learned was that Maya was $12,000 and 3DS Max was something around $8K.

I had to, for a long time, try to convince my boss to buy Maya for the company, so I could learn to use it…

Anyway, no idea how this topic got to this lol