Compiler performance problems. TSharedPtr vs std::shared_ptr

Background: I’ve been working on a 3D tile based level system based on a modified marching cubes/tetrahedra approach. While populating the lookup table for tile selection and orientation I have been experiencing progressively longer build times that appear to be out of proportion with the amount of code I’m adding. By around the 200th entry build times have reached ~6 minutes. These times are the same both via build (not rebuild, not clean then build, just straight build) in VS 2013 Express (12.0.30110.00 Update 1) and via the Compile button in the Editor.

I searched the answers site and the forums (Extremely slow compiling - C++ - Unreal Engine Forums How to improve compile times for a C++ project? - Programming & Scripting - Unreal Engine Forums) and the answers found made no difference.

So I did some testing and found that the main culprit appears to be, that the compile time increases greatly (relatively) for every TSharedPtr that is created (might be references to TSharedPtr that are increasing it rather than creation specifically, haven’t tested that).

Included below is the data I’ve collected based on creating and placing shared pointers in a fixed length array. I’ve been comparing compile times using TSharedPtr against compile times using std::shared_ptr. All shared pointers are created and placed into the arrays individually, not within a loop (see code at the end of post).

I’ve separated the std::shared_ptr onto its own graph so that the effect on compile time can be better see. As you can see the number of entries in the array using std::shared_ptr has an approximately linear effect on compile time. The compile time when using TSharedPtr is increasing on a curve. With only 50 entries in the array the TSharedPtr version takes 3 times as long to compile, but if there are 400 entries being placed into the array it’s taking nearly 100 times longer to compile.

Here’s a set of classes that should reproduce the problem (copied from my project and rearranged for brevity so won’t compile without some modifications).

Notes: I tested with the TSharedPtr array in the UClass and in the normal c++ class, there is no difference. I couldn’t include memory for std::shared_ptr in the UClass without errors so I tested in the normal class. I tested an array of std::shared_ptr<FRotator> and the compile time is fine (so not a general issue with UE4 classes).


#pragma once

class Simplest
{

public:
	Simplest() {};
};


----------------------------------------------------------------------


#pragma once

#include "Simplest.h"

class TestTSharedPtr
{

public:
	TestTSharedPtr() {
		lookupTable[0] = TSharedPtr<Simplest>(new Simplest());
		lookupTable[1] = TSharedPtr<Simplest>(new Simplest());
		lookupTable[2] = TSharedPtr<Simplest>(new Simplest());
		lookupTable[3] = TSharedPtr<Simplest>(new Simplest());
		...
		lookupTable[198] = TSharedPtr<Simplest>(new Simplest());
		lookupTable[199] = TSharedPtr<Simplest>(new Simplest());
	};
	
	TSharedPtr<Simplest> lookupTable[200];
};


----------------------------------------------------------------------

#pragma once

#include "memory"
#include "Simplest.h"

class TestStdSharedPtr
{

public:
	TestStdSharedPtr() {
		lookupTable[0] = std::shared_ptr<Simplest>(new Simplest());
		lookupTable[1] = std::shared_ptr<Simplest>(new Simplest());
		lookupTable[2] = std::shared_ptr<Simplest>(new Simplest());
		lookupTable[3] = std::shared_ptr<Simplest>(new Simplest());
		...
		lookupTable[198] = std::shared_ptr<Simplest>(new Simplest());
		lookupTable[199] = std::shared_ptr<Simplest>(new Simplest());
	};
	
	std::shared_ptr<Simplest> lookupTable[200];
};


----------------------------------------------------------------------


#pragma once

#include "TestStdSharedPtr.h" // or TestTSharedPtr.h

#include "GameFramework/Actor.h"
#include "GeneratedMap.generated.h"

/**
 * 
 */
UCLASS()
class AGeneratedMap : public AActor
{
	GENERATED_UCLASS_BODY()

	
	TestStdSharedPtr* testCase;
	//TestTSharedPtr* testCase;

	
};

----------------------------------------------------------------------

#include "Blackbeard.h"
#include "GeneratedMap.h"

AGeneratedMap::AGeneratedMap(const class FPostConstructInitializeProperties& PCIP)
	: Super(PCIP)
{
	testCase = new TestStdSharedPtr(); // TestTSharedPtr();
};

I like how you put the effort to even make graphs for compile time.
From what I see, my guess is the problem comes from template instantiation. That is, TSharedPtr<T> can be much more complex and as such, takes longer to instantiate. Consider explicit template instantiation and typedef-ing the type.

Very interesting find, ChrisJD! We will definitely look into this soon. Improving compile times is one our highest priorities.

–Mike

Some further notes for anyone looking into it. It appears to be something to do with the compiler optimising the classes. Compiling with optimisation disabled results in compile times similar to std::shared_ptr. After pulling the TSharedPtr code out of the engine into a separate test project and doing some further digging it specifically appears to be to do with the operator= implementation (TSharedPtr& operator=(TSharedPtr const& InSharedPtr), or the internals implementation) in TSharedPtr (for the test case mentioned above at least). The first line of that method contributes most of the extra compile time but the other two also have add a noticeable amount of compile time.

Awesome, that’s good news.

Awesome find, ChrisJD – compile times are one of my many pet peeves :slight_smile: We have someone here investigate as well - please keep us up to date with your findings.

It would be interesting to see a graph comparing compile times with just a bare pointer as well. I would expect it to be linear growth like shared_pointer, but I’m curious if shared_pointer is only incurring its template creation cost once, which is what I would expect. Whereas TSharedPtr looks like its incurring template creation cost for each instance - even though its the same type.

In my test project the compile times for TSharedPtr becomes much more reasonable if the FORCEINLINE on the operator= overload with the signature FORCEINLINE TSharedPtr& operator=( TSharedPtr const& InSharedPtr ) (line 612 in SharedPointer.h) is replaced with inline. This change brings the compile time for the slowest test down to ~5.7s (from ~341 seconds using TSharedPtr without any modification, x60 difference). Redefining FORCEINLINE to inline for the whole of SharedPointerInternals.h and SharedPointer.h brings the compile time down to ~2.6 seconds (std::shared_ptr is ~2.5 seconds).

Making a blanket change to make all the methods that have been defined with __forceinline to be defined with inline instead as in the latter case above to get the best compile time result doesn’t seem like a good idea. I assume quite a bit of thought and profiling went into determining what needed to be forced inline to get the best runtime performance. However the huge compile time cost of the forced inlining of the operator= doesn’t seem worth while, at least for builds running in the editor during development.

I haven’t done a full build of UE4 yet to verify that the change will provide the same kind of compile time reduction normally but I don’t see any reason why it wouldn’t.

I can’t see any reason why force inlining that particular method is resulting in compile times increasing on a curve. I’d expect the time increase to be linear even if the amount of code being inlined is large… the compile time cost shouldn’t become greater with each call (I didn’t notice anything recursive)? Someone with more knowledge of compilers than I might be able to explain it.

Edit: I imagine the same issue exists with SharedRef and WeakPtr as they have nearly identical operator= methods as well but I haven’t tested them.

We may think of a different behavior of forced inline function compared to the usual inline tips in regards to the result that could explain the curve.

Even if a function is inline by the optimization process, conventional rules apply because, inline or not, the code have to behave the same, like evaluating the arguments first, etc. it reduces the possible interleaving of the instructions of successive function call.

It is possible, and it looks like, that with forced inline functions, the compiler delete sequence points if they are not significant. It then results to a larger amount of instructions able to be reordered together, and possible permutation of a set grow exponentially in regards to the count of the element.

By the way, force inline attribute in UE4 is by far too much used. This is the by default flag, and do not have a real thinking on it. Compiler are far more clever than human, prevent them to choose what is better by them-self with that kind of flags is most often a bad idea. Only a profiling session should be able to highlight where a function call have a real negative impact over inlining, and then profile again to be sure we got a win.

For your specific case, i will probably solve it by factorizing the creation of the shared pointer in a loop to only have one instantiation of the assignment operator, instead of a crappy function with hundreds of copy and pasted lines.

Given that compilers like to unroll loops in some cases, it might not make that big difference :wink:

Anyway, I don’t think that original problem is hundreds of copy/pasted lines. Original problem is in real source code, where TSharedPtr is used in proper places. OP just made an effort to reduce it to synthetic testcase which shows the problem clearly, instead of dumping megabytes of cpp files on us and shouting “it is slow”. Optimizing away slowness from synthetic testcase is kind of counterproductive…

I ran some experiments earlier today using UE4Game Win64 Development configuration, and it doesn’t look like it’ll have a major impact in the vanilla code base:

Two baseline runs of rebuilding (no engine mods):
4:36
4:32

Changing the various operator= calls in SharedPointer.h:
4:28

Changing FORCEINLINE in WindowsPlatform.h to inline:
4:25

The variance here is probably in the noise floor for disk cache + using the computer while compiling. However, someone on the core team is also having a look at the implementations in SharedPointer.h and will probably chime in to this thread.

Cheers,
Michael Noland

That’s interesting Michael, I would have expected the change to have a more noticeable impact if it was a general problem. It’s producing huge improvements for me in my test projects and my actual project, but I haven’t tried doing tests with full engine builds as they take about 30 minutes for me.

Can anyone else actually replicate the kind of build times I’m having for the test case? If not… two steps backwards and I’m very open to suggestions as to what the actual underlying problem might be.

Here is a minimal project](http://www.chrisdrain.com/SharedPointerCompiling.zip) (generate VS solution from .uproject then compile once to get the first slow build done then compile again after a trivial change) I just created with the 4.0.2 editor binary that takes a long time to compile for me. As is it takes ~30 seconds to compile (this and all further times are for rebuilds after trivial changes to the MyActor.cpp file, i.e. adding a space to the end of a line and choosing build). If the code that populates the second array is uncommented the build time goes up to over 5 minutes. All times are just for “compiling”, not total UBT execution time (that time is generally about 8 seconds longer than the compile time).

Changing the various operator= calls in SharedPointer.h to inline:
3.1 seconds

Redefining FORCEINLINE in SharedPointer.h and SharedPointerInternals.h as inline:
1.2 seconds

The last case is the only one on par with the compile times I get when using std::shared_ptr instead of TSharedPtr.

I tested the same changes with my actual project and the improvements apply there as well, although the general change to inline makes much more of a difference (likely because I’m doing more with the pointers than jut assigning them to an array). My normal build time with no changes is:
324 seconds

Changing the various operator= calls in SharedPointer.h to inline:
49 seconds

Redefining FORCEINLINE in SharedPointer.h and SharedPointerInternals.h as inline:
8.4 seconds

FYI, on my computer, on the Development_Editor x64 target, a full build with FORCEINLINE set as __forceinline or inline show a 2 minutes delta, from 13m00s to 11m00s. The result is stable and tested a couple of time. That’s near of a 20% improvement of compile time.

For every 5 build, you got one free :cool:, and It is really easy to trigger large or even full build when working on the engine, and performance improvements over a simple inline compiler hint are not prove yet. It is also possible to keep it forced on retail final builds that are compiled by far more rarely.

EDIT : Project on a Samsung SSD 840 Pro, core i7-3930K and 32GB of RAM ( do not remember the exact Corsair model but very good ones of course :slight_smile: ).

I had only compiled the standalone game (not the editor target) since the iteration time is faster for testing, so that’s another good data point. Thanks galopBn.

There are certainly a number of places where FORCEINLINE has a noticeable performance impact, especially in some math code where it’ll round-trip to memory instead of keeping things in registers if (frequently when) it fails to inline the call, but I do agree that we’re probably overzealous on using FORCEINLINE in the code base today.

Cheers,
Michael Noland

Math off topic :

You are talking about the still mostly scalar math code ? FVector is not even aligned on 16B ( used more than 6x lines compared to FVector4 on a basic search ) and do not help compilers to produce SIMD for us. Do not try to change that or the UOBject serialisation will blow off ( we tried on UE3, failure ).

What made math fast is not inlining, it is how you use them, as an example, i will use the heavy math usage of visibility culling because of the frustum x bounds intersection tests. OOP is not adapted to the usage, spread calls like bool visible = frustum.Intersect( oneSphere); is just wrong compared to anArrayOfBool = frustum.Intersect( manySphere ). The first option loose a lot of opportunity to factorize invariant. It also means that spheres are likely to come from some primitive.getSphere() and this is also spread memory, consequence, less effective % of L2 line cache usage and more cache miss. What made math slow first is memory then doing things more than once when it could be done once.

Funny fact : I am surprised that you remove the octree frustum culling in UE4 and kept only a brute force loop over the primitives for the main view. Brute force may outperform hierarchical if well done, but it is not the case here ( cf previous paragraph ). Funny enough, i did the exact opposite, update the octree traversal to branch to a different function without bound test once a node is fully visible, and removed the “brute force” per primitive test that where running on primitives in a list that were here because they pass a frustum culling in the first place and that were always yelds true. That is the funny fact, UE3 was doing frustum calling twice :slight_smile:

Another funny fact : using a really costly atan2 to compute an angle to convert it back to consecutive multiple of sin and cos when a simple 2D normalize and simple iterative 2D rotation do the trick ( SHBasisFunction ). And this is only a drop in an ocean of possible optimization on the code around SH ( that were really an issue in UE3 )

A change landed today that looks like it might fix this issue.
https://github.com/EpicGames/UnrealEngine/commit/fdbf05dd13e2749f333cc58ca08358d15eff2fe4

Personally, I think it is one of the more interesting changes I’ve seen go through master since the live checkins were introduced. I’d love to see a technical writeup of how Epic went about fixing the problem. I think it would be pretty interesting read :slight_smile:

I’m also curious about making TSharedRef move assignment but not move constructible. I understand why you can’t make it move constructible, what I don’t understand is why not also prevent move assignable? Right now, it is swapping the values. Couldn’t this be a source of programmer error, where a simple assignment has now swapped out my ref and it is no longer pointing to what I think?

Another question related to this code:



	FORCEINLINE TSharedRef( TSharedRef&& InSharedRef )
 		: Object( InSharedRef.Object ),
 		  SharedReferenceCount( InSharedRef.SharedReferenceCount )
 	{
 		// We're intentionally not moving here, because we don't want to leave InSharedRef in a
 		// null state, because that breaks the class invariant.  But we provide a move constructor
 		// anyway in case the compiler complains that we have a move assign but no move construct.
 	}


Instead of copying in the move constructor, why not just =delete; it? I would think that would prevent the compiler from complaining.

Thanks.

I’ve done some build performance analysis of the operators and have some conclusions:

First, here are some timings on my PC from the original code, with the ‘array of 200’ code given above:

1> TestFile.cpp (0:42.57 at +0:00)
1> TestFile.cpp (0:42.16 at +0:00)
1> TestFile.cpp (0:42.26 at +0:00)

Some investigation reveals that the root cause of the problem is due to the extraneous OldSharedReferenceCount object as the first line in the assignment operator. It doesn’t seem to be doing any good, as the associated FSharedReferencer is already safe. It was likely causing combinatorial explosions in the common subexpression elimination path of the compiler - we have found this before.

There was some work to be done on TSharedPtr, so I removed that object, reduced the amount of template instantiations, as well as adding some degree of rvalue reference support (not all conversions are done yet).

This gives the following build times with the same code:

1> TestFile.cpp (0:10.71 at +0:00)
1> TestFile.cpp (0:10.65 at +0:00)
1> TestFile.cpp (0:10.76 at +0:00)

Note that these builds were done in Development. Debug builds had no perceivable difference in the compile time.

Overall build times don’t really seem to have been affected in real terms. Here are some full rebuild times on a typical game project, before and after:

// Before:
1>Time Elapsed 00:04:57.16
1>Time Elapsed 00:05:05.92
1>Time Elapsed 00:05:36.96

// After:
1>Time Elapsed 00:04:45.78
1>Time Elapsed 00:05:16.86
1>Time Elapsed 00:05:26.94

Deleting the move constructor would give a compile error whenever you attempted to copy an rvalue TSharedRef, not what you want. :slight_smile: Unfortunately, there doesn’t seem to be a way of defaulting a move constructor to be equivalent to a copy, other than not defining it at all, but that’s more like normal overload resolution than ‘defaulting the move operator’.

Steve

I neglected to respond to this bit:

I admit that it’s slightly unusual, and we considered it for a while, but memswap or copying are the only real viable options for a TSharedRef assignment operator. And we chose memswap because it is faster in general.

What is ‘it’ here? I assume the RHS of a move assignment, because the LHS should definitely be pointing to what you think. So the RHS in a move assignment could either be a prvalue or an xvalue:


// prvalue
MySharedPtr = FunctionWhichReturnsSharedPtr();

// xvalue
MySharedPtr = MoveTemp(MyOtherSharedPtr);

In the prvalue case, these are generated by the compiler and you’ll never see them, so I’m not sure why you’d care what a temporary points to. You may get different destruction order relative to other temporaries in the same expression, but the order of temporary creation (and thus destruction) within an expression isn’t guaranteed by a compiler anyway.

In the xvalue case, MyOtherSharedPtr will now be pointing to what MySharedPtr used to point to. This may be contrary to your expectations. But what are your expectations? C++ doesn’t define any guarantees about what a moved-from state should be, not even for its own library types, other than they should be valid:

Basically, you can’t assume anything about the state of a moved-from object, except that it’s not invalid.

This is the model that we’ve chosen to follow, and so the moved-from state is simply the LHS of the expression, which is definitely valid. If this doesn’t match your assumptions, then I suspect many other C++ types may also disappoint you in that regard. But feel free to list your concerns here and we’ll change it if you make a compelling case. :slight_smile:

Steve