Announcement

Collapse
No announcement yet.

Compiler performance problems. TSharedPtr vs std::shared_ptr.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Compiler performance problems. TSharedPtr vs std::shared_ptr.

    Background: I've been working on a 3D tile based level system based on a modified marching cubes/tetrahedra approach. While populating the lookup table for tile selection and orientation I have been experiencing progressively longer build times that appear to be out of proportion with the amount of code I'm adding. By around the 200th entry build times have reached ~6 minutes. These times are the same both via build (not rebuild, not clean then build, just straight build) in VS 2013 Express (12.0.30110.00 Update 1) and via the Compile button in the Editor.

    I searched the answers site and the forums (https://answers.unrealengine.com/que...compiling.html https://answers.unrealengine.com/que...c-project.html) and the answers found made no difference.

    So I did some testing and found that the main culprit appears to be, that the compile time increases greatly (relatively) for every TSharedPtr that is created (might be references to TSharedPtr that are increasing it rather than creation specifically, haven't tested that).

    Included below is the data I've collected based on creating and placing shared pointers in a fixed length array. I've been comparing compile times using TSharedPtr against compile times using std::shared_ptr. All shared pointers are created and placed into the arrays individually, not within a loop (see code at the end of post).



    I've separated the std::shared_ptr onto its own graph so that the effect on compile time can be better see. As you can see the number of entries in the array using std::shared_ptr has an approximately linear effect on compile time. The compile time when using TSharedPtr is increasing on a curve. With only 50 entries in the array the TSharedPtr version takes 3 times as long to compile, but if there are 400 entries being placed into the array it's taking nearly 100 times longer to compile.


    Here's a set of classes that should reproduce the problem (copied from my project and rearranged for brevity so won't compile without some modifications).

    Notes: I tested with the TSharedPtr array in the UClass and in the normal c++ class, there is no difference. I couldn't include memory for std::shared_ptr in the UClass without errors so I tested in the normal class. I tested an array of std::shared_ptr<FRotator> and the compile time is fine (so not a general issue with UE4 classes).

    Code:
    #pragma once
    
    class Simplest
    {
    
    public:
    	Simplest() {};
    };
    
    
    ----------------------------------------------------------------------
    
    
    #pragma once
    
    #include "Simplest.h"
    
    class TestTSharedPtr
    {
    
    public:
    	TestTSharedPtr() {
    		lookupTable[0] = TSharedPtr<Simplest>(new Simplest());
    		lookupTable[1] = TSharedPtr<Simplest>(new Simplest());
    		lookupTable[2] = TSharedPtr<Simplest>(new Simplest());
    		lookupTable[3] = TSharedPtr<Simplest>(new Simplest());
    		...
    		lookupTable[198] = TSharedPtr<Simplest>(new Simplest());
    		lookupTable[199] = TSharedPtr<Simplest>(new Simplest());
    	};
    	
    	TSharedPtr<Simplest> lookupTable[200];
    };
    
    
    ----------------------------------------------------------------------
    
    #pragma once
    
    #include "memory"
    #include "Simplest.h"
    
    class TestStdSharedPtr
    {
    
    public:
    	TestStdSharedPtr() {
    		lookupTable[0] = std::shared_ptr<Simplest>(new Simplest());
    		lookupTable[1] = std::shared_ptr<Simplest>(new Simplest());
    		lookupTable[2] = std::shared_ptr<Simplest>(new Simplest());
    		lookupTable[3] = std::shared_ptr<Simplest>(new Simplest());
    		...
    		lookupTable[198] = std::shared_ptr<Simplest>(new Simplest());
    		lookupTable[199] = std::shared_ptr<Simplest>(new Simplest());
    	};
    	
    	std::shared_ptr<Simplest> lookupTable[200];
    };
    
    
    ----------------------------------------------------------------------
    
    
    #pragma once
    
    #include "TestStdSharedPtr.h" // or TestTSharedPtr.h
    
    #include "GameFramework/Actor.h"
    #include "GeneratedMap.generated.h"
    
    /**
     * 
     */
    UCLASS()
    class AGeneratedMap : public AActor
    {
    	GENERATED_UCLASS_BODY()
    
    	
    	TestStdSharedPtr* testCase;
    	//TestTSharedPtr* testCase;
    
    	
    };
    
    ----------------------------------------------------------------------
    
    #include "Blackbeard.h"
    #include "GeneratedMap.h"
    
    AGeneratedMap::AGeneratedMap(const class FPostConstructInitializeProperties& PCIP)
    	: Super(PCIP)
    {
    	testCase = new TestStdSharedPtr(); // TestTSharedPtr();
    };
    Last edited by ChrisJD; 04-15-2014, 02:45 AM.

    #2
    I like how you put the effort to even make graphs for compile time.
    From what I see, my guess is the problem comes from template instantiation. That is, TSharedPtr<T> can be much more complex and as such, takes longer to instantiate. Consider explicit template instantiation and typedef-ing the type.

    Comment


      #3
      Very interesting find, ChrisJD! We will definitely look into this soon. Improving compile times is one our highest priorities.

      --Mike

      Comment


        #4
        Some further notes for anyone looking into it. It appears to be something to do with the compiler optimising the classes. Compiling with optimisation disabled results in compile times similar to std::shared_ptr. After pulling the TSharedPtr code out of the engine into a separate test project and doing some further digging it specifically appears to be to do with the operator= implementation (TSharedPtr& operator=(TSharedPtr const& InSharedPtr), or the internals implementation) in TSharedPtr (for the test case mentioned above at least). The first line of that method contributes most of the extra compile time but the other two also have add a noticeable amount of compile time.
        Last edited by ChrisJD; 04-21-2014, 02:27 AM.

        Comment


          #5
          Originally posted by Mike Fricker
          Improving compile times is one our highest priorities.
          Awesome, that's good news.

          Comment


            #6
            Awesome find, ChrisJD -- compile times are one of my many pet peeves :-) We have someone here investigate as well - please keep us up to date with your findings.

            Comment


              #7
              It would be interesting to see a graph comparing compile times with just a bare pointer as well. I would expect it to be linear growth like shared_pointer, but I'm curious if shared_pointer is only incurring its template creation cost once, which is what I would expect. Whereas TSharedPtr looks like its incurring template creation cost for each instance - even though its the same type.

              Comment


                #8
                In my test project the compile times for TSharedPtr becomes much more reasonable if the FORCEINLINE on the operator= overload with the signature FORCEINLINE TSharedPtr& operator=( TSharedPtr const& InSharedPtr ) (line 612 in SharedPointer.h) is replaced with inline. This change brings the compile time for the slowest test down to ~5.7s (from ~341 seconds using TSharedPtr without any modification, x60 difference). Redefining FORCEINLINE to inline for the whole of SharedPointerInternals.h and SharedPointer.h brings the compile time down to ~2.6 seconds (std::shared_ptr is ~2.5 seconds).

                Making a blanket change to make all the methods that have been defined with __forceinline to be defined with inline instead as in the latter case above to get the best compile time result doesn't seem like a good idea. I assume quite a bit of thought and profiling went into determining what needed to be forced inline to get the best runtime performance. However the huge compile time cost of the forced inlining of the operator= doesn't seem worth while, at least for builds running in the editor during development.

                I haven't done a full build of UE4 yet to verify that the change will provide the same kind of compile time reduction normally but I don't see any reason why it wouldn't.

                I can't see any reason why force inlining that particular method is resulting in compile times increasing on a curve. I'd expect the time increase to be linear even if the amount of code being inlined is large.. the compile time cost shouldn't become greater with each call (I didn't notice anything recursive)? Someone with more knowledge of compilers than I might be able to explain it.

                Edit: I imagine the same issue exists with SharedRef and WeakPtr as they have nearly identical operator= methods as well but I haven't tested them.
                Last edited by ChrisJD; 04-22-2014, 07:13 AM.

                Comment


                  #9
                  We may think of a different behavior of forced inline function compared to the usual inline tips in regards to the result that could explain the curve.

                  Even if a function is inline by the optimization process, conventional rules apply because, inline or not, the code have to behave the same, like evaluating the arguments first, etc. it reduces the possible interleaving of the instructions of successive function call.

                  It is possible, and it looks like, that with forced inline functions, the compiler delete sequence points if they are not significant. It then results to a larger amount of instructions able to be reordered together, and possible permutation of a set grow exponentially in regards to the count of the element.

                  By the way, force inline attribute in UE4 is by far too much used. This is the by default flag, and do not have a real thinking on it. Compiler are far more clever than human, prevent them to choose what is better by them-self with that kind of flags is most often a bad idea. Only a profiling session should be able to highlight where a function call have a real negative impact over inlining, and then profile again to be sure we got a win.

                  For your specific case, i will probably solve it by factorizing the creation of the shared pointer in a loop to only have one instantiation of the assignment operator, instead of a crappy function with hundreds of copy and pasted lines.

                  Comment


                    #10
                    Originally posted by galopBn View Post
                    For your specific case, i will probably solve it by factorizing the creation of the shared pointer in a loop to only have one instantiation of the assignment operator, instead of a crappy function with hundreds of copy and pasted lines.
                    Given that compilers like to unroll loops in some cases, it might not make that big difference

                    Anyway, I don't think that original problem is hundreds of copy/pasted lines. Original problem is in real source code, where TSharedPtr is used in proper places. OP just made an effort to reduce it to synthetic testcase which shows the problem clearly, instead of dumping megabytes of cpp files on us and shouting "it is slow". Optimizing away slowness from synthetic testcase is kind of counterproductive...

                    Comment


                      #11
                      I ran some experiments earlier today using UE4Game Win64 Development configuration, and it doesn't look like it'll have a major impact in the vanilla code base:

                      Two baseline runs of rebuilding (no engine mods):
                      4:36
                      4:32

                      Changing the various operator= calls in SharedPointer.h:
                      4:28

                      Changing FORCEINLINE in WindowsPlatform.h to inline:
                      4:25

                      The variance here is probably in the noise floor for disk cache + using the computer while compiling. However, someone on the core team is also having a look at the implementations in SharedPointer.h and will probably chime in to this thread.

                      Cheers,
                      Michael Noland

                      Comment


                        #12
                        That's interesting Michael, I would have expected the change to have a more noticeable impact if it was a general problem. It's producing huge improvements for me in my test projects and my actual project, but I haven't tried doing tests with full engine builds as they take about 30 minutes for me.

                        Can anyone else actually replicate the kind of build times I'm having for the test case? If not... two steps backwards and I'm very open to suggestions as to what the actual underlying problem might be.

                        Here is a minimal project (generate VS solution from .uproject then compile once to get the first slow build done then compile again after a trivial change) I just created with the 4.0.2 editor binary that takes a long time to compile for me. As is it takes ~30 seconds to compile (this and all further times are for rebuilds after trivial changes to the MyActor.cpp file, i.e. adding a space to the end of a line and choosing build). If the code that populates the second array is uncommented the build time goes up to over 5 minutes. All times are just for "compiling", not total UBT execution time (that time is generally about 8 seconds longer than the compile time).

                        Changing the various operator= calls in SharedPointer.h to inline:
                        3.1 seconds

                        Redefining FORCEINLINE in SharedPointer.h and SharedPointerInternals.h as inline:
                        1.2 seconds

                        The last case is the only one on par with the compile times I get when using std::shared_ptr instead of TSharedPtr.



                        I tested the same changes with my actual project and the improvements apply there as well, although the general change to inline makes much more of a difference (likely because I'm doing more with the pointers than jut assigning them to an array). My normal build time with no changes is:
                        324 seconds

                        Changing the various operator= calls in SharedPointer.h to inline:
                        49 seconds

                        Redefining FORCEINLINE in SharedPointer.h and SharedPointerInternals.h as inline:
                        8.4 seconds
                        Last edited by ChrisJD; 04-23-2014, 08:12 AM.

                        Comment


                          #13
                          FYI, on my computer, on the Development_Editor x64 target, a full build with FORCEINLINE set as __forceinline or inline show a 2 minutes delta, from 13m00s to 11m00s. The result is stable and tested a couple of time. That's near of a 20% improvement of compile time.

                          For every 5 build, you got one free , and It is really easy to trigger large or even full build when working on the engine, and performance improvements over a simple inline compiler hint are not prove yet. It is also possible to keep it forced on retail final builds that are compiled by far more rarely.

                          EDIT : Project on a Samsung SSD 840 Pro, core i7-3930K and 32GB of RAM ( do not remember the exact Corsair model but very good ones of course ).
                          Last edited by galopBn; 04-23-2014, 07:25 AM. Reason: add the spec of the computer

                          Comment


                            #14
                            I had only compiled the standalone game (not the editor target) since the iteration time is faster for testing, so that's another good data point. Thanks galopBn.

                            There are certainly a number of places where FORCEINLINE has a noticeable performance impact, especially in some math code where it'll round-trip to memory instead of keeping things in registers if (frequently when) it fails to inline the call, but I do agree that we're probably overzealous on using FORCEINLINE in the code base today.

                            Cheers,
                            Michael Noland

                            Comment


                              #15
                              Math off topic :

                              You are talking about the still mostly scalar math code ? FVector is not even aligned on 16B ( used more than 6x lines compared to FVector4 on a basic search ) and do not help compilers to produce SIMD for us. Do not try to change that or the UOBject serialisation will blow off ( we tried on UE3, failure ).

                              What made math fast is not inlining, it is how you use them, as an example, i will use the heavy math usage of visibility culling because of the frustum x bounds intersection tests. OOP is not adapted to the usage, spread calls like `bool visible = frustum.Intersect( oneSphere);` is just wrong compared to `anArrayOfBool = frustum.Intersect( manySphere )`. The first option loose a lot of opportunity to factorize invariant. It also means that spheres are likely to come from some `primitive.getSphere()` and this is also spread memory, consequence, less effective % of L2 line cache usage and more cache miss. What made math slow first is memory then doing things more than once when it could be done once.

                              Funny fact : I am surprised that you remove the octree frustum culling in UE4 and kept only a brute force loop over the primitives for the main view. Brute force may outperform hierarchical if well done, but it is not the case here ( cf previous paragraph ). Funny enough, i did the exact opposite, update the octree traversal to branch to a different function without bound test once a node is fully visible, and removed the "brute force" per primitive test that where running on primitives in a list that were here because they pass a frustum culling in the first place and that were always yelds true. That is the funny fact, UE3 was doing frustum calling twice

                              Another funny fact : using a really costly `atan2` to compute an angle to convert it back to consecutive multiple of sin and cos when a simple 2D normalize and simple iterative 2D rotation do the trick ( SHBasisFunction ). And this is only a drop in an ocean of possible optimization on the code around SH ( that were really an issue in UE3 )
                              Last edited by galopBn; 04-23-2014, 07:25 PM. Reason: funny fact 2

                              Comment

                              Working...
                              X