Announcement

Collapse
No announcement yet.

Created a new Instanced Skeletal Mesh Component -- encountering strange perf issues in D3D

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Created a new Instanced Skeletal Mesh Component -- encountering strange perf issues in D3D

    Hi all,

    For the past month I've been involved as a software engineer in a AAA project that involves a very complex scene with thousands of static and skeletal meshes rendering at once in VR. In order to optimize this scene, which sadly cannot benefit much from occlusion culling and other such practices, our designers have been making extensive use of Unreal's mesh combination tool, as well as the instanced static mesh components (both hierarchical and non-hierarchical varieties). Moreover, my primary task for the past month has been to implement an Instanced Skeletal Mesh Component into the engine so that we can take advantage of instancing for many of the characters in our game (think for marching bands/armies, or large crowds of people).

    Anyway, I'm happy to report that I finally succeeded at this task last week: the new Instanced Skeletal Meshes (ISKMs) are only skinned once per component and only a single draw call is issued to render them on screen. Unfortunately, our testing revealed some strange performance results from using the meshes: when rendering in OpenGL, they yield an impressive performance boost when rendering large crowds of characters. While rendering in D3D... not so much. I'll start with the good details.

    When rendering in OpenGL on my work computer (equipped with an nvidia gtx 1080) I was able to achieve 120 fps while rendering 1024 characters high-poly characters on screen at once (each one had over 40,000 triangles... comes out to over 42,000,000 triangles for the whole scene). This was easily three times faster than rendering each skeletal mesh with individual skinning + draw calls, so I'm quite pleased with the results...

    Click image for larger version

Name:	opengl4_noshadows_noculling.png
Views:	1
Size:	758.1 KB
ID:	1204240
    Click image for larger version

Name:	opengl4_noshadows_noculling_stat_engine.png
Views:	1
Size:	663.4 KB
ID:	1204241

    When rendering in D3D however, I had some pretty severe performance problems. Despite confirming once again that I was indeed skinning the mesh only once per frame, as well as issuing a single draw call for all instances, the ISKMs were rendering significantly slower than their non-instanced varieties in all of my stress tests...

    Click image for larger version

Name:	d3d_noshadows_noculling.png
Views:	1
Size:	784.4 KB
ID:	1204242

    A bit of profiling revealed the major source of the slowdowns. The most critical appears to be in base pass rendering. stat SceneRendering revealed most cycles there were being eaten up in some kind of "RenderQuery Result" process, while ProfileGPU stated they were being eaten by the "Dynamic" process (drawing dynamic elements?). Though neither of these facts have been very helpful in leading me to the source of the performance losses...

    Click image for larger version

Name:	scenerendering.png
Views:	1
Size:	720.7 KB
ID:	1204248
    Click image for larger version

Name:	profilegpu.png
Views:	1
Size:	44.4 KB
ID:	1204244
    Click image for larger version

Name:	statengine.png
Views:	1
Size:	708.2 KB
ID:	1204247

    Another weird thing I noticed (though it doesn't seem to affect performance) is that stat InitViews reports a lot of time being spent determining visibility (again though, only in D3D). I say that it doesn't affect performance, because toggling occlusion culling on/off doesn't raise or lower my performance at all; just changes the graphs around. (The above reports were generated with occlusion culling off because of this).

    Click image for larger version

Name:	initviews.png
Views:	1
Size:	715.8 KB
ID:	1204246

    Strangely, I also noticed that normal skeletal meshes do not seem to significantly affect the reported occlusion culling speeds in any significant way, so I wonder why this is the case for ISKMs (and again, only in D3D).

    Anyway, the main big question is: what is "RenderQuery Result", and why is it eating all of my cycles when D3D is active?

    I suspect that something weird is happening in the D3D pipeline, like materials constantly being loaded and unloaded for every instance (or something strange like that) but I can't really be sure.

    If anyone could give me some ideas about this though, I'd be much obliged.

    As this is not my own personal project, I am unfortunately restricted right now in how much of my code I am able to share right now. So I cannot, for example, just zip up my entire modified engine source and project files for everybody to inspect... BUT, I am interested in receiving help to solve this issue, so my team has authorized me to share as many code snippets and questions/answers as necessary to solve the problem.

    Personally, I would love if I could open-source this work at some point, since I suspect some other people could find ISKMs useful in certain scenarios... but we shall see.

    If anybody has any questions / comments / suggestions / answers, I would love to hear them cheers,

    Sheridan
    Attached Files

    #2
    ISMs will make occlusion culling slower since they count as one huge thing for the purposes of occlusion.

    Also "Dynamic" entry in GPU Profiler means that much time is spent on Dynamic actors/meshes as opposed to Static ones (you know, the switch in actor's properties, just below the transform, right hand side below the list of all actors in level)

    Comment


      #3
      Originally posted by Zireael07 View Post
      ISMs will make occlusion culling slower since they count as one huge thing for the purposes of occlusion.
      I guess that clears up the question over why occlusion culling takes so long

      Originally posted by Zireael07 View Post
      Also "Dynamic" entry in GPU Profiler means that much time is spent on Dynamic actors/meshes as opposed to Static ones (you know, the switch in actor's properties, just below the transform, right hand side below the list of all actors in level)
      Understood, but as these are animating skeletal meshes, I sadly don't have the choice to make them static; skeletal meshes are always dynamic objects.

      Moreover, it doesn't explain why the D3D pipeline spends such a long time rendering them in the base pass compared to OpenGL...

      Comment


        #4
        Originally posted by Zireael07 View Post
        ISMs will make occlusion culling slower since they count as one huge thing for the purposes of occlusion.
        As I understand it , one ISMC counts as one object in the scene, so there has to be exactly one occlusion check for it, it just takes the bounds of all instances together and if part of it is on the view then its rendered. So the occlusion culling is super fast on it, 1 check vs many individual checks on regular meshes. So I don't really get why you say it would be slower.

        ISMCs are slower on the GPU, but the CPU stuff (culling, draw calls) are way faster than static meshes (or in this case skeletal mehes).


        [MENTION=637924]srathbun-vl[/MENTION] Awesome that you implemented this! It would definitely be great if you could open source this at some time, I always wanted to have instanced skeletal meshes
        Easy to use UMG Mini Map on the UE4 Marketplace.
        Forum thread: https://forums.unrealengine.com/show...-Plug-and-Play

        Comment


          #5
          When running DirectX, try these console commands in sequence before run and see if anything changes in regards of RenderQuery Results:


          r.HZBOcclusion 2
          r.AllowOcclusionQueries 1
          r.OneFrameThreadLag 1
          r.NumBufferedOcclusionQueries 20
          | Savior | USQLite | FSM | Object Pool | Sound Occlusion | Property Transfer | Magic Nodes | MORE |

          Comment


            #6
            Hi Bruno,

            Thanks for the advice! Interestingly, those console commands diminished the reported impact of "RenderQuery Result" to nothing. However, in its place, RenderViewFamily + InitViews + View Visibility + Occlusion Cull mysteriously rose to fill the gap and reduce performance to the exact same level (still 30fps). Which doesn't make a lot of sense to me...

            Of those commands, r.AllowOcclusionQueries seems to be the key in toggling "RenderQuery Result" as the reported perf sink, rather than all of the others.

            EDIT: Even if r.AllowOcclusionQueries is set to 1, if I move the camera within the bounds of the ISKM, view visibility costs drop to zero and the cost of RenderQuery Result skyrockets back up to ~37ms. Which again is very puzzling... what is RenderQuery Result and who's calling it?
            Last edited by srathbun-vl; 12-20-2016, 01:45 PM.

            Comment


              #7
              It doesn't make sense that the CPU would just be idle while the gpu is rendering... seems to me that's why the engine is multi-threaded in the first place. Instanced Static Meshes don't have this problem...

              It would be really nice if one of the devs at Epic could comment on this problem. I am working on other issues at the moment so it's not as though I'm not preoccupied, but I'd be lying if I said I have many other ideas to pursue for solving this particular problem at the moment.

              Comment


                #8
                So I've spent the past few days fixing various bugs with the component... mostly editor implementation bugs (you couldn't select the ISKM through the viewport, selection outlines were bugged, etc.) as well as LOD selection (the meshes would not LOD with distance as the component's apparent size grew smaller). Now all of these bugs are fixed, so I'm really down to the final issue, which is this D3D perf problem.

                I'm determined to fix it so I will ask again, if any devs are viewing this board: any advice or comments you could give regarding my situation would be highly appreciated.

                Regards,

                Sheridan

                Comment


                  #9
                  A few days ago I got back from my holiday vacation, so I've been off the problem for a while. But since I came back I've been hard at work on the D3D performance problem and now I'm pleased to say I've finally solved it. The solution I discovered was absolutely bizarre though, and I'd appreciate some expert's opinion on why things worked the way they did... but anyway, onto my explanation.

                  Unreal has the capability to skin meshes on both the GPU as well as the CPU depending upon the situation. Most of the time, meshes are skinned on the GPU, and CPU-based skinning is reserved only for legacy hardware, or extremely complex meshes whose rigging data doesn't fit in the shader pipeline. Now early on, I made the choice while I was developing my ISKMs to have them *always* skinned on the CPU, because at the shader level, Unreal only has instancing integrated into their LocalVertexFactory pipeline, and gpu-based meshes go through an entirely different pipeline altogether. So by going with cpu-based skinning, I tactically avoided having to write quite a lot of code (and thus saved valuable time) by reusing as many existing components as necessary... but in doing so, I encountered an enormous performance variation between opengl and directx, and ultimately discovered something very strange about the performance of static vertex buffers vs dynamic vertex buffers in directx that was directly related to my performance issues on that platform.

                  In CPU-based skinning, since you are updating your buffer every frame, it makes sense that you'd want to have a dynamic vertex buffer so that you can update it constantly at a reduced performance penalty. So Epic implemented CPU-skinning just like that, and under most situations it seems to work fine. But while I was attempting to profile and debug my problem with RenderDoc, I noticed that if I switched out a dynamic buffer for a dummy static buffer in the ISKM's vertex factory, my performance increased hugely... and I didn't think much of it, since of course CPU based skinning needs a dynamic buffer to work, right?

                  Well, turns out it doesn't. For ***** and giggles I decided to modify FFinalSkinVertexBuffer::InitVertexData to create and use a static buffer rather than a dynamic buffer for the skinned mesh, and guess what, the performance skyrocketed up to OpenGL levels of awesomeness.

                  So at this point, I can guess as to why a dynamic buffer would be so much slower to draw than a static one... if for example I render a thousand instances of a complex mesh, then essentially I will have n2 data complexity (instances x vertices) in my gpu draw, and I could expect to encounter cache misses or other inefficiencies as my gpu repeatedly iterates through the dynamic buffer to draw all of my meshes a thousand times over... right?

                  But on the other hand... isn't the very purpose of a dynamic buffer to allow access on the CPU at minimal cost? And if so, why does cpu-skinning work at all if I initialize my buffer with BUF_Static?

                  For this I don't have an answer so I'd appreciate some comments from experienced developers. Either way though, I really do hope I can share this work with all of you who need it in the future...

                  Regards,

                  Sheridan
                  Last edited by srathbun-vl; 01-04-2017, 08:37 PM.

                  Comment


                    #10
                    Hi Sheridan,

                    this sounds pretty awesome. Respect
                    Do you have any plans to release this as a plugin?
                    Cheers Pascal

                    Comment


                      #11
                      Bump the thread! Very interesting to hear some answers from experienced devs.

                      Comment


                        #12
                        @srathbun-vl Hi, any news or progress? I would be interested in it as a future topic for me...
                        Edge of Chaos RTS
                        "Age of Total Heroes" - RTS Pathfinding and Movement System for UE4
                        RTS Camera C++ Tutorial

                        Comment


                          #13
                          Would you be able to share any details on how you did this? I recently had to solve a similar problem, animated crowds of thousands, but I took a very different approach. I used instanced static meshes and baked the animations as vertex deformations stored in a texture then used vertex offset in the shader. How did you do this? Can you animate the instances out of sync with each other? Can you have different instances playing different animations?

                          Comment


                            #14
                            Bumpity bump bump.
                            George Rolfe.
                            Technical Coordinator at Orbit Solutions Pty Ltd.

                            Comment


                              #15
                              Originally posted by srathbun-vl View Post
                              Hi all,
                              Hey! Is there any public info about this work? A plugin, or tutorial, or an article... Anything? :-) Thank you!
                              Working on the Knightmare Lands game.
                              Contacts: telegram, twitter, website, email

                              Comment

                              Working...
                              X