Announcement

Collapse
No announcement yet.

Rendering 5000 sprites in UE4 10x slower than Unity5?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    #16
    Originally posted by Nate View Post
    Just to clarify, all default post-processing was turned off in unreal 5000 sprites test:
    [ATTACH=CONFIG]30581[/ATTACH]

    Glad to hear that batching support is coming in 4.8, is the code already in master branch? And is the batching support only for paper2d or there'll be general batching for 3d mesh too?
    The batching support isn't in yet, I'll probably start on it in the next week or two. It will be specific to sprites, and the first form will be a similar interface to instanced static mesh components (but not using hardware instancing).

    Originally posted by TheJamsh View Post
    3D Meshes already have batching IIRC, Instanced Meshes definitely do. Separate meshes might not, though I don't see why that would be the case.

    I know that Mesh particles in UE4 are done in batches now, but they're probably instanced meshes too.
    For 3D meshes, we have the instanced static mesh component for dynamic batching, and you can also use Simplygon if you have it to mesh merge (I think having a built-in mesh merging system is on the backlog). Loose static mesh components are not batched at the RHI level, but there is already a lot of render-thread optimization for static meshes with mobility set to Static. They still end up being multiple API draw calls, but they go down a much faster path on the render thread with fairly low per-mesh overhead.

    Cheers,
    Michael Noland

    Comment


      #17
      Interesting note about Simplygon didn't know about it.
      | Savior | USQLite | FSM | Object Pool | Sound Occlusion | Property Transfer | Magic Nodes | MORE |

      Comment


        #18
        Ran the same test with Urho3D on my machine (in OpenGL mode):

        5,000 sprites: 678 fps
        50,000 sprites: 90 fps
        100,000 sprites: 45 fps

        Tried to run 50,000 sprites test on UE4, but failed with black sreen and no responding, have to kill it in task manager.

        Another issue puzzles me is the huge memory usage in UE4 test, it took 10x more memory than Unity or Urho.
        | twitter | github | #ue4tip

        Comment


          #19
          Originally posted by Nate View Post
          Ran the same test with Urho3D on my machine (in OpenGL mode):

          5,000 sprites: 678 fps
          50,000 sprites: 90 fps
          100,000 sprites: 45 fps

          Tried to run 50,000 sprites test on UE4, but failed with black sreen and no responding, have to kill it in task manager.

          Another issue puzzles me is the huge memory usage in UE4 test, it took 10x more memory than Unity or Urho.
          Yes, the DirectX work better in Windows than OpenGl, anyway, give good FPS amount.
          Hevedy - Instance Tools: https://hevedy.itch.io/hevedyinstances
          Hevedy - Image Tools: https://hevedy.itch.io/imagetools

          Comment


            #20
            Originally posted by BrUnO XaVIeR View Post
            Interesting note about Simplygon didn't know about it.
            That's because we don't get it with UE4, one must license it.

            Comment


              #21
              Thanks for the info Nate, I'll post back here with before/after results when I get started and will also look into the memory usage you're reporting.

              In the mean time, you can do stat memory (displays some brief high-level info) and memreport (writes out a file with more detailed stats) to see roughly where things are going.

              Cheers,
              Michael Noland

              Comment


                #22
                @Michael Noland, I wonder what's the status of batching support. Haven been following your commits on master branch, it seems there's lots of work haven been done on editor side lately. Thanks!
                | twitter | github | #ue4tip

                Comment


                  #23
                  It's in progress right now and still planned for 4.8, but no numbers to report yet though. I'm doing some refactoring to let me provide clean workflows in the editor to merge/split batches, etc...

                  Cheers,
                  Michael Noland

                  Comment


                    #24
                    Originally posted by Michael Noland View Post
                    For 3D meshes, we have the instanced static mesh component for dynamic batching, and you can also use Simplygon if you have it to mesh merge (I think having a built-in mesh merging system is on the backlog). Loose static mesh components are not batched at the RHI level, but there is already a lot of render-thread optimization for static meshes with mobility set to Static.
                    That's interesting, if I understand correctly instanced meshes fall back to a single draw call per-mesh is that right? (+ 1 call per material on top, unless it's 2-sided in which case 2 calls for a material?) Or is it actually one call per-material and the meshes are handled some other way?

                    Is it even possible to batch loose meshes into less draws? I know at DICE they took some higher level approaches and attached all of their static meshes together to reduce draw calls, since that gave them more performance benefit than culling them out I imagine.

                    EDIT: Sorry slight Hijack, just interesting stuff

                    Comment


                      #25
                      The grouped sprite component is in 4.8 now and out of experimental, so you should be able to check them out in 4.8 preview 3. Here are some numbers with a test that I think approximates yours.

                      I have two test cases, 5K elements using a large sprite and 100K using a smaller sprite (spatially), as at 100K at the large sprite ended up GPU bound due to overdraw.

                      100K grouped instances:
                      Click image for larger version

Name:	Grouped100K.png
Views:	1
Size:	391.0 KB
ID:	1078531
                      • 5K large separate components - 85 ms render thread time, 5001 draw calls
                      • 5K large instances on a grouped sprite component - 1.9 ms render thread time, 2 draw calls
                      • 100K small separate components - Got tired of waiting for stats after a few minutes
                      • 100K small instances on a grouped sprite component - 1.85 ms render thread time, 2 draw calls (as long as the group isn't dynamic, RT time is now constant, only GPU time increases as the # of sprites increases)


                      Note: Unfortunately I wasn't able to do much about the cache uniform expressions overhead when using loose sprites it isn't possible to safely make a persistent override proxy if the game code is using a MID, but grouped sprites dramatically reduce the # of components typically necessary, and should help quite a bit there.

                      Working with groups is pretty simple:
                      • You can build them programmatically like a UInstancedStaticMeshComponent works, but they have fewer limitations (you can mix/match sprites and materials in one, it will generate additional draw calls as necessary)
                      • You can convert selections in the level editor into a grouped component using the Merge button in the details panel if all selected items are sprite actors, or the right-click context menu if the selection is mixed (it will leave non-sprite objects alone but delete sprite actors, replacing them with a merged actor).
                      • You can split a sprite group back into separate actors if you need to move/reposition something, and can then re-merge them.
                      • You can sort them based on the rendering project settings TranslucencySortAxis, so that batches that contain translucent sprites render as expected.


                      Note: All sprites in a group will be drawn as one or a few draw calls (the mininum required given the materials and textures). This means that:
                      • Culling will be done as a whole unit. Either all instances are drawn or none of the instances are drawn. You probably don't want to group sprites from opposite ends of the map together.
                      • Sorting will be done as a whole unit. If you have some translucent foreground sprites and some translucent background sprites, you probably don't want to group them together. With the sort button they'll sort correctly relative to each other, but a translucent player in the mid-ground can't pass in between the two of them, it'll either draw in front of both or behind both. These kinds of sorting issues only apply to Translucent materials, Masked materials don't have the same issues but they only work for binary (0 or 1) opacity.


                      Cheers,
                      Michael Noland
                      Last edited by Michael Noland; 05-18-2015, 04:44 PM.

                      Comment


                        #26
                        BTW. Did that test with t.MaxFPS set to 0, so it was capping at 60 Hz. Changing to t.MaxFPS 1000 for the 100K grouped test shows a frame time around 4ms / 250 fps, but that's not a very meaningful number, need to keep adding work until you're back in the target range.

                        Cheers,
                        Michael Noland

                        Comment


                          #27
                          Awesome work Michael!
                          Ryan Brucks
                          Principal Technical Artist, Epic Games

                          Comment


                            #28
                            Very cool!

                            So you're also telling me that you also exceeded the Unity implementation with an extra 16 fps Stunning!

                            Comment


                              #29
                              Originally posted by Michael Noland View Post
                              The grouped sprite component is in 4.8 now and out of experimental, so you should be able to check them out in 4.8 preview 3. Here are some numbers with a test that I think approximates yours.

                              I have two test cases, 5K elements using a large sprite and 100K using a smaller sprite (spatially), as at 100K at the large sprite ended up GPU bound due to overdraw.

                              100K grouped instances:
                              [ATTACH=CONFIG]40003[/ATTACH]
                              • 5K large separate components - 85 ms render thread time, 5001 draw calls
                              • 5K large instances on a grouped sprite component - 1.9 ms render thread time, 2 draw calls
                              • 100K small separate components - Got tired of waiting for stats after a few minutes
                              • 100K small instances on a grouped sprite component - 1.85 ms render thread time, 2 draw calls (as long as the group isn't dynamic, RT time is now constant, only GPU time increases as the # of sprites increases)


                              Note: Unfortunately I wasn't able to do much about the cache uniform expressions overhead when using loose sprites it isn't possible to safely make a persistent override proxy if the game code is using a MID, but grouped sprites dramatically reduce the # of components typically necessary, and should help quite a bit there.

                              Working with groups is pretty simple:
                              • You can build them programmatically like a UInstancedStaticMeshComponent works, but they have fewer limitations (you can mix/match sprites and materials in one, it will generate additional draw calls as necessary)
                              • You can convert selections in the level editor into a grouped component using the Merge button in the details panel if all selected items are sprite actors, or the right-click context menu if the selection is mixed (it will leave non-sprite objects alone but delete sprite actors, replacing them with a merged actor).
                              • You can split a sprite group back into separate actors if you need to move/reposition something, and can then re-merge them.
                              • You can sort them based on the rendering project settings TranslucencySortAxis, so that batches that contain translucent sprites render as expected.


                              Note: All sprites in a group will be drawn as one or a few draw calls (the mininum required given the materials and textures). This means that:
                              • Culling will be done as a whole unit. Either all instances are drawn or none of the instances are drawn. You probably don't want to group sprites from opposite ends of the map together.
                              • Sorting will be done as a whole unit. If you have some translucent foreground sprites and some translucent background sprites, you probably don't want to group them together. With the sort button they'll sort correctly relative to each other, but a translucent player in the mid-ground can't pass in between the two of them, it'll either draw in front of both or behind both. These kinds of sorting issues only apply to Translucent materials, Masked materials don't have the same issues but they only work for binary (0 or 1) opacity.


                              Cheers,
                              Michael Noland
                              Michael you can share the map/scene or the project ? or give an example about how is made the scene etc ?
                              Hevedy - Instance Tools: https://hevedy.itch.io/hevedyinstances
                              Hevedy - Image Tools: https://hevedy.itch.io/imagetools

                              Comment


                                #30
                                I love the 'fighting talk' nature of this thread - I wonder what I could persuade one of the engine developers to do if I could come up with something that another app could do better. How about... In unity you can output a signal with an alpha channel to a dedicated broadcast output card (I don't know if that's true), why can't UE? (that's a highly personalised problem I personally face).

                                Comment

                                Working...
                                X