How to use hierarchical LOD

anonymous_user_d9ccfe00 · August 1, 2015, 9:19pm

Hello

Can some one tell me how the hierarchical LOD works not how to activate it but how it work’s.
I cant finde eny tutorials and documentation. Question to just get it right, This does create LOD.

Thanks

Slayemin · August 4, 2016, 1:00am

Normally, when you create a static mesh which has various LOD settings and then place it into the world, the engine is able to switch LOD’s automatically based on the screen size your mesh occupies.

Let’s say you have thousands of the same static mesh which you want to place all over your level. On the backend side, each placed static mesh is a separate draw call!!! If you have thousands of the exact same mesh, this could get expensive very quickly! How does a draw call traditionally work? You take a static mesh from main memory and then you send its vertex data down the graphics pipeline to the graphics card. It then draws that mesh based off of the vertex data and your material.

So, the graphics card manufacturers and DirectX came together and said, “Hey, a lot of people are using the exact same mesh thousands of times, but just with a little bit of variance in the mesh parameter data. How can we make this more efficient?” What they came up with is a technique called “Instancing”. You send a single batch of vertex data down to the graphics card once, and then you send another batch of data which contains “instance” data, which commonly contains things such as the instance location, rotation, scale, and any other per-instance parameters. The graphics card then pretty much takes a rubber stamp and plops down hundreds of instances of your static mesh, ALL IN ONE DRAW CALL!!! This is SUPER awesome because suddenly performance is through the roof! You gotta remember that the graphics card absolutely thrives on doing the same task over and over again because it can easily parallelize those things. Instancing is really great and is used all over the place. All of those fancy particle systems you see? Especially with GPU sprites? Instancing magic!

So, instancing is great, right? But it could be even better.

I have a flat plane which I use to draw walls all over my level. It has 5 levels of LOD. I have hundreds of these static mesh planes, and at the highest LOD, each one tile contains 200 triangles! When you move away, I quickly LOD away to 2 triangles. This would be a GREAT candidate for instancing, right? But how does instancing work with a mesh that has 5 different LOD’s? Enter the “Hierarchal Instancing” system! What this does on the backend is it takes EVERY LOD in your static mesh and creates instances of each one. In my case, 5 instances, one for each LOD! Then, the engine goes through and figures out the correct LOD for each static mesh position and includes that in the instanced data block. It’s like, “Hey, you know that list of five instances I sent you? Use #2 for this particular instance!” Then, the engine goes through and invokes a draw call on all of your instances. The number of draw calls is equal to the number of LOD’s in your mesh! If I had 1,000 static mesh planes and each one had 200 triangles, and I just did a regular old instanced draw call on that, then the GPU would be drawing 1000 * 200 = 200,000 triangles … that’s quite a bit. If I use Hierarchical Instancing and 90% of my quads are reduced to my lowest LOD, then I’m looking at 1000 * 0.1 * 200 + 1000 * 0.9 * 2 = 21,800 triangles. This is WAY fewer triangles! So, the question of the day we need to ask: Is it faster to draw 200,000 triangles in one draw call, or is it faster to draw 21,800 triangles in five draw calls? (Let that be answered by your performance analyser, don’t guess!)

Where do you see Hierarchical Instanced LOD’s in the engine? Pretty much any time you use the foliage tool with a foliage mesh which contains LOD’s, you’re using it. You can create massive landscapes filled with instanced grass meshes which auto LOD as you get closer with a negligible impact on performance.

Anyways, that’s how it all works

Slayemin · August 4, 2016, 5:44am

Thanks. I wrote my own engine from scratch before switching to UE4 and I implemented instancing using my own custom vertex buffers. Their HISM implementation is a good step beyond what I did, but my instancing with custom vertex buffer code and shader code was extremely lean. Anyways, I’ll try to weigh in a bit more on a point by point basis here.

I created my own tiling static mesh blueprint which has a drop down menu where I can toggle whether I want it to be a static mesh, an instanced static mesh, or a HISM. The implementation was done in the construction script and the node graph is very similar across all three implementations. Once you got one implementation down, implementing the other two is about a 5 minute job and requires relatively little thought.

Generally, I tend to implement first, look for performance bottlenecks, and then optimize if I find any. You don’t want to prematurely optimize something, particularly if you can’t measurably say that it actually is a performance bottleneck. In some cases though, it’s a pretty plain no-brainer: 200,000 triangles vs. 21,800? The GPU still has to process every vertex, even if its a half pixel on the screen due to distance. The more you can lighten the polygon count in areas it doesn’t matter, the more budget you have in areas that do matter.

So, you really only start to see huge performance gains with instancing if there are lots of similar meshes to actually instance. I wouldn’t waste my time instancing 10 meshes, but when you start looking at 100’s and 1000’s, you start getting interested. In the case of rock formations, you’d want to make a lot of them to make it worth instancing. One other area you could start looking for performance gains is also in memory pools. Rather than constantly allocating and deallocating memory on your heap, which gradually gets fragmented, you can create a contiguous array of memory which recycles objects. If you combine memory pools with instancing, you can get some blazing fast performance. Don’t quote me officially on this, but I think merging actors isn’t really the same comparison to make with instancing (you may have multiple static meshes being merged, which all need their own draw calls unless you can instance batch those on the backend).

I don’t think there’s a difference. All the instancing does is do batched draw calls on the GPU, and this increases frame rate and reduces polygon counts. The physics stuff is generally still going to be on the CPU side, and even if you use the NVidia Physx stuff, I don’t think it makes a difference in terms of performance. Someone smarter than me on the Physx coding side could probably correct me, but I generally haven’t gotten far on the GPU specific API calls. You can see the use of instanced static meshes on the GPU particle emitters though: Each point sprite is just 2 triangles which get instanced and the physics collision stuff is done with NVidia Physx calls on the GPU. Keep in mind that the GPU side physics collisions are going to be a lot more limiting in how you can respond than what you can do on the CPU side though. Once data gets down to the metal in the graphics card, there is no going back up to main memory to poll a variable value, so … those flaming sparks particles probably shouldn’t ignite that flammable straw on the GPU side.

janpec · August 4, 2016, 7:29am

Slayemin;575565:

Normally, when you create a static mesh which has various LOD settings and then place it into the world, the engine is able to switch LOD’s automatically based on the screen size your mesh occupies.

Let’s say you have thousands of the same static mesh which you want to place all over your level. On the backend side, each placed static mesh is a separate draw call!!! If you have thousands of the exact same mesh, this could get expensive very quickly! How does a draw call traditionally work? You take a static mesh from main memory and then you send its vertex data down the graphics pipeline to the graphics card. It then draws that mesh based off of the vertex data and your material.

So, the graphics card manufacturers and DirectX came together and said, “Hey, a lot of people are using the exact same mesh thousands of times, but just with a little bit of variance in the mesh parameter data. How can we make this more efficient?” What they came up with is a technique called “Instancing”. You send a single batch of vertex data down to the graphics card once, and then you send another batch of data which contains “instance” data, which commonly contains things such as the instance location, rotation, scale, and any other per-instance parameters. The graphics card then pretty much takes a rubber stamp and plops down hundreds of instances of your static mesh, ALL IN ONE DRAW CALL!!! This is SUPER awesome because suddenly performance is through the roof! You gotta remember that the graphics card absolutely thrives on doing the same task over and over again because it can easily parallelize those things. Instancing is really great and is used all over the place. All of those fancy particle systems you see? Especially with GPU sprites? Instancing magic!

So, instancing is great, right? But it could be even better.

I have a flat plane which I use to draw walls all over my level. It has 5 levels of LOD. I have hundreds of these static mesh planes, and at the highest LOD, each one tile contains 200 triangles! When you move away, I quickly LOD away to 2 triangles. This would be a GREAT candidate for instancing, right? But how does instancing work with a mesh that has 5 different LOD’s? Enter the “Hierarchal Instancing” system! What this does on the backend is it takes EVERY LOD in your static mesh and creates instances of each one. In my case, 5 instances, one for each LOD! Then, the engine goes through and figures out the correct LOD for each static mesh position and includes that in the instanced data block. It’s like, “Hey, you know that list of five instances I sent you? Use #2 for this particular instance!” Then, the engine goes through and invokes a draw call on all of your instances. The number of draw calls is equal to the number of LOD’s in your mesh! If I had 1,000 static mesh planes and each one had 200 triangles, and I just did a regular old instanced draw call on that, then the GPU would be drawing 1000 * 200 = 200,000 triangles … that’s quite a bit. If I use Hierarchical Instancing and 90% of my quads are reduced to my lowest LOD, then I’m looking at 1000 * 0.1 * 200 + 1000 * 0.9 * 2 = 21,800 triangles. This is WAY fewer triangles! So, the question of the day we need to ask: Is it faster to draw 200,000 triangles in one draw call, or is it faster to draw 21,800 triangles in five draw calls? (Let that be answered by your performance analyser, don’t guess!)

Where do you see Hierarchical Instanced LOD’s in the engine? Pretty much any time you use the foliage tool with a foliage mesh which contains LOD’s, you’re using it. You can create massive landscapes filled with instanced grass meshes which auto LOD as you get closer with a negligible impact on performance.

Anyways, that’s how it all works

Nice explanation I wish HLOD would have worked with dynamic actors aswell, since i am using many rocks that are placed as dynamic from get go.

Slayemin · August 4, 2016, 8:46pm

I’m not quite sure what you mean here. HISM can be used with actors which are dynamic (see: all particle emitters). If you’re placing hundreds of rocks and each rock has an LOD, you could use HISM, but it would take a bit of extra work upfront.

I’d probably create a blueprint which has a list of all the different rock types (array of meshes), and then create an array of transforms for each rock type. So, rock type 1 has a corresponding array #1, rock type 2 has array #2, etc.

In your construction script, you’d run through each mesh template and place an instance at each array transform for the corresponding mesh.

It would be a bit clunky to use in the editor, but it’d work. If you need the rocks to move (such as in an avalanche), you can go through each instance and update the transform position per tick.

Alternatively, the better solution is to just let the foliage tool in the editor place your rocks. It would handle the placement and back end management of each rock and give you good tools to edit and customize each instance. This is good for rocks which don’t move though, so rock avalanches would be out.

Naveed · November 7, 2016, 3:01pm

Just in case anyone stumbles upon this thread, there is now HLOD documentation:

https://docs.unrealengine.com/latest/INT/Engine/HLOD/index.html

Oskar_Swierad · November 7, 2016, 4:26pm

@Slayemin: that’s a great explanation of … Hierarchical ISMs. Not HLOD ;] HLOD is for grouping actors into bigger batches, to reduce their number.

Slayemin · November 7, 2016, 9:16pm

Yeah, you’re kind of right and kind of wrong at the same time, and I was also slightly right and wrong about the implementation of the HLOD on the engine side. I had to read up on it a bit to understand what the approach is.

So, where I’m wrong is when I mistakenly assumed that HISM == HLOD. Where I’m right is that they’re still taking advantage of instancing. They have to be. The DirectX API only does instanced and non-instanced primitives, and instancing is far faster on the GPU because the vertex data is sent down the GPU pipeline only once and then the GPU creates the instances, so the data throughput is insanely small and the GPU is insanely fast at doing parallel processing. Anything other than instanced primitives generally has its own separate draw call, which is why instancing is so much faster.

So, the HLOD system is a really cool approach. You create a cluster of primitives. Those primitives may have various LOD’s as well. When it comes time to do a draw call, I imagine that the HLOD system is asking “How many primitives with this vertex set do I need to draw?” and when it receives a number, it does an instanced draw call. Each LOD is a unique vertex set. The best way for the engine to take advantage of instancing is to look through the entire scene to look for primitive vertex sets which can be instanced, so when you’re creating a set of rocks, turn it into an HLOD cluster, and then drop multiple clusters into your scene. The scene renderer should be spanning across all clusters to see if there are shared primitives it can batch into an instanced draw call. It also does something cool with textures by creating a single texture atlas for all the meshes in a cluster, and then sending that one texture down to the GPU. Any duplicate clusters can share that global texture atlas until the rendering pass is complete. This further reduces the amount of data going down the GPU pipeline, which in turn means higher frame rates.

Oskar_Swierad · November 14, 2016, 10:28pm

Training Stream - Hierarchical Level of Detail - Nov 15th - Live From Epic HQ - Announcements - Epic Developer Community Forums

KVogler · November 14, 2016, 10:37pm

Here is the doc :

https://docs.unrealengine.com/latest/INT/Engine/HLOD/index.html

anonymous_user_c94f194d · January 2, 2017, 2:11pm

Slayemin;620250:

Yeah, you’re kind of right and kind of wrong at the same time, and I was also slightly right and wrong about the implementation of the HLOD on the engine side. I had to read up on it a bit to understand what the approach is.

So, where I’m wrong is when I mistakenly assumed that HISM == HLOD. Where I’m right is that they’re still taking advantage of instancing. They have to be. The DirectX API only does instanced and non-instanced primitives, and instancing is far faster on the GPU because the vertex data is sent down the GPU pipeline only once and then the GPU creates the instances, so the data throughput is insanely small and the GPU is insanely fast at doing parallel processing. Anything other than instanced primitives generally has its own separate draw call, which is why instancing is so much faster.

So, the HLOD system is a really cool approach. You create a cluster of primitives. Those primitives may have various LOD’s as well. When it comes time to do a draw call, I imagine that the HLOD system is asking “How many primitives with this vertex set do I need to draw?” and when it receives a number, it does an instanced draw call. Each LOD is a unique vertex set. The best way for the engine to take advantage of instancing is to look through the entire scene to look for primitive vertex sets which can be instanced, so when you’re creating a set of rocks, turn it into an HLOD cluster, and then drop multiple clusters into your scene. The scene renderer should be spanning across all clusters to see if there are shared primitives it can batch into an instanced draw call. It also does something cool with textures by creating a single texture atlas for all the meshes in a cluster, and then sending that one texture down to the GPU. Any duplicate clusters can share that global texture atlas until the rendering pass is complete. This further reduces the amount of data going down the GPU pipeline, which in turn means higher frame rates.

You are only guessing about implementation. That is not how HLOD works at all. HLOD pregenerate new static meshes under hood and also optionally bake combined materials to textures for simplified materials. Instancing is not useful for at all because each cluster is unique mesh. Because of this there can be additionally optimziations like removing triangles under landscape.

Slayemin · January 3, 2017, 12:06am

Yup, I’m totally wrong. The training stream is available now and gives a lot more detail on what all of this is and how it works. Implementation details can be found in the engine source code.