Optimizing repetitive calculation in material system

MOZGIII · February 1, 2015, 1:37am

Hello everyone.

I have material with over 300 instructions, all of which are basically depend on constant data - one texture. From that texture I generate material data, like base color, metallic, normal map, etc.

As far as I understand how rendering works, material gets compiled to shader(s), which, it turn, executes every frame. The problem is, that values of all those 300+ instructions are constant from frame to frame.
So, I guess, engine has an opportunity to cache those values somewhere and to cut on all those instructions. From what’s editor reporting, it seems like, so far, neither shader compiler nor runtime does such kind of optimizations, correct me if I’m wrong though.

That is, probably, tricky to implement in the ‘fully automated’ manner, but how about giving artist a way to explicitly create “cached”, or “constant up to this point” pieces data in the material editor?

It may seem like useless thing to do, but it may be of service when you’d like to cut on the amount of textures you pack into the game (cause in runtime you’d still have to load those “caches”) while also being limited in shader complexity.

Sotalo · February 1, 2015, 4:30am

Um, I don’t think you understand how shaders really work. It is calculated per pixel per frame because the object exists in 3D space, and can exist anywhere in the frame. The camera can also move per-frame. And if neither the object nor the camera moves, then you still have the sky moving, and screenspace reflections updating in realtime based on what’s going around. You can’t just cache the lighting data for a single frame and expect it to work forever. Remember, all that shader data gets fed into the lighting and GI and shadows get cast as well. So, nothing can ever be constant with a shader in a realtime environment. UE4 will automatically compile your material and make optimizations during compilation. I think another optimization pass is done when the game is finished and packaged to take care of things like static parameters, but it’s not going to magically shrink a material that’s 300 instructions down to something reasonable.

I think your problem has to do with using extra shader instructions to generate data that you can simply feed in with a texture map. Deriving normal map and metallic data from the same texture? Yeah, UE4 has shared samplers now, so you can have up to 128 textures in the same material without costing any more draw calls per-texture. The main issue with textures is not shader instructions, but draw calls, and those were taken care of. You have enough memory to use textures for stuff like this!

Keep finding ways to make your shaders more simple: remove extraneous features. Don’t overlay too many normal maps. If you can separate your shader into different materials on the same object at the polygon level, do that if it’s getting too heavy. Contract the math beforehand as best as you can. I tested an animated sine wave by adding 1 and dividing 2 (-1 - 1=>0-1), then lerping between two parameters, but that was all 10 instructions. The same sine wave lerped 0.5-1 is 8 instructions, so if I am able to perform the math beforehand, I can get a lot more control over the hard-coded method.

Shadowriver · February 1, 2015, 7:00am

If you want something consistent outside of GPU code do it in CPU (Blueprint or C++) and send result it to shader (material) as a parameter

Flakky · February 1, 2015, 9:57am

Hello guys. The problem is not about making material simplier. It’s about final output results, that UE4 shaders are using. Here is my material…

There are 280 instructions, that make this material pretty slow. It also using one texture as input and that is all. So the input info will never be changed. That means, that Roughness, Metallic, Color and normal maps WON"T change. So the question is, can Engine convert all of these, and make this:

Or UE actually cache these outputs to apply in real-time?

Jambax · February 1, 2015, 10:12am

Once you’ve built the material, you have to compile it. Unreal does a lot of “black magic” behind the scenes and does a big heap of optimization, that most of the time is actually more efficient than coding in raw HLSL (because Epic probably know what they’re doing more than most). I can’t remember if the stats readout shows the final instruction count, or the mat editors instruction count.

Either way, 280-300 instructions is still only around the mid-point of where you can aim. You reach in excess of 500 instructions and the engine still won’t complain (hell, a 350 instruction material shows up as dark green in ‘Shader Complexity View’ in the viewport on my machine). Another thing, it’s a pixel shader, so it’s actually 300 Instructions per-pixel, per-frame. Imagine you have a 1920 * 1080 scene. If your shader covers the entire area, that’d total up to:
**
622,080,000 Instructions Per Frame for the GPU to do.
**
Now, it doesn’t actually work like that thanks to the magic of things like Cachin, Buffers etc and a whole heap of other stuff I don’t understand, but the end result is it’s actually super efficient as far as realtime rendering goes. You’re a very brave or intelligent person if you think you can optimize UE4’s renderer. However you do have to add things like Lighting, Post Process, Reflections, Lightmaps and Translucency on top, that’s where you can get killed.

What you’re effectively saying in the post if I’m not mistaken, is something that fundamentally wouldn’t work. You can collapse those nodes you’ve made down to functions in the material editor, but the functionality you need won’t change. If you have a bunch of static operations (such as altering the colour with multiply, or changing the contrast), it’s better to export the texture back out and do those operations on the texture in Photoshop, then re-import them and remove the functionality. Unless of course you need that functionality to be dynamic or instance-able, in which case they have to be done in the material. Unless you want to create a load of slightly altered textures and import them, that’s the only way to do it. (The latter method is horribly inefficient btw).

The image you linked is too small to for me to see any other operations you’re doing, but it looks like there’s some UV math in there as well. That you can’t do in an external editor, and has to be done in the editor. Whether the input texture changes or not, if you have to perform operations on it’s UV’s you can’t do that any other way.

I’m surprised that 280 instructions runs slowly though. I have a scene where 350 instructions pretty much fills up the entire viewport the entire time and it’s coping just lovely.

Edit: MarioMGuy pretty much nailed it. Should really read threads before I comment…

Flakky · February 1, 2015, 11:10am

Well, I guess it runs slowly. I have GTX 770, so everything goes smooth, so I cannot tell whenever my mat is expensive or not. I’m just talking about instruction number.

MOZGIII · February 1, 2015, 10:37pm

At this point, I disagree with you. (Fragment) shaders are indeed calculated per pixel per frame and camera can move and etc., but there are some things that are in common across multiple frame calculations. Not whole shader program, but some pieces of it keeps executed on the same data, therefore leadingto the same values. This is what we call constant part. The idea is to store the intermediate results of the shader program (former material) in somewhere (most probably some texture memory, or maybe somewhere else on GPU, like in cuda-allocated memory, if possible) and cut on this repetitive part. This would be a trade-off of GPU memory for shader complexity. Again, that would be difficult to be made fully automatic optimization (during shader compile process) due to the undeterministic nature of the trade-off, meaning whether the material-devloper wants it or not.
I don’t know if this technique is usefull in games since it seems to take a lot of effort to implement for wide use.
As proposed:

we could prefetch this calls with the CPU, or, maybe, with CUDA or via rendering parts of the initial shader to some separate buffer. And then pass via shader parameters to where applicable.

By the way, GPU are now operating at the scale of gigaflops - thats 10^9 floating point operations per second. My GeForce 470 has 1088.6 GFLOPS, according to wikipedia, which is more than enough to execute 622,080,000 (0.62208 * 10^9) instructions per frame at 60 fps (60 * 0.62208 * 10^9 = 37.3248 * 10^9 - per second at 60 fps). Of course, there would be some optimizations to cut on shader calls, but I guess, in general shaders in unreal are called as usual shaders are. I wouldn’t mind some advanced high-level overview of rendering optimization techniques currently deployed in unreal though.

Getting back on topic, such thing would require not only shader compiler support, but also the runtime support for it to work. I’m really curious what Eipc’s opinion on this.

PS: even though I started the whole thing, this seems like an egde case (in term of use) to me, but I think it still worth discussion.

Sotalo · February 2, 2015, 7:36pm

No, a shader will not automatically generate textures when you compile: it does the layering in realtime. This is because of stuff like UVs: you can layer one texture over another and shift it ever so slightly that the only way to compress it down to a texture is to use a very large texture that is not a power of two. The benefits to shader calculations running in realtime is that flexibility: you can do virtually anything you want in the shader. You can tile a texture a thousand times and layer it over another texture of a different resolution that tiles 752.37 times, and then pan it using values generated during the game. But textures are cheap. If you can combine the textures outside the material editor and achieve a good result, do that. Everything you do in the shader is a waste of resources, and everything you do in the shader that can been done in Photoshop is a complete and utter waste of resources to the highest degree. There are some constants and cross-referencing, but that’s it.

Normal maps are also not standard textures: the engine automatically calculates a -1 - 1 calculation for the RG values so the normals function appropriately. But it does this to each normal map you decide to use. If you need to overlay more than two normals, it might be cheaper to import them as a standard texture, mask the RG values, lerp them -1 - 1, and then append the B values, just so this calculation is done once. I’m not sure if UE4’s compiler has the capability of rearranging operations, and when you’re overlaying 5 normal maps with each other, there’s a huge waste there. Just try combining the normal maps into fewer textures using a different program.

Just because you CAN make a shader waste 500 instructions per pixel, DON’T DO THIS. You’re gonna wanna look at the pixel fillrate to get a better idea of a graphics card’s pixel shading performance. Then, divide this by every single pixel (720p is 921,600 pixels. 1080p is 2,073,600 pixels). Then, divide this by every single pixel that has a stationary or dynamic light touching it for every stationary/dynamic light that does (look at the light complexity view for this value). Then, divide all that by the number of instructions required per-pixel to run the screen space reflections. Then, divide that by the number of instructions the reflection environment is costing per-environment capture per-pixel (if your shader is unlit, you can skip all the way down to here). Then, divide all of that by the cost of post processing per-pixel. Then, divide this by the number of times the frame needs to be drawn per second… and THEN you’ll have a closer idea how much the real pixel rendering costs are for a game per second. Note this does not take into account vertex instructions or draw calls. Look at the fillrate, not the FLOPS to determine how well GPUs render pixel operations.

MOZGIII · February 11, 2015, 9:47am

mariomguy

By stating that, you don’t mean that’s impossible to do, right? You are just stating that it just does not work this way right now. What’s the problem though? Layering can also be done in same manner with textures generated by some part of the engine. You also have a shader code. Proposal is to add the ability to extract pieces of the shader code to generate some textures, and to make it work so that it executes once and stores the result into texture (which is only stored in RAM/video memory, and not shipped with the game package to cut on it’s size), which, in turn, is then used by shader in place where the code to generate that same data per each shader call was used. It is shader code cache in a sense.

Just because you CAN make a shader waste 500 instructions per pixel, DON’T DO THIS. You’re gonna wanna look at the pixel fillrate to get a better idea of a graphics card’s pixel shading performance. Then, divide this by every single pixel (720p is 921,600 pixels. 1080p is 2,073,600 pixels). Then, divide this by every single pixel that has a stationary or dynamic light touching it for every stationary/dynamic light that does (look at the light complexity view for this value). Then, divide all that by the number of instructions required per-pixel to run the screen space reflections. Then, divide that by the number of instructions the reflection environment is costing per-environment capture per-pixel (if your shader is unlit, you can skip all the way down to here). Then, divide all of that by the cost of post processing per-pixel. Then, divide this by the number of times the frame needs to be drawn per second… and THEN you’ll have a closer idea how much the real pixel rendering costs are for a game per second. Note this does not take into account vertex instructions or draw calls. Look at the fillrate, not the FLOPS to determine how well GPUs render pixel operations.

I don’t really get your formulas here. How about a practical example with real numbers? And why 500 in particular is bad?

Despite your math is strange, the idea that wasting instructions per pixel is bad is obvious. Great attempt clarifying that though. The thing is, that’s what I’m trying to deal with here. All this stuff is about lowering instructions per pixel.

Pjotr · February 11, 2015, 11:19am

You could in theory generate a texture from a bunch of math, these are often called look-up-textures/tables. This is often done at the end of a development cycle to optimize the game and is done by programmers. It would be cool to have an option in the unreal engine to do this for you (for example you select a few nodes a generate a texture for them). This would make it possible for artists to make them. But it’s not a good idea to just let the engine do this automatically as a look-up-texture will not be the same as the original nodes. This is because if you zoom in enough so that the pixel of the generated texture are bigger than screen pixels then the pixels in between will be linear interpolated and this might not be the intent of the original material nodes.

But yeah, it could work if you let the end users tweak the resolution of the final texture. Just note that it’s tricky to know when you need to do this kind of optimizations, it all depends on the target hardware. If you have texture bandwidth to spare then it’s usually a good idea, but if you are already limited by texture bandwidth then turning calculations into textures will probably hurt performance. Also would not be able to use all the nodes when doing this, nodes that depend on the current frame/view like ddy/ddx would not work for example.

Shadowriver · February 11, 2015, 5:08pm

I think what you want is image processing outside the shaders, as everything does not change and can be just cached forever there no point of using GPU for that

Sotalo · February 17, 2015, 7:02pm

You’re not thinking about what you’re expecting the shader to do. If you want to arbitrarily scale two textures on top of each other, you can’t just merge them into a single texture. If you blend a 512512 texture over another 512512 texture, and you scale one of the 512*512 textures by 5 on the UV space, the end result will tile every 2560 pixels. If you want the texture to be merged into a single texture, that’s the lowest size you can do it. Now, you can’t mipmap a texture that’s not a power of two, and since no power of 2 is also divisible by 10, you will not be able to mipmap that texture at all! Your final texture will be 26 MB and you will not be able to stream it.

That’s why shaders perform texture UV calculations in realtime without compressing the result down into a single cached texture. Now, if you want to overlay textures in Photoshop and come up with a final result of 1k or 2k that’s fine, you can do whatever you want in Photoshop. As long as the end result is a power of 2, you’ll always be fine doing that instead.

The cost of total pixel instructions is:

Instructions for the shader, per pixel (includes the basic shader, lighting with reflection environment and screenspace reflections, and lightmass GI all blended together)
Reevaluations of the shader per dynamic light influence, per pixel (with a dynamic sun, every pixel receiving light from the sun becomes twice as difficult to render)
Translucency
Post process per pixel (bloom, any depth of field, any screen effects)

So say your material has 500 instructions for the basic shader, reflection environment, and lightmass, and you’re running the world’s most powerful GPU: an NVIDIA GTX 980, capable of 78 billion pixel operations per second. That’s not bad, 500 seems like nothing. But then you decide to have a stationary sun so you can get proper specular highlights in your level, so this bumps up to a cost of 1000 instructions per pixel. Now At 1080p you have 1920x1080 pixels, or 2,073,600, so now you have 2,073,600,000 instructions to render one frame. 2 billion for a frame, but 78 billion on the card… per second. If you have translucent objects, you need to add the cost of those pixels to the cost of the surface underneath, so in the translucent pass let’s say you have some smoke effects taking up 25% of the screen, and the smoke costs 140 instructions to render per-pixel. This is an extra 72,576,000 instructions for a total of 2,146,176,000 instructions. Now you need post processing. I don’t know how much bloom costs, but I do know it typically runs at 1/4 the resolution and it’s not free. For the hell of it, let’s say in total this adds an extra 50 instructions per-pixel. Now we have a grand total of 2,249,856,000 instructions per frame.

If you want this to run at 60 frames per second, you’ll need a graphics card with a fillrate of 134,991,360,000 pixels per second, or 135 Gp/s. But the GTX 980, currently the world’s most powerful GPU (costs about $550) only has a peak pixel fillrate of 78 Gp/s. Theoretically, a graphics card that powerful would only be able to run a scene like this at 35 frames per second, if the pixel framerate is the only bottleneck to speak of. So if you insist on using 500 instructions for a wall, this is a more realistic depiction of how difficult it would be for the world’s most powerful GPU to render that wall if a player decided to just stare at it.

Now consider that games like Super Smash Bros can run 60 frames per second on the Wii U with a fillrate of only 4.4 billion pixels/second, or 5.6% the fillrate of the GTX 980. And they’re rendering a lot more than just a wall with some smoke.

wes_h · February 18, 2015, 1:09am

The idea of deriving many constants from a single texture input, and/or generating several textures using math instructions only once/when needed rather than every frame, to me sounds like Allegorithmic’s Substance plugin might help solve at least a portion of that problem, though I’m not that familiar with the specifics of the UE4 integration. https://www.allegorithmic.com/substance-ue4

Sotalo · February 19, 2015, 12:56am

Yeah, I’m not entirely sure how substance works. At first I thought it just used procedurals to generate textures, but it goes a bit further than that because you can adjust values during gametime. I’m not sure if it’s compatible with lightmass. But it does take quite a few milliseconds to generate all the textures necessary for a decent substance material: if it was rendering the procedural every frame, you’d be lucky to have a framerate higher than 1 per second.

joshezzell · March 7, 2015, 4:18am

mariomguy;223649:

So say your material has 500 instructions for the basic shader, reflection environment, and lightmass, and you’re running the world’s most powerful GPU: an NVIDIA GTX 980, capable of 78 billion pixel operations per second. That’s not bad, 500 seems like nothing. But then you decide to have a stationary sun so you can get proper specular highlights in your level, so this bumps up to a cost of 1000 instructions per pixel. Now At 1080p you have 1920x1080 pixels, or 2,073,600, so now you have 2,073,600,000 instructions to render one frame. 2 billion for a frame, but 78 billion on the card… per second. If you have translucent objects, you need to add the cost of those pixels to the cost of the surface underneath, so in the translucent pass let’s say you have some smoke effects taking up 25% of the screen, and the smoke costs 140 instructions to render per-pixel. This is an extra 72,576,000 instructions for a total of 2,146,176,000 instructions. Now you need post processing. I don’t know how much bloom costs, but I do know it typically runs at 1/4 the resolution and it’s not free. For the hell of it, let’s say in total this adds an extra 50 instructions per-pixel. Now we have a grand total of 2,249,856,000 instructions per frame.

If you want this to run at 60 frames per second, you’ll need a graphics card with a fillrate of 134,991,360,000 pixels per second, or 135 Gp/s. But the GTX 980, currently the world’s most powerful GPU (costs about $550) only has a peak pixel fillrate of 78 Gp/s. Theoretically, a graphics card that powerful would only be able to run a scene like this at 35 frames per second, if the pixel framerate is the only bottleneck to speak of. So if you insist on using 500 instructions for a wall, this is a more realistic depiction of how difficult it would be for the world’s most powerful GPU to render that wall if a player decided to just stare at it.

Now consider that games like Super Smash Bros can run 60 frames per second on the Wii U with a fillrate of only 4.4 billion pixels/second, or 5.6% the fillrate of the GTX 980. And they’re rendering a lot more than just a wall with some smoke.

Is this information accurate? The community ocean project on the forums has an ocean material between 500-600 instructions. I never noticed any kind of performance hit from it other than the tessellation.

Sotalo · March 9, 2015, 3:20am

Yes, it’s accurate. What kind of graphics card and resolution are you running? 500-600 instructions on a shader at 1080p with dynamic lighting and GI, you’d definitely need a high-end card to be able to run that properly. In a tiny little window, sure, pretty much anything “can” run it because there are so few pixels being rendered, and the only limitations you’d really have with rendering tiny viewport windows have to do with polygons. At that scale, polygons cost more than pixels. You can still have millions of polygons rendering in a tiny space, so tessellation does have a hard ceiling within each graphics card. Maybe I obsess over these things because I operate on a GT 640, but optimization should be important to ensure whatever material you’re making will function properly on lower spec PC and consoles. If you have a GTX 980, sure, go crazy. Unfortunately, most people do not.

RVillani · February 27, 2017, 7:00pm

For people finding this thread after UE 4.13:

If I understood correctly, OP wanted a way to calculate some outputs from his textures, that wouldn’t change over time, and cache them. You can do it now using Render Texture nodes.
There’s an example in Getting the Most Out of Noise in UE4 - Unreal Engine that uses this technique to calculate a complex noise material only once.