Article written by Matt O.
In this article, we’ll cover at a high-level what shader permutations are, how they happen, the performance issues they create both at runtime and compile time, and potential remedies.
Covere…
https://dev.epicgames.com/community/learning/knowledge-base/6PxP/unreal-engine-understanding-shader-permutations
Really a useful article!
But I’m still confused about using the static switch parameter node, the documentation said :“Try to minimize the number of static parameters in the material and the number of permutations of those static parameters that are actually used.”(Parameter Expressions | Unreal Engine Documentation)
But it also said: "Static Switches can be used to remove an entire branch of a material with no runtime cost. Instances can have different values, making it possible to have a templated shader setup with no performance loss. "
Would you like to add more details about the static switch parameter? I don’t know how to deal with this part when I’m writing a material template. Thanks!
I can have a look around and see if I can get you a better answer or if we need to elaborate on this in this article.
Wow, another article worth setting a bookmark.
For those of you who might want an answer to this question, as the article points out, using static switch parameters results in generating shader permutations since each different combination of using them in material instances will require different shader bytecodes. The compilation of the shaders happen in cook time which means there is no real time performance impact of them, but you will need more time to cook(and build) as well as more memory space for shader codes. This is why you should “try to minimize the number of static switch parameters” for cook time management and memory usage but in the same time, you can “have a templated shader setup with no performance loss” since those compilation works are not done in runtime.
“graphics hardware’s time is spent compiling”
Compilation happens on CPU. The GPU cannot do compilation, which is why shader permutation is needed.
The “Run-time” section simply does not describe the real reason why “an increased number of permutations can result in performance issues” at run-time.
Actually, it is permutations that result in simpler and non-divergent after-permutation-shaders that could run faster due to lower register usage/higher wave occupancy and no need to execute both branches. This is why permutations exist.
“GPU likes to do a lot of the same thing”, but it actually means same thing within a wave (a draw for an after-permutation-shader can result in many waves) instead of across the whole GPU (which can have many draws with different after-permutation-shaders in parallel).
“Modern GPUs can evaluate a certain number of pixels in parallel, as long as they are using the exact same bytecode.” GPUs can render many pixels in parallel with many draws/after-permutation-shaders (from different bytecode) on different execution units/compute units/SIMDs.
“So if a 64-pixel area all uses the same bytecode, great!” The same bytecode still can be divergent with a wave. The 64-pixel area can form a wave, when a GPU renders a draw (of coz with 1 after-permutation-shader), it rasterizes the geometry into many waves, of coz those waves are of the same after-permutation-shader/bytecode.
“different permutations of a shader result in different bytecode, so while it may appear that all of your materials are the same because they inherit from the same monolithic base material, each usage and switch combination of that material is different as far as the GPU is concerned”, yeah, but those different bytecode/after-permutation-shaders/PSOes will result in different draw calls running in parallel/concurrently on different execution units/compute units/SIMDs/SIMD slots of a GPU, different after-permutation-shaders will not be packed into a single wave.
The reason why too many switches of PSOes result in lower performance is described in the References link: The Shader Permutation Problem - Part 1: How Did We Get Here? (therealmjp.github.io), search for “switch”. While there might be both CPU and GPU cost switching PSOes, PSO is a main reason why modern APIs are faster in CPU setting render states, switching PSOes is already much faster now in modern APIs.
And on AMD, you may hit context-roll limitation of at most 7 different states at the same time.
Anyway, it depends on if your game actually uses those permutations for rendering, which is different than permutations generated at build time.
While the “Run-time” section is saying, it could be slow due to “many permutations are in the same 64-pixel wave”.