Then i read it wrong, and that’s the sampling kernel’s size, where 3 is fine to get a 3x3 kernel to work with, but doing only 1x1 means you get the same image since you only sample the center. So the 3 should be fine.
Yes it takes a little longer to read all this stuff, but you either get a 3x3 kernel as a minimum regardless you requested 1 (minimum lock) or it could be the way the GPU is sampling the pixels, and as they coming from cached values (instead of measuring the same pixel multiple times) the performance costs also gets significantly reduced.
Makes sense, it must be two separated blur passes then.
Probably that’s where the blending happens, and it takes some time to do it multiple times.
Maybe this? A third method could just be .25?
Yes indeed, since it would reduce the tonemappers work as well.