Reduce bloom quality to a single gaussian blur?

John_Alcatraz · October 21, 2016, 7:13am

The doc about bloom says:

I would like to reduce the amount of Gaussian blurs to 1 to save performance. How to do that? I did not find any settings that affect the performance of the bloom. All settings in the PP volume are purely cosmetically it seems with no impact on performance (apart from setting Intensity to 0). Then there is r.bloomquality, 0 disables bloom and 1 is default. And r.bloom.cross, default is 0 and 1 makes bloom ugly, but performance is not affected.

So I think there should be some way to reduce the amount of blurs that are done to render the bloom to improve performance, but how?

The reason I’m asking is that in VR, bloom is definitely very expensive, and not having any bloom also is not really a solution.

Profiling with bloom enabled:

Profiling with bloom disabled:

http://puu.sh/rQlUp/a70722839c.png

So bloom is taking 0.62 ms currently, that’s too much in VR. I’m testing with 4.14 now and unfortunately bloom performance has not improved in recent engine versions.

Chosker · October 21, 2016, 11:58am

+1, I’m quite interested in this too

Konflict · October 26, 2016, 1:40am

According to the source the quality 1 should result in 3 stages only, and on quality 5 you will get all the 6 you are showing on the profile screenshots.

These are hardcoded values unfortunately, which will require the engine code to be modified for better optimization on this area, but the improvement sounds reasonable.

There however is the possible issue with the result of just 1 pass will not present you the expected quality. A very clever reimplementation of the similar bloom idea have also ended up using multiple passes (it’s 5 if i’m counting it right) to get the nice bloom effect, but this obviously should not keep you away from experimenting with it.

John_Alcatraz · October 26, 2016, 2:51am

Ah, thanks very much for linking that source file! That number there can easily be modified in the source, I have no problem with modifying source, I’m fine with that.

So I did that and compiled. I definitely see the quality difference between r.BloomQuality 1 and r.BloomQuality 2 now, the quality with 1 stage is still relatively acceptable. It’s blocky, but its still way better than no bloom at all, so thats what I wanted.

Unfortunately, it didn’t help with performance at all though. Changing the amount of stages to 1 only affects the “PostProcessWeightedSampleSum” points in the profiler, and those are very cheap, so I saw no relevant performance difference between 1 stage and 3 stages. That’s probably why Epic didn’t make it possible to set it to 1 from the editor, it just makes no sense since 1 is same expensive like 3. (@Konflict for every blur stage there are 2 “PostProcessWeightedSampleSum” in the profiler, so in the screenshots of my first post the amount of stages was 3, not 6).

But why are the points that are responsible for the majority of the blooms cost not affected by this? The most relevant things are:

“Downsample 756x840”,
“PostProcessBloomSetup 756x840”
“Downsample 378x420”

Those are responsible for the majority of the blooms cost and not affected at all by changing the amount of stages it seems. Also, the tonemapper seems to be quite a bit more expensive when bloom is enabled.

756x840 is half of the per-eye resolution, so that number seems to be hardcoded to screen size * 0.5, why? Would it be possible to change that to 0.25 or 0.125 of the screen resolution? That would make a much bigger difference to performance it seems than changing the amount of blur stages.

Where could I change that number in the source? I looked around the source in AddBloom() and the functions below that are called, but I couldn’t find it there. Any ideas?

Konflict · October 26, 2016, 4:00am

Then i read it wrong, and that’s the sampling kernel’s size, where 3 is fine to get a 3x3 kernel to work with, but doing only 1x1 means you get the same image since you only sample the center. So the 3 should be fine.

Yes it takes a little longer to read all this stuff, but you either get a 3x3 kernel as a minimum regardless you requested 1 (minimum lock) or it could be the way the GPU is sampling the pixels, and as they coming from cached values (instead of measuring the same pixel multiple times) the performance costs also gets significantly reduced.

Makes sense, it must be two separated blur passes then.

Probably that’s where the blending happens, and it takes some time to do it multiple times.

Maybe this? A third method could just be .25?

Yes indeed, since it would reduce the tonemappers work as well.

John_Alcatraz · October 26, 2016, 4:20am

I think that PostProcessDownsample is a general thing for all PP effects and not related to bloom, right? Only the bloom should get calculated on a lower resolution.

Konflict · October 26, 2016, 2:22pm

It should be, but it might not the right spot. How about this and the next line? Both seem to have effect on the rendertarget size to reach lower resolution. But as you can see there’s no simple way to just set the scale here, it is pretty much designed to always generate halved sizes. It’s a logical way, tho not optimal as per VR requirements. I’d try to extend this class to add more configurable options, so the bloom could request the very small maps optionally. Size .25 or .33 should be fine.

It also worth to point out, that the comments in the bloom will mention that eyeadaptation is also use these maps to measure the lumen, so i’d advise to check the function regularly to see if it’s still working as it should.

Edit:
Try r.UseMobileBloom 1, that’s a different bloom and might have a lesser cost. But as the name suggests, it was not designed for desktops.

John_Alcatraz · October 26, 2016, 7:40pm

Thanks, I’ve tried to divide by 4 there instead of 2, but unofurtutanely that makes the bloom appear only in a quarter of the screen, so the lower res thing is not scaled to the full screen but just shown on the top left. So there are probably some more places where something would need to get changed to make it work…

Eyeadaption is not realy making much sense in VR, so that would not be an issue.

Thats very interesting, thanks! I didn’t know about that console variable. It works on desktop, but unfortunately the cost seems to be same or even higher than the regular bloom…

Konflict · October 26, 2016, 7:55pm

The downsample pass should be followed by an upsample pass which will revert the scaling to a higher dimension. The upscaling pass therefore must be changed as well before the blending happens. But as you can see, this part of the code is very rigid, and designed to always trigger all downsample and upsample passes.

John_Alcatraz · October 26, 2016, 8:09pm

Thanks, I tried multiplying the UpScale variable by 2, but that didn’t change anything unfortunately. I don’t believe that with this kind of trial and error approach I will get to where I want to get…

I wish someone from Epic who knows how that Bloom stuff works could just quickly comment here and tell us what the easiest way to modify this would be?

Konflict · October 26, 2016, 10:21pm

I’m terribly sorry to hear that, since it is one of the greates adventure that can happen while learning something new. I’m sure you did not mean that trial and error never helped you to figure something out!

I’d also like to hear from the corporate programmers how to resolve this individual customer issue. I also find it a little bit of odd that while the engine have this frontend design that suggests it will be easy to modify and customize, without the requirement of serious c++ knowledge, yet still here is this bloom which is pry one of the simples effect that you can get in 3d, and there you have this rigid hardcoded pipeline and helpless to find a dam value to set the number of passes.

It’s doable anyways, but i don’t like the way it is being done.

John_Alcatraz · October 28, 2016, 12:53am

You don’t need to be sorry, trial and error definitely often helped me, but trying to do something like this (messing around with the ue4 renderer) without having a clue what happens there just doesn’t seem to be too successful…

You have removed some hardcoded passes? Is it just one pass on 1/4 resolution now?

Konflict · October 28, 2016, 3:29am

Yes i’m affraid that’s the only way i have found so far. Removing the unneccessary downsample passes should help to reduce the cost of blending (even if you manage to ommit the draw on them, the empty RT’s would remain to be queued for blending at the last stag, so they had to go), and the downsample pass does a 1/4 instead of 1/2 operation. You were actually pretty close to get the downsample right, but did not aligned the Extent property to project a 1/4 instead of 1/2 which is why the bloom was appeared at the top left quarter.

There apparently was no consequences of forcing the downsample pass to do the 1/4 scaling, that is why i believe this is actually doable. I however was hoping to find a way to just kick the current bloom out of the postprocess graph, then fill in with a low cost bloom solution which would do the downsampling on it’s own.

What do you really need? I mean, i hardly believe that a nice epic personel just hop in here one day and write an exhaustive documentation of these postprocess classes. So it seems to me that they just gave us this great engine, that you can either figure out by yourself or you’re already done with it.

Anyways PP is an interesting part of the engine that worth to look into more, and i will do just that

Norman3D · January 17, 2017, 11:26pm

Konflict;615673:

Yes i’m affraid that’s the only way i have found so far. Removing the unneccessary downsample passes should help to reduce the cost of blending (even if you manage to ommit the draw on them, the empty RT’s would remain to be queued for blending at the last stag, so they had to go), and the downsample pass does a 1/4 instead of 1/2 operation. You were actually pretty close to get the downsample right, but did not aligned the Extent property to project a 1/4 instead of 1/2 which is why the bloom was appeared at the top left quarter.

There apparently was no consequences of forcing the downsample pass to do the 1/4 scaling, that is why i believe this is actually doable. I however was hoping to find a way to just kick the current bloom out of the postprocess graph, then fill in with a low cost bloom solution which would do the downsampling on it’s own.

What do you really need? I mean, i hardly believe that a nice epic personel just hop in here one day and write an exhaustive documentation of these postprocess classes. So it seems to me that they just gave us this great engine, that you can either figure out by yourself or you’re already done with it.

Anyways PP is an interesting part of the engine that worth to look into more, and i will do just that

Bumping the thread! I’m also interested in performance optimizations for the bloom effect. Konflict, would it be possible for you to recap what exactly needs to be changed in the source so that only one downsample is used at 1/4 the resolution?

Konflict · January 18, 2017, 2:52pm

Hey, @Norman3D.

Everything we have discussed here should be enough to figure this out, but for convenience i just done the modification in 4.14.3 and here are the most significant changes have to be made to accomplish the result:

Remove the additional stages, then restrict the BloomStageCount to 1 only



FBloomStage BloomStages] =
		{
			/*{ Settings.Bloom6Size, &Settings.Bloom6Tint },
			{ Settings.Bloom5Size, &Settings.Bloom5Tint },
			{ Settings.Bloom4Size, &Settings.Bloom4Tint },
			{ Settings.Bloom3Size, &Settings.Bloom3Tint },
			{ Settings.Bloom2Size, &Settings.Bloom2Tint },*/
			{ Settings.Bloom1Size, &Settings.Bloom1Tint },
		};

const uint32 BloomStageCount = 1;// BloomQualityStages[BloomQuality - 1];

By restircting the bloom stage count you will no longer be able to make any changes with r.BloomQuality, but i think you can avoid confusions that way. The other way to set this up would be to change the preset value in BloomQualityStages] it’s up to you.

Second, the downsample sizing is here, you can adjust the downsample to be a quarter size



void FRCPassPostProcessDownsample::Process(FRenderingCompositePassContext& Context)
	FIntRect DestRect = FIntRect::DivideAndRoundUp(SrcRect, 4);
	SrcRect = DestRect * 4;

At the bottom of the code you will find this, change it to 4 and you are done.


	
FPooledRenderTargetDesc FRCPassPostProcessDownsample::ComputeOutputDesc(EPassOutputId InPassOutputId) const
	Ret.Extent  = FIntPoint::DivideAndRoundUp(Ret.Extent, 4);

The result is a basic bloom, but it does the job. If you wish to give a larger halo, just change it to be 8 (additional halving) in the code, then in the editor set the size scale to be 8 in the postprocess field.

Be aware that the changes in the downsample code will not be restricted for the blooms only, so if you have any custom effect that depends on this downsample pass there will be complications. I believe the official engine version doesn’t use this method other than the bloom only. This might change later.

Norman3D · January 18, 2017, 5:21pm

Konflict;653023:

Hey, @Norman3D.

Everything we have discussed here should be enough to figure this out, but for convenience i just done the modification in 4.14.3 and here are the most significant changes have to be made to accomplish the result:

Remove the additional stages, then restrict the BloomStageCount to 1 only
FBloomStage BloomStages] =
		{
			/*{ Settings.Bloom6Size, &Settings.Bloom6Tint },
			{ Settings.Bloom5Size, &Settings.Bloom5Tint },
			{ Settings.Bloom4Size, &Settings.Bloom4Tint },
			{ Settings.Bloom3Size, &Settings.Bloom3Tint },
			{ Settings.Bloom2Size, &Settings.Bloom2Tint },*/
			{ Settings.Bloom1Size, &Settings.Bloom1Tint },
		};

const uint32 BloomStageCount = 1;// BloomQualityStages[BloomQuality - 1];
By restircting the bloom stage count you will no longer be able to make any changes with r.BloomQuality, but i think you can avoid confusions that way. The other way to set this up would be to change the preset value in BloomQualityStages] it’s up to you.

Second, the downsample sizing is here, you can adjust the downsample to be a quarter size
void FRCPassPostProcessDownsample::Process(FRenderingCompositePassContext& Context)
	FIntRect DestRect = FIntRect::DivideAndRoundUp(SrcRect, 4);
	SrcRect = DestRect * 4;
At the bottom of the code you will find this, change it to 4 and you are done.
	
FPooledRenderTargetDesc FRCPassPostProcessDownsample::ComputeOutputDesc(EPassOutputId InPassOutputId) const
	Ret.Extent  = FIntPoint::DivideAndRoundUp(Ret.Extent, 4);
The result is a basic bloom, but it does the job. If you wish to give a larger halo, just change it to be 8 (additional halving) in the code, then in the editor set the size scale to be 8 in the postprocess field.

Be aware that the changes in the downsample code will not be restricted for the blooms only, so if you have any custom effect that depends on this downsample pass there will be complications. I believe the official engine version doesn’t use this method other than the bloom only. This might change later.

Thank you so much! You saved me a couple of hours/days figuring this out. I’ll give this a try in GearVR!

John_Alcatraz · January 24, 2017, 9:34pm

Thans very much Konflict!

It works great Have tested it in 4.15P1 now.

Konflict · January 25, 2017, 5:30am

Thanks for the update, i’m happy to see it works in the next release, and i am also a bit sad to realize they didn’t change anything in the postprocess so far, which means it is still very hard to make adjustments and optimizations in the current pipeline the engine have.

One more important thing:

It still needs to be resolved, and i just realized that i missed this important part. I’ll look into it when i have the time to see how to reduce the cost of the blending, which doesn’t seem to apparent on the gpu profiling, but as far as i remember it still is happening in the background. The GPU might be smart enough to realize theres no data to be blended and skip these jobs in every frame, but i prefer to see this for myself.

Konflict · January 25, 2017, 5:58am

This you change



{ NULL, TEXT("BloomDownsample1") /*, TEXT("BloomDownsample2"), TEXT("BloomDownsample3"), TEXT("BloomDownsample4"), TEXT("BloomDownsample5")*/ };

By commenting out the downsample passes after the 1st iteration.
That you modify



typedef TBloomDownSampleArray<2/*DownSampleStages*/>   FBloomDownSampleArray;

Yes that is a number of 2, that means we have 2 rendertargets to exists. Not sure if we require both, but maybe the reason for this lays in the way they implemented the blending of the passes, so it always will require one normal sized RT to exists which will be used to blend the final results.

As for mobiles and futher readings, here is one very important part of the code which seems to have a custom rendering pass of the whole postprocess line. This pipeline might only be executed on certain hardware, which may or may not include mobile renderings as well. I can’t answer this just yet, but maybe you can test this to see if mobiles require this graph to be changed instead of the regular paths i just described here.

John_Alcatraz · January 25, 2017, 10:09am

Thanks! I have changed that now too. I don’t see a difference in the profiler, so the GPU was probably smart enough to not do anything there. It definitely doesn’t hurt though to not have that unneeded blending there