Understanding TSR RejectShading GPU cost

Hi, we’re taking a look at some optimizations for Steam Deck and noticed that even with sg.antialiasingquality=1 that TSR is quite expensive to run. Output from the `profilegpu` command:

0.8% 2.63ms TemporalSuperResolution(sg.AntiAliasingQuality=1) 640x360 -> 1280x720 5 dispatches 0.0% 0.00ms TSR ClearPrevTextures 640x360 1 dispatch 40x23 groups 0.1% 0.20ms TSR DilateVelocity(#0 MotionBlurDirections=0 OutputIsMoving) 640x360 1 dispatch 80x45 groups 0.1% 0.26ms TSR DecimateHistory(#7 ReprojectMoire ReprojectResurrection 16bit) 640x360 1 dispatch 80x45 groups 0.4% 1.35ms TSR RejectShading(#53 TileSize=26 PaddingCostMultiplier=1.5 WaveSize=32 VALU=16bit FlickeringFramePeriod=1.716346 Resurrection) 640x360 1 dispatch 25x14 groups 0.3% 0.82ms TSR UpdateHistory(#4 Quality=Low 16bit R11G11B10 SupportLensDistortion OutputMip1) 1280x720 1 dispatch 160x90 groups

We have the following cvars set in DefaultScalability.ini for AntiAliasingQuality[Content removed] this was changed by us for content reasons):

r.TSR.History.UpdateQuality=0 r.TSR.ShadingRejection.Flickering=1 r.TSR.RejectionAntiAliasingQuality=0 r.TSR.History.GrandReprojection=0

Obviously we could turn off AA entirely or fall back to the old TAA implementation to avoid this cost, but the results are not nearly as visually pleasing even compared to a 50% TSR upscale, hence our desire to take a look at doing some manual optimization on TSR if possible. I’m mainly trying to get a better sense of what in particular is the limiting factor of the GPU here and why RejectShading takes over 1.3ms at 640x360.

I did see there was a discussion in one of the NDA groups about potentially increasing the WaveSize to 64 to reduce register pressure, but I don’t know if that’s still relevant in 5.5 or even applicable to the Steam Deck GPU at all. There doesn’t appear to be much guidance out there regarding profiling the Steam Deck GPU (beyond the usual `profilegpu` and Unreal Insights trace), so any advice you have regarding that would also be welcome.

Thanks.

Hi Andr3wV,

TSR reject shading uses a lot of wave intrinsic and LDS to accelerate 3x3 convolutions like min3x3, max3x3, sum3x3. It is quite VGPR bounded. If increasing WaveSize to 64 helps to reduce the register pressure, feel free to apply the practice after testing.

I believe the anti-flickering logic added for content reason increases the reject shading time a lot. We turn it off for scalability low and med for performance consideration. You could also try disabling `r.TSR.Resurrection` or setting `r.TSR.History.UpdateQuality` lower if they did not bring any noticeable artifact. You can take a look at `BaseScalability.ini` to check the difference of AntiAliasingQuality@0 and AntiAliasingQuality@1.

Thanks,

Tiantian

If the rendering time still does not hit your target, you might want to use the secondary upscaler controlled by (r.SecondaryScreenPercentage.GameViewport). Please see this document (https://dev.epicgames.com/documentation/en\-us/unreal\-engine/screen\-percentage\-with\-temporal\-upscale\-in\-unreal\-engine\#secondaryspatialupscale) for more detail. We use it to achieve 60fps on XBOX One for Fortnite.

Thanks, that does help give me some direction as to what to investigate here. I did end up just trying forcing the wave size to 64 and can definitely confirm that 64 performed worse than 32, so likely that doesn’t apply to the steam deck hardware.

Doing some A/B testing with r.TSR.Resurrection and r.TSR.ShadingRejection.Flickering, it seems that r.TSR.Resurrection is only responsible for ~0.1ms and r.TSR.ShadingRejection.Flickering is responsible for ~0.58ms (on steam deck). I’ll see if we’re fine with the visual trade-off of turning the anti-flicking logic off again since it costs so much, but that still leaves roughly 0.83ms for RejectShading with both off. Do you have a method of profiling that GPU cost more low-level on the steam deck or has that not received much attention yet?

Edit: we also already had r.TSR.History.UpdateQuality at 0

I have not used steam deck but I have asked around about any suggestions of lower level GPU profiling tools around. Since it is using AMD’s GPU, I assume you can use Radeon GPU Profile to check the instruction level performance? Will let you know if I get more information.

Got one question, is async compute used? By default, the first three passes should be in the async compute path and only Reject Shading and Update History is on critical pass.

Tiantian

Thanks, I’ll take a look to see what’s involved with using the Radeon GPU profiler on the steam deck.

Regarding async compute, I believe it should be. We have r.TSR.AsyncCompute set to the default value of 2. Does that mean that the timings reported by `profilegpu` are inaccurate in this case? If so, what’s the most accurate way to visualize the actual timings?

So, this is my first time using RGP and very well may be doing something incorrectly or interpreting this data incorrectly, but I did capture a trace of our game on steam deck and it seems… odd. The wavefront occupancy appears to be extremely sparse and very little async compute occupancy seems to show on the graph. I do have r.Shaders.ExtraData=1 and D3D12.EmitRgpFrameMarkers=1 for reference. The game is running through Proton (normally DX12 on windows) using the current hotfix version at the time of posting.

Is there a way I can verify that both the trace is correct and whether or not the TSR passes are async as expected?

[Image Removed]

ProfileGPU gives the total time while all rendering is done in sync mode. So if steam deck supports async, the actual rendering time of TSR should be lower. It should be able to hide the first 0.46ms before TSRRejectShading. In UE5.6 you can use Unreal Insight to connect from the kit directly with e.g., -tracehost=10.28.6.234 -trace=default. Now it should be able to present what is happening in both graphics and async queues.

To have shaders output all debug info: You might need to setup the following CVars:

r.Shaders.Symbols=1

r.Shaders.SymbolsInfo=1

r.Shaders.ExtraData=1

r.Shaders.Optimize=0

To see the full RDG events, you might also want to set r.RDG.Events=3. For new version, you might need to manually set the symbol path.

In async mode, stat gpu should give you a better total sum of the rendering time of TSR.