Big ComputeLightGrid compute shader spikes during resolution changes

The ComputeLightGrid shader(s) seem to get bogged down during resolution changes.

I’m using a patch from John Alcatraz to allow viewport resizing without render target reallocation on SteamVR (similar to adaptive pd on Oculus). But I was getting big hitches with it when I change resolutions.

The hitches show up as “other” in the GPU part of the performance graph, which I read can come from compute shaders:

What I’m doing there is slightly increasing the resolution each frame by around .1% and then walking it back down, over and over.

I ran the profiler and tracked it down to STAT_ComputeLightGrid, which runs in FDeferredShadingSceneRenderer::ComputeLightGrid

I looked around at the cvars and found r.Forward.LightGridPixelSize. When I bump that up to around 512 (it defaults to 64), all the hitches stop and the “other” part of the GPU performance graph no longer has any spikes. I can freely change resolution every frame with no hitch.

Any thoughts on changes I can make to avoid this, while still having a usable light grid (512 pixels means there are only a few cells)? Is it just that the first initialization of the light grid is much more expensive than subsequent updates? Would it be possible to base it on just a fixed number of divisions instead of pixels, so that the grid size doesn’t change with resolution (maybe rounding would mean slightly different cells end up overlapped)?

I suspect this affects Oculus adaptive pixel density as well, but I haven’t verified it there yet.

I’ve tried working around this by adjusting the LightGridPixelSize to try and hold the overall grid size stable across changes. Unfortunately it causes a lot of artifacts. After looking around a good bit I’ve found that LightGridPixelSize has to be a power of two or it causes artifacts. Shaders all index into the grid with ComputeLightGridCellIndex, which uses a bit shift instead of a divide, so the grid size must be a power of two.

From what I can tell, a floating point multiply by an inverse factor (passed instead of the shift width) could be used instead, and it should be twice as fast as int32 shift anyway (at least on NVidia):

http://docs.nvidia.com/cuda/cuda-c-p…c-instructions

There is enough mantissa for very large resolutions, but there would probably be conversion and possible rounding costs.

This is where the indexes are calculated:


uint ComputeLightGridCellIndex(uint2 PixelPos, float SceneDepth, uint EyeIndex)
{
    const LightGridData GridData = GetLightGridData(EyeIndex);
    uint ZSlice = (uint)(max(0, log2(SceneDepth * GridData.LightGridZParams.x + GridData.LightGridZParams.y) * GridData.LightGridZParams.z));
    ZSlice = min(ZSlice, (uint)(GridData.CulledGridSize.z - 1));
    uint3 GridCoordinate = uint3(PixelPos >> GridData.LightGridPixelSizeShift, ZSlice);
    uint GridIndex = (GridCoordinate.z * GridData.CulledGridSize.y + GridCoordinate.y) * GridData.CulledGridSize.x + GridCoordinate.x;
    return GridIndex;
}

This line:


    uint3 GridCoordinate = uint3(PixelPos >> GridData.LightGridPixelSizeShift, ZSlice);

Would become something like:


    uint3 GridCoordinate = uint3(PixelPos * GridData.LightGridPixelSizeIndexFactor, ZSlice);

Will that get the rounding right? How expensive will the integer/floating conversions be? Are there other reasons for the power of 2 requirement (gpu warps etc.)?

(edit: from the same programming guide, conversions seem to be about half the speed of 32bit shifts, and it would need two, so with multiplication twice as fast overall it may take ~4.5x longer than the shift to run)