Using groupshared variables to improve a compute shader

I am trying to think in how to improve further a compute shader I have, my thread group size is 8, so 8x8 for the size of my groupshared. Here is a snipped code:

As you can see, the idea is access the texture only once per DispatchThreadID, thus saving some performance, (ignore the macros for the permutations) then after adding the proper memory barrier I would have my info ready to continue

After the sync barrier I am diving my logic in two cases, when the ThreadID.xy is part of the border of the current group and when is not, for a pixel that is not a border the logic is easy, we recover the index and we access the SharedColor, for the border case I read somewhere that is better to have a single case instead of trying to handle every different case on an if, due the GPU divergence, which I don’t fully understand haha. Then as you can see if the pixel is a border of the group then it samples the neighbors like a normal pixel shader I guess. It does improves, here are my results on render doc:

I sampled a bunch and bunch of frames and the results seem to be consistent, it seems to gain between 10% to 15%, it is faster than accessing the 8 neighbors like in a pixel shader fashion. But I wonder if I am doing something naive here, I was expecting a bigger improve here, but I don’t know how to archive such thing.

If you guys know something, please let me know.