Usually there’s a dispersion factor like 0.99 or 0.999, Take this example on shadertoy. Look in buffer B for the line: nu = 0.99*nu; I noticed your code is lacking that.
What sort of performance are you getting? I still imagine it would be faster to do GPU side and read the pixels.