I’d say there is likely something wrong in my GPU implementation then because changing the Grid Size from 256x256 to 32x32 results in the same issues I was seeing in my CPU implementation.
I assumed that my GPU implementation was correct but it appears like something is not working correctly.