Question Title

I deleted my question.

you read the data back to ram? multiple times per frame? the pci-e latency is a kinda enemy. also if you read it back with cpu access, the buffer is not usable for the gpu, which would stall the gpu execution. you could use a double buffer and flipflop between them or maybe better gpu copy the result. so… you can compute into one buffer while the other is handled by the cpu to copy back. this way the processing units will not clash or have to wait for each other to finish. you can’t have both of them work on the same piece of memory at the same time.