It looks really awesome, could you show how it looks like with a quite low res so that the performance impact is only something like 3 ms? Does it scale linearly with the particle amount, so will 100 times less particles run 100 times faster? In games you often don’t need super high accuracy.
Why do you mention the CPU here? Isn’t the fluid sim only heavy on the GPU?
Also, you mention CUDA, if you want this to be usable in games you should make it use OpenCL, direct compute or vulkan compute so that it runs on AMD and Intel GPUs too, for now direct compute is probably the best.