There is no need for sync on CPU, as you can execute the whole thing with 32 parallel fors, each doing 32 elements. Last but not least, doing inverse Fourier transform is not mandatory at all either. You can straight up evaluate individual wave signals instead. More so, it gives you valuable advantage for server side logic or any game logic, where you would need to get displacements at time other than current frame time. Of course, you will be limited to about 16-30 wave signals, as opposed to 1024, but what you can do, is importance sample the spectrum, evaluating complex amplitude including current wind settings and randoms, and picking 16 most contributing ones. Such approach is faster when you need only few dozens of height queries per frame.
But for starters, mindlessly copy pasting the code for each pass, wrapping it into for x for y, and resolving any issues on the way will get you with working water height queries in several hours. From there, one can decide whatever next step could be.