We were looking into why some of our assets were constantly “slightly” different when cooking. After spending a week tracking it down we discovered that the UE4::SSE::HalfToFloat function was to blame. The strange thing is, it’s deterministic and matches the x64 intrinsic _mm_cvtph_ps perfectly but still can be non-deterministic.
When called repeatedly inside a ParallelLoop - it loses determinism by occasionally adding an extra 1/256th to some of the output. Replace ParallelLoop with a single threaded loop and everything returns to being deterministic. Similarly replacing the entire function with just the single intrinsic will also ensure it’s deterministic even when run via a ParallelLoop.
So our two solutions to ensure deterministic generation of a RGBA16F source texture is to either remove the parallelloop that generates it OR replace HalfToFloat with the single intrinsic FPU instructon.
Timings for this are as follows.
Using hyperthreads - original code
Execution time of 380.553125 ns for 262144 pixels - Not Deterministic
Using PhysicalCores - original code
Execution time of 377.331250 ns for 262144 pixels - Not Deterministic
Single Intrinsic HalfToFloat, PhysicalCores
Execution time of 137.078125 ns for 262144 pixels - Deterministic
ParallelLoop Removed, PhysicalCores
Execution time of 231.250000 ns for 262144 pixels - Deterministic
Conclusions:
Hyperthreading results in FPU stalls in the original code.
ParallelLoop results in Non-deterministic results in original code.
Single Intrinsic is faster in all circumstances AND deterministic.
Removal of ParallelLoop is usually slower on larger textures but can be faster on smaller ones.
So - we’ve gone with removing the ParallelLoop for now on just the RGBA16F conversion to RGBA32F even though the single intrinsic is faster we know it may not be available on all out x64 FPUs.
We found the same results on a comparable Intel and AMD hardware.
Can someone take a look at this and confirm the same determinism loss via ParallelLoop execution and if so whether the single intrinsic should be implemented? Thanks.
Steps to Reproduce
Choosing a default texture to test of RGBA16F source format. We used VT_Lightning.
In BuildTextureMips if you set a flag when DebugTexturePathName matches your chosen RGBA16F texture and make it generate the texture 32 times instead of once (via LinearizeToWorkingColorSpace), each time comparing the results with the originally converted result, you will see small differences in the generated output.
Our analysis as to why this is is below as well as two solutions we found that ensures determinism.
Thanks for the report. Our best guess is this is caused by the FPU being set to a non-standard rounding or precision mode (on some threads but not others).
eg. some other operation runs on some of the threads that is messing with the float control words.
VCVTPH2PS doesn’t suffer from the same problem because it isn’t sensitive to rounding/precision modes.
We could check if this hypothesis is correct by querying the FPU modes at the start of the work function on each thread, or by resetting the float control word to default on each thread when this task runs.
When I was looking at the parallelloop code I noticed it seems to send most of the work off onto what I presume to be identically setup worker threads used across the engine, however it did seem to keep a “bit” of work for itself which if your hypothesis is correct may be where the FPU difference is.
Well, it could be any task that has run on any thread that changed the controlfp state, so it could be on one of the task worker threads as well.
I’ve tried running tests here and have not been able to repro this problem. We aren’t aware of anywhere in Unreal that is intentionally changing the FP state, but it could be in some 3rd party code or in a plugin.
If possible please put something like this at the start of CopyImage and see if it ever gets hit. If it does, then we have to go on a goose chase to try to figure out who is responsible for changing that.
Using your code we’ve found where this happens, you were correct in that it’s during StartupModule in a third party plugin. In this case it’s Havok who have altered the DeNorm settings for extra speed. Since this is done on the GameThread but not the worker threads, we find that all ParallelLoop functions that split the work across WorkerThread+GameThread will have the same potential for non-determinism.
In the example we found, you could say that 90% of the texture was converted from Half to Float using the worker threads and 10% slightly different due to the GameThread being used to help out.
This is concerning us as there may be more places where having the Havok module active could damage determinism. We are left wondering if Epic have any plans to make the Engine more resilient to FPU state. We’re in conversation with the Havok devs to see how best to deal with this, with the possibility existing that they may change and restore other threads FPU state when they go and do threaded work.
FYI we spoke with Havok and in the future they will be fixing their behavior to restore the FPU mode after they change it. Until they release a new version with that fix, you will have to continue with fixing it on your side.
Great find, thanks! Probably the easiest short term fix will be to restore the FPU control word after each call to Havok. Ideally they should be doing that internally or not messing with it. We will also try to reach out to Havok directly.