Hi,
can someone explain to me why ParallelFor is not improving performance in my case? My plan is to populate TMap<FVector, float> which contains data about my 3d grid of points.
Example:
0,0,0 - 0.5
0,0,10 - 0.7
0,0,20 - 0.4
etc.
At first I generated these values using nested “for” loops. For 512x256x256 it took 35s to finish.
for (int z = 0; z < Height; ++z)
{
for (int y = 0; y < Size; ++y)
{
for (int x = 0; x < Size; ++x)
{
//PointOffset is used to increase/decrease points spread
gridPoint = FVector(
_chunkLocation.X + (x * PointOffset),
_chunkLocation.Y + (y * PointOffset),
_chunkLocation.Z + (z * PointOffset));
if (z <= GroundLevel)
{
_gridPoints.Add(gridPoint, GenerateValue(gridPoint));
continue;
}
_gridPoints.Add(gridPoint, 0);
}
}
}
Because every pass is independent from the other I decided to use ParallelFor. It now takes 45s to finish…
ParallelFor(Height, [&](int z)
{
ParallelFor(Size, [&](int y)
{
ParallelFor(Size, [&](int x)
{
FVector gridPoint = FVector(
_chunkLocation.X + (x * ChunkInfo.PointOffset),
_chunkLocation.Y + (y * ChunkInfo.PointOffset),
_chunkLocation.Z + (z * ChunkInfo.PointOffset));
if (z <= ChunkInfo.GroundLevel)
{
_gridPoints.Add(gridPoint, GenerateValue(gridPoint));
}
else
{
_gridPoints.Add(gridPoint, 0);
}
});
});
});
As far as I’m concerned ParallelFor should assign multiple threads to do this job (Task graph to be more precise). For example (outer loop):
- 4 threads assigned
- Each thread takes Height/4 passes so that all passes are assigned properly
But it seems like only one thread is doing this job and additional time is the result of synchronization mechanisms in ParallelFor and TDiscardableKeyValueCache (TMap seems to be not thread-safe, so I replaced it with this).
I experimented with each loop by forcing one thread only in different combinations, but it turns out that original solution with 3 normal “for” loops is the best. Whyyyy?