ParallelFor not improving performance

H0fek · August 21, 2021, 7:32am

Hi,
can someone explain to me why ParallelFor is not improving performance in my case? My plan is to populate TMap<FVector, float> which contains data about my 3d grid of points.

Example:
0,0,0 - 0.5
0,0,10 - 0.7
0,0,20 - 0.4
etc.

At first I generated these values using nested “for” loops. For 512x256x256 it took 35s to finish.

for (int z = 0; z < Height; ++z)		
	{
		for (int y = 0; y < Size; ++y)				
		{
			for (int x = 0; x < Size; ++x)				
			{
               //PointOffset is used to increase/decrease points spread
				gridPoint = FVector(
					_chunkLocation.X + (x * PointOffset),
					_chunkLocation.Y + (y * PointOffset),
					_chunkLocation.Z + (z * PointOffset));				

				if (z <= GroundLevel)
				{
					_gridPoints.Add(gridPoint, GenerateValue(gridPoint));
					continue;
				}
									
				_gridPoints.Add(gridPoint, 0);								
			}
		}
	}

Because every pass is independent from the other I decided to use ParallelFor. It now takes 45s to finish…

ParallelFor(Height, [&](int z)
	{
		ParallelFor(Size, [&](int y)
		{
			ParallelFor(Size, [&](int x)
			{
				FVector gridPoint = FVector(
					_chunkLocation.X + (x * ChunkInfo.PointOffset),
					_chunkLocation.Y + (y * ChunkInfo.PointOffset),
					_chunkLocation.Z + (z * ChunkInfo.PointOffset));

				if (z <= ChunkInfo.GroundLevel)
				{
					_gridPoints.Add(gridPoint, GenerateValue(gridPoint));
				}					
				else
				{
					_gridPoints.Add(gridPoint, 0);
				}
					
			});
		});		
	});

As far as I’m concerned ParallelFor should assign multiple threads to do this job (Task graph to be more precise). For example (outer loop):

4 threads assigned
Each thread takes Height/4 passes so that all passes are assigned properly

But it seems like only one thread is doing this job and additional time is the result of synchronization mechanisms in ParallelFor and TDiscardableKeyValueCache (TMap seems to be not thread-safe, so I replaced it with this).

I experimented with each loop by forcing one thread only in different combinations, but it turns out that original solution with 3 normal “for” loops is the best. Whyyyy?

Dutii · August 21, 2021, 4:06pm

Try running the loops without adding anything to your TMap, once in parallel and once without and compare performance then. I might be way off here, but if I were to make a guess I’d say that the bottleneck of the operation is the process of adding something to your TMap and not the loop itself.

Zeblote · August 22, 2021, 3:08am

For a start, adding elements to a map is not thread safe, so this code is completely broken. If you want to do this, you’ll need to generate one map locally on each thread and merge them when finished.

However, it seems like you don’t need a map at all, as you are generating a dense grid of values. That can be compacted to an 1d array with some clever indexing.

It’s not faster because you are creating too many tasks. The trask graph is not meant to handle thousands of tasks, if you want to use it you will need to group your work into reasonable batches yourself.

H0fek · August 22, 2021, 3:18am

@Dutii, thx, good suggestion, I’ll try it tomorrow.

@Zeblote That’s why I’m using TDiscardableKeyValueCache instead of TMap (I should had pointed it out more clearly, sorry about that) when dealing with ParallelFor. If the taskgraph is not meant to such tasks then what’s the point of ParallelFor’s? Taskgraph should automatically calculate how to assign work to threads and optimize number of needed threads, shouldn’t it?
Good point with 1d array and proper indexing… But I don’t want to “invent” things before I’m sure that included solutions in engine won’t work.

Zeblote · August 22, 2021, 3:28am

Never heard of that before, but a quick skim over the code tells me it uses a read/write lock on every access. While it’s correct that this is thread safe, that’s useless for your use case.

Zeblote · August 22, 2021, 3:29am

Task graph is not rayon. It has quite a bit of overhead for launching tasks. You’ll need to batch manually.

H0fek · August 22, 2021, 3:32am

Yeah… I just checked it now too, uhhhh. I suggested too much by this “Implements a thread-safe” and I haven’t checked how it actually works. Thanks a lot for help!

Zeblote · August 22, 2021, 3:34am

The beautiful part of flatterning your indices to 1d is that you can create separate blocks of data (say for every slice in some axis) in the same array that you can then write to with threads with no conflicts, so there will be no need for any kind of synchronization.

H0fek · August 22, 2021, 3:39am

Yep, time for creating my own solutions ;D I came here from C# and also had some fun with OpenMP in C++ so I’m a little bit surprised that such things aren’t implemented here And documentation about dealing with multi-threading in unreal is not that wide…