ParallelFor Optimization

Hello all! I hope you all are doing well. I have some optimization questions I was hoping I could get answered, particularly about ParallelFor.

Right now, I am seeing some strange behavior where ParallelFor seems to be decreasing my performance instead of helping it. I have attached a video where you can see the comparison on stat Game

So I am working on boids flocking behavior, and I would like to implement a ParallelFor when each boid is sampling other members of the flock to improve performance through multithreading. This is my first attempt at multithreading in Unreal though I have done it before although only a little bit. Here is the relevant code.

Ordinary For Loop:

for (AActor* boid : Flock)
{
	if (boid == GetOwner())
	{
		continue;
	}

	UBoidsMovementComponent* boidMovementComponent = boid->GetComponentByClass<UBoidsMovementComponent>();
	if (IsValid(boidMovementComponent))
	{
		// calculate distance
		FVector boidPos = boidMovementComponent->UpdatedComponent->GetComponentLocation();
		FVector currentPos = UpdatedComponent->GetComponentLocation();
		float distanceFromBoid = FMath::Max(FVector::Dist(boidPos, currentPos), 0.001f);

		// is the boid within the protected range? 
		if (distanceFromBoid < ProtectedRange)
		{
			// Seperation
			FVector sepVector = ((currentPos - boidPos).GetSafeNormal()) / distanceFromBoid;
			cumulativeSeparation += sepVector;

			numProtectedBoids++;
		}

		// is the boid within the visual range? 
		if (distanceFromBoid < VisualRange)
		{
			// Alignment
			velocitiesSum += boidMovementComponent->Velocity / distanceFromBoid;
			
			// Cohesion
			positionsSum += boidPos / distanceFromBoid;

			numVisualBoids++;
		}

	}
}

ParallelFor:

ParallelFor(Flock.Num(), [&](int32 i)
	{
		AActor* boid = Flock[i];

		if (boid != GetOwner())
		{
			UBoidsMovementComponent* boidMovementComponent = boid->GetComponentByClass<UBoidsMovementComponent>();
			if (IsValid(boidMovementComponent))
			{
				// calculate distance
				FVector boidPos = boidMovementComponent->UpdatedComponent->GetComponentLocation();
				FVector currentPos = UpdatedComponent->GetComponentLocation();
				float distanceFromBoid = FMath::Max(FVector::Dist(boidPos, currentPos), 0.001f);

				// is the boid within the protected range? 
				if (distanceFromBoid < ProtectedRange)
				{
					// Seperation
					FVector sepVector = ((currentPos - boidPos).GetSafeNormal()) / distanceFromBoid;
					cumulativeSeparation += sepVector;

					numProtectedBoids++;
				}

				// is the boid within the visual range? 
				if (distanceFromBoid < VisualRange)
				{
					// Alignment
					velocitiesSum += boidMovementComponent->Velocity / distanceFromBoid;

					// Cohesion
					positionsSum += boidPos / distanceFromBoid;

					numVisualBoids++;
				}

			}
		}
	},
	false);

Is there an important point about ParallelFor I am missing? One thing I was thinking was that maybe this is calling too many items at once? Is there another better option for my use case I should look at? Thank you so much for taking the time to read this!

There’s typically overhead of locking, spawning threads and so on that won’t net you the theoretical gain of x times performance improvement simply by using x threads. So most of the time there is some sweet spot where multiple threads become faster if the amount of data you need to process gets sufficiently large. Your example probably isn’t large enough with ~200ish actors.

If you want more accurate data, I’d suggest using the profiler and possibly adding your own STATs rather than just looking at the overall Tick Time.

1 Like

I see that makes sense, I might try some more testing with larger amounts and the profiler. Thank you for the answer! :smiley:

I don’t see the point of using concurrency/parallelism. There are no heavy calculations here.

This is an old thread, and the OP probably doesn’t need the answer anymore, but I will answer for people who will come here in the future and are seeking for the answer:

ParallelFor Has its overheads by dispatching your tasks to its thread pool.
So, if you have large amount of little simple tasks, going task per thread isn’t the best option.

What you want to do, is to divide your data to some bigger chunks, and execute these chunks on the threads.

For example, you have 1000 calculations to do, but they are really little and tiny, it is better to divide them let’s say by 20 and execute in parallel for:

Example:

		int32 Threads = 30;
		int32 IterationsPerThread = MovementComponents.Num() / 30;

		ParallelFor(Threads, [&](int32 Index)
		{
			int32 StartIDX = Index * IterationsPerThread;
			for (int32 I = StartIDX; I < FMath::Min(StartIDX + IterationsPerThread, MovementComponents.Num()); I++)
			{
				MovementComponents[I]->CustomTick(DeltaTime);
			}
		});