Multithreading and Performance in Unreal

Ayliroe · July 1, 2023, 2:32pm

There’s a criminal lack of condensed documentation on how to implement actually useful multithreading in Unreal, and also a general difficulty to find a good overview of practices that lead to performant code, so here’s that.

MULTITHREADING

1. Overview

Unreal by default supports multithreading, but only makes partial use of it. While there are dedicated threads for audio, render and stats, most operations are still done in the game thread, including EventTicks and blueprints, while the rendering follows one or two frames behind the main thread, meaning that most expensive calculations will invariably lead to loss of framerate. That’s where threading comes in handy!

There are two major approaches to async work in Unreal:

Runnables

Run on dedicated, newly created threads, to which you have full control over. They will automatically stop once their work is completed, and are generally useful for big computations that require a nearly continuously running thread.

Tasks

Run on the TaskGraph, a job manager that tries to balance out workload along multiple preexisting threads. This is ideal to send packages of small operations, as it abstracts away from you the complexity of managing threads, and also supports defining dependencies between Tasks.

Queuing Tasks will not cause performance concerns due to the threads being already running, but the system may also be less reactive as it has to find slots to fit the work in inside a limited pool, and sending long Tasks should be avoided to not clog-up threads. It may also sometimes decide to run Tasks directly in the game thread, depending on the setup.

2. Threading Classes

Here’s a brief of the common classes provided by the engine’s Core:

FRunnable

The most comprehensive threading tool of Unreal. Simply derive a class from FRunnable like thus:

// .h
#pragma once
#include "CoreMinimal.h"

class FMyThread : public FRunnable {
  FMyThread( /*Parameters*/ ) {
  Thread = FRunnableThread::Create(this, TEXT("MyThread"));
  };

  virtual bool Init() override;
  virtual uint32 Run() override;
  virtual void Exit() override;
  virtual void Stop() override;

  FRunnableThread* Thread;
  bool bShutdown= false;
};

// .cpp
#include "FMyThread.h"

bool FMyThread::Init() {
  /* Should the thread start? */
  return true;
}

uint32 FMyThread::Run() {
  while (!bShutdown) {
  /* Work on a dedicated thread */
  }
  return 0;
}

void FMyThread::Exit() {
  /* Post-Run code, threaded */
}

void FMyThread::Stop() {
  bShutdown = true;
}

When you want to start your thread, include its header and call its constructor (keep the pointer at hand!):

auto* Thread = new FMyThread( /*Parameters*/ );

That constructor in turn calls FRunnableThread::Create(), with this (i.e. your FMyThread) as the FRunnable to be created.

The class provides four functions to override (automatically run by the class), in which you can implement your threaded code:

Init(): Gets called by FRunnableThread::Create(), runs in the game thread. Insert logic of your choice to decide whether or not the thread should initialize, return true if yes.
Run(): Runs in the new thread, once Init() has completed. Do your implementation here. When the function ends, the thread will naturally exit, so you might want some form of loop to iterate until your calculations are done, or you’ve told your thread to stop (ex. with a boolean). The default return value is 0 for successful.
Exit(): Runs in the new thread, once Run() has completed. If you want special logic to occur there.
Stop(): A normal function running in the game thread that does not get called automatically. Use Thread->Kill() to run it in the event you want to end your thread early in a controlled manner. It’s then up to you to implement how that happens (in this example by stopping the while loop). You can also call Thread->Kill(false) to forcefully terminate the thread.

That is all you need, this will run by itself on a standalone thread and do your computations there. More info there, there and there.

AsyncTask

If you want to run a small async operation without creating a dedicated class or starting a new thread and do not need the control logic for pausing or callbacks, you can put it inside an AsyncTask running on the TaskGraph:

AsyncTask(ENamedThreads::AnyHiPriThreadNormalTask, [this] () {
  /* Work on the TaskGraph */
  Caller->FunctionToThread(); // Function call captured using [this]
});

The structure here is similar to a TFunction, which is Unreal’s way of implementing C++ lambdas. That means you need to capture references or copies of all variables or functions declared outside the AsyncTask, usually using [this] for global variables and [&] for locals (see lambdas in C++).

ENamedThreads gives you multiple choices for where you want to execute that threaded work, including back onto the game thread if calling from outside it. Since code is run on mostly preexisting Tasks in the TaskGraph, queuing numerous short Tasks has lower overhead than creating new FRunnable threads, but they may execute with more delay.

ParallelFor

A fancier version of AsyncTask that splits a for loop into multiple Tasks running in the TaskGraph.

ParallelFor(Array.Num(), [&](int32 i) { // Run Array.Num() operations, with current index i
  /* Work on the TaskGraph (order of execution is variable!) */
  ++Array[i];
});

There’s no guarantee about the order or the thread safety of the operation within, so you might want to use mutexes or atomics with it. MSVC has an analogous #pragma loop(hint_parallel(n)). Practically speaking, the contents of your loop must be significant to really benefit from this approach.

FNonAbandonableTask

A way to declare your own AsyncTasks, in a format that sits inbetween FRunnable and lambda-like AsyncTasks. You can implement your own code in a standalone class to be more reusable, which will run on the TaskGraph instead of inside a dedicated thread, but missing some of FRunnable’s initializing and stopping logic.

#pragma once
#include "CoreMinimal.h"

class FMyTask : public FNonAbandonableTask {
  friend class FAutoDeleteAsyncTask<FMyTask>;

  FMyTask( /*Parameters*/ ) {
    / Constructor */
  }
  void DoWork() {
    /* Work on the TaskGraph */
  }
  FORCEINLINE TStatId GetStatId() const { // Probably declares the Task to the TaskGraph
  RETURN_QUICK_DECLARE_CYCLE_STAT(FMyTask, STATGROUP_ThreadPoolAsyncTasks);
  }
};

Start your custom Task like such:

auto* MyTask = new FAsyncTask<FMyTask>( /*Parameters*/ );
MyTask->StartBackgroundTask();

The Task can be declared as a friend of FAsyncTask or FAutoDeleteAsyncTask, with the latter being more self-managing. More info there and there.

If you are extremely lazy and don’t want to port your slow blueprints to C++, it is also possible to make an AsyncTask run blueprints through events.

FFunctionGraphTask and TGraphTask

FFunctionGraphTask is another way to queue Tasks using lambdas, with a single prerequisite and completion action as optionals.

FGraphEventRef PrerequisiteTask; // A task to be completed before

FGraphEventRef Task = FFunctionGraphTask::CreateAndDispatchWhenReady( []() {
  /* Work on the TaskGraph */
}, TStatId(), PrerequisiteTask, ENamedThreads::AnyThread);

FGraphEventArray TasksList; TasksList.Add(Task);

if (Task->IsComplete()) {
  /* On work completed */
}

TGraphTask takes an entire user-defined class and adds it to the queue, with an optional list of prerequisites. You need to implement your own DoTask() function within your class for it to work. More info and an example.

FGraphEventRef Task = TGraphTask<FMyTask>::CreateTask(&PrerequisitesList, ENamedThreads::GameThread).ConstructAndDispatchWhenReady();

CreateTask() returns an FConstructor that ConstructAndDispatchWhenReady() uses to make the FMyTask you provide it. ENamedThreads here is the thread the call is made from, not the one to run the Task on.

AsyncPool, AsyncThread, TFuture and TPromise

Some things in that engine are just purely undocumented, and I doubt what follows was meant to be used directly as-is.

AsyncPool seems to execute functions on a preprovided thread pool, and you can pass an optional callback function to execute when done:

FQueuedThreadPool* Pool = FQueuedThreadPool::Allocate(); // Or Pool = GThreadPool;
verify(Pool->Create(4, 32 * 1024, TPri_Normal));

TUniqueFunction<void()> Callback = []() { // Wrap in MoveTemp() to pass as parameter
  /* Callback code */
};
TFunction<void()> Body = []() {
  / Work on a thread pool */
};

AsyncPool(*Pool, Body, MoveTemp(Callback));

The FQueuedThreadPool seems to also be used with functions such as this:

IQueuedWork* Work; Pool->AddQueuedWork(Work);

AsyncThread probably creates a new thread in the same vein as FRunnable, and can return a TFuture struct when done:

TFunction<bool()> Task = []() {
  /* Work on a new thread */
  return true;
};
TFuture<bool> FutureResult = AsyncThread(Task);

As to how TFuture and TPromise are meant to be used, well, I don’t know either. Seems to be a system of callbacks, with functions like Future.IsReady(), Future.IsValid() and Future = Promise.GetFuture().

Some more info here and here.

3. Communicating with your Thread

Beware that modifying UObjects and AActors from a thread may not be supported, due to the way Unreal and its Garbage Collector is setup. Prefer doing your computations in async, and apply the results back onto the main thread.

The following assumes you are using FRunnable as your threading structure, but the other approaches behave similarly.

When starting the thread, you can pass any list of parameters, by copy, pointer or reference through the Constructor:

FMyThread(int _i, AMyActor* _Actor, FVector& _Vector) {
  i = _i;
  Actor = A_Actor;
  FVector& Vector = _Vector;
}

Using an initializer list, you can even initialize global references, and make the formatting snappier while reusing names:

FMyThread(int i, AMyActor* Actor, FVector& Vector) : i{i}, Actor{Actor}, Vector{Vector} {}

FVector& Vector; // Global reference, cannot be initialized by other means than the above

How shared memory is managed

It is actually possible to read or modify variables and call functions from anywhere in or outside your thread.

uint32 FMyThread::Run() {
  Actor->bIsBeingThreaded = true; // Variable from another class
  while (!bShutdown) {
    Actor->DoThreadedWork(); // Function from another class
  }
  return 0;
}

The only important limitation to this is concurrent writing: while reading a variable is threadsafe, having multiple threads write into one can lead to data races, or in the case of operations that resize containers such as TArray adds, invalidating all pointers to that array, which is an EXCEPTION_ACCESS_VIOLATION waiting to happen.

There are multiple approaches to counter this:

Maximize the amount of local variables in your threaded functions, and prefer sending your parameters through the constructor by copy. This can however become very expensive when copying large arrays of elements.
While one thread runs operations on a container, on the others avoid all operations that may trigger a memory reallocation, such as Add(), Remove() and SetNum(). Writing to TArray/TMap indices while reading into others with concurrent threads is actually feasible, as long as the operations are not on the same indices, and do not add or remove elements.
Use “FlipFlop” containers, such that each thread locks access to one while performing its work, and releases it for others to use when done.
Use Move Semantics; more about that later down.

C++11 and Unreal also provide multiple threadsafe tools:

thread_local

If you are running multiple threads within the same class, you may want to preface your global variables with thread_local. This will guarantee that each thread receives its own individual instances of the variables.

std::atomic

By wrapping a variable in an atomic, such as std::atomic<int> i;, you ensure that reads and writes to it are always threadsafe, and you can also define how concurrent writes are synchronized, though the syntax may be more complex (see link). Atomics can come at a substantial performance cost due to the added synchronization load, so try to use them only when inter-thread communication is necessary.

An std::atomic can only wrap relatively simple types such as ints and bools; it will not compile around containers like TArray. A struct can be made atomic, but this will only make assigning the whole struct threadsafe, not modifying individual members inside it (more info). The linked page provides an alternative of deriving a struct from std::mutex (same with FCriticalSection), and use Struct.Lock() before accessing members for thread safety.

Use std::atomics, as Unreal’s TAtomics are considered deprecated.

FPlatformAtomics

Using operations such as FPlatformAtomics::InterlockedAdd(&i, 10), you can perform atomic operations directly without having to declare either the variable itself as atomic, or using FCriticalSection. Includes modifying (Read, Add, Store, Increment, Decrement), logical comparisons (And, Or), etc. Some default classes implement those, such as FThreadSafeBool and FThreadSafeCounter.

TQueues

A TQueue allows for threadsafe atomic concurrent writing into a single container, with the usual setup being one or multiple threads (depending on the EQueueMode) adding elements, while a single other pops the results. You can even queue other containers, such as TQueue<TArray<int>>.

As it is a FIFO queue, elements get added to the start with Enqueue(), and removed at the end with Dequeue(). The container is a lock-free list, which avoids the usual performance overhead of mutexes around container accesses, and is contention-free in Single-producer mode (SPSC), but not in Multi-producer (MPSC).

TCircularQueue is a variant based on TCircularBuffer, also lock-free and FIFO but only supporting SPSC and with a static capacity (rounded to a power of 2, minus 1). Some examples.

#include "Containers/CircularQueue.h"

class FMyClass : public FRunnable {
  FMyClass(uint32 QueueSize) : Queue{QueueSize} { // Capacity initialized from list
    TCircularQueue<int> QueueB(QueueSize); // Capacity initiliazed from a constructor
  }
  TCircularQueue<int> Queue;
};

There’s one last, undocumented class deeply hidden in the source files called TSafeQueue, a FIFO Multi-producer Multi-consumer queue with minimal locks.

#include "UnrealAudio/Private/UnrealAudioUtilities.h"
using namespace UAudio;

TSafeQueue<int> Queue;

TLockFreePointerList and TListThreadSafe

TLockFreePointerList is a container like TQueue similarly based on a list, for threadsafe and ABA-resistant pointer storage. Can be either FIFO (first in, first out), LIFO (last in, first out) or unordered. Used inside the engine’s TaskGraph.

Chaos::TListThreadSafe is a lock-free, single-linked list, that is read out by being extracted as a TList (a convoluted setup; see the section on TList for how this is done). Seems to support Multi-producer Multi-consumer scenarios.

FCriticalSection and FScopeLock

FCriticalSection (Unreal’s implementation of std::mutex), prevents race conditions by locking all threads until the current one has exited the section. This is fast until two threads enter the same lock concurrently, which triggers it and pauses all but one (contention). Deadlocks can also occur when using multiple mutexes, if different threads grab a lock required for another to proceed.

FCriticalSection Section;

uint32 FMyThread::Run() {
  Section.Lock();
  /* Threadsafe section */
  Section.Unlock();
  return 0;
}

FScopeLock will perform the same operation, except it will unlock by itself once the scope has been exited (i.e. the brace closes).

FCriticalSection Section;

uint32 FMyThread::Run() {
  {
  FScopeLock Lock(&Section);
    /* Threadsafe until the closing brace */
  }
  return 0;
}

FScopeLock is safer to handle due to automatically releasing the lock, but FCriticalSection is also useful in being able to wrap tightly around the parts requiring threadsafe access.

FSpinLock (UE5)

FSpinLock is a variant of FCriticalSection that repeatedly tries to acquire the lock instead of putting the thread to sleep. Useful only in the specific circumstance where your lock covers little code, and it is faster to wait by keeping the CPU busy rather than sleeping.

TSharedPtr

TSharedPtr provides a mode for threadsafe atomic operations. This (as is general with atomics), comes with a performance impact.

Callbacks from threads

If you plan on doing more useful operations than calculating the first 50,000 prime numbers, you’ll probably want the game thread to be notified or run events when your thread has finished your calculations; but you might also have noticed there’s no straightforward method provided by the aforementioned classes (FRunnable’s Run() and Exit() both run in the new thread, and there are no “post-execution” functions in game thread).

Direct

The first and simplest method is as basic as calling a function of your choice, from within or outside your thread class.

uint32 FMyThread::Run() {
  /* Threaded code */
  Actor->ThreadCallback();
  Actor->bIsBeingThreaded = false; // Preferably atomic
  return 0;
}

There is no technical limitation against that, with the one big caveat that since you are doing it from within a threaded function, this will also run inside the same thread. Avoid therefore calling directly code you did not intend to run threaded, and prefer writing into variables or containers (safer when atomic) what information you need to transmit, so that other game thread functions may read it back on their next EventTick.

Async

Alternatively, you can wrap that call into an AsyncTask running in the game thread. This avoids most of the aforementioned issues.

AsyncTask(ENamedThreads::GameThread, [this, Vector] () {
  Actor->ThreadCallback(Vector);
});

Using AsyncTasks to repeatedly send very short messages can be somewhat inefficient, so avoid this approach if you expect to make callbacks 1000 times a second, and where a “direct” call writing into atomics would suffice.

Delegates

A Delegate is as its core a function pointer: calling it calls the function it is bound to, you can pass parameters, have a return value, etc. They can be Multicast (call multiple functions at once), Dynamic (serializable and blueprint-friendly, but slower), Sparse (lower memory usage if not bound), Events (only the declaring class can call it); be bound with UFunctions, UObjects, shared pointers, lambdas… More info there and there.

// Declaring a Singlecast, one-parameter void Delegate (done before classes)
DECLARE_DELEGATE_OneParam(FMyDelegate, FVector /*, ParameterName (optional) */ )

// Create, bind and execute (use .AddUFunction() and .Broadcast() if Multicast)
FMyDelegate Delegate;
Delegate.BindUFunction(this, FName("MyFunction")); // "this" can be replaced by any class
Delegate.ExecuteIfBound(FVector(Vector));

When you execute/broadcast, ‘listeners’ to that delegate will run like normal functions in the thread they are called from; wrap in an AsyncTask if you want to avoid that.

Lambdas

TFunctions (i.e. lambdas) allow you to pass a function anonymously, which is useful for example to avoid circular dependencies in your code.

/* Passing a function returning a vector anonymously through a TFunctionRef in the constructor */
// Caller
TFunctionRef<void (FVector)> Callback = [&](FVector Vector) {
  ThreadCallback(Vector); // What the lambda will execute when called
};
auto* Thread = new FMyThread( /*Parameters, */ Callback );

// MyThread .h
FMyThread( /*Parameters, */ TFunctionRef<void (FVector)> Callback) : Callback{Callback} {}

TFunctionRef<void (FVector)>& Callback; // Can be called like a normal function

Beware that they will also run in the same thread they are called from. TFunction will store a copy of the function, while TFunctionRef will avoid the copy cost by instead using a reference, and TUniqueFunction is moved in its entirety using MoveTemp(). More info.

Pausing, syncing and resuming threads

While it is possible to start a new thread for every discrete operation you may need calculations for, this is an expensive operation that also slows the game thread down when repeated. Prefer sending work in larger packages, or keeping the thread alive continuously by putting it to sleep until another call asks it to resume work.

If you’ve implemented a threadsafe callback after killing a thread, wait at least one frame before deleting objects/actors it shares references with as some of its internal logic may still be running.

Sleep

Somewhat similarly to blueprints’ delay node, Sleep() puts the thread it is called in to sleep without consuming resources. You might prefer not calling it from within the game thread if you do not intend to create a freezing simulator.

FPlatformProcess::Sleep(0.01f); // Time in seconds

Example usage is to check at regular intervals if a boolean asking to resume has been set to true, else continue sleeping.

ConditionalSleep() allows defining directly the condition to check for, through a lambda:

bool bResume;

FPlatformProcess::ConditionalSleep([this]() -> bool {
  return bResume;
});

Suspend and WaitForCompletion

FRunnable provides two functions to suspend a running thread, or synchronize another thread with it:

Thread->Suspend(true); // Suspends or resumes depending on the passed value
Thread->WaitForCompletion(); // Stalls the caller until the thread has completed

Suspend() seems to be only effective when the platform doesn’t support multithreading and instead implements FFakeThread or FSingleThreadRunnable, through disabling the Tick() on that ‘thread’.

FScopedEvent and FEvent

Once declared, an FScopedEvent will sleep the thread when the scope exits; you can then pass its pointer with Event.Get() to another thread and call Event.Trigger() inside it to resume execution.

{
  FScopedEvent Event;
  DoThreadedWork(Event.Get());
  /* Sleep at the closing brace until the other thread calls Event.Trigger(); */
}

FEvent similarly uses Event->Trigger(), but can be called at any time using Event->Wait(). Since the class is abstract, you need to grab and return instances from a pool by pointer. Example.

FEvent* Event;

FMyClass() { // Constructor
  Event = FGenericPlatformProcess::GetSynchEventFromPool(false);
}

~FMyClass() { // Destructor
  FGenericPlatformProcess::ReturnSynchEventToPool(Event);
  Event = nullptr;
}

Example usage is to force a worker thread to finish its work before the main thread can resume, or at the opposite end to sleep a worker thread until it is called again by the main thread. Some sparse info there, there and there.

PERFORMANCE

1. Containers

A major challenge with containers access lies between functions that take constant O(1), logarithmic O(Log n) or linear O(n) time to execute; with n being the amount of elements stored.

The textbook example to avoid is:

for (auto& n : ArrayA) { // For all n elements
  if (ArrayB.Contains(n)) // Checks all n elements and stops at the first match
    ArrayA.Remove(n); // Checks all n elements and remove if matching
}

Where in a for all n loop, you’ve nested an O(n) Remove() and another up-to-O(n) Contains(). This will not scale well with big TArrays. Using different approaches or switching container types is the way to go.

Regular

Those containers can be accessed or created directly in blueprints.

Be warned however that UPROPERTY big containers can sometimes cause performance issues in editor (especially with EditAnywhere), and in general I recommend doing A/B tests in Shipping too against such situations, as removing the UPROPERTY is an easy but somewhat obscure fix.

TArray

TArrays are the bread and butter of most projects, but there are some things to watch out for:

Operations such as AddUnique() and Remove() have to parse the entire array at every call to make sure there are no duplicates, which is a big performance hog. You can improve on this by using RemoveSingle() (stops at the first find), or even Add() and RemoveAt(), which do not have to make expensive comparisons that scale poorly with many elements.

Using RemoveSwap(), RemoveAtSwap() or even RemoveSingleSwap(), you can also avoid the overhead of the array reallocating when compacting gaps between elements, but this will not preserve the elements’ order; RemoveAt() also has an optional bAllowShrinking bool you can disable. A TSet is a better alternative if you don’t need an ordered container, and especially if you make regular or nested Contains() or Remove() calls.

Add() and Insert() create a copy of the passed value which is efficient for small types like int, while Emplace() and EmplaceAt() construct a new instance inside the container, which is better for types bigger than 64-bit as it avoids an expensive temporary copy. Technically, Add() gets the value by reference, checks its validity and internally calls Emplace(). Both still work through copy if you do not use MoveTemp().

AddUninitialized(), InsertUninitialized() and SetNumUninitialized() (and their Zeroed equivalents) can also improve performance if you plan on overwriting the contents before doing any meaningful reads.

TSet

If you’re fine with your container not being ordered, TSets can be incredibly useful. They are multiple orders of magnitude faster (constant time) at Add(), Remove() and Find() operations than a TArray with many elements.

This makes TSets very useful to store large amounts of unique items and make regular comparisons to other containers. For example, iterating through a TArray to see if an element is contained by a TSet is substantially faster than the reverse, or than comparing to another TArray. Sets are however less compact in memory than arrays.

You can also enable duplicate elements using DefaultKeyFuncs:

TSet<FVector, DefaultKeyFuncs<FVector, true>> MySet;

Note that both MySet[] and MySet.Find() work to access a key, but the latter returns a pointer that needs to be dereferenced to get the key, or nullptr if the value is not in the set.

TMap

A TMap combines the features of a TSet (unordered, unique, constant time) with the ability to pair the key with a second element of any value. Doing Add() on a preexisting key will overwrite the contents of the second element with the provided value, or the default constructor if left empty.

As one of the advantages of being unordered, removing keys leaves gaps in memory that will be filled again when others are added, reducing the amount of expensive resizings and reallocations of the whole container.

C++ only

The following containers are not directly accessible through blueprints, but you can make your own UFUNCTION wrappers to rectify that.

TStaticArray

A TStaticArray has its size predefined at compilation, which doesn’t necessarily change much in terms of performance, but makes the intent clearer as to what it should and should not do. Pointers to it will also never become invalid due to reallocations.

TStaticArray<FVector, 32> MyArray;

TSparseArray

A TSparseArray couples a TArray with an internal TBitArray of bools to store which indices are allocated.

It is unordered as added elements get assigned the first vacant index, and memory usage is similar as it still allocates space for all potential elements in a contiguous index range. Adding and removing is very fast (constant time) as the non-contiguous indices allow gaps to appear and fill again later without shuffling elements, and iteration is fast thanks to the TBitArray.

Since TSet encapsulates a TSparseArray in a hashing system that also allows constant time finding of elements, you might only prefer a sparse array if you’re overly concerned with the added memory usage and don’t need fast finds.

TIndirectArray

In a regular TArray (equivalent to std::vector<Type>), all instances of Type are stored directly into the container, and have to be reconstructed when the array grows or its elements shuffle around. This is efficient if the sizeof(Type) is small like int (4~8 bytes), but becomes an issue with bigger objects, structs or nested containers.

TIndirectArray (equivalent to std::vector<Type*>) alleviates that by storing pointers to its elements instead of the elements themselves. When the elements are shuffled or reallocated, only pointers to them (8 bytes) have to be moved around instead of the actual elements.

Indirect arrays are very often used in source files to hold another dynamic container, a struct containing multiple dynamic containers or an FString (which is a TArray<TCHAR>); this is because of the size of the objects owning the containers (an std::vector is a compile-constant 24 bytes, containing start and end pointers to the allocated memory and some metadata), but this is relevant to any object with a large sizeof(). Doing this to store ints would be wasted; reallocations would not perform better as they are smaller or of similar size to pointers, this would take additional memory to store the pointers, and the added dereferencing for access also impacts performance. More info.

This is a less relevant concern for TSet and TMap, as they are unordered and do not have to shuffle elements nearly as often.

TChunkedArray

A TChunkedArray stores its elements in discrete chunks of memory (themselves of TIndirectArray type), so that adding new elements creates new chunks but never reallocates old ones. Useful if you want to keep on piling large types without repeatedly reallocating a progressively larger array, are concerned about allocation failures from memory fragmentation or if you’d like to keep pointers valid. Some more info.

TMultiMap and TSortedMap

TMultiMap is a variant of TMap that can store duplicate keys, with similar overall performance.

TSortedMap stores unique, automatically ordered elements, but as its underlying type is based on TArray it also shares its performance setbacks on adding and removing elements in big containers (O(n)), though finding is faster (O(Log n)). There is no variant of it for duplicate elements.

TList and TLinkedList

Lists allow adding and removing elements with constant time speed and without reallocating memory (and as such without invalidating pointers), but random access is impossible and elements are not stored contiguously which increases memory usage.

A TList is just a struct that stores a List->Element and a List->Next pointer to the next one, as such only those two functions exist and it is up to you to create new elements and assign the pointers. Some examples to clarify:

// Creating
TList<int>* NewElement = new TList<int>(10, nullptr); // New List with nullptr Next

// Emptying
void ExtractList(TList<int>* List, TArray<int>& OutArray) {
  while (List) { // While non-empty
    OutArray.Add(List->Element); // Copy current element value to OutArray
    TList<int>* Elem = List; // Store a pointer to the current element
    List = List->Next; // Move to the next
    delete Elem; // Delete the current
  }
}

TLinkedList is a more fleshed-out version with dedicated linking functions. TDoubleLinkedList allows iterating backwards, like an std::list.

TCircularBuffer

TCircularBuffer is the underlying type of TCircularQueue. Useful if you want an equivalent to a FIFO TQueue with the constant time access of a circular buffer (doesn’t shuffle elements around as it is fixed-sized), but without the multithreading overhead. A list is preferable if you want constant time access but dynamic size. Isn’t really used much in the source files.

Avoiding copies

MoveTemp

One of the greatest innovations of C++11 is Move Semantics, which allows displacing the contents of one container into another by cannibalizing it, instead of making a slow, deep copy of it.

In Unreal, MoveTemp() can be used to move the contents of an entire container into another’s (this does not invalidate the old container):

TArray<int> NewArray = MoveTemp(OldArray); // "=" has constructor/assignment overloads
TSet<int> NewSet = MoveTemp(OldSet); // Old containers will be left empty
TMap<int, FVector> NewMap = MoveTemp(OldMap);
FMyStruct NewStruct = MoveTemp(OldStruct); // UStructs also have that overload

It also makes Append() more efficient, as otherwise that function makes an expensive deep copy of the container:

NewArray.Append(MoveTemp(OldArray)); // Moves OldArray’s contents instead of copying it

But, and that’s where Unreal’s documentation is thoroughly lacking, you can also move individual elements in and out of containers with MoveTemp(), as the function overloads already exist for it!

NewArray.Add(MoveTemp(Elem)); // All insertion functions have move overloads
NewSet.Add(MoveTemp(Elem));
NewMap.Add(i, MoveTemp(Elem));
Queue.Enqueue(MoveTemp(Elem));

For moving out, it’s more complex as the container does not necessarily know the element has been moved out if you do not use a member function; you should usually remove the used index/key afterwards:

int i = Array.Find(Elem); auto Out = MoveTemp(Array[i]); Array.RemoveAt(i); // or RemoveAtSwap
FSetElementId i = Set.FindId(Key); auto Out = MoveTemp(Set[i]); Set.Remove(i);
auto Out = MoveTemp(Map[Key]); Map.Remove(Key);

Some containers provide member functions which perform MoveTemp() internally:

// With a return value (optimized away thanks to RVO)
auto Out = Array.Pop(); // Returns the last element
auto Out = Map.FindAndRemoveChecked(Vector); // Throws if not found

// With a parameter passed as reference (moves into the uninitialized ref)
int Out; Map.RemoveAndCopyValue(Key, Out); // Doesn’t actually copy, nice misnomer
int Out; Queue.Dequeue(Out);

// Simplified internals of Dequeue: Moves the popped item into the out ref
bool Dequeue(ItemType& OutItem){ OutItem = MoveTemp(Popped->Item); }

Return Value Optimization (RVO) will optimize away the return values for you, so you do not need to do MoveTemp(Array.Pop()). This also means you should avoid using move semantics on returns. Note that you cannot move an std::atomic, for very good reasons.

On large types or structs, this is a godsend as not only can you avoid expensive copies, you can also grab and return discrete elements at your leisure too. An usage example is having a big container storing calculations results in game thread, handing out individual elements to async threads and grabbing them back when finished, without copies, mutexes, or risking having threads read into a container that could reallocate at any time.

Usually only nested dynamic containers (heap-allocated) or structs containing dynamic containers benefit from MoveTemp(); it is wasted or even counter-productive on trivial types like int.

new

The new and delete keywords allow creating variables and containers directly on the heap, which means you are no longer bound by the scope of your function or class and can pass those pointers around classes freely.

auto* Array = new TArray<int>({2, 5}); // Array starting with two ints
auto* ArrayB = new TArray<int>(MoveTemp(OldArray)); // Array starting with another’s contents

Array->Add(10); // Safe to perform on a new TArray
delete Array; // Always delete after use!

However, this makes you enter the realm of dynamic memory management, so always ensure to keep your pointers at hand and use delete after you are done to avoid memory leaks. This otherwise only gets cleaned after the entire program shuts down.

There are also instances in the engine’s source of C-type arrays of TArrays.

auto* CArrayOfTArray = new TArray<int>[8]; // C array containing 8 TArrays

CArrayOfTArray[2].Add(10);
delete[] CArrayOfTArray; // Use delete[] instead

Call delete[] instead in such a case.

FMemory

The TArray wiki page has an interesting example which uses FMemory to copy a C array’s contents into a TArray:

int32 In[] = { 2, 3, 5, 7 };
TArray<int32> Out;
Out.AddUninitialized(4);
FMemory::Memcpy(Out.GetData(), In, 4*sizeof(int32));

With bigger types where moving is more efficient, you can replace Memcpy() with Memmove() and use it as an alternative to MoveTemp() for move semantics:

FMemory::Memmove(Out.GetData(), In, 4*sizeof(int32));

There are other interesting functions, such as Memzero() to zero-out a range of elements, or Malloc() and Free() which replace C++’s native malloc functions. More info.

Containers reallocation

An std::vector can be made to reallocate on grow through move semantics, if the element’s class it contains has both its destructor and move constructor declared with noexcept (or if that class is trivial or default-constructed); otherwise the vector will perform pessimization and use copy reallocations.

With Unreal containers, you will not see noexcept being used in the source files; they use bespoke allocator implementations for all of their functions, passed through FMemory.

The allocator used is declared in files like WindowsPlatformMemory.cpp, under BaseAllocator(): Shipping builds default to binned2 (a bin malloc), and the Editor to TBB (Intel’s Thread Building Blocks; scales with available processors). Also noteworthy is binned3 (64-bit only), Ansi (default C malloc), and Mimalloc (Microsoft opensource malloc). More info.

Each allocator implements its own version of Realloc() as declared in MemoryBase.h, which in turns calls TryRealloc(). Most (binned2 included) use FMemory::Memcpy(), while TBB calls the lib’s scalable_aligned_realloc().

This means that for practical purposes Unreal containers will reallocate by copy when resizing; and therefore why it is notably advantageous to Reserve() your TArrays and make smart uses of MoveTemp(). Otherwise, if you really need it, you could fallback on std::vector as it does support move reallocation.

2. Code and Loops

A foreword: As a general rule, code compiled in DebugGame configuration will be slower than in Development Editor, which in turn is slower than standalone Shipping builds.

An accurate way to test your speeds is using std::chrono:

#pragma once
#include <chrono>
using namespace std::chrono;

class Timer {
  time_point<high_resolution_clock> t1;

  void start() { t1 = high_resolution_clock::now(); }

  float stop() {
    auto t2 = high_resolution_clock::now();
    return duration_cast<microseconds>(t2 - t1).count();
}
};

Optimizing containers

If you’re reusing containers, consider declaring those as global variables to avoid reconstructing them entirely on every function iteration, and using Reset() instead of Empty() to keep the previous memory allocation. Using Init() to initialize with a specific value is also somewhat slower than SetNum().

Accessing methods

There are three ways to iterate through containers directly, without using Add() or the [] operator:

With a simple pointer:

auto start = Array.GetData(); // Equivalent to std vector.data()
auto end = Array.GetData() + Array.Num(); // Equivalent to std &vector.back()

With a generic iterator:

auto start = Array.CreateIterator(); // TIndexedContainerIterator
auto end = Array.CreateIterator(); end.SetToEnd();

With a checked iterator which checks against the container resizing during operation, but is slightly slower:

auto start = Array.begin(); // TCheckedPointerIterator
auto end = Array.end();

This last iterator is missing most of the expected functionality however, as it is primarily intended to be the underlying structure of range-based for loops.

In all three cases, note the difference between assigning a value and moving the pointer or iterator:

*start += 1; // Current element's value += 1
start += 1; // Pointer position += 1 (move to the next container element)

Accessing efficiency

Let’s assume a worst case scenario where you want a multidimensional TArray of elements, and compare different implementations from slowest to fastest:

Slowest: Using nested containers. Convoluted to work with and nothing the compiler can do to optimize further.

TArray<TArray<TArray<int>>> Array; // 3D array using nesting
// Let’s assume Size is an FIntVector you’ve used to SetNum() all the nested TArrays

for (int x = 0; x < Size.X; ++x) {
  for (int y = 0; y < Size.Y; ++y) {
    for (int z = 0; z < Size.Z; ++z) {
      Array[x][y][z] += 10;
    }
  }
}

Average: With a 1D array of cubed length, using the known dimensions in each axis to offset the reads and writes to emulate a multidimensional array. Easier to work with, UPROPERTY-compatible, lesser memory usage from having types within types within types, and the compiler can optimize the offsets to bitshifts if the Size dimensions are multiples of 8.

TArray<int> Array;
Array.SetNum(Size.X * Size.Y * Size.Z);

for (int x = 0; x < Size.X; ++x) {
  for (int y = 0; y < Size.Y; ++y) {
    for (int z = 0; z < Size.Z; ++z) {
      // Fastest when writing in sequential memory order; reorder the x/y/z loops if needed
      Array[x*Size.X*Size.Y + y*Size.X + z] += 10;
    }
  }
}

Fast: Replacing the [] operator with a pointer to which you add the expected memory offset, and then dereference. Unreal’s bracket operator is overloaded with bound-checking logic that makes it slower than an equivalent std::vector’s (at least outside of Shipping builds), so you can bypass that by accessing the memory location directly through pointers.

for (int x = 0; x < Size.X; ++x) {
  for (int y = 0; y < Size.Y; ++y) {
    for (int z = 0; z < Size.Z; ++z) {
      *(Array.GetData() + (x*Size.X*Size.Y + y*Size.X + z)) += 10;
    }
  }
}

Faster: If you assume elements in your array are always stored and retrieved in the same order, you can actually abstract away the entire function for offsetting and write linearly through iterators, for example with a range-based for loop or with Array.begin().

for (auto& i : Array) { // Assumes you will write AND read later in the same order
  i += 10; // Uses a checked iterator
}

Fastest: Range-based for loops are slightly slower than regular ones, as the TCheckedPointerIterator they use has to always make sure the container wasn’t resized while iterating. You can maximize speed by using a raw pointer to the array and a regular loop instead.

auto p = Array.GetData(); // May sometimes be faster if moved inside the loop’s scope

for (int i = 0; i < Size.X*Size.Y*Size.Z; ++i) {
  *p += 10; ++p; // Assign the current element, then move to the next one
}

Footnote: If you use std::vector, vector.at() is slower than vector[] as it also does bound-checking.

Adding efficiency

Every time you call Add(), there’s a chance the container exceeds its internal reserved space and decides to reallocate.

If you are using regular functions such as Add(), you can reserve memory by big increments to reduce that impact:

TArray<int> Array;
Array.Reserve(1024); // Guarantees there is enough memory for 1024 elements

for (int i = 0; i < 1024; ++i) {
  Array.Add(10); // No reallocation until the reserved memory is exceeded
}

If you are accessing through braces, iterators or pointers, use SetNum() or SetNumUninitialized(), as you may otherwise run over elements outside the container’s bounds.

Array.SetNumUninitialized(1024); // Doesn't call the Constructor, faster
auto p = Array.GetData(); // Call after SetNum(), otherwise this becomes nullptr!

for (int i = 0; i < 1024; ++i) {
  *p = 10; ++p; // *p has no initial value here, don't use +=
}

Be warned that once you resize a container, all pointers toward it become invalid; either avoid the situation or grab them again after size changes.

Optimizing loops

Variable types

int and size_t are platform-dependent, but on modern machines it’s usually:

Signed 32-bit for int, whether or not the platform supports 64-bit.
Unsigned for size_t, 32 or 64-bit depending on the platform (always the biggest type supported).

Both can be as low as 16-bit on certain platforms, but it’s unlikely you will ship Unreal on those.

int8, int16, int32 and int64 (and their uint unsigned equivalents) allow you to guarantee the size in a platform-agnostic way. They are simple aliases for respectively char, short, int (or long) and long long. int32-likes use Unreal-defined aliases, while the suffixed int32_t use std ones.

int is usually the fastest signed type and size_t the fastest unsigned one, as both scale to best fit a given platform’s memory register. Smaller types are not faster but take less memory, while types bigger than the platform’s register are slower as the CPU has to use instructions to ‘emulate’ a size it doesn’t natively support. Types such as int_fast8_t guarantee fastest execution with at least n bits (usually just int).

Using explicit casts such as static_cast<int>(MyEnum) adds no overhead at runtime, beside the inherent cost of the conversion itself.

const

Most articles online aim toward const being insignificant for performance concerns, but it helps clarify the intent in your code. const (runtime) or constexpr (compile-time) global variables however are useful, as they get inlined into code instead of using address calls to check against the value having potentially changed.

Inlining

Let’s say you’ve split away part of your loop logic into functions:

// .cpp
int AMyClass::GetID(int x, int y, int z) {
  return x*Size.X*Size.Y + y*Size.X + z; // Function will be called at runtime
}

void AMyClass::Loop() {
  for ( /* x, y, z */ ) {
    Array[GetID(x, y, z)] += 10;
  }
}

If declared in the .cpp, there’s a high chance your loop will run slower, as each iteration has the added cost of the function calls, and because the compiler cannot optimize and vectorize the loop as efficiently since the function is now a different structure in memory, or could make unexpected changes to global variables.

The solution is to inline the function, which is most easily done by moving it into the header:

// .h
inline int GetID(int x, int y, int z) { // inline keyword is redundant within a class body
  return x*Size.X*Size.Y + y*Size.X + z; // Code will be copied in place at compilation
}

This will instead emplace entirely the code of the function at the location of the call when compiling. This effectively makes it as if you wrote it there yourself, and allows full optimizations to be applied. It is up to the compiler to decide to do it or not however; you can use the FORCEINLINE to guarantee the behavior. Prefer inlining only functions that are relatively small and called repeatedly.

Pointers and references

Prefer passing most function parameters other than the simplest types by reference (FVector&) or pointer (FVector*) if possible, this reduces the amount of expensive temporary copies at runtime. Both methods fare similarly in speed, so use whichever you prefer.

int GetID(const FVector& Vector) { // FVector* also works but requires dereferencing
  return Vector.X*Size*Size.Y + Vector.Y*Size.X + Vector.Z;
}

Local variables

The compiler cannot know that a global variable will not change while iterating, so making a local copy ‘explicits’ that no checks against that are needed and improves performance.

bool bMyTest_l = bMyTest;

for (int i = 0; i < 1024; ++i) {
  if (bMyTest_l) { // Compiler knows this won't change since it's local to the function
    /* Code */
  }
}

Scoping

Declaring variables in the innermost scope available reduces the memory footprint of your variables, and puts them closer to where they will be accessed.

int x1_A; // Has to hold resources for longer than necessary

for (int i = 0; i < 1024; ++i) {
  x1_A = x+1;
  int x1_B = x+1; // Only exists and accessible for the minimal required time
  /* Code */
}

Do not move calculations to the innermost loop when not required. Sometimes refactoring loops to be more concise and readable actually degrades performance; the example below shows how the position of a single comparison can have a substantial impact:

for (int x = 0; x < 32; ++x) {
  for (int y = 0; y < 32; ++y) {
    for (int z = 0; z < 32; ++z) {
      if (bMyTest) { /* Code */ } // bMyTest is compared 32768 times!
      else { /* Code */ }
    }
  }
}

// Less concise, but substantially faster
for (int x = 0; x < 32; ++x) {
  for (int y = 0; y < 32; ++y) {
    if (bMyTest) { // bMyTest is compared "only" 1024 times
      for (int z = 0; z < 32; ++z) { /* Code */ }
    }
    else {
      for (int z = 0; z < 32; ++z) { /* Code */ }
    }
  }
}

Precalculating

Precalculate extremely aggressively. Every single calculation, assignment, pointer location or other logic done even just twice can benefit from being declared once and have its result reused; even the smallest int.

// x+1, y+1, z+1, GetID of those, AND the array pointer are inefficiently done twice
for (int i = 0; i < 1024; ++i) {
  if (Array[GetID(x+1, y+1, z+1)] == 0)
    Array[GetID(x+1, y+1, z+1)] += 10;
}

// Less concise, but maximizes performance
for ( /* x, y, z */ ) {
  const int x1 = x+1; const int y1 = y+1; const int z1 = z+1;
  auto* NextID = Array[GetID(x1, y1, z1)];
  if (*NextID == 0)
    *NextID += 10;
}

Even subtle operations such as the implicit conversion of pointers to bools benefit from it:

bool PointerValid = MyPointer ? true : false; // True if MyPointer is not null
for ( /* x, y, z */ ) {
  if (PointerValid) { /* Code */ } // Slightly faster than "if (MyPointer)"
}

Ternary operator

The ternary operator, a more concise alternative to If/Else, is not in itself faster (both usually compile to the same assembly code), but allows assigning a variable in the same breath as it is declared, and by extension also declaring it as a const or reference; all of these plus other cases can improve performance.

bool ResultA; // Must be declared before being assigned due to scope restrictions
if (MyCondition) ResultA = true;
else ResultA = false;

const bool& ResultB = MyCondition ? true : false; // Allows declaring as const, &, etc.

Pointer aliasing

In situations where two pointers in a loop could target overlapping memory regions, pointer aliasing can occur which prevents a lot of usual loop optimizations from being performed. Use a restrict keyword in that case:

int* __restrict MyPointer;

__restrict is for MSVC and __restrict__ for GCC (restrict itself is deprecated in C++).

Bitshifts

Using bitwise operators can yield increased performance in different ways:

Multiplications and divisions

With unsigned integers, you can replace a multiplication, division or modulo operation by a power of 2 with <<, >> or &, which in theory is faster:

uint8 a = 2; // (binary 00000010)

a = a << 3 // a leftshifted by 3 becomes 16 (00010000)
a = a >> 2; // a rightshifted by 2 becomes 4 (00000100)
a = a & 1; // Equivalent to "a = x % 2;"

This is most efficient when shifting by a static amount defined at compilation. Practically however, modern compilers will do the same optimizations under the hood if the conditions match (unsigned by powers of 2), and current CPU architectures have dedicated hardware to handle fast multiplications at similar speeds. Using shifts to perform divisions can still net performance gains, and you can use uints with powers of 2 to help your compiler optimize.

Packed integers

Bitshifts and dedicated bitfield structs allow packing multiple smaller values into a single int at different offsets; ex. 4 int8 within an int32. This reduces memory usage and copies, and is for example useful for sending compact vertex calculations to the GPU.

uint32 vertex(uint32 x, uint32 y, uint32 z, uint32 type, uint32 light, uint32 norm, uint32 ao) {
  return ao << 30 | norm << 27 | light << 23 | type << 18 | z-1 << 12 | y-1 << 6 | x-1;
}

Example taken from the Binary Greedy Meshing repo (good stuff).

1ULL is equivalent to a 1 cast as an unsigned long long (i.e. uint64), which prevents a bitshift on an int (32-bit) from overflowing before being stored into a larger type.

uint64 a = 0;
  for (int shift = 0; shift < 8*sizeof(uint64); ++shift) { // For all bits of a uint64
    a |= 1ULL << shift; // Set true the current bit, by shifting a 1 read as "uint64"
}

Bitmasks

Bitwise | (OR) and & (AND) can be used to create bitmasks, which package bools as individual bits in an integer:

constexpr uint8 FLAG_1{ 0b00000001 }; // binary representation with only bit 1 set
constexpr uint8 FLAG_2{ 0b00000010 }; // binary representation with only bit 2 set

uint8 Flags = 0; // Type must be large enough to hold the flags!
Flags |= FLAG_1 | FLAG_2; // Set the bits for FLAG_1 and FLAG_2 to 1

if ((Flags & (FLAG_1 | FLAG_2)) == (FLAG_1 | FLAG_2)) { // If an AND of the bits with FLAG_1 or FLAG_2 match either of those flags
  Flags &= FLAG_1; // Set the bit for FLAG_1 to 0
}

| and & are used to compare masks, by returning respectively the OR and AND or every bit in the left and right operands. |= and &= set single bits in the bitmask true or false, by comparing two operands and assigning the result to the left one in a single operation.

Bitsets

Using the STD library, the std::bitset and std::vector<bool> containers package an array of bools as single bits, which minimizes space allocation and can improve speed; std::bitset is faster but its size cannot be changed at runtime.

Unreal equivalents are TStaticBitArray and TBitArray; here’s an example inspired by the engine’s source:

TBitArray<> Set; Set.Init(false, 128); // Or TStaticBitArray<128> StaticSet;
Set.SetRange(10, 6, true); // Sets alls bits from 10 to 10+6 to true

uint32 NumWords = FBitSet::CalculateNumWords(Set.Num()); // Number of uint32s in the bitset
uint32 Hash = NumWords;
const uint32* Data = Set.GetData(); // Pointer to internal array of NumWords uint32s
for (uint32 i = 0; i < NumWords; ++i) {
  Hash ^= Data[i]; // Bitwise XOR assignment: Compares each uint32 to the Hash and stores the XOR of individual bits
}

SIMD

SIMD (Single Instruction Multiple Data), aka vector instructions, are the endgame of optimization for algebra: they allow you to process whole groups of floats or integers through large registers, improving speeds by factors of up to 4, 8 or even higher, depending on the situation.

The main instruction sets are SSE, AVX, and AVX512, respectively using 128, 256 and 512-bit registers; which are available depends on the platform’s hardware and the compiler settings. Just with SSE, you can for example process 2 double, 4 float or int32, or even 16 int8 for the price and clock cycles it would take to do one; double or quadruple that with AVX.

Unreal defaults to SSE due to it having the most widespread support. You can enable AVX by adding bUseAVX = true to your project’s Build.cs, but this also seems to require modifying the engine’s UE4Game.Target.cs and recompiling it from source. Doing so makes the VCToolChain.cs add the /arch:AVX flag to the compilation arguments for MSVC (this might therefore only apply to platforms that compile with the MSVC pipeline).

Functions

SIMD is famously finicky to learn and implement, and spending time understanding Intrinsics can be very daunting; writing an abstract on SIMD implementations would probably be as long as the entirety of the document up to this point, so we’ll dispense with that.

Unreal wraps a number of SSE Intrinsics through FMath, for example Math::RoundToInt(float) internally calls UE4::SSE::RoundToInt() (or SSE4 if available), which itself uses _mm_cvt_ss2si().

To my knowledge intrinsics are never used directly in implementations in Unreal, they always pass through wrappers like FMath that check for the platform availability. I recommend using your IDE to search the source files for ‘SSE’ and ‘AVX’ if you want to learn more; also take a look at sse_mathfun.h and the Align() function.

MSVC provides an optimize pragma to place before a function’s definition. It might or might not have any effect at all due to which flags the default toolchain uses, or may be using optimizations that are already on by default. I didn’t look into it much.

Libraries

The ‘default’ approach to vector instructions is through Intrinsics; as down to the machine and abstract as it gets. Intel Intrinsics work with both Intel and AMD CPUs as their underlying architectures are similar, but ARM (mobile) platforms use Neon Intrinsics instead.

Major compilers have extensions for vectorization pragmas and functions:

OpenMP in MSVC. The /openmp compiler option might be disabled by default in Unreal, requiring recompiling from source.
std::simd in GCC.

Third-party libraries also exist to abstract away some of the complexity of Intrinsics:

Intel ISPC for Unreal: In use in the engine’s source. More info there, there and there.
OpenCL for Unreal: A standalone language with widerange SIMD functionalities.
Light SIMD and Pure SIMD: Wrappers for SIMD functionalities.

In all cases, be sure that what you want to do isn’t already doable through the platform-agnostic classes Unreal provides.

3. Modularity

There’s no hard rule against using std types (and the engine’s source makes ample use of them), but for containers it is usually recommended to use Unreal’s equivalents for a number of reasons:

They support the reflection system, and as such can be made accessible from blueprints, can safely store UObjects and AActors, etc.
They can be easily serialized (stored into memory).
They have very good intercompatibility with the large swath of bespoke Unreal features.
They come in an immense breadth of types, as the wall of text before this might indicate. What you want to do is very probably supported, if not necessarily documented.

But if you do not require the above, or want an easier time linking external libs to your projects, and so on, anything goes. For portability concerns, I doubt there’s much of a difference; most std containers are in use in the source files too so your code wouldn’t be the first to break. One notable difference is that std containers can be made to reallocate by move, whereas Unreal’s are copy-only.

Nested containers

A nested array resizing will never affect its parent; a container holding another or a struct of containers only stores objects (of compile-constant size) that hold pointers to their true heap-allocated data. More info there and there.

Some examples of nested containers usage found in the engine’s source (and many more):

// TArray of TArray/TSet/TMap (there’s even some triple-nested TArrays)
TArray<TArray<int>> A; TArray<TSet<int>> B; TArray<TMap<int,FVector>> C;
// TQueue of TArray/TSet/TMap
TQueue<TArray<int>> D; TQueue<TSet<int>> E; TQueue<TMap<int,FVector>> F;
// TStaticArray of TArray/TSet
TStaticArray<TArray<int>,32> G; TStaticArray<TSet<int>,32> H;
TSet<TArray<int>> I; // TSet of TArray
auto* J = new TArray<int>[8]; // Dynamic C array of TArray
TQueue<int>> K[8]; // Static C array of TQueue

Not found in the source files:

TMap<TArray<int>,FVector> X; // TMap of TArray/TSet/TMap
std::vector<TArray<int>> Y; // std::vector of TArray/TSet/TMap (or reverse)
TArray<int> Z[8]; // Static C array of TArray/TSet/TMap

Unreal does not allow you to mark nested containers as UPROPERTY. UObjects and AActors that are not referenced in at least one variable or container marked as UPROPERTY are not GC-safe and can be garbage-collected at any time.

Structs

Unlike nested containers, you can make a TArray of a struct and declare as member any variable or container, and mark it as UPROPERTY to access it in blueprints.

struct FMyStruct {
  FMyStruct();
  TArray<FVector> MyArray;
};

TArray<FMyStruct> MyArrayOfStruct; // Blueprint-compatible
MyArrayOfStruct[i].MyArray.Add(FVector());

You can also define a Constructor and set default sizes for containers in newly created instances of your struct:

inline FMyStruct::FMyStruct() { // Constructor of MyStruct
  MyArray.SetNum(Size); // Size can be a global constexpr
}

As with big containers, marking every sub-element in large structs as UPROPERTY can cause performance concerns.

Actors

Keep in mind that spawning Actors and Components is expensive (hundreds of milliseconds with 1k+ actors), even if they are nearly empty, and that operation cannot be multithreaded. If you can get away with one Actor spawning multiple Components, or even 1-Actor 1-Component setups, repeating the same functions that can reuse variables and containers, you will see benefits.

ChildActors are more expensive to spawn than regular Actors, and creating Actors or Components through C++ is somewhat faster than through Blueprints.

Consider taking a look into ECS-like (Entity Component System) architectures.

I did say condensed.

Ayliroe · July 1, 2023, 4:36pm

Also, google doc version with more colors.

Dr4u6 · October 6, 2023, 4:32pm

Gold, thx for this. <3

Madlions · October 19, 2023, 7:53am

Wow this is golden, thank you so much for sharing this! This should be turned into some kind of compendium like the Unreal Networking Compendium

szefopl · October 22, 2023, 11:56pm

Amazing guide, thank u.

U seem to know a lot about UE threading, u dont happen to also know more about the new physics async thread? With all the callbacks, sim input/output etc.

Cuz there is also criminal lack of documentation of how this thing works, i know its in the character movement component and used by chaos vehicle plugin.

If u know anything detailed about that like this guide then please, do share.

DJChen · December 25, 2023, 2:46am

Thank you for sharing! I am trying to understand async tasks in PCG plugin and this helps a lot!

Babush61 · January 10, 2024, 10:28pm

Hey, man. How to get the return value from Async.
For example:

AsyncTask(ENamedThreads::AnyHiPriThreadNormalTask, [FindedActors, this]()
{
// something
 });

How to call a function (callback) after completion “AsyncTask” or

FGraphEventRef DoThisTask = FFunctionGraphTask::CreateAndDispatchWhenReady

or UE::Tasks::TTask is finished ?

RedSigma6166 · January 11, 2024, 6:18am

You could have a class called AsyncMaker which has a delegate and a function that calls this AsyncTask. You call the delegate inside AsyncTask block and you bind to the delegate from the AsyncMaker class. Ofc you need to make sure the AsyncMaker class doesnt get GCed.

Babush61 · January 11, 2024, 8:58am

I found a solution for the callback using an additional async

  AsyncTask(ENamedThreads::AnyHiPriThreadNormalTask, [FindedActors, DelegateQ, this]()
      {
          if (DelegateQ.IsBound())
          {
              // call from GameThread
              AsyncTask(ENamedThreads::GameThread, [SelectedActors0, DelegateQ]()
                  {
                      UE_LOG(LogTemp, Warning, TEXT("AsyncTask Delegate"));
                      DelegateQ.ExecuteIfBound(SelectedActors0);
                  });
          }
      });


RunAsyncTest2(FAsyncDelegateZ::CreateUObject(this, &USelectionToolAsync::HandleAsyncSave));

DelegateQ → DECLARE_DELEGATE_OneParam(FAsyncDelegateZ, const TArray<AActor*>&);

DJChen · January 17, 2024, 9:16am

Nice article and great work! It has helped me to understand this system A LOT.

Regarding this part, I shall explain it in my understanding for future readers. I will use an analogy to demonstrate the relationship. You can skip it if you want to read more technical terms. Also, please correct me if I am wrong.

The analogy

Assume you are at a restaurant and going to order. After you complete your order with the waiter, the chef will make you a TPromise. This promise is that eventually, you will get your food. And for each promise object, there is a SetValue function that can only be called ONCE, in our analogy, by the chef to “fulfill this promise”.

Meanwhile, the waiter will give you a token, which is a TFuture that can be used to query the result of the associated Promise. So I can do something else (async calls) or I can simply do nothing and query the result every tick.

To summarize:

Ordering (Promise): You make a promise to get a result (place an order).
Fulfillment (Promise.SetValue()): The entity producing the result fulfills the promise by setting the result (chef preparing the food).
Token (Future): You receive a token (Future) representing your order.
Retrieving (Future.Get()): You use the token (call get()) to retrieve the result, blocking if needed (waiting at the chef’s counter).

Promise:
- Represents the commitment to produce a result in the future.
- Created by the entity initiating the asynchronous task (task producer).
- Allows the task producer to fulfill the promise with a result or reject it with an error using set_value() or set_exception().

Future:
- Represents the result of an asynchronous operation.
- Created by the entity using or waiting for the result of the asynchronous task (task consumer).
- Allows the task consumer to retrieve the result using get().

Remember a promise can not be abandoned. It needs to be completed. So creating too many promises at one time will cause a performance drop. Also, neither the TFuture nor the TPromise holds the operation itself.

You can check the example of using them in the PCG plugin at PCGAsync.cpp line 46 in UE 5.2.

// Setup [current, last, nb points] data per dispatch
TArray<TFuture<int32>> AsyncTasks;
AsyncTasks.Reserve(NumFutures);

ilbeyataberkcan · April 26, 2024, 8:10pm

Amazing detail. Thank you for compiling all this knowledge into one thread.

raianlp · May 26, 2024, 2:56pm

Ohh, probably UE5 bug tracker will be overloaded with the bugs in the next few years due to this discovery

JackJansen60 · August 8, 2024, 4:16pm

Indeed, thank you very much @ayliroe for this great overview of Unreal multithreading!

The one thing that I couldn’t find here, and also couldn’t find with Google, is how I can do a blocking producer/consumer pattern. Actually, I am only interested in a blocking producer (because the consumer is going to the the game thread).

I was expecting there to be a blocking variant of TCircularQueue but I can’t seem to find it.

Also, I can’t find anything like a counting semaphore (with which I could turn a TCircularQueue into a blocking circular queue).

Am I missing something? Or am I trying to go at my problem in a non-Unreal way? I have a lot of data that I need to read from disk, and I don’t want the reader thread to get too far ahead of the consumer, because things wouldn’t fit in memory otherwise…

mrefeyton · September 26, 2024, 1:10pm

Great write-up. Just one remark on Memmove, this has nothing to do with move semantics. This is Unreal’s implementation of the standard memmove C function, which does the exact same thing as memcpy with additional overlap checks in case the source and destination overlap in some way.

Ivan3z · October 26, 2024, 5:40pm

You are a cape without a hero