Course: Neural Network Engine (NNE)

mattai · August 7, 2023, 5:12pm

This is very useful information, thank you. I might have misunderstood when and where to pick a runtime type. I’m running a couple of models that aren’t image/rendering related, they analyze (preprocessed) input data and provide predictions that I postprocess and convert into in-game actions. The inference is run adhoc when an event is triggered (not on tick or regular basis). My models run fine on the CPU but I’d rather use the GPU if possible (to free the CPU and because it should be faster (TBD).

On my dev PC, I have an NVIDIA card so the models run directly on the GPU, but I figured that users without a CUDA interface would have to use RDG runtimes instead. Did I misunderstand?

Here is my very basic/crude runtime selection function (for visibility in case others are trying to do the same):

UNeuralNetworkModel* UNeuralNetworkModel::CreateModel(UObject* Parent, UNNEModelData* ModelData)
{
	using namespace UE::NNECore;

	TArray<TWeakInterfacePtr<INNERuntimeGPU>> GPURuntimes;
	TArray<TWeakInterfacePtr<INNERuntimeRDG>> RDGRuntimes;
	TArray<TWeakInterfacePtr<INNERuntimeCPU>> CPURuntimes;

	TArrayView<TWeakInterfacePtr<INNERuntime>> Runtimes = GetAllRuntimes();
	for (int32 i = 0; i < Runtimes.Num(); i++)
	{
		if (Runtimes[i].IsValid()) {
			if (auto CPURuntime = Cast<INNERuntimeCPU>(Runtimes[i].Get())) {
				CPURuntimes.Add(CPURuntime);
				UE_LOG(LogTemp, Warning, TEXT("CPU runtime available: %s"), *Runtimes[i]->GetRuntimeName());
			} else if (auto GPURuntime = Cast<INNERuntimeGPU>(Runtimes[i].Get())) {
				GPURuntimes.Add(GPURuntime);
				UE_LOG(LogTemp, Warning, TEXT("GPU runtime available: %s"), *Runtimes[i]->GetRuntimeName());
			} else if (auto RDGRuntime = Cast<INNERuntimeRDG>(Runtimes[i].Get())) {
				RDGRuntimes.Add(RDGRuntime);
				UE_LOG(LogTemp, Warning, TEXT("RDG runtime available: %s"), *Runtimes[i]->GetRuntimeName());
			} else {
				UE_LOG(LogTemp, Warning, TEXT("Non CPU/GPU/RDG runtime: %s"), *Runtimes[i]->GetRuntimeName());
			}
		}
	}

	// pick the first available runtime starting from GPU, then RDG, then CPU
	if (GPURuntimes.Num() > 0)
	{
		TWeakInterfacePtr<INNERuntimeGPU> Runtime = GPURuntimes[0];
		if (Runtime.IsValid())
		{
			TUniquePtr<IModelGPU> UniqueModel = Runtime->CreateModelGPU(ModelData);
			if (UniqueModel.IsValid())
			{
				if (UNeuralNetworkModel* Result = NewObject<UNeuralNetworkModel>(Parent))
				{
					Result->GPUModel = TSharedPtr<IModelGPU>(UniqueModel.Release());
					UE_LOG(LogTemp, Warning, TEXT("GPU Neural Network model created"));
					return Result;
				}
			}
		}
	}

	if (RDGRuntimes.Num() > 0)
	{
		TWeakInterfacePtr<INNERuntimeRDG> Runtime = RDGRuntimes[0];
		if (Runtime.IsValid())
		{
			TUniquePtr<IModelRDG> UniqueModel = Runtime->CreateModelRDG(ModelData);
			if (UniqueModel.IsValid())
			{
				if (UNeuralNetworkModel* Result = NewObject<UNeuralNetworkModel>(Parent))
				{
					Result->RDGModel = TSharedPtr<IModelRDG>(UniqueModel.Release());
					UE_LOG(LogTemp, Warning, TEXT("RDG Neural Network model created"));
					return Result;
				}
			}
		}
	}

	if (CPURuntimes.Num() > 0)
	{
		TWeakInterfacePtr<INNERuntimeCPU> Runtime = CPURuntimes[0];
		if (Runtime.IsValid())
		{
			TUniquePtr<IModelCPU> UniqueModel = Runtime->CreateModelCPU(ModelData);
			if (UniqueModel.IsValid())
			{
				if (UNeuralNetworkModel* Result = NewObject<UNeuralNetworkModel>(Parent))
				{
					Result->CPUModel = TSharedPtr<IModelCPU>(UniqueModel.Release());
					UE_LOG(LogTemp, Warning, TEXT("CPU Neural Network model created"));
					return Result;
				}
			}
		}
	}

	return nullptr;
}

(note that running on GPU isn’t always the best option, so use with care).

So, my question was how to convert my data (floats) into a render buffer and how to get it back. Based on what you shared, the creation of input bindings for RDG runtimes would be done like that:

FRDGBufferDesc InputBufferDesc = FRDGBufferDesc::CreateBufferDesc(sizeof(float), NeuralNetworkInputSize.X * NeuralNetworkInputSize.Y * 3);
FRDGBufferRef InputBuffer = GraphBuilder.CreateBuffer(InputBufferDesc, *FString("NeuralPostProcessing::InputBuffer"));

(I’ll have to figure how to convert the tensor shape into NeuralNetworkInputSize but it shouldn’t be too bad since we have xyRGB).

Then populating the buffer is where I’m getting lost. From what I saw in the docs, I probably need to do something like this:

FRDGBuffer IndexBuffer = GraphBuilder.CreateBuffer(
        FRDGBufferDesc::CreateUploadDesc(sizeof(uint32), NumIndices),
        TEXT("MyIndexBuffer"));

    // Allocates an array of data using the internal RDG allocator for deferral.
    FRDGUploadData<int32> Indices(GraphBuilder, NumIndices);

    // Assign Data
    Indices[0] = // ...;
    Indices[1] = // ...;
    Indices[NumIndices - 1] = // ...;

    // Upload Data
    GraphBuilder.QueueBufferUpload(IndexBuffer, Indices, ERDGInitialDataFlags::NoCopy);

Since I want to pass the data from CPU to GPU (and also upload an empty output bindings object), then I need to figure out how to hook into the RDG execution, get a reference to my buffer and figure out how to run the model and get the data out of the GPU to the CPU to process the output. (All of that should be somewhat clear once I fully grasp the Render Dependency Graph the documentation)

Did I get that right?

Thanks for your patience!

ranierin · August 10, 2023, 9:27am

Hey @mattai ,

The runtime selection code looks good. It is the way to go as not all runtimes will be available on all target systems. It is unfortunately even a bit more complicated/cumbersome: Not all models run on every runtime (which in addition are not available on all systems). So you may run into the case where you would have a runtime, but when you pass it some model data it will complain and not be able to create the model for you. Since this logic is very application specific, it is not contained inside the plugin but left for the developer to implement.

The interfaces (currently CPU, GPU and RDG) are not really related to the backend (Cuda, DirectML, …) but are standing for the use case: Based on whether you want to run on cpu, or you have cpu data but want to run on gpu, or you need to work inside a frame, you would pick an interface (template argument of GetRuntime). Then based on what system you are running on you would pick a runtime (function argument of GetRuntime). By the way, in theory, a runtime based on CUDA could implement both the GPU and the RDG interface! So using RDG runtime does make sense if your input and output resources are residing on GPU already, if not, GPU would do as well.

However: We do not have complete coverage with all runtimes yet. Thus, if there is a RDG runtime which would run on one system which is not covered by any GPU runtime, then it could make sense for now to mimic the GPU behaviour: Enqueue a render command and create an own RDGBuilder (from the command list that you get in the enqueue render command function) on which you enqueue your network using our RDG interface.
You find code in the docs in the ‘Render Graph Builder’ section, just replace GraphBuilder.AddPass with Model.EnqueueRDG. For filling in the RDG buffer you are at the right point. By doing it that way, you let RDG decide when the best time to upload data is. To access the resulting data, you may need to ‘extract’ the buffer (see the ’ External Resources’ section), then lock the resource after the graph has been executed and copy the memory. There are certainly better (more efficient) ways of doing this, but this could be an easy first draft.

In the long run, using an RDG runtime to do GPU interface work should not be necessary as there will be more covering GPU runtimes. For the time being however, you may want to do the above fallback using our HLSL runtime (which should be supported on almost all platforms but has limited operator support, not all models will run).
Hope that will be sufficient for now to cover your use case.

Good work @matt, it is a pleasure to see people diving deep into this!

mattai · August 18, 2023, 8:03pm

Note to future self and potentially other people landing on this thread who used the NNE plugin. NNE is being moved to the engine (it’s not going to be a plugin anymore).

See this commit for details on what’s changing and how to adapt your code if you were using the experimental plugin.

ranierin · August 19, 2023, 9:28am

Yes @mattai , thanks for the note! There will be some changes for 5.3 which we will describe in this course soon. The commit you posted is on ue5-main which will go into 5.4. After that, the core API of NNE will not be a plugin anymore but part of the engine. Please note, you will still have to enable the plugins of the runtimes that you want to use.

Heyzonsteve · September 18, 2023, 2:40pm

Does the NNE in 5.3 support the text Inputs/Outputs of tensors or will it be available soon?

ranierin · September 19, 2023, 11:59am

The input tensor bindings for cpu runtimes basically just need a raw pointer, the data type inside is defined by the neural network. So if your network supports input tensors of type char, you should be able to feed it. Or in other works, if you are able to run the model with e.g. ORT it should also work with NNERuntimeORTCpu.

Heyzonsteve · September 20, 2023, 4:16am

Thanks @ranierin. Finally, I was able to figure it out!

macarran · September 25, 2023, 11:25pm

hello! thanks for the useful tutorial and exciting package. I was able to run the example code successfully and wanted to see how far I’d be able to take this using a large language model. I tried a few ONNX variants of llms all with the same result, a crash on Model->CreateModelInstance with the following error:

Assertion failed: false [ORTExceptionHandler.cpp] [Line: 16] ONNXRuntime threw an exception with code 6, e.what(): "Exception during initialization: D:\build++UE5\Sync\Engine\Plugins\Experimental\NNERuntimeORTCpu\Source\ThirdParty\onnxruntime\Onnxruntime\Private\core\optimizer\initializer.cc:31 onnxruntime::Initializer::Initializer !model_path.IsEmpty() was false. model_path must not be empty. Ensure that a path is provided when the model is created or loaded. ".

The code I have is the exact same I use for the mnist model example:

if (ManuallyLoadedModelData)
	{
		TWeakInterfacePtr<INNERuntimeCPU> Runtime = UE::NNE::GetRuntime<INNERuntimeCPU>(FString("NNERuntimeORTCpu"));
		if (Runtime.IsValid())
		{
			ModelHelper = MakeShared<FMyModelHelper>();
			TUniquePtr<UE::NNE::IModelCPU> Model = Runtime->CreateModel(ManuallyLoadedModelData);
			if (Model.IsValid())
			{
				ModelHelper->ModelInstance = Model->CreateModelInstance(); // engine hard crashes here

I am assuming the error is a side effect of something going wrong with either the import of the model or something about it not being supported. Wondering if this is something that is expected to work at all before sinking too much time into it, and whether any documentation exists about the current limitations/constraints relevant to supported ONNX models (e.g. supported operators, opset versions, model size, etc). Any pointers appreciated, thanks!

ranierin · September 27, 2023, 10:30am

Hey @macarran , I am just guessing: Maybe the model you try to load is storing weights in a separate file (Typically the case for models > 2GB). This is a feature that we do not currently support yet inside NNE, sorry

macarran · September 28, 2023, 6:22pm

@ranierin you’re spot on, it’s a larger model with weights split across different files. Thanks for confirming this is not yet supported. I don’t know if I missed this elsewhere, but it might be worth adding some lightweight documentation on what’s currently supported and any known limitations somewhere.

darsac · October 4, 2023, 2:01pm

Hello , do you have an exemple of how to feed a text input and how to retrieve the output ?

darsac · October 4, 2023, 2:05pm

I don’t understand how to give the parameter to my model and how to get the result

Heyzonsteve · October 4, 2023, 3:58pm

Hi @darsac , Currently I’m using a CNN architecture-based model for my project. It’s basically similar to mnist model. so also for that, I had to turn the image into tensors. I used the OpenCV plugin to get pixelated data and Flatten it to the required tensor shape. You could feed any data type if your model supports it. For NLP or text input-based models, I hope you will need to tokenize them as your tensor shape fits into the model.
Finally, if your data is preprocessed and ready, you can easily change the Input Tensor array and get predicted value from Output Tensor array

darsac · October 6, 2023, 10:02am

@ranierin I want to use ONNX model like Bertsquad but i don’t understand how to give the model the input and how to extract the output of the model . Do you have any idea ? thx

ranierin · October 9, 2023, 6:16am

Hey @darsac , You got a pretty good answer from @Heyzonsteve : You can feed any data to a NNE model as long as the model supports it. E.g. the CPU tensor binding just takes a void*, pointing to any data type. Of course you need to make sure your memory points to data that your neural network expects. This depends on the model you are using. I quickly looked at the bert-squad in the onnx model zoo and it is like @Heyzonsteve described: you need to tokenize your text. How this is done can be found in above link, they have some sample code.

So long story short: I recommend reading the NNE tutorials to understand how to pass any input to NNE and then read the documentation of the model you want to use to figure out how to prepare your input.

darsac · October 12, 2023, 10:24am

I have a lot of trouble tokenizing my string, there is no explanation in the documentation. It only uses their python function which is not possible in runtime on unreal, do you have any idea ?

ranierin · October 12, 2023, 1:34pm

I am not familiar with the details of the bert tokenizer, sorry I guess you have to read that up online. Maybe this explanation can help.

gabaly92 · October 25, 2023, 10:31pm

Hi everyone, I am working with mnist-8 model in the guide and I got the following runtimes to work, for context, I am on a windows machine with an Intel CPU and a Nvidia GPU running Unreal Engine 5.3:

NNERuntimeORTCuda (strictly requires CUDA 11.4 and CUDNN 8.8.2.26)
NNERuntimeORTDml (Works with no extra requirements)
NNERuntimeOrtCpu (Works with no extra requirements)

for NNERuntimeRDGHlsl → I could not even create a Model, I got errors that some layers are not supported, like the MaxPooling Layer for example, which I assume that requires creating those custom layers using HLSL some where in the code

I am trying to get NNERuntimeRDGDml to work and I think I am very close, here are the steps:

Load Model Manual, Automatic or Lazy → works
Create RDG Model → works
Create RDG Model Instance → works
set inputs and outputs → works
Create input and output RDG Buffers on the GPU → works
create input and output bindings → works
Enqueue model, input bindings, output binding to the RDG → works
Execute the RDG → works and then the editor crashes

I checked the logs and the last thing printed before the crash is “Graph Builder Executed” which I print right after GraphBuilder.execute() so I assume this means inference is running on the render dependency graph successfully
also the reason for the crash is a Memory access violation, so may be after or during graph execution the engine is trying to access a null pointer or something, trying to figure it out

There are 3 functions in my code (see below)

void NNERDG::CreateModelRDG() → the main function where steps 1 to 8 are executed
void NNERDG::CreateRDGBuffers(int32 InputSize, int32 OutputSize) → for step 5
void NNERDG::RunInferenceRDG(TSharedPtrUE::NNE::IModelInstanceRDG ModelInstanceRDG, TArray NNEInputTensorArrayBuffer) → for steps 6,7 and 8

I need help in

confirming that the input and output data setup and bindings are correct in steps 4, 5 and 6
confirm that step 7 is done correctly as well
Why the editor is crashing after GraphBuilder is executed

Thank you

RDGModel.cpp (9.1 KB)

ranierin · October 26, 2023, 7:04am

Hey @gabaly92

Nice work! You are right about the HLSL runtime, this is work in progress and missing most of the operators. With RDG, you also took the biggest challenge, as it is the most complex to get it running.

So first of all: If you have data on cpu and you want to run a model on it on gpu but then get the results on cpu again, I would recommend to use the INNERuntimeGPU interface (and probably the ORTDml runtime) as it handles the upload and download for you.

RuntimeRDG is meant to be used when you want to run a network as part of rendering a frame (e.g. post processing), e.g. consuming the resource that is generated during frame rendering and consume the output which then contributes to the final output.

However, for the fun of the exercise you can indeed do what you did and try to do the up- and download manually. But there are unfortunately a couple of issues with the code and I recommend you reading about the Render Dependency Graph first. Especially the part on uploading buffers and buffer extraction.

E.g. you create the buffers on one GraphBuilder but then use it in the other (which only works if you register or convert it to external), you would typically move this code into the dispatch function, to allow RDG to reuse resources. Also you use the upload mechanism on the output buffer, but what you want there is to download the data from the buffer into your array.

So:

Create the buffers inside ENQUEUE_RENDER_COMMAND on the same builder and set the input bindings there
Looks correct
It crahses because you use buffers that are not valid anymore as they belong to another graph builder
(4. You will not get any results back, as you are uploading your output array)

Hope that helps! As mentioned, it is difficult so dont be demotivated!

gabaly92 · October 29, 2023, 9:11pm

Thank you so much for the quick and detailed response, I tried moving everything under one Graph Builder, but I am still getting an access violation error, looks like the problem is, like you mentioned, with how the input is created or uploaded, not sure at this point, but looks like the input and output data somehow become unavailable after or during Graph execution.

Like you mentioned, RDG is meant for models dealing with the rendered frame or the rendering pipeline, like for example DLSS (I assume) or running Neural Style transfer on the rendered frame for example, so at this point I will focus on the CPU and GPU runtimes, they are more reliable and less complex than the RDG pipeline at the current version of NNE. I will get back to the RDG runtime if I really need to use it