Course: Neural Network Engine (NNE)

N2Man · February 7, 2025, 2:59pm

Hello! I’m a newcomer to both NNE and Unreal Engine. Could you please clarify the current level of operator support in NNE?

Does NNE currently support all standard ONNX operators?
Where can I find the information regarding supported operators/opset versions?
Thank you!

SouryGame · February 7, 2025, 10:56pm

Hey everyone! I haven’t seen this issue mentioned anywhere else, so I thought I’ll share my experience here in case it helps anyone out.

Today I’ve experienced an issue with NNE when the output data array wasn’t in certain cases filled with values even though the result status of “RunSync()” function was “EResultStatus::Ok”. It turns out that because of the code in NNERuntimeORTModel.cpp line 406,

if (!InOutputBindings.IsEmpty() &&
	InOutputBindings[i].Data &&
	Tensor.GetDataSize() > 0 &&
	InOutputBindings[i].SizeInBytes >= Tensor.GetDataSize())
{
	FMemory::Memcpy(InOutputBindings[i].Data, OrtOutputTensors[i].GetTensorData<void>(), Tensor.GetDataSize());
}

if your output data array is under allocated (which was totally my fault) it silently fails and doesn’t copy any data into the output array. This resulted in a very random behavior when the model would work perfectly in certain cases when the output was smaller, but seemingly randomly wouldn’t return a result in others.

In my opinion it would be nice if there was at least warning in the console saying this is happening, otherwise it’s very easy to spend hours troubleshooting the issue.

gabaly92 · February 8, 2025, 3:32am

yess, make sure you calculate and allocate the exact memory needed for the model inputs and outputs before running inference, you can easily do that using the input and output tensor dimensions and their types

SouryGame · February 8, 2025, 9:31am

I totally agree with this, if you can get the dimensions ahead of time you should do that, in my case the model has variable output dimensions, which according to the comments in the code are only resolved while running the model (I’m new to this so I might be completely wrong and there might be a better way of handling this).
In either case it was my bad code that was causing this, but I still think the system shouldn’t just fail silently while returning EResultStatus::Ok. It makes it a lot harder to debug what went wrong.

ranierin · February 10, 2025, 6:58am

@N2Man Operator support heavily depends on the runtime you are using. If you are using NNERuntimeORTCpu or Dml, all onnx operators available in the ORT version inside the engine are supported.

@SouryGame Sorry for the missing documentation about this. It was a design choice (a feature, not a bug ), let me explain:

Some neural networks cannot provide the output shapes upfront (e.g. if the shape depends on some input values). In these cases, you cannot provide the proper shapes until you run. So the idea is that you run inference with just some allocated output. If it does not fit, you will not get results, but at least after that, the output shapes will be known and you can allocate proper outputs and run inference again. Since inference did run correctly, we decided to not make it an error.

But you are right, we should document it better.

Also small side note: While we expect inputs to be exactly the volume defined by the shapes you set, for outputs you can pass more. So if you know the upper bounds of memory needed in your outputs, you can just allocate that and it will always be fine.

SouryGame · February 10, 2025, 9:07am

Thanks for the clarification, that makes complete sense. Is there a built it way to check if the output array was filled after the first run, or should I just manually check if the elements were left as zeroes? If not it would be useful if the EResultStatus would contain additional state for that - currently it’s just Ok or Fail, but it could contain a third result like OkOutputTooSmall (with some better name).

One more question: after I’ve experimented with the cpu runtime I’ve decided to switch to RDG, but I’ve noticed a lot of the operators aren’t implemented. I understand it’s a lot of work and you’re continuously adding more, so I’ve decided to implement the ones I need myself. Currently there doesn’t seem to be a way of doing that outside of modifying the engine, which is a shame because it limits the ability to share expanded operator sets as plugins, and it would be helpful even after all the onnx operators are implemented in unreal for custom operators. Am I just missing something, or is there a no built in way to expand the operator set without engine modification? If not, was a feature like this considered?

ranierin · February 12, 2025, 2:48pm

@SouryGame We will consider adding an additional flag. The safest way is just to get the output shapes after the inference call and make sure your output array provided >= memory than what is declared by the output shapes.

Regarding the second question: Which RDG runtime do you use? I imagine you are using NNERuntimeRDGHlsl if you are missing operators? Can you try NNERuntimeORTDml instead?

We do not plan to expose adding custom operators to Hlsl and it is not possible with other runtimes. You could implement your own runtime if you have many custom operators, but that will be a lot of work.

SouryGame · February 12, 2025, 9:30pm

Thank you very much for the answer, I’ll use the method for checking the output data size you’ve mentioned in the future.

Sorry, I’ve forgot to mention that. Yes, I’m using the NNERuntimeRDGHlsl runtime, as it seems it has larger platform support since it doesn’t require DX12.

I would assume it has better performance over NNERuntimeORTDml as well, because it’s fully running through Unreal’s RDG and HLSL. Do you happen to know if that’s the case, or the difference is imperceivable?

In either case I only have one operator left to implement to get my model running, so I may do some performance testing afterwards. It’s a shame there are no plans to make the operator set extensible, but I understand that decision from the API support and maintenance perspective. I guess it’s custom built engine forever!

ranierin · February 17, 2025, 7:54am

@SouryGame DirectML is able to access special hardware (e.g. tensor cores) if available.
So depending on the model and the hardware, it can be noticeable faster than a compute shader based runtime.

On the other hand, we have found some models/circumstances where the Hlsl runtime was performing slightly better.

Yes, platform support of our Hlsl runtime is wider, as it uses the UE shader pipeline underneath.

One possible solution regarding custom op: if that happens at a single point in your model, you maybe can split your model into two parts: pre-custom-op and post-custom-op. Inside UE you then just enqueue the first model part, use the result to enqueue your custom compute shader, and use the compute shader output when enqueueing the second model.

This ‘sandwiching’ of ML and compute shader is something done quite often actually. Not sure if that would be applicable to your model as well.

jvin1011 · February 19, 2025, 3:25pm

Hi, I am using Unreal Engine(5.4.4/5.5.1) Neural Network Extension (NNE) with RDG HLSL, which works fine for a simple denoising neural network. However, I’m now creating a network for a super-resolution task and utilizing the ConvTranspose operator. Unfortunately, the RDG HLSL compilation is throwing the following error.

LogNNE: Warning: Found unsupported attribute ‘kernel_shape’.
LogNNE: Warning: RDG MLOperatorRegistry failed to validate operator:ConvTranspose
LogNNE: Warning: Model validator ‘RDG Model validator’ detected an error.
LogNNE: Warning: Model is not valid.
LogNNE: Warning: UNNERuntimeRDGHlsl cannot create a model from the model data with id 2D3254744755AB7C28B918878B33B370
LogTemp: Error: Could not create the RDG model

Upon investigating the Unreal Engine source code (located at E:\UnrealEngine\Engine\Plugins\Experimental\NNERuntimeRDG\Source\NNERuntimeRDG\Private\Hlsl\NNERuntimeRDGConvTranspose.cpp), I noticed that the kernel_shape attribute is commented out. Could you explain why this is the case and help me resolve the issue?

bool ValidateConvTransposeOperator(const NNE::FAttributeMap& AttributeMap, TConstArrayView InputTypes, TConstArrayViewNNE::FSymbolicTensorShape InputShapes)
{
bool bIsValid = true;

FAttributeValidator AttributeValidator;
AttributeValidator.AddOptional(TEXT("auto_pad"), ENNEAttributeDataType::String);
AttributeValidator.AddOptional(TEXT("dilations"), ENNEAttributeDataType::Int32Array);
AttributeValidator.AddOptional(TEXT("group"), ENNEAttributeDataType::Int32);
//AttributeValidator.AddOptional(TEXT("kernel_shape"), ENNEAttributeDataType::Int32Array);
AttributeValidator.AddOptional(TEXT("output_padding"), ENNEAttributeDataType::Int32Array);
//AttributeValidator.AddOptional(TEXT("output_shape"), ENNEAttributeDataType::Int32Array);
AttributeValidator.AddOptional(TEXT("pads"), ENNEAttributeDataType::Int32Array);
AttributeValidator.AddOptional(TEXT("strides"), ENNEAttributeDataType::Int32Array);

bIsValid &= AttributeValidator.Validate(AttributeMap);

FInputValidator InputValidator;
InputValidator.AddSupportedType(ENNETensorDataType::Float);
InputValidator.AddRequired();
InputValidator.AddRequired();
InputValidator.AddOptional();
bIsValid &= InputValidator.Validate(InputTypes);

return bIsValid;

}

As far as I understand, while creating a neural network in pytorch with convtranspose operator, the kernel_size attribute needs to be used. It is then renamed to kernel_shape in onnx after conversion.

So, I uncomment the above line (kernel shape one), rebuild UE, and the model compiles fine, but the output is a black screen (so the output is incorrect). And, then I use the stat GPU memory command for profiling and it shows that the ConvTranspose operator is extremely slow. So, I am wondering now if it was commented out on purpose.

ranierin · February 21, 2025, 6:52am

Hey @jvin1011 yes, the HLSL runtime is work in progress and highly experimental. some operators or attributes are not supported yet. It can be quite a rabbit hole to add new (seemingly simple) features to it.

Sometimes you can get around by modifying your model before exporting it. E.g. replacing an unsupported operator by a supported one, or make sure an attribute is explicitly defined.

The last option is that you try the NNERuntimeORTDml runtime which is also able to consume onnx models.

I hope that helps!

jvin1011 · February 25, 2025, 6:38am

Hi Nico, thank you for the update. However, I’m unable to run the NNERuntimeORTDml backend despite using Windows with DirectX support and enabling the NNERuntimeORT plugin. I’m encountering the following error:

LogNNERuntimeORT: Error: Failed to add DirectML execution provider to OnnxRuntime session options: D:\a\_work\1\s\onnxruntime\core\providers\dml\dml_provider_factory.cc(71)\onnxruntime.dll!000001A8F6881B44: (caller: 000001A8F68837FF) Exception(1) tid(65738) 80004002 No such interface supported LogNNERuntimeORT: Error: Failed to configure session options for DirectML Execution Provider. LogTemp: Error: Could not create the RDG model instance.

Could you advise on resolving this issue?

ranierin · February 26, 2025, 10:41am

Hey @jvin1011 not 100% sure but this looks like incompatible hardware/driver. Do you have DirectX12 or 11? Note that DirectML requires DirectX12.

jvin1011 · February 27, 2025, 1:51am

I have DirectX 12 installed on my machine, but I still can’t figure out why the NNERuntimeORTDml runtime isn’t working, so I will continue to use Hlsl runtime for now. I replaced the ConvTranspose operator with Conv followed by Interpolation, and it seems to work fine. On a separate note, the NNE tutorial mentions that a lot of optimization is possible, but it isn’t discussed there. Could you suggest a few ways to optimize for performance?

ranierin · March 3, 2025, 9:03am

Not sure to which part of the tutorial you are referring.

Typically, a runtime does some optimization on a model before running it (e.g. fusing certain operators if possible). Also, some operations can hit a fast pass if some requirements are met (e.g. a 3x3 convolution with stride 1 can be run through the winograd algorithm). So you can try to change your model and see if performance changes.

Besides that you can do the typical things like fp16 quantization or trying to adjust your hyper parameters to increase performance while maintaining quality.

jvin1011 · March 4, 2025, 10:01am

Thanks, Nico. I was referring to the Neural Post Processing tutorial from 5.3 version.

To keep things simple and explanatory, a lot of optimizations have been left out. Thus there is some open work left to the reader to make this code production ready.

Noted, I will try to change my model architecture and also work with a fp16 quantized model.

ranierin · March 7, 2025, 8:04am

Ah, thanks for clarifying! I think that does not refer to model optimizations (since the model is actually just a simple conv kernel with fixed weights corresponding to an edge detection filter). It refers more to how to setup code, get dynamically input and output shapes and so forth and so on. I see this is a little misleading, apologies for that!

Soulast · March 23, 2025, 8:30pm

Hi Nico. Before I will get into rabbit hole I would like to ask if is it possible to use NNE to run lets say TinyLlama-1.1B-Chat-v1.0-ONNX or any similar llm.

At this moment I am using api’s with LMStudio. Would be great to avoid it and have everything in one place.

jvin1011 · March 28, 2025, 3:21am

Hi Nico. From what I understand, NNE is an interface that allows us to choose between different engines/modes like OnnxRuntime, HLSL, IREE, and BasicCPU. If I wanted to use TFLite instead, could I use it directly without integrating it into NNE? Since using TFLite directly might be easier, is my understanding correct?

ranierin · March 28, 2025, 7:20am

@Soulast Yes, this should totally be possible. I’d recommend you start trying things out with NNERuntimeORTCpu as it has the widest operator support and LLM somethimes have some exotic operators inside. Also, it may be that you have to prepare your data in a special way, e.g. implement the tokenizer, do the manual embedding etc. And you maybe also have do the iteration to consecutively generate output tokens yourself. It really depends on how the model you are using is set up / exported. So maybe also try out different LLMs if one does not work.