Data copies in NNERuntimeRDG

anonymous-edc · August 28, 2025, 3:33pm

In the NNERuntimeRDGHlsl runtime for NNE (experimental NNERuntimeRDG plugin), operators that do not alter the incoming data in any way, like Reshape, Squeeze or Unsqueeze (which just manipulate the shape of the tensor) are implemented with AddCopyBufferPass. For example, in NNERuntimeRDGReshape.cpp:

// ...
virtual void Dispatch(FRDGBuilder& GraphBuilder, TConstArrayView<FTensorRDGRef> InputTensors, TConstArrayView<FTensorRDGRef> OutputTensors) override
{
	check(InputTensors.Num() == 2);
	check(OutputTensors.Num() == 1);
	check(InputTensors[0] != nullptr);
	check(OutputTensors[0] != nullptr);

	const FTensorRDG& Data = *InputTensors[0];
	const FTensorRDG& Output = *OutputTensors[0];

	RDG_EVENT_SCOPE_STAT(GraphBuilder, FNNEOperatorReshape, "NNE.Operator.Hlsl.Reshape");
	RDG_GPU_STAT_SCOPE(GraphBuilder, FNNEOperatorReshape);

	AddCopyBufferPass(GraphBuilder, Output.GetBuffer(), Data.GetBuffer());
}
// ...

Ideally, this operations should be “free”, as only the metadata of the tensor changes, so the output could reuse the input buffer. And it is not uncommon for a model graph to carry out many of these operations on large arrays. Does NNERuntimeRDGHlsl strictly require each operation output to live in its own separate buffer? Are there plans to avoid these copies in the future?

ranierin7 · August 29, 2025, 7:45am

Hi Javier,

Yes, the copy is unfortunate and the result of some shortcuts that we had to take to be able to treat each operator individually. The overhead in memory is not as bad as it may look as RDG reuses intermediate buffers. But it is still a slight overhead and there is also tthe overhead of the copy itself.

Note, due to limited resources we need to pick our battles and we are currently putting the efforts in bringing NNERuntimeIREE for RDG on par with our HLSLS runtime. Performance wise, the HLSL runtime still outperforms IREE, but with IREE we will eventually have a more sustainable solution. E.g. in above case, IREE will completely remove the operator from the GPU.

For this reason it is not planned at the moment to further optimize the HLSL runtime.

Apologies for the inconveniences and thanks for your understanding!

anonymous-edc · August 29, 2025, 9:09am

Thank you for your clear answer one more time, Nico, that’s completely understandable. The fact that RDG can reuse buffers is great to know.

I take it then that in the future, once it is ready, NNERuntimeIREE RDG backend will be the recommended choice for running models on the GPU (perhaps along with OnnxRuntime DML for DirectX platforms)? Being able to use IREE for both CPU and GPU by default would actually be pretty nice.

ranierin7 · August 29, 2025, 9:33am

Exactly, if you are on a DirectX based system, NNERuntimeORTDml has the widest model support and probably the fastest inference for most models. If DirectX is not available (e.g. when the Engine runs a Vulkan RHI backend or on consoles), NNERuntimeRDGHlsl is currently probably still the best, and once we can catch up with performance on NNERuntimeIREERdg, this should replace NNERuntimeRDGHlsl.

Note, it may be worth to try out the ARM runtime as well, they implement the RDG interface as well and base their runtime on a Vulkan extensions.