Hmm, this sounds odd: IREE compiles models down to executable code making use of specific hardware instructions like vector registers etc.
The performance of small MLP was actually pretty good as these have already been optimized by the compiler community. I can imagine that the bad performance you observed could have been introduced by going through the onnx importer.
We did some tests while ago, comparing BasicCpu, IREE and ORT. While ORT was a factor five slower on small mlps, IREE could almsot match BasicCpu.
So while I have the feeling something is off and I would try IREE again but directly export the model to MLIR (trying out different dialects like linalg, tosa stablehlo), BasicCpu could be a valid alternative as it would still give you a slight performance boost. And it does use ISPC kernels and runs to my knowledge on current gen consoles. But it has a custom input format, so you would have to manually write out the weights and the graph information (Both MLDeformer and LearningAgents contain some reference code how to do so).
Note, an advantage of NNE over custom implementations is that it generalizes well to other runtimes nad/or hardware. E.g. it would just take a single line of code to change running on an NPU on target devices that have one, freeing some CPU resources. But not sure if this applies to your use case.
Hope that helps!