How to add image as observation when using Learning Agents

I’m new to unreal engine and try to find a way to take the image of the enviroment as observation (like human looking at the enviroment) with some others numeric observations and output the actions. But in the Interator, Event Setup Observations node, there is nothing like image, is it not supported yet?

It’s not currently supported because it’s something we want to discourage at the time being. Images pretty much need to be processed on the GPU and its not realistic to render a complex game on the GPU and do ML inference on the same GPU.

We will probably add it in the future but right now we want to encourage users to use ground truth information, i.e. game state directly.

Let me know what you think.

1 Like

Thank for your reply. Really looking forward for the future functions as i think the agents should work with the same amount of information as human

1 Like

Technically you could add it yourself in user space. If you sub-class the observation types, you can make a new one for images without needing to modify the plugin content itself.

The harder part would be adding in the 2d convolutions…

1 Like

Correct me if i’m wrong but the image is numerical matrix, then you can transform it to suitable (H,W,C), why is it hard?

It might be hard to add those operations in without having to modify that part of the plugin. I haven’t put a ton of thought into if that portion of the code is easily extensible or not. The actual content isn’t as concerning.

1 Like

I absolutely could be wrong, but I’m wondering if there isn’t a slight misunderstanding happening.

I’m wondering, Allen.The.Alien, if you are thinking of just inputting the pixel data directly without performing convolutions.

While doable, it will not be particularly efficient and there will be issues with the large increase in the size and complexity of the model. Even a 256x256 image per frame will dominate the other data that is passed to the model without some sort of transformations of the data with either convolutions or attention.

Rendering the images at a resolution similar to human sight would make it rather difficult to train and inference due to the sheer size of the data, then you have to account for the number of agents rendering that image per frame.

At the moment, it would probably be simpler and more effective to use raycasts that pass only the information the agent needs to learn and nothing more. There will be a lot of information that is passed through images that will be unusable or even detrimental to the model’s performance.

If I’m misunderstanding, let me know.