Hello everyone. I am creating a game with Learning Agents in UE 5.5, which is going great.
I’m making just a simple proof-of-concept where RL agents are expected to be running on a plane, collect stationary resources and avoid stationary obstacles. So I’ve decided to give the RL agent the following observation: no more than N closest objects to the agent, ordered by distance, in a “dynamic-sized” array observation. However the agents seem to experience stuttering, seemingly much of it while being equidistant to some objects.
I have recently noticed the “… utilizes attention. … use sparingly.” warning on “dynamic-sized” array observations. Curious, I have gone into the source code for it.
(I naturally understand that in this case I can use a static array, the remainder of which (if any) would be filled with objects tagged “Empty”, however I wonder if the current implementation could be improved.)
As far as I can see, the elements input into such an array are paired with their positional indices, after which they just form a set. At the same time, pairs are structs under the hood, and structs work like an “And”-type observation element. So, for being an “And” observation, it seems the indices are appended to the element data. I was wondering if RoPE could be used at this point, if this makes any sense for RL of course.
So next I’ve looked through the NNE (Neural Network Engine, NNERuntimeBasicCPU) code and it seems the respective Q, K and V are calculated with linear layers for each element (for each head also). However, I couldn’t grasp what is happening in the Attention part. For example, looking at OperatorAggregateDotProductAttention [ https://github.com/EpicGames/UnrealEngine/blob/2d53fcab0066b1f16dd956b227720841cad0f6f7/Engine/Plugins/Experimental/NNERuntimeBasicCpu/Source/NNERuntimeBasicCpu/Private/NNERuntimeBasicCpuModel.cpp#L941 ], it seems the resulting Attention dot product matrix somehow only holds a scalar for each element for each head, while I’d expect it to be [ElementNum][ElementNum][AttentionHeadNum] or at least [ElementNum][ElementNum] e.g. in case the input dimensions are somehow divided between the attention heads.
So Attention[ElementIdx][HeadIdx] seems to be just the dot product of Queries[ElementIdx][HeadIdx] $\cdot$ Keys[ElementIdx][HeadIdx].
Am I right reading it like this? I wonder what’s the name for this mechanism to search for it further.
Thanks.