Grasping the details of Learning Agents QKV attention mechanism, e.g. for position encoding with RoPE

Hello everyone. I am creating a game with Learning Agents in UE 5.5, which is going great.

I’m making just a simple proof-of-concept where RL agents are expected to be running on a plane, collect stationary resources and avoid stationary obstacles. So I’ve decided to give the RL agent the following observation: no more than N closest objects to the agent, ordered by distance, in a “dynamic-sized” array observation. However the agents seem to experience stuttering, seemingly much of it while being equidistant to some objects.

I have recently noticed the “… utilizes attention. … use sparingly.” warning on “dynamic-sized” array observations. Curious, I have gone into the source code for it.

(I naturally understand that in this case I can use a static array, the remainder of which (if any) would be filled with objects tagged “Empty”, however I wonder if the current implementation could be improved.)

As far as I can see, the elements input into such an array are paired with their positional indices, after which they just form a set. At the same time, pairs are structs under the hood, and structs work like an “And”-type observation element. So, for being an “And” observation, it seems the indices are appended to the element data. I was wondering if RoPE could be used at this point, if this makes any sense for RL of course.

So next I’ve looked through the NNE (Neural Network Engine, NNERuntimeBasicCPU) code and it seems the respective Q, K and V are calculated with linear layers for each element (for each head also). However, I couldn’t grasp what is happening in the Attention part. For example, looking at OperatorAggregateDotProductAttention [ https://github.com/EpicGames/UnrealEngine/blob/2d53fcab0066b1f16dd956b227720841cad0f6f7/Engine/Plugins/Experimental/NNERuntimeBasicCpu/Source/NNERuntimeBasicCpu/Private/NNERuntimeBasicCpuModel.cpp#L941 ], it seems the resulting Attention dot product matrix somehow only holds a scalar for each element for each head, while I’d expect it to be [ElementNum][ElementNum][AttentionHeadNum] or at least [ElementNum][ElementNum] e.g. in case the input dimensions are somehow divided between the attention heads.

So Attention[ElementIdx][HeadIdx] seems to be just the dot product of Queries[ElementIdx][HeadIdx] $\cdot$ Keys[ElementIdx][HeadIdx].

Am I right reading it like this? I wonder what’s the name for this mechanism to search for it further.

Thanks.

However it be, when I switched the dynamic array for a static array, the average reward dropped sharply into negative.

Hi Senyarik,

Thanks for taking a look at Learning Agents and digging so deep into it!

Yes you are correct in your assessment of how it all works.

Right now we are not using a sophisticated position encoding at all for the Array Observation. I imagine that RoPE (or pretty much any other position encoding) will work better than what we have now, but we have not had the time to find what could be the best default. I imagine at some point we swap out the default position encoding used by the Array Observation for something better.

In the meantime, if you felt it could help, you could certainly replace the built-in Array Observation with your own which uses a more sophisticated position encoding. I would be very curious to know if that helps your situation!

In fact, for your situation you could consider using the “Set Observation” directly, but including the distance to the entity in the observations of each entity. Then you should not have any kind of jittering when the order of objects in the array changes (the Set Observation is independent of the order of the elements). The distance will be acting a little like the “position encoding” in this setup.

The reason we don’t have [ElementNum][ElementNum][AttentionHeadNum] attention elements is because the attention mechanism we are using is summarizing along the element dimension to create a single output. With a standard transformer which goes from sequence to sequence (or more specifically from “set” to “set”) you would indeed have [ElementNum][ElementNum][AttentionHeadNum] attention values. However we are going from a sequence (or “set”) to a single output so we don’t need to compute the full pairwise attention matrix.

I hope that makes sense!

Thanks,

Dan

2 Likes