Hi Senyarik,
Thanks for taking a look at Learning Agents and digging so deep into it!
Yes you are correct in your assessment of how it all works.
Right now we are not using a sophisticated position encoding at all for the Array Observation. I imagine that RoPE (or pretty much any other position encoding) will work better than what we have now, but we have not had the time to find what could be the best default. I imagine at some point we swap out the default position encoding used by the Array Observation for something better.
In the meantime, if you felt it could help, you could certainly replace the built-in Array Observation with your own which uses a more sophisticated position encoding. I would be very curious to know if that helps your situation!
In fact, for your situation you could consider using the “Set Observation” directly, but including the distance to the entity in the observations of each entity. Then you should not have any kind of jittering when the order of objects in the array changes (the Set Observation is independent of the order of the elements). The distance will be acting a little like the “position encoding” in this setup.
The reason we don’t have [ElementNum][ElementNum][AttentionHeadNum] attention elements is because the attention mechanism we are using is summarizing along the element dimension to create a single output. With a standard transformer which goes from sequence to sequence (or more specifically from “set” to “set”) you would indeed have [ElementNum][ElementNum][AttentionHeadNum] attention values. However we are going from a sequence (or “set”) to a single output so we don’t need to compute the full pairwise attention matrix.
I hope that makes sense!
Thanks,
Dan