I use the 5.5 plugin and it works really well for both Reinforcement Learning and imitation learning, so keep at it. I generally train with 64-128 agents at a time, for simple cubes you can probably do more. Make sure your rewards make sense… to keep things from blowing up, try to scale everything (rewards and NN inputs) to -1 to 1 range if possible. I can watch it train for 10-15mins and can tell if my rewards aren’t set up properly and will pause the train and tweak them, restarting without reinit to continue training if it hasnt hit some weird minimum.
I would say start simple, the cube shouldn’t know it’s own location XY but it should be it’s location relative to the center of the board (normalized to close to -1 to 1), and it should know target’s location relative to itself(normalized). Make sure your values are ego-centric to the agent, not world relative. Add reward for distance to target and reward distance to center. Training in avoiding the specific walls will be more difficult and depends on your design intent… for that type of stuff giving your agents ‘whiskers’ using line traces is a pretty good solution.
As an example my rewards for a flight combat game are set up to reward staying within an altitude range, maintaining target position in front (dot product x vector), maintaining a distance to target, and reward minimizing roll (I use Map Range Clamped a lot here to scale the world values to reward values). The agent controls floats yaw/pitch/roll/throttle and boolean ‘fireweapon’. I can have somewhat workable agents trained in a couple hours with that setup that fly around pretty well… training more advanced behavior I use Imitation Learning with surprisingly good results with care in creating good training data, it honestly feels like magic when it works.