I followed the tutorial to create my own spaceship control Agent. During the Reset Agent Episode step, since my agent uses PhysicsConstraints and has SimulatePhysics enabled, I found that I cannot complete the agent’s position reset within a single tick. The engine requires a few ticks to achieve the desired position reset.
During this reset process, LearningAgents continues to interact with the agent, leading to anomalies in sample data. For example, data in GatherObservation is based on the pre-reset position, while GatherReward is calculated based on the post-reset position.
I attempted to implement my own AgentStatus control and added logic in functions like Gather Agent Observation to check whether sampling should be performed. For instance, if AgentStatus == PAUSE, these callback functions would not be called. However, the current ProcessExperience requires that the iteration counters for Observation, Action, Rewards, and Completion be the same. Without direct control over these counters, this approach often leads to Non-matching Iteration Number errors.
I also tried using RemoveAgent/AddAgent to replace the reset but encountered similar issues during agent initialization. Specifically:
If I wait for the agent’s position initialization to complete before adding the agent, I cannot ensure the agent starts from the beginning of a new sampling cycle, leading to a Non-Matching Iteration Number Error
If I add the agent without waiting for the position initialization to complete, data anomalies and pollution occur during sampling.
In summary, would it be possible to add a control function at the agent level for Gathering/Performing? This would allow developers to directly implement simple sampling control logic, such as pausing, starting from the beginning of next cycle, or discarding the current cycle.
This would greatly help in managing agent interactions.
Thanks, this is valuable feedback as I haven’t experimented with a use-case like this yet.
Have you tried using the “lower” level API provided by the trainer? E.g. here’s a training program I setup in my manager for playing both sides of Connect Four like game:
Instead of calling RunTraining/RunInference, you can call the functions like:
Begin Training
Gather Completions/Rewards
Process Experience
etc.
Then have your on/off control logic here in the manager instead of inside the GatherObs function (i.e. the manager should control who is training, not the agents).
Let me know if/why that does not work if you can, but otherwise I will think about your use-case some more and see if I can’t think of a way to make it more convenient for you.
I talked to my colleague who also trains some agents using the physics and his solution is to use the training settings to trim the first few samples from each episode:
A related question. How do you handle invalid moves? (Although perhaps connect 4 doesn’t have invalid moves). There doesn’t seem to be support for masking valid moves in the framework and I am having a hard time just getting my agent to learn what an invalid move is.
Part of the issue is that the invalid move happens at Run Inference (so at the end of the train chain → completion, reward, process experience, run inference). Now I am in a situation where the agent tried to make an invalid move, so the game cannot proceed. I would catch this in a completion, but the completion doesn’t run until the next time the agent tries to move. (And before that the other player - which is just random - also needs to move).
Do you have suggestions on how to deal with this situation in a clean way?
Masking is coming in 5.6. It’s currently available on UE5-Main if you compile from source.
We still need to finish testing it so it may contain bugs but it’s “code complete” and we did some preliminary testing.
Not sure what you should do in the meantime. Best you could do with the tools available would be to Terminate the episode with a negative penalty and the policy should eventually learn not to do invalid moves, but that’s not a great solution (IMHO).