Learning Agents fails

kexul22 · September 3, 2023, 3:03pm

Hi, I’m using the Learning Agents plugin and try to build a simple task for my agent: Trace a target in the map. However, the training process randomly failed with the following information:

LogLearning: Display: Training Process: Creating Replay Buffer...
LogLearning: Display: Training Process: Creating Networks...
LogLearning: Display: Training Process: Sending Policy...
LogLearning: Display: Training Process: Creating Optimizer...
LogLearning: Display: Training Process: Creating PPO Policy...
LogLearning: Display: Training Process: Opening TensorBoard...
LogLearning: Display: Training Process: Begin Training...
LogLearning: Display: Training Process: Profile| Pull Experience           157571ms
LogLearning: Display: Training Process: Traceback (most recent call last):
LogLearning: Display: Training Process:   File "D:\softwares\unreal\UE_5.3\Engine\Plugins\Experimental\LearningAgents\Content\Python\train_ppo.py", line 361, in <module>
LogLearning: Display: Training Process:     train_ppo(config, trainer)
LogLearning: Display: Training Process:   File "D:\softwares\unreal\UE_5.3\Engine\Plugins\Experimental\LearningAgents\Content\Python\train_ppo.py", line 199, in train_ppo
LogLearning: Display: Training Process:     assert response == UE_RESPONSE_SUCCESS
LogLearning: Display: Training Process: AssertionError
LogLearning: Warning: Training Process finished with warnings or errors
LogLearning: Display: BP_MLManager_C_UAID_482AE33EAAE729A501_1524821621: Resetting Agents [0].
LogLearning: Error: NewLearningAgentsTrainer: Error waiting for policy from trainer. Check log for errors.
LogLearning: Display: NewLearningAgentsTrainer: Stopping training...
LogLearning: Display: NewLearningAgentsTrainer: Sending / Receiving initial policy...

Any help? Much appreciated!

Deathcalibur · September 5, 2023, 1:25pm

Hello,

Sorry to see that you’re running into a problem. It looks like the python process is having an issue receiving the training data from the UE process.

LogLearning: Display: Training Process: Profile| Pull Experience 157571ms

This time is how long the python process was waiting to receive experience from UE. This includes the time for the game to run and gather experience. Given this is two minutes, this is a really long time which is a little suspicious.

I’m guessing you are running into a timeout:
Can you add a print(response) to line 195 in “D:\softwares\unreal\UE_5.3\Engine\Plugins\Experimental\LearningAgents\Content\Python\train_ppo.py” so we can figure out if it’s a timeout or some other unexpected error?

Here is my initial advice:

Increase data gathering speed by using multiple agents, e.g. duplicate the game world
Consider decreasing the max step num. The default is 300, which is about 5 seconds in a 60 fps game where the trainer is ticking every frame.

Also, check out Learning to Drive | Tutorial if you have not already.

Let me know if you continue to run into issues or if it does not appear to be a timeout.

Thanks,
Brendan

kexul22 · September 5, 2023, 2:15pm

Thanks @Deathcalibur , it print 4, according to the definition, it’s timeout.

Increase data gathering speed by using multiple agents, e.g. duplicate the game world

Do you mean add more agents to the map and increase the Max Agent Num in the Learning agents manager?

Consider decreasing the max step num. The default is 300, which is about 5 seconds in a 60 fps game where the trainer is ticking every frame.

Where should I change this value? I’ve been looking for this parameter for a while because I found that the agents were always reset by 300 steps but not the steps I defined in the Event Set Completions.

Many thanks for your work to build this training environment, I’ve been following the thread for a while, and, sure, I’ve read your tutorial

kexul22 · September 5, 2023, 3:50pm

Confirmed that adding more agents helps! With only one agent in the map, training failed during the first epoch. Now that four agents have been added, training has continued （reaching 3k iterations）. However, in previous experiments, I recall that training still failed at random points despite having multiple instances.

Deathcalibur · September 5, 2023, 4:08pm

Ok that’s good to know it is a timeout. That starts to get us somewhere

Where should I change this value? I’ve been looking for this parameter for a while because I found that the agents were always reset by 300 steps but not the steps I defined in the Event Set Completions .

This is in the Trainer Settings which is a struct you pass into the Setup Trainer node during BeginPlay. But I’m guessing this isn’t the issue if you haven’t touched it. I was concerned that perhaps you had increased it significantly.

So the way that training data is sent to the python process is that the gathered experience is put into a buffer, and when that buffer is full, that triggers the training iteration. The “fullness” is controlled by the MaxEpisodeNum or the MaxStepNum on the Trainer Settings:

If you’re using the default settings, you’re typically going to be hitting the Max Recorded Steps Per Iteration once you have ~33 episodes (10000 / 300, assuming you’re always hitting the max step in each episode). I think you’re hitting a timeout from this taking too long, although I don’t remember off the top of my head. By adding more agents, you’re significantly speeding up the time it takes to gather those episodes.

Let me know if you run into further issues but it sounds like you’re back on track?

Brendan

kexul22 · September 6, 2023, 3:04am

Yes, after adding more and more agents, crash disappeared! Many thanks for your help and detailed explanation!

deathsaber92 · September 19, 2023, 6:00am

Hey, same error on my side, I am trying to get around it, I tried adding more agents, I got up to 30, and also I tried decreasing the steps to 100. I don’t understand what I am doing wrong here if that counts for future updates, this really needs an improvement.

LogLearning: Display: Training Process: Creating Replay Buffer…
LogLearning: Display: Training Process: Creating Networks…
LogLearning: Display: Training Process: Sending Policy…
LogLearning: Display: Training Process: Creating Optimizer…
LogLearning: Display: Training Process: Creating PPO Policy…
LogLearning: Display: Training Process: Opening TensorBoard…
LogLearning: Display: Training Process: Begin Training…
LogLearning: Display: Training Process: Profile| Pull Experience 155577ms
LogLearning: Display: Training Process: Traceback (most recent call last):
LogLearning: Display: Training Process: File “C:\Program Files\Epic Games\UE_5.3\Engine\Plugins\Experimental\LearningAgents\Content\Python\train_ppo.py”, line 361, in
LogLearning: Display: Training Process: train_ppo(config, trainer)
LogLearning: Display: Training Process: File “C:\Program Files\Epic Games\UE_5.3\Engine\Plugins\Experimental\LearningAgents\Content\Python\train_ppo.py”, line 199, in train_ppo
LogLearning: Display: Training Process: assert response == UE_RESPONSE_SUCCESS
LogLearning: Display: Training Process: AssertionError

There’s also something in the logs I am trying to understand and looks like should not be here:

LogAutomationController: Ignoring very large delta of 2.72 seconds in calls to FAutomationControllerManager::Tick() and not penalizing unresponsive tests

What is the relation of this with the training process?

AnthonyF · September 27, 2023, 12:48am

Hi, I faced the same issue. I checked every data that I send to the framework. Such as the reward, observations. It seems that I was sending Nans values. I recommend to verify that all your data are valid during the SetReward and SetObservations. I suspect that the python crash behind that cause a timeout during processing the replay buffer.

deathsaber92 · October 3, 2023, 4:29pm

I just had to increase the batch size, I maxed it out to 4000 and then i could play with the parameters however i wanted. But I could not make the agent learn anything, struggled for around 3 weeks, more than i should have anyway. I feels this lack enough documentation, I managed to implement the training by looking through plugin code and comments and with help of gpt… Adjusting the training params for my case however, impossible… Not even close

system · November 2, 2023, 4:29pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.