Learning agent training failed

xrotted · June 9, 2024, 11:24am

Just set up my learning agents again after upgrading from 5.3 to 5.4. I get this error when I start the training. Where can I find the log with the errors?

LogLearning: Display: Training Process: Begin Training…
LogLearning: Display: Training Process: Profile| Pull Experience 90537ms
LogLearning: Display: Training Process: Profile| PPO load tensors 2ms
LogLearning: Display: Training Process: Profile| PPO gae 9ms
LogLearning: Display: Training Process: Profile| PPO log prob 354ms
LogLearning: Display: Training Process: Profile| Training 1890ms
LogLearning: Display: Training Process: Traceback (most recent call last):
LogLearning: Display: Training Process: File “G:\unreal engine 5\UE_5.4\Engine\Plugins\Experimental\LearningAgents\Content\Python\train_ppo.py”, line 393, in
LogLearning: Display: Training Process: train_ppo(config, trainer)
LogLearning: Display: Training Process: File “G:\unreal engine 5\UE_5.4\Engine\Plugins\Experimental\LearningAgents\Content\Python\train_ppo.py”, line 217, in train_ppo
LogLearning: Display: Training Process: stats = ppo_trainer.train(
LogLearning: Display: Training Process: ^^^^^^^^^^^^^^^^^^
LogLearning: Display: Training Process: File “G:\unreal engine 5\UE_5.4\Engine\Plugins\Experimental\LearningAgents\Content\Python\ppo.py”, line 466, in train
LogLearning: Display: Training Process: logp = self.compute_old_logp(obs, mem, act)
LogLearning: Display: Training Process: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LogLearning: Display: Training Process: File “G:\unreal engine 5\UE_5.4\Engine\Plugins\Experimental\LearningAgents\Content\Python\ppo.py”, line 225, in compute_old_logp
LogLearning: Display: Training Process: logp = schema_log_prob(self.action_schema, act_dist, act)
LogLearning: Display: Training Process: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LogLearning: Display: Training Process: File “G:\unreal engine 5\UE_5.4\Engine\Plugins\Experimental\LearningAgents\Content\Python\train_common.py”, line 1282, in schema_log_prob
LogLearning: Display: Training Process: total[elem_indices[ei]] += schema_log_prob(
LogLearning: Display: Training Process: ^^^^^^^^^^^^^^^^
LogLearning: Display: Training Process: File “G:\unreal engine 5\UE_5.4\Engine\Plugins\Experimental\LearningAgents\Content\Python\train_common.py”, line 1251, in schema_log_prob
LogLearning: Display: Training Process: total += schema_log_prob(
LogLearning: Display: Training Process: RuntimeError: output with shape [5009] doesn’t match the broadcast shape [5009, 5009]
LogLearning: Warning: Training Process finished with warnings or errors
LogLearning: Error: Trainer_0: Error waiting for policy from trainer: Unexpected communication received. Check log for additional errors.
LogLearning: Display: Trainer_0: Stopping training…
LogLearning: Error: Trainer_0: Training has failed. Check log for errors.

Deathcalibur · June 10, 2024, 1:36pm

Hello,

Thanks for reporting this issue. With the information provided, this looks like an actual learning agents bug to me and not an error on your end.

Can you provide me any details of your interactions observations and actions code (I think just the Specify functions implementations will be sufficient)?

I will see if I can reproduce this error on my end and let you know if there is a workaround.

Thanks,
Brendan

xrotted · June 11, 2024, 1:01am

Thanks for the reply, here are the screenshots
Specify agent observation:

Specify agent actions

Deathcalibur · June 11, 2024, 2:05pm

Hey,

Thanks for reporting this issue! We are making a bug fix to the python code. You found a combination of actions we had not tested apparently!

In the meantime, you can edit the following lines in \Engine\Plugins\Experimental\LearningAgents\Content\Python\train_common.py:

Line 1133 to:

    if act_type == 'Null':
        return torch.zeros([len(act_dist)], device=act_dist.device)

Line 1225 to:

    if act_type == 'Null':
        return torch.zeros([len(act_dist)], device=act_dist.device)

Line 1342 to:

    if act_type == 'Null':
        return torch.zeros([len(act_dist)], device=act_dist.device)

Hopefully that should fix the issue for you!

Thanks,
Brendan

xrotted · June 11, 2024, 2:38pm

That worked, no errors now. Thanks!
Also as a side note, every time the training starts or when an episode ends, unreal engine freezes for like 5-10 seconds. Just wondering if this is normal since it doesn’t freeze like that in my 5.3 project.

Deathcalibur · June 11, 2024, 6:00pm

I think because you’re using the exclusive union layers, your network is a bit slower to train and currently the training is blocking.

In 5.5, we will be supporting async calls to the training process so the UE editor won’t hang anymore.

Brendan

xrotted · June 12, 2024, 12:22am

Ah i see, thanks

system · July 12, 2024, 12:23am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.