Possible learning agents bug

Scottykl · January 5, 2024, 12:56am

Hi all, and @Deathcalibur (Brendan Mulcahy) if you watch these threads.

I’m currently using learning agents to create flying AI that can fly to randomly chosen goals.
In my example the goals are glowing sticks randomly placed around the level. The agents also randomly spawn at one of about 80 spawn points.

Since there’s no well defined end to an episode for my case I just set the step limit to a number and allow the episode to end, so enough agents have completed some goals in that time and the next iterations can begin.

What started happening is that I’d get a crash when the step limit was about 5000, so I lowered it to 2300, then 1000 and it has eventually started crashing again each time I lower it, training has occurred for about 3 hours from initialisation without problem, but now it seems to crash within a minute or two of starting more training.

Reinitalizing and starting training anew does not result in a crash, using my pretrained dataasset however does. So something must go wrong as training goes on in my case.

I’m training using CPU as I have an AMD Gpu (feature request, can we get Rocm support through WSL2, I have ROCM on my WSL2 ubuntu install, and it works pretty seemlessly for any pytorch applications I’ve worked with after modifying the device to be directML.)

I have currently 160 agents training at once now, however this network was trained from initialisation no problems with 200 agents.

Here is a video of the crash circumstances.

as you can see I’m just spectating one of the 160 agents, so the actual perspective might not show anything interesting, but I assure you that if I reinitialize the network, it will not crash probably for at least 4 or so hours, this crash is somehow related to the network when it has some training behind it

here is the text from the crash report
LoginId:7cf9ecfe470cb61e695978a151311948
EpicAccountId:2f350cd904834d6aa4c6658c48ba984f

Assertion failed: FMath::IsFinite(View.GetData()[Idx]) && View.GetData()[Idx] != (3.402823466e+38F) && View.GetData()[Idx] != -(3.402823466e+38F) [File:D:\build++UE5\Sync\Engine\Plugins\Experimental\LearningAgents\Source\Learning\Public\LearningArray.h] [Line: 347] Invalid value -inf found at flat array index 0

UnrealEditor_Learning
UnrealEditor_LearningAgents
UnrealEditor_LearningAgents
UnrealEditor_LearningAgentsTraining
UnrealEditor_LearningAgentsTraining
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Core
UnrealEditor_Core
UnrealEditor_Core
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_UnrealEd
UnrealEditor_UnrealEd
UnrealEditor
UnrealEditor
UnrealEditor
UnrealEditor
UnrealEditor
UnrealEditor
kernel32
ntdll

murraythis · January 5, 2024, 12:22pm

Hey,

I’m still learning myself so may be out of the ball park here. 160 agents is a lot for your PC to process. What is your frame rate while running the training? If you train just one agent does that cause a crash?

murraythis · January 5, 2024, 12:28pm

Also on watching your video again I notice you’re printing out some large values in the log. Are you normalising your observations and reward? Could be some large values being used, which could be why the assert is from FMath:IsFinite ie some value is blowing up:

FMath::IsFinite(View.GetData()[Idx]) && View.GetData()[Idx] != (3.402823466e+38F) && View.GetData()[Idx] != -(3.402823466e+38F)

Scottykl · January 6, 2024, 1:08pm

I thought it could be the large number of agents, but this exact crash occurs even with 1 agent, and everything looking fairly standard.

Scottykl · January 6, 2024, 1:13pm

Bottom line is you may be very close to the answer here.

The large values I print are possibly the distance each agent is when it’s close enough to a goal for me to give them the reward, around 2000 units, this value isn’t passed as an input to the network, nor is it an output.

However your point about normalisation is an interesting one, and I should hope that the designer of the plugin, makes it easier to detect if there are normalisation issues in the future, it’s not well documented how to use the scale to the actions or the observations. I’ve take to scaling the values as though the scale is a factor, so that if I know my input is in the range of 0 to 30k, then the scale is 30k and the framework will just x/30k everything to normalise. Or perhaps I should just scale to 1 and do that normalising myself… But that error IS possibly related to some NaN being received or output from the network, not sure, but if the plugin designer sees the log I posted, I’m 100% certain he’d be able to identify what the issue is instantly.

Scottykl · January 7, 2024, 6:02am

Bottom line, normalisation issues cause crashes like this after some running.

After clamping every input to the interactions and trainer to 1, I haven’t had a crash.
I request that there be some console warning output from the framework when normalisation is inadequate.

This issue can be closed.

Scottykl · January 7, 2024, 6:21am

Never mind it still crashes even if I clamp all the intputs. It is not a normalization problem, there’s a real bug here.

Deathcalibur · January 8, 2024, 3:05pm

This looks unrelated to the number of agents I think. Does the crash report have the function calls for this part of the stack trace?

UnrealEditor_LearningAgents
UnrealEditor_LearningAgents
UnrealEditor_LearningAgentsTraining
UnrealEditor_LearningAgentsTraining

I could help you pinpoint your error. But yeah the first index contains a negative infinity so some part of the math is weird in either an observation or action probably. But the full stack trace I could tell you.

Yeah if your total map size is 30K then your scale should be 30000. Redesigning this to be more clear has come up to us before so we will think about it some more.

What is your first observation or first action? In addition to the stack trace?

Scottykl · January 10, 2024, 4:31am

HI Brendan,
I have found the source of the crash definitively.

Setting activationNoiseMin, and activationNoiseMax to other values other than 0.25 (default) results in that math infinite check failing. At least in my setup.

I’ve got pretty good results training my x-wings to fly to their objectives for now.

I would say that my observations and actions are quite vanilla,

Could you advise on some sensible ranges to modify the activationNoiseMin and Max?
I think it was crashing when I adjusted from 0.25 to 0.35.

Deathcalibur · January 10, 2024, 1:38pm

Thanks for following up and investigating the issue with ActionNoiseMin/Max. I will see if I can reproduce the issue on my end and then figure out what to do about it.

For now, the default noise is probably fine. You would most likely want to set the noise to 0.0 when you are doing inference.

tu-danaaa · January 31, 2024, 7:41am

I had the same problem with changing the value of this action noise