Hi all, and @Deathcalibur (Brendan Mulcahy) if you watch these threads.
I’m currently using learning agents to create flying AI that can fly to randomly chosen goals.
In my example the goals are glowing sticks randomly placed around the level. The agents also randomly spawn at one of about 80 spawn points.
Since there’s no well defined end to an episode for my case I just set the step limit to a number and allow the episode to end, so enough agents have completed some goals in that time and the next iterations can begin.
What started happening is that I’d get a crash when the step limit was about 5000, so I lowered it to 2300, then 1000 and it has eventually started crashing again each time I lower it, training has occurred for about 3 hours from initialisation without problem, but now it seems to crash within a minute or two of starting more training.
Reinitalizing and starting training anew does not result in a crash, using my pretrained dataasset however does. So something must go wrong as training goes on in my case.
I’m training using CPU as I have an AMD Gpu (feature request, can we get Rocm support through WSL2, I have ROCM on my WSL2 ubuntu install, and it works pretty seemlessly for any pytorch applications I’ve worked with after modifying the device to be directML.)
I have currently 160 agents training at once now, however this network was trained from initialisation no problems with 200 agents.
Here is a video of the crash circumstances.
as you can see I’m just spectating one of the 160 agents, so the actual perspective might not show anything interesting, but I assure you that if I reinitialize the network, it will not crash probably for at least 4 or so hours, this crash is somehow related to the network when it has some training behind it
here is the text from the crash report
LoginId:7cf9ecfe470cb61e695978a151311948
EpicAccountId:2f350cd904834d6aa4c6658c48ba984f
Assertion failed: FMath::IsFinite(View.GetData()[Idx]) && View.GetData()[Idx] != (3.402823466e+38F) && View.GetData()[Idx] != -(3.402823466e+38F) [File:D:\build++UE5\Sync\Engine\Plugins\Experimental\LearningAgents\Source\Learning\Public\LearningArray.h] [Line: 347] Invalid value -inf found at flat array index 0
UnrealEditor_Learning
UnrealEditor_LearningAgents
UnrealEditor_LearningAgents
UnrealEditor_LearningAgentsTraining
UnrealEditor_LearningAgentsTraining
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_CoreUObject
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Core
UnrealEditor_Core
UnrealEditor_Core
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_Engine
UnrealEditor_UnrealEd
UnrealEditor_UnrealEd
UnrealEditor
UnrealEditor
UnrealEditor
UnrealEditor
UnrealEditor
UnrealEditor
kernel32
ntdll