Course: Learning Agents (5.5)

Deathcalibur · November 14, 2024, 7:30pm

Get familiar with Learning Agents: a machine learning plugin for AI bots. Learning Agents allows you to train your NPCs via reinforcement & imitation learning. It can be used to create game-playing agents, physics-based animations, automated QA bots, and much more!

https://dev.epicgames.com/community/learning/courses/GAR/unreal-engine-learning-agents-5-5

FlameTheory · November 15, 2024, 7:05pm

Attempting to get this working on macOS.

I’ve switched the Spawn Shared Memory and Make Shared Memory nodes for Socket nodes but I am getting the error: MakeSocketCommunicator: Failed to connect to training process: Communication timeout. Check log for additional errors.

Any tips would be appreciated!

Deathcalibur · November 15, 2024, 9:58pm

That’s unfortunate you are having trouble. I would make sure you are passing the same settings to training process and the communicator, meaning the same port.

The socket stuff was a little janky on Mac. I had some issues where if the server starts up slow, it can cause issues. For whatever reason, if the game tries to talk on the Socket before the python server can startup, it would get locked up forever. It’s really janky but right now UE will sleep for 1 second to give the python server time to start. This could be the issue - I plan on getting this fixed up when we have time… anyways, you can look into maybe sleeping a bit longer in the game after starting the python server

BTW, UE just got shared memory on Mac integrated so I can finally look to support that for UE 5.6.

FlameTheory · November 15, 2024, 11:49pm

Thanks for the info!

I made some progress by tweaking the port and timeout settings. I can now get the initial connection working somewhat consistently but the training then fails after about 20 seconds with:

Error waiting for policy from trainer: Communication timeout. Check log for additional errors.

vmartineau · November 17, 2024, 3:18pm

Hello, I updated my my tutorial from 5.4 to 5.5 and adapted the blueprints to match the 5.5 tutorial. When I try to launch, I have the following error:

LogLearning: Display: PPOTrainer_0: Sending config...
LogLearning: Display: Sending config signal...
LogLearning: Display: PPOTrainer_0: Sending initial policy...
LogLearning: Display: Subprocess: Traceback (most recent call last):
LogLearning: Display: Subprocess:   File "D:\Unreal\UE_5.5\Engine\Plugins\Experimental\LearningAgents\Content\Python\train.py", line 32, in <module>
LogLearning: Display: Subprocess:     module = import_module(trainer_module_name)
LogLearning: Display: Subprocess:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LogLearning: Display: Subprocess:   File "D:\Unreal\UE_5.4\Engine\Binaries\ThirdParty\Python3\Win64\Lib\importlib\__init__.py", line 126, in import_module
LogLearning: Display: Subprocess:     return _bootstrap._gcd_import(name[level:], package, level)
LogLearning: Display: Subprocess:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LogLearning: Display: Subprocess:   File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
LogLearning: Display: Subprocess:   File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
LogLearning: Display: Subprocess:   File "<frozen importlib._bootstrap>", line 1147, in _find_and_load_unlocked
LogLearning: Display: Subprocess:   File "<frozen importlib._bootstrap>", line 690, in _load_unlocked
LogLearning: Display: Subprocess:   File "<frozen importlib._bootstrap_external>", line 940, in exec_module
LogLearning: Display: Subprocess:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
LogLearning: Display: Subprocess:   File "D:\Unreal\UE_5.5\Engine\Plugins\Experimental\LearningAgents\Content\Python\train_ppo.py", line 9, in <module>
LogLearning: Display: Subprocess:     import torch
LogLearning: Display: Subprocess:   File "D:\Unreal Projects\RLAgent1\Intermediate\PipInstall\Lib\site-packages\torch\__init__.py", line 148, in <module>
LogLearning: Display: Subprocess:     raise err
LogLearning: Display: Subprocess: OSError: [WinError 126] The specified module could not be found. Error loading "D:\Unreal Projects\RLAgent1\Intermediate\PipInstall\Lib\site-packages\torch\lib\fbgemm.dll" or one of its dependencies.
LogLearning: Error: PPOTrainer_0: Error sending policy to trainer: Unexpected communication received. Check log for additional errors.
LogLearning: Error: PPOTrainer_0: Training has failed. Check log for errors.

Is this error linked to my setup of learning agent or is it an error because I didn’t follow the tutorial properly. Looking at the message I feel the problem is my setup, but I wouldn’t know how to force install of torch.

FlameTheory · November 17, 2024, 8:37pm

Update: I now have this working on macOS.

I had to increase the timeout values in LearningExternalTrainer.cpp line 720 and LearningSocketTraining.cpp line 49 and 62.

Deathcalibur · November 18, 2024, 2:02pm

Try to increase the timeout to 30 seconds or 60 seconds to try to rule out an actual timeout. If it times out with 60 seconds, then it’s probably getting an error on the python side. Once the timeout occurs, check the console/log for errors see if there are any related to LogLearning or Python. Let me know what you find. BTW you need to wait for the timeout because that’s when the python should finish flushing to the log.

When things are working well, you should never be timing out with that long of a timeout setting but I’m curious if there is some other issue here. We have relatively powerful development workstations so some of these sorts of race-conditions can slip through the cracks occasionally.

BTW, if it’s not clear, Learning Agents is experimental still haha! Thanks for the engagement and feedback though.

If you’re on ue5-main or 5.6 (which is not out anytime soon), you can do the following:
You could also start the python trainer process from an IDE, like VS Code. If you do, python train.py --help you might be able to figure it out.

EDIT: Thank you so much for following up with your resolution. I will make a ticket and we will look into it soon! Thanks for helping make the world a better place

Deathcalibur · November 18, 2024, 2:08pm

Strange, I have never seen anything like this before, but I normally recreate the project from scratch each patch and don’t do an in-place upgrade. I do have internal projects which are constantly updating though and haven’t seen this.

Can you try nuking your {game-project-workspace}\Intermediate\PipInstall directory? These will re-install automatically on startup of the UE Editor. I’m guessing the dependencies got messed up and need a clean install.

Let me know if that does or does not work either way so I can make a note of it.

Thanks!

FlameTheory · November 18, 2024, 6:04pm

I also added another clause in train_ppo.py at line 60:

elif torch.backends.mps.is_available():
    device = 'mps'

MPS should give us much better performance on Mac!
https://pytorch.org/docs/stable/notes/mps.html

AntiSarumyan · November 18, 2024, 9:47pm

I had exactly the same error. The project was created completely from scratch. I also tried to delete the “PipInstall” folder. The error still occurs.

UPD. While writing, I found a temporary solution:

Run your project.
Open the command line (Windows console) as administrator.
enter the command "cd {game-project-workspace}\Intermediate\PipInstall\Scripts" there (the path to the object must be yours). You may need to change the disk.
remove the library with the command “pip uninstall torch”
reinstall the library with the command “pip install torch torchvision torchaudio”

After these steps, everything worked for me. A similar problem was with version 5.4 and using “TensorBoard”. I do not know why UE does not register python normally in the system. I will be grateful if someone reports a simpler solution.

Deathcalibur · November 19, 2024, 1:28pm

This is awesome. We will get this tested and integrated for the next release. Thanks!

Deathcalibur · November 19, 2024, 1:33pm

Hmm interesting. We have a solution for making tensorboard installation easier, and hopefully we can bring in an alternative to tensorboard next release… maybe ml flow?. Let me know if y’all have suggestions on alternatives to tensorboard you prefer.

I made some tickets around these python issues and we will have a think about simplifying the entire python setup process if possible.

If you run into issues and can grab the log, usually there are pip installation errors which would help us out if you could forward our way.

Thanks for taking the time to report these issues! This feedback is super valuable and the main way we can make things better!

AntiSarumyan · November 21, 2024, 9:00am

UPD: after googling the information I realized that to solve the error “winerror 126” you need to download “libomp140.x86_64.dll” and place it in “C:\Windows\System32”. After that everything worked for me and now there is no need to delete and reinstall the “torch” module via the console every time you start the project. Just in case, I will attach a link to the DLL that personally fixed the problem for me:
https://www.dllme.com/dll/files/libomp140_x86_64/00637fe34a6043031c9ae4c6cf0a891d

Deathcalibur · November 21, 2024, 1:56pm

Hmm, that’s really unexpected.

Thanks for sharing!

a_mohame · November 27, 2024, 3:00pm

Thanks a lot for the tutorial its really great to see this happening in Unreal engine.
I wanted to ask, now I have a camera attached to the vehicle and I would like to use the images as the observation to the network. Is it possible to do so ? and is there any tutorial about this?

Deathcalibur · December 3, 2024, 3:49pm

No, you can’t currently use images. We don’t really think this is practical for game development hence we haven’t prioritized it yet.

You can do ray casts from the camera actor and get a pretty good approximation… depends on what you are trying to accomplish though.

p34c · December 9, 2024, 7:35pm

Hello,

I’m running 5.5 on Macos M1, here’s the log I’m getting after configuring the socket communicator:

LogLearning: Display: Subprocess: Starting Socket Communicator...
LogLearning: Display: Subprocess: Creating Socket Trainer Server (127.0.0.1:48491)...
LogLearning: Display: Subprocess: Listening...
LogLearning: Display: Subprocess: Traceback (most recent call last):
LogLearning: Display: Subprocess:   File "/Users/Shared/Epic Games/UE_5.5/Engine/Plugins/Experimental/LearningAgents/Content/Python/train.py", line 83, in <module>
LogLearning: Display: Subprocess:     s.bind((host, port))
LogLearning: Display: Subprocess: OSError: [Errno 48] Address already in use
LogLearning: Error: MakeSocketCommunicator: Failed to connect to training process: Unexpected communication received. Check log for additional errors.

Deathcalibur · December 9, 2024, 7:44pm

I would check if you ended up with an orphaned python process or if you can simply switch to a different port.

Did you try that already?

p34c · December 9, 2024, 7:51pm

There was indeed an orphaned python process, good catch !

peac@peac-mbp work % ps aux | grep python
peac             49372   0,0  0,3 412616784  88256   ??  S     5:35     5:31.48 /Users/peac/Documents/Unreal Projects/testRL/Intermediate/PipInstall/bin/python3 /Users/Shared/Epic Games/UE_5.5/Engine/Plugins/Experimental/LearningAgents/Content/Python/train.py /Users/Shared/Epic Games/UE_5.5/Engine/Binaries/Mac/ train_ppo Socket 127.0.0.1:48491 /Users/peac/Documents/Unreal Projects/testRL/Intermediate/LearningAgents 1
peac             56385   0,0  0,0 410749648   1584 s000  S+    8:47     0:00.00 grep python
peac@peac-mbp work % kill -9 49372
peac@peac-mbp work % ps aux | grep python
peac             56406   0,0  0,0 410733264   1488 s000  S+    8:47     0:00.00 grep python

Now I get the timeout error mentioned in the thread, is the only way forward to compile from source ?

Deathcalibur · December 9, 2024, 8:15pm

You can increase the timeouts by editing the settings objects. Like the training settings, etc.

Let me know if you can/can’t figure it out.