Tutorial: Learning to Drive

Deathcalibur · August 7, 2024, 12:46pm

Hmm, it’s not currently exposed in an easily accessible way. In the past, I have added a counter to the manager’s tick (or wherever you call RunTraining) and then manually save/load snapshots when hitting certain counts.

We’ll think about this and see if we can’t add something.

jeanluc97233 · August 17, 2024, 11:33pm

Hi. This tutorial is really great. i have just a little problem : in The Reset Agent Episode, when i paste, the function Reset To Random Point On Spline dont appear in the blueprint…I’ve been looking for a while, i dont know why

Thanks the response

gmartinez006 · October 8, 2024, 6:26pm

Hello Everyone Ill start off with thanks for the Turtorial. It was great got a lot of information from it. But I seem to be having lots of LogLearning: Warning: did i miss reference objector and the scene episodes reset evey so often is that normal?

Deathcalibur · October 8, 2024, 6:42pm

The error is indicating that something got wired up wrong in one of these parts. Make sure that the “Location along spline observation” is being put into the “location” on the map, and the same for the direction one.

Hopefully you didnt edit the tags, but the tags need to match. The specify location tag needs to match the make location’s tag.

gmartinez006 · October 8, 2024, 9:43pm

Thank you, had two Specify Location Along Spline Observation nodes in Specify Agent Observation, was costing a lot of memory usage

barthdamon · November 12, 2024, 5:25am

Hello! I’m trying to run headless with snapshots but I’m not seeing any in the snapshots folder. I see lots of config json files but no .bin snapshots. I have Save Snapshots set to true in my Trainer Training Settings. Is there another step to save snapshots beyond checking that box? What is the best place to debug why snapshots might not be saving? Do I need to run it for a certain period of time before snapshots start saving?

I also have tensorboard set up. I followed the steps to install it, can run it and see the window on my browser connected to local host, but then when I run the game headless with the settings to Use Tensorboard set to true it looks like it can’t find tensorboard:

LogLearning: Display: Training Process: Warning: Failed to Load TensorBoard: No module named ‘tensorboard’. Please add manually to site-packages.

Is there an additional configuration step required for that to recognize tensorboard beyond the local file paths?

Thank you!

Deathcalibur · November 12, 2024, 4:13pm

You need to adjust these paths:

To get tensorboard to work, follow the tensorboard tutorial here:

Snapshots might be getting saved to a directory you’re not expecting. Have you seen https://www.voidtools.com/ ? A great search tool.

Snapshots save on startup and every 1000 game steps (this needs to be fixed in a later version of LA to give users more control - I guess you could edit the python code pretty easily)

Debugging is best accomplished through reading the log.

Good luck,
Brendan

P.S. let me know if you’re still stuck after trying to figure it out

barthdamon · November 20, 2024, 2:47am

I’ve been looking at/adding logs to train_ppo.py to get a grasp on what is going on with my setup. It seems to be receiving/creating everything, starts training, and then it looks like it pulls the experience when I call StopTraining after about 8 seconds of training where my agent is clearly receiving commands but it doesn’t seem like any of the training is actually being saved:

LogLearning: Display: Training Process: Receiving Policy...
LogLearning: Display: Training Process: Receiving Critic...
LogLearning: Display: Training Process: Receiving Encoder...
LogLearning: Display: Training Process: Receiving Decoder...
LogLearning: Display: Training Process: Creating Optimizer...
LogLearning: Display: Training Process: Creating PPO Policy...
LogLearning: Display: Training Process: Opening TensorBoard...
LogLearning: Display: Training Process: Begin Training...
LogLearning: Display: Training Process: Profile| Pull Experience             2528ms
LogLearning: Display: Training Process: Done!
LogLearning: Display: Training Process: Exiting...

I never see it actually running any of the “push” functionality it looks like it should be… It is as if this code:

trainer.recv_experience(
                trim_episode_start, 
                trim_episode_end)

Never receives a response until I stop the training even though I am calling the RunTraining function every tick. Should I be seeing the push logs while training is running?

Deathcalibur · November 20, 2024, 3:41pm

If you have the version of Learning Agents from UE5-Main (eventually coming to UE 5.6), you can run the python process separately from UE, which means you can easily run train.py from VS Code or your favorite python debugger. This makes it 100% easier to debug compared to adding a bunch of prints.

I think it actually works with 5.5 but I kind of forget, we switched the python CLI from the positional arguments to using argparse so 5.6 will be much better. I wish I could help you more with 5.5 but I’m really busy with working on 5.6

Thanks for pushing through these issues though! I know it’s a pain and can get frustrating (for me at least).

ath_toh · November 21, 2024, 3:56pm

Hello @Deathcalibur and thank you for your tutorial. I followed the 5.4 version and it worked perfectly until I turned on my PC and opened the project again the next day The cars didn’t move anymore and in the Output Log I got the error:

LogLearning: Error: SportsCarTrainer_0: Can’t find Python executable “[Project Path]/Intermediate/PipInstall/Scripts/python.exe”.

I am sorry I’m not experienced enough to understand what’s happening, but if you have any idea it would help. Thank you!

Deathcalibur · November 21, 2024, 4:06pm

That’s odd. I would navigate to that folder and see if the python exe is there. If it’s not, then I would close the editor and re-open it. The editor runs python installation scripts during startup, and it’s possible they failed for whatever reason.

If you close and open the editor, it should fix itself. If it doesn’t, then try deleting the Intermediate directory’s contents after closing the editor, and then open the editor again.

In the future, we are hoping we can put a UI on the python installation process, but it’s not clear when or even if that will be able to come.

Thanks,
Brendan

ath_toh · November 21, 2024, 4:35pm

It works, thank you so much for answering so fast

barthdamon · November 22, 2024, 6:35am

It looks like
shared_memory_recv_experience_multiprocess is getting hung up here waiting for these controls:

    for controls in processes_controls:
        while not controls[UE_SHARED_MEMORY_EXPERIENCE_SIGNAL]:
        
            if controls[UE_SHARED_MEMORY_STOP_SIGNAL]:
                controls[UE_SHARED_MEMORY_STOP_SIGNAL] = 0
                return UE_RESPONSE_STOPPED, None, None
            print("Sleeping")
            time.sleep(0.001)
    
    print("Moving On")

I see it print sleeping until the process stops, and it never moves on. Any idea why the controls don’t have the shared memory experience signal to move on?

Deathcalibur · November 22, 2024, 2:34pm

Each of the worker processes (UE game process) will write a “1” to the shared memory in the appropriate index when they are done writing experience to the shared mem. So there must be something going on with UE where it’s not writing the data appropriately. I would check the logs on the UE side and step through to see what is going on over there.

Sorry this has been such a hassle for you.

barthdamon · November 25, 2024, 5:32am

I got snapshots working! I had to set the MaximumRecordedEpisodesPerIteration to a lower number to trigger snapshot saving at a more frequent interval for my setup.

I was overflowing memory at first when I got it working but now I’ve simplified/packed down my observation space. Do you have any insight on what might be causing this error? I get it sporadically now when training:

Engine\Plugins\Experimental\LearningAgents\Content\Python\ppo.py", line 531, in train
LogLearning: Display: Training Process:     policy_batch = torch.randint(0, len(window_indices), size=[policy_batch_size])
LogLearning: Display: Training Process:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
LogLearning: Display: Training Process: RuntimeError: random_ expects 'from' to be less than 'to', but got from=0 >= to=0

I’ve been messing with the training settings for my use case, so likely something on my end. I imagine every use case requires different configurations but that there are certain principles (which I’m still learning) that hold true in every case for ML to be effective…

Deathcalibur · November 25, 2024, 1:58pm

I haven’t ran into this before, but I believe the issue is that you need to ensure that the policy batch size is larger than the policy window size in these settings:

What values were you using? I can reproduce the issue and add a warning/error message in UE.

EDIT: It could also be the case that your episode has zero steps in it, which would means that you’re trying to get a random value between (0,0) which isn’t possible. So double check if perhaps that is actually the issue.