GC FSM - Event-driven, hierarchical finite state machines in blueprint

Albie_123 · August 13, 2018, 3:21pm

Hello,

Sorry to double post, but I appear to have picked up another bug. The attached state machine is located in a **User Widget, **and some component of it (I suspect the local states) appear to be being collected by the garbage collector. This stops the state machine, even when the state machine is meant to keep running.

Note that without the “collect garbage” node, the same bug occurs - I just put it in there to test whether that was in fact why the state machine was stopping.

EDIT: This appears to happen with separate State Classes as well, unfortunately.

gamecentric · August 13, 2018, 4:55pm

Hi Albie_123, thanks for reporting. I will look at it. It might be related with the other GC-related bug reported by eanticev just last week, so it’s good to have a repro rule that doesn’t involve multiplayer.
PS: I have good news about the nativization problems: making some BlueprintInternalUseOnly methods public instead of private actually solves the issue.

Albie_123 · August 14, 2018, 10:31am

That’s great to hear. Any ETA on that nativisation fix being pushed onto the marketplace version?

By the way, I can’t manage to reproduce the Garbage Collection error on my Player Controller or Game State blueprints - state machines on these blueprints aren’t getting collected, luckily.

gamecentric · August 14, 2018, 10:14pm

I’d like to address the GC issue also. However, if I see more time is needed for that, I will submit the nativization fix by the end of this week.

Albie_123 · August 15, 2018, 7:03am

Awesome. Was hoping to use it for the jam this weekend, but if it’s not ready, that’s all good.

woodzong · August 17, 2018, 6:42am

eanticev:

Hi there, there seems to be a garbage collection bug / issue with how this plugin works with a dedicated server multiplayer (not replication).

Find sample project here: https://1drv.ms/u/s!Asdf-Mafqv3XjypqwvW37ahuXXmW

Steps to reproduce without sample project:

Create game mode with a simple PING-PONG state machine that switches from ping to pong (etc) every 1s.

Launch FSM in BeginPlay (note that it’s not intended to be a replicating FSM, just a local FSM to the server in the game mode)

OnTick log out “FSM is running: {0} in State {1}”

Launch dedicated server using a command (with your UE4 path and IP address) such as
"C:\Program Files\Epic Games\UE_4.20\Engine\Binaries\Win64\UE4Editor.exe" "%CD%\..\FSMTest.uproject" ThirdPersonExampleMap?listen -server -log -nosteam -port=1234
Launch client to connect to server using a command such as
"C:\Program Files\Epic Games\UE_4.20\Engine\Binaries\Win64\UE4Editor.exe" "%CD%\..\FSMTest.uproject" 192.168.1.157:1234 -game -ResX=1280 -ResY=720 -WinX=0 -WinY=20 -log -nosteam -WINDOWED
After about 20 seconds the FSM context gets garbage collected and you see a transition in the server log that looks like:
[2018.08.11-19.45.33:584][807]LogBlueprintUserMessages: [ThirdPersonGameMode_C_0] FSM is running: 1 in State Ping
[2018.08.11-19.45.34:583][837]LogBlueprintUserMessages: [ThirdPersonGameMode_C_0] FSM is running: 1 in State Pong
[2018.08.11-19.45.34:627][838]LogGCFSM: Error: Object ThirdPersonGameMode_C_0 is not running FSMs
[2018.08.11-19.45.34:698][840]LogBlueprintUserMessages: [ThirdPersonGameMode_C_0] FSM is running: 0 in State
What you see above is that** GetContext() in the FSM is failing because the weak pointer starts pointing to NULL** about 20s after the first client connects.

I got the same error in the similar situation.
In my case, FSM is not running in server.
My FSM is created in one actor’s blue print. and it Launched in the “Event BeginPlay” node. and after running some seconds or minutes; It will log “LogGC FSM: Error: Object xxx is not running FSMs” just like eanticev 's report. The actor is still in the map, and running fine, but the FSM in it seens be destoried. I tried to trace the bug in the source code. It seems there is something wrong with the FSM’s garbage collection.
Waiting for help… I can’t keep going with my work until this bug is fixed. so sad…
https://forums.unrealengine.com/core/image/gif;base64

gamecentric · August 17, 2018, 10:31am

I have some good news. I fixed the GC issue that has been reported by Albie_123. For some obscure reasons Unreal is marking some types of objects (all Widgets, for instance) as “unreachable” and my code was not expecting that. I am now going to check if the fix also solves other cases, such as the client/server issue that eanticev is reporting. If that’s the case, I will submit the update today.

gamecentric · August 17, 2018, 1:02pm

Yes! As I hoped, all three issues reported by eanticev, Albie_123 and woodzong are indeed related. I have reasons to believe the problem has been introduced by changes in the garbage collector introduced in the Unreal 4.20, because the reproducibility is so high that I can’t explain how the bug could have been passed unnoticed so far. Here’s the catch: during garbage collection, a few objects (in particular, the GameMode object, but occasionally also other actors), may be temporarily marked “unreachable”. The process was changed in Unreal 4.20 to introduce some form of parallelism, so I presume the GCFSM code related to garbage collection is now called at a time when objects may be unreachable, a case that did not occur before Unreal 4.20. When the GCFSM code found an unreachable context object, the context was abandoned and its FSMs stopped. Since I don’t want to mess with the Unreal internal flags, I now replaced a FWeakObjectPtr::Get() with a FWeakObjectPtr::GetEvenIfUnreachable() call and everything is now working.

I am packaging the fix right now and will push it immediately. It usually takes a day or two to be online, I’ll keep you posted.

Thanks to all of you for the reports!

PS: Widgets (as in Albie_123 report) seem to always be marked unreachable… that was helpful in addressing the bug since it removed the little non-deterministic behaviour of actors that may or may not be marked unreachable.

Albie_123 · August 18, 2018, 4:44am

That’s great to hear. It also makes a lot of sense, since my actors would occasionally do the same thing but not all the time, so at first I assumed I’d just screwed up somehow - I could only reliably replicate the issue with widgets.

Thanks for being so quick with the patch!

woodzong · August 20, 2018, 2:04pm

Greate to hear that!!! Thanks!!!

gamecentric · August 20, 2018, 2:23pm

I can’t double check, since I’m not in front of my PC, but I received a notification from Epic saying that the fix went online just few minutes ago. Thanks again for all your reports and patience.

eanticev · August 20, 2018, 5:06pm

I’m concerned that the FWeakObjectPtr::GetEvenIfUnreachable() will not work. I tried that myself and it just introduces a crash a few minutes later due to another cascading issue.

Also, consider what we’re doing here by making this change… we now might be not garbage collecting the FSM at the right time, thus leaving it running even if an object is unreachable.

I’m not sure what your test-case looks like, but you ideally have to run the scripts I suggested because the engine behaves slightly differently in editor than standalone.

I’m happy to connect and go over these issues via screenshare and figure out a solution.

eanticev · August 20, 2018, 5:08pm

If I remember correctly with this line change I was getting a crash after a few minutes on line 87 in GCFSMUtilities where it’s trying to get


auto context = rootState->GetContextObject();

eanticev · August 20, 2018, 5:27pm

Instead of this check


return context && !context->IsPendingKill();

You might need to do something like:


return context->IsValidLowLevel() && !context->IsPendingKill();

I seem to remember that just a nullptr check is not always 100% as opposed to IsValidLowLevel

Albie_123 · August 21, 2018, 12:53am

When are you getting crashes eanticev? I don’t seem to be having any problems with the new version but would love to help test what’s going on.

gamecentric · August 21, 2018, 6:44am

Hi eanticev, I understand your concerns, you have a point here. I’ll dig deeper into the issue using your test project as test case. If you’re comfortable with Slack, would you join me there? It might make me easier to share hotfixes with you. Just send me you email address via private message and I’ll invite you on my dedicated workspace.

gamecentric · August 21, 2018, 12:59pm

I may have oversimplified in my post, replacing Get() with GetEvenIfUnreachable() is a start but it’s not enough to fix the issue. Did you reproduce those crashes with the v1.5.3 version?

When the context object is destroyed, the FSMs will stop ticking and will therefore stop “running” immediately. It’s true, however, that there may be a situation where the root state object and all its FSMs stay around until the next garbage collection cycle, instead of being purged as soon as its context object is. The situation occurs only if the context object becomes unreachable without first being marked pending kill. YMMV, but I believe this case doesn’t occur very often and even if it does, the situation eventually heals itself without memory leaks.

That’s a good advice. The standalone indeed has a different behaviour from the editor, so I must be more careful and check that also. That said, I could not reproduce bugs nor crashes with your test project using v1.5.3 even on standalone.

eanticev:

If I remember correctly with this line change I was getting a crash after a few minutes on line 87 in GCFSMUtilities where it’s trying to get
auto context = rootState->GetContextObject();
Instead of this check
return context && !context->IsPendingKill();
You might need to do something like:
return context->IsValidLowLevel() && !context->IsPendingKill();
I seem to remember that just a nullptr check is not always 100% as opposed to IsValidLowLevel

The problem here could be that there is a situation where the call to AddReferencedObject() might nullify rootState. It has nothing to do with context having an invalid value. Adding a null check on rootState might probably be safe, I just would like to understand if it’s really needed, because I did not encounter this problem during regular use. BTW, due to the way the value of context variable is obtained (it’s the value of a UPROPERTY), a null check is ok, using IsValidLowLevel() would be a huge waste of time.

Anyway, I’m thinking of rewriting this part of GC FSM. It looks a bit fragile and I would like to make it less dependent from garbage collection internal details.

Albie_123 · August 21, 2018, 3:26pm

For what it’s worth I’ve tried my state machines now on a variety of classes (controllers, gamestate, widgets, etc.) with both forced GC and just letting Unreal do its thing, and in both packaged and in-editor I haven’t had any issues (either the state stopping, OR the state machine not getting GC’d when the context object does) or crashes.

I don’t think it’s necessary for the state machine to be GC’d immediately rather than waiting for the next GC cycle, especially since virtually everything else in Unreal works this way and IMO it would make more sense to keep it consistent with the rest of the engine. There might be a specific use case I’ve not considered though.

Kilrogg · September 20, 2018, 3:16am

Hi,

I’m pretty sure I ran into a bug, or I don’t understand how Submachines and Local States are supposed to interact.

What I was trying to do was have an “abort” transition on all my states by using a Submachine as described in the documentation. Here’s a simplified example:

If I run FSM_0, I see the states change properly but none of the OnEntry/Exit/Tick nodes inside the Local States fire. If I run the NewSubmachine_0 directly instead of as a Submachine, everything fires as expected.

Am I doing something wrong?

Thanks!

gamecentric · September 20, 2018, 6:50am

Hi @JeromeParent ,
your understanding is correct: events in the local FSMs should be triggered in your scenario. I will look into the issue and get back to you shortly.
Thanks for the report and your patience,