Discovered an OnlineSubsystemGDK bug that often causes GDK builds to crash around startup

Discovered in engine version 5.4.2.

The issue stems from the FOnlineSubsystemGDK::OnNetworkConnectivityHintChanged function, specifically when the CVarXboxRecreateGDKContextByConnectivity CVar is set (which it is by default). This function is meant to cleanup/restore the GDK context as appropriate for network disconnects/reconnects, which it does just fine when those happen at runtime. When the connectivity hint indicates an inactive network connection, the cleanup code runs, which removes the GDK context/internal tracking for all users registered with the identity interface, as well as calling XblCleanupAsync, which crucially invalidates the SCID received from any XblGetScid calls.

Again, this works just fine for runtime disconnects, but the issue occurs when launching an instance that does not have a valid network connection at boot. Once the delegate is hooked that ultimately results in this function call (the XNetworkingRegisterConnectivityHintChanged API call in the Init function), it is immediately fired, and then reacted to. This happens during engine init, and specifically before the OnFEngineLoopInitComplete core delegate fires. As the GDK identity manager adds all local users as a result of this core delegate firing (via the RefreshGamepadsAndUsers call in FOnlineIdentityGDK::OnEngineInitComplete), that means that the CleanupGDKContextForNetworkConnectivityLoss function does not actually delete the GDK context for any users, as there are not any cached users yet.

The state this ultimately leaves the system in is having erroneously cached users and an overall invalid SCID. Again, this invalid SCID means that the XblGetScid API call will not return a valid SCID, and will also have a non-zero HRESULT, which becomes a problem because no calls to XblGetScid in the entire OSSGDK have their results checked at all, they’re all just assumed to complete successfully. Normally, that isn’t an issue, because the OSS uses those internally tracked users in the identity interface to gate all these calls further upstream, with the idea that the only way for that XblGetScid call to fail is to not have any users with a valid context.

However, as I’ve described, launching an instance without a network connection breaks this assumption. This can lead to XblGetScid calls completing unsuccessfully, with the results of those calls being assumed to be valid, causing null reference exceptions and the like. In our particular case, this happened when attempting to send a presence update for a user later on. This was just the case we ran into, but I’d imagine pretty much all uses of XblGetScid could be a problem in some way.

We do currently have a local workaround that delays the processing of that initial connectivity hint until after the identity manager has filled out its users, but I wanted to share what I had discovered in hopes to see an official fix.