Hey Alex
Thanks for providing some more info about the intended usecase of the replay functionality, however there’s still some things I wonder about. For additional context, we’ve been running with the aforementioned fixes during July and it appears to fix most of the issues we’re having.
When we’re talking about “AckState”, I’m referring primarily to NetGUIDAckStatus of FPackageMapState. I understand we’re simply recording past data (hence why the replay connection is treated as “reliable” and sets the bInternalAck option to true) and thus the concept of “acking” or resending data doesn’t really make sense. In this case I’m treating the “NetGUIDAckStatus” as whether or not we’ve previously exported the full object path at least once for the given NetGUID (i.e have ShouldSendFullPath been true at least once for this object?).
The issue we’ve been dealing with, is that state changes during amortized checkpoints (new actor channels, channels closed to dormancy or destruction) will be missed entirely due to the nature of how amortized checkpoints are set up. For example for actors it will cache a list (see CheckpointSaveContext.PendingCheckpointActors) at the start of the new checkpoint, thus any new actors that are created after this point up until the amortized checkpoint is finished will not be a part of the checkpoint. Then consider the following example:
- SaveCheckpoint is called, caching actors A and B in the list of actors to call in ProcessCheckpointActors
- Actor C is created
- FlushCheckpoint is called, checkpoint is finished. This checkpoint will contain two Open bunches, one for A and one for B
- Regular traffic for A, B and C comes in, these are non-open bunches, as the channels are already open.
During playback, when we scrub to our checkpoint we only have actors channels for A and B. However as we start processing traffic that came in after that checkpoint, most notably traffic for C, we will receive a bunch for a channel that isn’t open, yielding errors in the net driver.
SerializeGuidCache has the same issue, it takes a snapshot, and any new objects that go into the guid cache from then until the checkpoint is finished will be missed, thus any traffic after the checkpoint for those objects will be missing important “initialization”.
Now, given that we’re doing serverside recordings we will be having network traffic going into both QueuedDemoPackets and QueuedCheckpointPackets during the checkpoint amortization period, even without our divergence to record regular demo frames during checkpoints. Lets take another example:
- SaveCheckpoint is called, actor A is added to the list of pending actors
- Super::TickFlush is called in UDemoNetDriver::TickFlushInternal, causing A to become dormant. Thus a Close bunch is recorded into QueuedDemoPackets (NOT CheckpointPackets)
- ProcessCheckpointActors is called for actor A, recording an Open bunch for Actor A.
During playback from this checkpoint, any regular traffic for actor A will assume the channel is closed, because that is the most recent event we’ve seen for that actor. However when we’re starting playback from our checkpoint, the opposite is true, is channel is open. This results in errors like “Received channel open command for channel that was already opened locally” or “Reliable bunch before channel was fully open”.
To fix this, we acknowledge we essentially have two streams of data during amortized checkpoints: “regular” (QueuedDemoPackets) and “checkpoint” (QueuedCheckpointPackets) traffic data. The “regular” data can essentially be seen as “missed” or “skipped” data, stuff that was missed from the checkpoint snapshotting due to unfortunate timing. Or in the case of server recordings, data that came from externally triggered events such as dormancy, that aren’t caught in the regular ProcessCheckpointActor flow where data is redirected to the QueuedCheckpointPackets array.
Thus, to fully “load” a checkpoint, and make sure we’re not missing any critical history or state for a given NetGUID (actor) we first process the checkpoint data, and then any regular data gathered during the checkpoint period. Given that there’s no guarantees about when during a checkpoint a given actor is replicated relative to when it may or may not receive regular data from other events (mentioned above), we’ll almost always end up applying bunches for a given NetGUID in the wrong chronological order. To get around this we add a linearly incrementing counter to each network bunch for replay connections and keep a map of NetGUID -> CounterID during playback, skipping any bunch if it has an older timestamp than what we’ve already received for this NetGUID.
Finally, onto the topic of NetGUIDs specifically. When a client receives the data from the server (checkpoint + chunks of traffic), they have nothing acked, i.e they’ve never processed the full object path for any object at all. I believe this is the purpose of SerializeGuidCache? To ensure all net addressable objects have been registered before we start processing traffic for them. I realize this as I’m writing, but perhaps the root cause for us is that we’re processing traffic for objects that didn’t get registered in SerializeGuidCache. Especially given that we take traffic recorded during a checkpoint into account in order to ensure everything is up-to-date. For example if an actor is destroyed during the checkpoint amortization period, before the SerializeGuidCache step, then it will not be recorded as a part of the GuidCache snapshot. Whereas our current fix ensures it will get NetGUID exports from its regular traffic bunches instead (or checkpoint bunches, whichever is processed first). Then I understand the reasoning behind SavePackageMapExportAckStatus better, as its not meant as a primary mechanism for NetGUID exports, moreso just ensuring it keeps recording ack status as regular from when the checkpoint started.
So rather than forcing ExportNetGUIDForReplay to be called a lot more, I think the proper fix for us might be to ensure SerializeGuidCache actually includes ALL objects for which the checkpoint contains traffic for. Right now it does a “if (Object && (NetworkGUID.IsStatic() || Object->IsNameStableForNetworking()))” check in SerializeGuidCache, which excludes destroyed objects. But if that destruction happened during our amortization period then we’ll miss it, and then when we process that destruction event as part of our checkpoint load. For example:
- Actor A is created before our checkpoint
- SaveCheckpoint is called, A is added to the pending list
- ProcessCheckpointActor is called for A, we record an Open bunch for A, no NetGUID exports (with our local fix we do actually get exports here, since we reset the ack state). That info was in the bunch from step 1, which was omitted from the checkpoint.
- A is destroyed, a Close bunch is recorded as “regular” traffic.
- CacheNetGuids and SerializeGuidCache is called, omitting A since it is destroyed.
- FlushCheckpoint is called, checkpoint is finalized.
During loading of this checkpoint, we will fail to process the checkpoint recording of A since it lacks NetGUID exports, and no exports were made in CacheNetGuids since at that point the actor was destroyed. I guess one could make the argument that A shouldn’t be a part of the checkpoint since it was destroyed during it, but the destroy event came in after the actor had been recorded as a part of the checkpoint. So in order to ensure the actor is already destroyed during playback we now need to make sure to include the destroy event as well (Close bunch). Furthermore one could argue that getting a serialization error for an actor that will be destroyed momentarily isn’t an issue, afterall in this case the Close bunch is included as part of the checkpoint fast forward step so it will never be visible to players. However, in our case, this specific actor had components that in BeginPlay (on the client) depended on having its replicated variables initialized with server given data, if the CDO default values of the properties were present the game crashes as this triggered several assertions. But given that the Open bunch for the actor failed to find NetGUID exports for its components, the initial bunch for the components couldn’t be processed, thus resulting in CDO default values.
Sorry for the wall of text, but I hope this clarified what we’re doing a bit more. Whilst I do believe some of these issues are self-inflicted from our divergence to record during amortized checkpoints, things like the having two data streams on server recordings and missing critical events that come in during an amortization period feels like they would be problematic in regular Unreal as well. Do you think my proposed solution, i.e avoid forced NetGUID exports for both checkpoint and regular data traffic during checkpoint amortization periods and instead make sure CacheNetGuids and SerializeGuidCache doesn’t miss any objects, sounds like a reasonable way forward, given our usecase?