Acking NetGUID exports for Replay Checkpoints for serverside recording

anonymous-edc · July 4, 2025, 12:06am

Hello! We’re currently developing instant killcams by recording data on the server. Then when the client dies ingame, the server extracts the last few seconds of the data stream and sends it to the client, allowing them to play back the last few seconds before their death from the killers viewpoint.

We’ve already got some modifications in place to make this work, most notably we’re calling TickDemoRecordFrame even during checkpoint generation (we’re using amortized checkpoints), as otherwise we’re getting gaps in the data stream during the amortization period for checkpoints (problematic when crossing over another checkpoint period during playback). However to support this we’ve had to make a few modifications to how NetGUIDs are exported, but I’m unsure I’m understanding the existing code correctly so I just wanted to have some things clarified.

First off, why is “PackageMapClient->SavePackageMapExportAckStatus” called in FReplayHelper::SaveCheckpoint? Isn’t a checkpoint meant to be a clean state to enter from, meaning we can’t assume anything has been acked. Currently I’ve replaced this line with “CheckpointSaveContext.CheckpointAckState = FPackageMapAckState();” to ensure this is the case. However I’m guessing this will case Exports to always grow throughout the match, potentially making checkpoints at the later stages in a match very large? Maybe our use-case of serverside recording is different from how your internal usecases are?

Second, why is the GuidCache also exported in FReplayHelper::SerializeGuidCache, shouldn’t this be more or less covered by the NetGUIDs exported via Append/ReceiveExportData?

Furthermore, with serverside recordings the scope defined with “ResendAllDataState = EResendAllDataState::FromCheckpoint” in FReplayHelper::ProcessCheckpointActors is insufficient in properly catching all data. Since you might get calls to DemoNetDriver::LowLevelSend once the NetDriver ticks, triggering stuff like dormancy, while CheckpointSaveState != Idle, but ResendAllDataState=None. This data is then discarded when the checkpoint is loaded (since it’s too early in the FileAr stream), which is problematic if you get bunches after the checkpoint that depends on that. In general, state changes during amortized checkpoints is a bit iffy, close bunches, open bunches, dormancy/destruction etc etc.

We get around this by combining data recording for a checkpoint (TickCheckpoint) with the non-checkpoint data recorded during a checkpoint (TickDemoRecordFrame) with some selective reordering of bunches. Given we (with our divergence to record regular data during amortized checkpoints to be used together with the checkpoint) might receive traffic during a checkpoint for actors not part of a checkpoint, we also forcefully export NetGUIDs during this period for regular traffic, as the exports for that data will not be a part of the checkpoint (by returning true in ShouldSendFullPath and skipping the ACK check in ExportNetGUIDForReplay). There’s lots of sneaky edge cases depending on the relative order of ReplicateCheckpointActor for a given actor and new open/close bunches coming in for that actor during the amortized checkpoint. Not really a question in this paragraph, moreso explaining our usecase in case it helps understand what we’re doing different from your usecases.

Best regards

/Jakob

akoumandarakis · July 16, 2025, 5:41pm

Hi,

Thank you for the additional info and context around your use case, as it is different to how replays tend to be handled. In the past, we usually had the client handle recording and playing back instant replays locally, and we also haven’t tried recording frames of the replay at the same time that a checkpoint is being recorded.

To answer your questions:

First off, why is “PackageMapClient->SavePackageMapExportAckStatus” called in FReplayHelper::SaveCheckpoint?

I believe this is done so the ack status can be overridden in FReplayHelper::ProcessCheckpointActors, so that if the ack status changes while the checkpoint is being recorded, we still only record the status from when the checkpoint started to be recorded.

Isn’t a checkpoint meant to be a clean state to enter from, meaning we can’t assume anything has been acked?

Even though checkpoints are a clean slate in the sense that they don’t assume a previous state that can be used to delta off of, because replays are recorded directly, everything is assumed to be ack’d implicitly. Because of this, they still record the current state of the world, which should include the current package map ack state.

Currently I’ve replaced this line with “CheckpointSaveContext.CheckpointAckState = FPackageMapAckState();” to ensure this is the case. However I’m guessing this will case Exports to always grow throughout the match, potentially making checkpoints at the later stages in a match very large?

Since this isn’t something we’ve tried or tested ourselves, it’s hard to say what side effects this might cause. After talking with a colleague, it does seem as though you’d want to keep recording the checkpoint ack state as normal, even though you’re also recording the ack state into the demo frames, in case these two diverge.

Second, why is the GuidCache also exported in FReplayHelper::SerializeGuidCache, shouldn’t this be more or less covered by the NetGUIDs exported via Append/ReceiveExportData?

Append/ReceiveExportData are used when writing/reading demo frames. For a checkpoint, we need to save the entire network guid cache, so this state can be applied when scrubbing to that checkpoint.

Thanks,

Alex

anonymous-edc · July 21, 2025, 12:52pm

Hey Alex

Thanks for providing some more info about the intended usecase of the replay functionality, however there’s still some things I wonder about. For additional context, we’ve been running with the aforementioned fixes during July and it appears to fix most of the issues we’re having.

When we’re talking about “AckState”, I’m referring primarily to NetGUIDAckStatus of FPackageMapState. I understand we’re simply recording past data (hence why the replay connection is treated as “reliable” and sets the bInternalAck option to true) and thus the concept of “acking” or resending data doesn’t really make sense. In this case I’m treating the “NetGUIDAckStatus” as whether or not we’ve previously exported the full object path at least once for the given NetGUID (i.e have ShouldSendFullPath been true at least once for this object?).

The issue we’ve been dealing with, is that state changes during amortized checkpoints (new actor channels, channels closed to dormancy or destruction) will be missed entirely due to the nature of how amortized checkpoints are set up. For example for actors it will cache a list (see CheckpointSaveContext.PendingCheckpointActors) at the start of the new checkpoint, thus any new actors that are created after this point up until the amortized checkpoint is finished will not be a part of the checkpoint. Then consider the following example:

SaveCheckpoint is called, caching actors A and B in the list of actors to call in ProcessCheckpointActors
Actor C is created
FlushCheckpoint is called, checkpoint is finished. This checkpoint will contain two Open bunches, one for A and one for B
Regular traffic for A, B and C comes in, these are non-open bunches, as the channels are already open.

During playback, when we scrub to our checkpoint we only have actors channels for A and B. However as we start processing traffic that came in after that checkpoint, most notably traffic for C, we will receive a bunch for a channel that isn’t open, yielding errors in the net driver.

SerializeGuidCache has the same issue, it takes a snapshot, and any new objects that go into the guid cache from then until the checkpoint is finished will be missed, thus any traffic after the checkpoint for those objects will be missing important “initialization”.

Now, given that we’re doing serverside recordings we will be having network traffic going into both QueuedDemoPackets and QueuedCheckpointPackets during the checkpoint amortization period, even without our divergence to record regular demo frames during checkpoints. Lets take another example:

SaveCheckpoint is called, actor A is added to the list of pending actors
Super::TickFlush is called in UDemoNetDriver::TickFlushInternal, causing A to become dormant. Thus a Close bunch is recorded into QueuedDemoPackets (NOT CheckpointPackets)
ProcessCheckpointActors is called for actor A, recording an Open bunch for Actor A.

During playback from this checkpoint, any regular traffic for actor A will assume the channel is closed, because that is the most recent event we’ve seen for that actor. However when we’re starting playback from our checkpoint, the opposite is true, is channel is open. This results in errors like “Received channel open command for channel that was already opened locally” or “Reliable bunch before channel was fully open”.

To fix this, we acknowledge we essentially have two streams of data during amortized checkpoints: “regular” (QueuedDemoPackets) and “checkpoint” (QueuedCheckpointPackets) traffic data. The “regular” data can essentially be seen as “missed” or “skipped” data, stuff that was missed from the checkpoint snapshotting due to unfortunate timing. Or in the case of server recordings, data that came from externally triggered events such as dormancy, that aren’t caught in the regular ProcessCheckpointActor flow where data is redirected to the QueuedCheckpointPackets array.

Thus, to fully “load” a checkpoint, and make sure we’re not missing any critical history or state for a given NetGUID (actor) we first process the checkpoint data, and then any regular data gathered during the checkpoint period. Given that there’s no guarantees about when during a checkpoint a given actor is replicated relative to when it may or may not receive regular data from other events (mentioned above), we’ll almost always end up applying bunches for a given NetGUID in the wrong chronological order. To get around this we add a linearly incrementing counter to each network bunch for replay connections and keep a map of NetGUID -> CounterID during playback, skipping any bunch if it has an older timestamp than what we’ve already received for this NetGUID.

Finally, onto the topic of NetGUIDs specifically. When a client receives the data from the server (checkpoint + chunks of traffic), they have nothing acked, i.e they’ve never processed the full object path for any object at all. I believe this is the purpose of SerializeGuidCache? To ensure all net addressable objects have been registered before we start processing traffic for them. I realize this as I’m writing, but perhaps the root cause for us is that we’re processing traffic for objects that didn’t get registered in SerializeGuidCache. Especially given that we take traffic recorded during a checkpoint into account in order to ensure everything is up-to-date. For example if an actor is destroyed during the checkpoint amortization period, before the SerializeGuidCache step, then it will not be recorded as a part of the GuidCache snapshot. Whereas our current fix ensures it will get NetGUID exports from its regular traffic bunches instead (or checkpoint bunches, whichever is processed first). Then I understand the reasoning behind SavePackageMapExportAckStatus better, as its not meant as a primary mechanism for NetGUID exports, moreso just ensuring it keeps recording ack status as regular from when the checkpoint started.

So rather than forcing ExportNetGUIDForReplay to be called a lot more, I think the proper fix for us might be to ensure SerializeGuidCache actually includes ALL objects for which the checkpoint contains traffic for. Right now it does a “if (Object && (NetworkGUID.IsStatic() || Object->IsNameStableForNetworking()))” check in SerializeGuidCache, which excludes destroyed objects. But if that destruction happened during our amortization period then we’ll miss it, and then when we process that destruction event as part of our checkpoint load. For example:

Actor A is created before our checkpoint
SaveCheckpoint is called, A is added to the pending list
ProcessCheckpointActor is called for A, we record an Open bunch for A, no NetGUID exports (with our local fix we do actually get exports here, since we reset the ack state). That info was in the bunch from step 1, which was omitted from the checkpoint.
A is destroyed, a Close bunch is recorded as “regular” traffic.
CacheNetGuids and SerializeGuidCache is called, omitting A since it is destroyed.
FlushCheckpoint is called, checkpoint is finalized.

During loading of this checkpoint, we will fail to process the checkpoint recording of A since it lacks NetGUID exports, and no exports were made in CacheNetGuids since at that point the actor was destroyed. I guess one could make the argument that A shouldn’t be a part of the checkpoint since it was destroyed during it, but the destroy event came in after the actor had been recorded as a part of the checkpoint. So in order to ensure the actor is already destroyed during playback we now need to make sure to include the destroy event as well (Close bunch). Furthermore one could argue that getting a serialization error for an actor that will be destroyed momentarily isn’t an issue, afterall in this case the Close bunch is included as part of the checkpoint fast forward step so it will never be visible to players. However, in our case, this specific actor had components that in BeginPlay (on the client) depended on having its replicated variables initialized with server given data, if the CDO default values of the properties were present the game crashes as this triggered several assertions. But given that the Open bunch for the actor failed to find NetGUID exports for its components, the initial bunch for the components couldn’t be processed, thus resulting in CDO default values.

Sorry for the wall of text, but I hope this clarified what we’re doing a bit more. Whilst I do believe some of these issues are self-inflicted from our divergence to record during amortized checkpoints, things like the having two data streams on server recordings and missing critical events that come in during an amortization period feels like they would be problematic in regular Unreal as well. Do you think my proposed solution, i.e avoid forced NetGUID exports for both checkpoint and regular data traffic during checkpoint amortization periods and instead make sure CacheNetGuids and SerializeGuidCache doesn’t miss any objects, sounds like a reasonable way forward, given our usecase?

akoumandarakis · July 22, 2025, 5:06pm

Hi,

Thank you for the additional info and context! This is an interesting issue, and I do think it is because of your custom setup of recording demo frames at the same time as the checkpoint.

In normal replay recording, anything that happens during checkpoint recording is processed and captured after we’re finished with the checkpoint. This prevents these kinds of critical events from being missed, and it also ensures that these events do not affect the state being recorded into the checkpoint. For instance:

Actor A is created before the checkpoint.
We begin recording the checkpoint, including actor A.
During checkpoint recording, actor A is destroyed. This event is not captured or processed by the DemoNetDriver, and so it does not affect the guid cache or any other state being recorded into the checkpoint.
Once the checkpoint is done being recorded, regular frame recording is resumed, and actor A’s destruction is recorded into the replay as normal.

The same is true for actors that are created during checkpoint recording. Only after the checkpoint is finished will their creation be processed by the DemoNetDriver, including capturing their open bunch.

Based on what you’ve described here, it does seem as though your workaround to modify SerializeGuidCache is reasonable, as making sure that all the guids needed for the checkpoint are recorded is essentially just what the normal replay system does. We haven’t tried this before though, so it’s hard to say if there are any side effects to watch out for.

Thanks,

Alex