Parallel Iris: Guidance needed on use of NoSync flags in Net Trace events.

Hello, I have some questions regarding the changes made in this commit to enable tracing for parallel polling in Iris: https://github.com/EpicGames/UnrealEngine/commit/da1a30a3f74782a7c3d6d998ae19ee87a0fad3d8\#diff\-144c2e5a19eb9944eb081dc4525f4b2185aad18bfe7b08ecb3a7085cc91da1d9

I recently modified our engine and Iris to execute the client connection tick phase (`STAT_NetDriver_TickClientConnections`) in parallel.

I’m happy to say that we’ve been running our servers with these modifications without issue for about 2 months now.

We run automated scale tests every day and 2-3 large scale playtests a week so I’m fairly comfortable in saying it’s free of any obvious bugs.

However, we do have an issue related to traces.

Roughly 2 in 5 traces run into an issue when loading the file into Insights which causes Insights to freeze for several minutes before opening the trace.

This freezing is caused by an extreme number of log statements being flushed to the program log file. The log in question being:

UE_LOG(LogNetTrace, Warning, TEXT("PacketContentEvent GameInstanceId: %u, ConnectionId: %u %s, Missing NameIndex: %llu"), (uint32)GameInstanceId, (uint32)ConnectionId, ConnectionMode ? TEXT("Incoming") : TEXT("Outgoing"), DecodedNameOrObjectId);

The log files produced when opening such a trace ends up around 3GB in size for about 1 minute of trace data.

If I disable this log line, Insights is able to open the file without issue but due to the log triggering in the first place I fear there may be missing Net Trace data.

Realistically, if we wanted to get reliable packet info then we can just disable our parallel ticking when we want an accurate trace of the net data, but this isn’t ideal.

I’m in the process of testing multiple things, but I believe the main issue to be the use of `NoSync` for the Net Trace Name Event:

// Trace a name
UE_TRACE_EVENT_BEGIN(NetTrace, NameEvent, NoSync)
    UE_TRACE_EVENT_FIELD(uint32, NameId)
    UE_TRACE_EVENT_FIELD(UE::Trace::AnsiString, Name)
UE_TRACE_EVENT_END()

If I understand correctly, Insights is unable to locate the Name ID for the Packet Content Event, which could be due to the Name Event not being recorded reliably.

I will be running several scale tests to check if the issue still occurs after removing the flag to try and verify this.

However, before removing the `NoSync` flag from the event I would like to ask for information regarding the bug mentioned in the original commit.

It mentions that NoSync was added to deal with a bug in the tracing code itself, but I couldn’t find any information about such a bug elsewhere.

Hi,

The issue we encountered was that when tracing the parallelized polling, the trace would occasionally not open with the following error in the TraceAnalyzer: “Serial gaps detection failed.” This seems to be due to a bug with how sync trace events are output from multiple threads. We observed that a sync event may be emitted but not appear in the trace stream, causing all events after the missing one to remain unprocessed.

I don’t believe we have a public issue tracker available for this. Work is ongoing to fix this bug though, and we hope to have a fix submitted soon.

After discussing this with a developer more familiar with the work around parallelizing Iris, it is okay to remove the “NoSync” flag, but you will likely run into the same issue if Iris’ parallel polling logic is enabled.

They also wanted to call out that we’re currently working to officially support running NetConnectionTick in parallel on servers, allowing for separate tasks to be run for each connection and for Iris’ write phase to be parallelized.

Thanks,

Alex

Hey Alex, thanks for the fast response. I’ll keep an eye out for that fix on Github.

I’ve performed some scale tests with the flag removed and the only issue I’ve encountered so far is that the InitEvent is not reliably produced in the Trace Data on one of our scale tests (weapons).

This leads to Insights not processing the net trace data and prevents the Networking window from being opened.

I’ve marked it as NoSync + Important which seems to have resolved the issue, but the only way to know for sure will be to keep doing traces.

Also, it’s great to hear that there’s someone working on parallelizing this phase at Epic. If you’d like, we’re open to sharing the details of our implementation with you all; since we’ve had this running for a while now without any issues (aside from tracing).

A few more scale tests later and I’ve encountered two issues:

  1. One of the traces would open the Networking Insights tab but not show any data.
    1. There were no Game Instances listed in the dropdown menu when this happened. Seems to be a missing InstanceUpdatedEvent.
  2. The traces would not display any client IDs in the dropdown menu, despite showing me the packet data.
    1. Going back through older traces, this seems to have been an ongoing issue probably since I initially implemented multithreading.

Hi,

It turns out that fix was actually recently submitted! You can find it here on GitHub: https://github.com/EpicGames/UnrealEngine/commit/fce424196d6269127b66bf709cb5d9dab0d376b7

If you’re able to integrate that change, please let us know if you run into any other issues with tracing.

Thanks,

Alex

Awesome, I’ve integrated the changes from that commit along with removing NoSync from the NameEvent.

I was never able to encounter the issue you had during my earlier testing unfortunately. And I do believe the debug logging is enabled, as our TargetAnalyzer.cs contains:

GlobalDefinitions.Add("UE_TRACE_ANALYSIS_DEBUG_API=1");
GlobalDefinitions.Add("UE_TRACE_ANALYSIS_DEBUG=1");
GlobalDefinitions.Add("UE_TRACE_ANALYSIS_DEBUG_LOG_IMPL=1");
GlobalDefinitions.Add("UE_TRACE_ANALYSIS_DEBUG_LEVEL=2");

The only issue I encountered after integrating the change was the one I mentioned in my reply, which was InitEvent.

That seems to be resolved by marking it as Important, as afterwards I ran our scale tests and each trace file could be opened normally and displayed the data without issue.

Hi,

Great, glad that’s working for you! If you have any further questions or run into any other problems, please don’t hesitate to reach out.

Thanks,

Alex

Hey, sorry for the late response.

I’ve just checked the last few days of our scale test results and all seems to be in working order.

Thanks for your help with this.

And again, if you’re willing we would love to share our implementation for this with Epic.

I understand that parallelizing this isn’t a priority at Epic currently, and other work may take priority, but we’d love the chance to give back.

Hi,

You’re very welcome!

If you’d like to further discuss parallelization, we’d recommend making a new question at this point, so it can be routed to the appropriate SME.

Thanks,

Alex