[Networking] Huge packet loss and replication loop (TArray replication)

Rumbleball · April 5, 2017, 6:13pm

Hi,

we need to send lots of data over the network and right now only testing locally (single machine/LAN).
We have lots of actors that need to be replicated with each containing a TArray of about 100-70.000Bytes in size. The TArray itself is marked as replicated and replication seems to work pretty well if not all actors try to replicate at once.

It is possible for clients to join in late, all actors (testing with 200-300 at the moment) in need of replication (actors are clustered in about 50m radius). Somehow replication gets stuck trying to replicate these TArrays. When I remove the TArray from replication, all those actors get replicated without problems.

What seems to happen is that a lot of packages get lost on the network (for what ever reason) and the engine tries to resent those, which again get lost. The engine gets stuck in a loop trying to sent the lost data that gets lost again.

Here is a STAT NET shot:

The question is basically why do those packages get lost? What is the issue?
This replication needs to happen only once.

Any help/suggestions would be appreciated, thanks.

Jambax · April 5, 2017, 6:53pm

At the moment your out rate per-frame (if I’m understanding the screenshot correctly) is 0.125Mb, which is a bit insane. In a 60 FPS game, that’s 7.5 Mb/sec - few internet connections have that kind of speed. You’re suffering from saturation, and basically throttling the connection. If this doesn’t work over LAN, it will never work online. Unreal does the best it can to reduce the network load but it has it’s own limits.

Before I mention this btw, there are ways to assist the replication system. You can tell actors how often per-second they should be considered (default is 100, so every frame usually - drop that) - and their priority. Reducing those might help the network system send data, eventually. Still though, you’re replicating way too much data. Therefore beyond this, there are two approaches to fix it.

The first is to manually batch together actors to send. You’ll have to write your own system that sends actors to the client and when the client achknowledges them, send the next batch. You’d have to manage that yourself, and it would probably require at least some engine level changes.

The second approach and by far the best one, is to reduce the amount of data you’re sending. You simply cannot send that much data in a networked game, it’s unreasonable. If you’re crushing a LAN connection, then this is never going to work in the real world (most players in the world have terrible internet connections, an unfortunate fact). The most obvious question here is what is it that these actors have that you’re trying to replicate? Surely this is generated data of some kind - in which case you can probably generate it locally on the clients, and only send accross a few minimal parameters. That’s the best approach for things like this.

EDIT: Btw - with 300 actors each with a 70,000 byte array, that’s over a gig of data. Not gonna happen

Rumbleball · April 6, 2017, 10:37am

Thanks TheJamsh,

Ok, first off: We are not creating a game in the common sense. We are working on a Tool, create Shapes in realtime. We are already not sending mesh data, instead data that can be used to recreate the shape. And we are developing in VR -> 90FPS

I gues it already shows bytes per second, not per frame. Datarate is set to 200kB/sec per client. Looking into the NetworkProfiler, it sends the 0.125MB (MegaBYTES) and waits for about a second before the next bunch of data is sent.
Within a local network this should be far less than possible. We are planning for singleplayer as well, but we need that high bandwidth for multiplayer. Amout of players for network: 2-5.
The client needs about 300KB/sec downstream, I don’t see where this is an issue. The Network buffer should easily be worked through by the client. Upstream from the client is far below that rate. And it is only needed when joining, not all the time.

This gets set within the constructor of the actor.



NetUpdateFrequency = 1;
MinNetUpdateFrequency = 0.01;
NetPriority = 1;

I thought unreal gathers network data per frame and sends that bunched together already?

see above.

300*70,000 = 21,000,000 = 21MB. A reasonable size for todays networking. Sure, not everyone has that bandwith -> That person would simply need to spare the multiplayer.

[EDIT]
By the way, if the issue is the packet loss due to client saturation, the packets should get through at some point in time, as the client network buffer gets emptied. Another reason why I don’t understand what is going on.
For testing I set the outrate back to 10KB/sec, but the result is somewhat the same. Replication gets stuck at some point.

[EDIT2]
Increasing the bandwith to 1MB/sec works actually better than 200KB/sec. Still packet loss and looping resent.

Rumbleball · April 8, 2017, 9:04pm

made some more tests and need to somewhat correct the statement from above. It does not seem to be the loss of the packages. As the image above shows, there is an OutPackages of 312 and a OutLoss of 32, means this should not be an issue. Most of the packages are still reaching their target.

With the Blueprint node GetAllActorsOfClass I’m checking the number of actors on the server and the client. Here it shows that the client does not spawn actors most of the time at all. Seems Unreal sends property data before the actor even exist on the client. The client does not know what to do with that data and deletes it, this data is send again from the server. In the end there is only property data sent for actors that never have been spawned on the client. This occupies the whole bandwidth and actors spawn messages never get through. This would as well explain why increasing the bandwidth works better, as there is a higher chance for actor spawn messages to get through, property data can get assigned and there is more space for other data. This for sure is a major issue.

To check this further, I made use of DOREPLIFETIME_ACTIVE_OVERRIDE within AActor::PreReplication. The big arrays will not be replicated from the beginning. This give space on the line for the actor spawn messages. All actors on the server are replicated to the client. When all actors exist, I set the big arrays to replicated and there you go. Works out pretty well. There are more issues still as the arrays grow to big, which should be a bug somewhere as well. The amount of data is NO issue at all, the replication system seems bugged.

EDIT:
Kept digging the source and noticed that there are no special spawn messages. With the first message of an actor comming in, a Channel for that actor is created. After that the Channel is checked for a valid actor reference, if not valid the actor gets spawned.
Need to correct my statement that no actor gets spawned at all as well. Some actors get spawned, just the player was not spawned at that point, which checked for the other actors.
No Idea whats the issue so far, keepin diggin.

Jambax · April 10, 2017, 12:53pm

Okay yeah so reading some of this back I made a few errors - if you’ve done a network profiler test and it only shows 0.125Mb outgoing then that seems reasonable. (Mixed up my Byte vs Bit terms hence the huge number). Actor Spawning is expensive, unfortunately. It sounds tbh like the system is doing the best it can, but this is a pretty unique use-case to spawn that many actors in quick succession and replicate such a large amount of data.

It might be worth using RPC’s to send data rather than relying on the automated / variable replication system. At the end of the day, it’s a lot of work to ask the Property system to check 70,000 elements of an array for 300 actors to see if any elements have changed. If you think about it, the network system is optimized to send minimal amounts of data possible since it’s built with games in mind.

From my experience, this just sounds like a Saturation issue - especially if you’ve modified PreReplication to send array data after actor data. Maybe there is an underlying bug somewhere, but IMO if the replication system was truly bugged in the sense that it isn’t reliably updating clients most of the time - then that would surely affect all games and projects regardless of data size, not just your own.

Also, check this thread out - according to Epic, streaming large amounts of data isn’t something they’ve had time to integrate nicely yet. You could look at doing your own Sockets, maybe this thread has something that can help.
https://answers.unrealengine.com/questions/151532/what-is-the-best-way-to-replicate-large-amounts-of.html

anonymous_user_f77a82de · April 10, 2017, 11:57pm

Rumbleball;691416:

Thanks TheJamsh,

Ok, first off: We are not creating a game in the common sense. We are working on a Tool, create Shapes in realtime. We are already not sending mesh data, instead data that can be used to recreate the shape. And we are developing in VR -> 90FPS

I gues it already shows bytes per second, not per frame. Datarate is set to 200kB/sec per client. Looking into the NetworkProfiler, it sends the 0.125MB (MegaBYTES) and waits for about a second before the next bunch of data is sent.
Within a local network this should be far less than possible. We are planning for singleplayer as well, but we need that high bandwidth for multiplayer. Amout of players for network: 2-5.
The client needs about 300KB/sec downstream, I don’t see where this is an issue. The Network buffer should easily be worked through by the client. Upstream from the client is far below that rate. And it is only needed when joining, not all the time.

This gets set within the constructor of the actor.
NetUpdateFrequency = 1;
MinNetUpdateFrequency = 0.01;
NetPriority = 1;
I thought unreal gathers network data per frame and sends that bunched together already?

see above.

300*70,000 = 21,000,000 = 21MB. A reasonable size for todays networking. Sure, not everyone has that bandwith -> That person would simply need to spare the multiplayer.

[EDIT]
By the way, if the issue is the packet loss due to client saturation, the packets should get through at some point in time, as the client network buffer gets emptied. Another reason why I don’t understand what is going on.
For testing I set the outrate back to 10KB/sec, but the result is somewhat the same. Replication gets stuck at some point.

[EDIT2]
Increasing the bandwith to 1MB/sec works actually better than 200KB/sec. Still packet loss and looping resent.

sending 20 mb upstream? I think you should do more research.

Rumbleball · April 11, 2017, 6:34pm

Made an EDIT to my last post: https://forums.unrealengine.com/showthread.php?141591-Networking-Huge-packet-loss-and-replication-loop-(TArray-replication)&p=692631&viewfull=1#post692631

Thanks TheJamsh. Already came about that post, was getting everything about TArray replication out of the forum I could get.
I don’t see reason in doing my own sockets, as the amount of data sent up/down would be the same. Just the handling would be different. As we cover join in progress, Unreals replication system does exactly what we need, keeps already joined players up to date and sends everything to players comming in late.
Diggin the source I already come to some hardcode limitations.

What I know so far (please correct me if something is different):
Each Actor for each connection has a Channel.
There is a ReplicationManager per actor that has info of which data changed at what point in time. The rep data is only checked once per actor and then spread to client instances using timestamps of the last replication.
The data of an Actor is clustered in a Bunch (A Bunch is combination of multiple propertys, a Bunch only contains data of a single actor). If a bunch gets to big, it gets split into multiple smaller bunches (PartialBunch). The Bunch/PartialBunches build up OutgoingBunches. There can only be OutgoingBunches.Num()<=256 otherwise the reliable output buffer will overflow and disconnect the client. Each Bunch/PartialBunch can be 4008bits. A Bunch is also the data for a single network package. The numbers defining those values are Hardcoded and can thus only be modified with a custom build of the engine.

The sentence you were refering to is a bit missleading when not read everything else. 21MB was just the amount of data in that example, not the date rate.
If you think uploading 21MB is much, I’m up for discussion or links.

Still stuck on the replication, tons of data is sent out for serveral minutes without issues. Looking at NetworkProfiler for the client computer, tons of data is comming in. The data is just not processed as intended.
Still digging.

ExtraLifeMatt · April 11, 2017, 6:58pm

I don’t think the answer here is to make Unreal work with larger amounts of replicated data. I think the answer is to find ways to send less data. You could try something as simple as using some type of compression scheme (LZH, Huffman) on your data stream before sending it out. You could better organize the data so you only send the closest/only visible shapes to an actor first and do the others later (if at all). You could create “Network friendly” versions of all your shapes which use things like 16bit floats rather than 32bits for things like position (There’s tons of examples of this in the engine, just search for NetQuanitize).

Right now your approach seems very brute force. If you break things in to more manageable quantities - not only will you be sending less data, but you’ll have a far easier time debugging things.

kamrann · April 11, 2017, 11:26pm

You say the replication needs to happen only once, but then some of your other comments seem to contradict that. Can you clarify - does this array, once filled, never change? It makes a big difference to what the right approach is.

Rumbleball · April 12, 2017, 7:10pm

ExtraLifeMatt;693941:

I don’t think the answer here is to make Unreal work with larger amounts of replicated data. I think the answer is to find ways to send less data. You could try something as simple as using some type of compression scheme (LZH, Huffman) on your data stream before sending it out. You could better organize the data so you only send the closest/only visible shapes to an actor first and do the others later (if at all). You could create “Network friendly” versions of all your shapes which use things like 16bit floats rather than 32bits for things like position (There’s tons of examples of this in the engine, just search for NetQuanitize).

Right now your approach seems very brute force. If you break things in to more manageable quantities - not only will you be sending less data, but you’ll have a far easier time debugging things.

Thanks for your answer. I’m new to unreal and especially to networking and need to understand some basic principles first, before doing our own. Unreal does basically everything we need. Sure it would be good to implement our own replication logic to only do what is really needed, but thats not something you do within a week.
I already made use of smaller datatypes as possible for net transfer and savegame.

The array fills slowly. The connected clients need that data right away, while the array is filling. Newly connected clients need to get the whole array and changes to it if it is still filling. Once complete there are no changes anymore at the moment but might get changes in the future.
As unreal uses delta replication for the arrays, the data is not that much when just replicating the changes (delta). Sure, unreal needs to check the arrays again and again for changes, which will cause many calculations while the world is growing. Fast TArray replication is no option, as the data needs to stay in line.

ExtraLifeMatt · April 12, 2017, 7:29pm

If I were you, I’d write a simple manager on the server that feeds data in chunks to a client (as well as handles sending updates) using RPCs.

The benefit being you could place clients in a simple state machine (NeedsBaseline, Dirty, etc), and simply feed them the info while keeping track of client state internally.

So, when a client connects you add them to your manager and mark them as NeedsBaseline. Every frame,if you don’t have an outstanding request sent to a client, you send X amount of elements of your array. When the client receives it, it replies back that it got X elements. You mark that section as done and continue until you have sent all data (and received acknowledgements from the client). Once a client has the baseline, you’re good unless you need to send a delta which you do in a similar fashion. The SEND/ACK paradigm is how things like TCP/Reliable UDP work so it’s a reliable paradigm.

kamrann · April 12, 2017, 11:20pm

In that case, exactly what ^he^ just said.

You may be right that there’s an issue, but UE4’s replication simply wasn’t meant for this kind of use. RPCs are by far a better solution in this case (lower level socket code would make the most sense, but is also more work).

Rumbleball · April 13, 2017, 4:32pm

ExtraLifeMatt;694460:

If I were you, I’d write a simple manager on the server that feeds data in chunks to a client (as well as handles sending updates) using RPCs.

The benefit being you could place clients in a simple state machine (NeedsBaseline, Dirty, etc), and simply feed them the info while keeping track of client state internally.

So, when a client connects you add them to your manager and mark them as NeedsBaseline. Every frame,if you don’t have an outstanding request sent to a client, you send X amount of elements of your array. When the client receives it, it replies back that it got X elements. You mark that section as done and continue until you have sent all data (and received acknowledgements from the client). Once a client has the baseline, you’re good unless you need to send a delta which you do in a similar fashion. The SEND/ACK paradigm is how things like TCP/Reliable UDP work so it’s a reliable paradigm.

Unfortunately such a simple manager can grow pretty easily pretty big cause of some “small” requirements.

ExtraLifeMatt · April 13, 2017, 4:39pm

Sure, I never said network programming was easy. The devil is always in the details.

However, I don’t see any other option. You can either try and get replication to fit into your use case (which is a bit of a round peg-square hole dilemma), or write a custom solution (either better organize your data into smaller sizes so replication can handle the load, or write a manager to handle large data distribution).

Rumbleball · April 13, 2017, 5:25pm

had another look at: https://answers.unrealengine.com/questions/151532/what-is-the-best-way-to-replicate-large-amounts-of.html
and found what should be the issue. Somehow was not able to get that the 2 first times I read the thread. As ExtraLifeMatt metions, “The devil is always in the details”.

But why split a package (Bunch) into partial packages (PartialBunch) and send small packages, when the loss of a single partial package causes the whole package to fail? I gues someone got something wrong there.

Well then, trying my luck on custom replication.

Thank you guys for your replys/help.

Zoc · April 14, 2017, 5:10am

Sorry to join late on the topic, but have you noticed this UProperty?

https://docs.unrealengine.com/latest/INT/Programming/UnrealArchitecture/Reference/Properties/Specifiers/RepRetry/index.html

I hope it helps!

Pierdek · April 14, 2017, 11:06am

Just checked in the 4.15 code what this thing do:



case EVariableSpecifier::RepRetry:
{
	FError::Throwf(TEXT("'RepRetry' is deprecated."));
}

Rumbleball · April 18, 2017, 2:08pm

Thanks @Pierdek. However, as the documentation for this states, it was only for structs that contain references to other replicated actors. Thanks anyway @Zoc

anonymous_user_3b4c8d46 · April 20, 2017, 8:27am

Hey guys, this is a bit irrelevant, but how can I run the stat net command?
It doesn’t work to call the console command “Stat net” for me

Rumbleball · April 20, 2017, 11:39am

using a “Development Editor” build for the engine, this should be the default one. No issues here using “stat net”. Do other stat command work at your side?