Hi, I’ve been looking for a side project to work on, and was considering working on a GPU accelerated implementation of server side replication (using CUDA), to help address high server CPU usage issues. GPU acceleration wouldn’t replace the bulk of the replication pipeline, as there’s far too much complexity in the replication system to shift all of it to the GPU. However using a targeted approach, there looks to be a whole lot of room for improvement, potentially even an order of magnitude, by replicating only basic data types such as transforms using the GPU (and possibly use dormancy/FlushNetDormancy to handle the remaining more complex but less frequently changing data types).
Whether that would make a material difference I’m not really sure, which is the purpose of this post, to try and get an idea for what limitations people have been running against when running with high client counts. Are people primarily running up against CPU usage limits of packet generation (specifically UNetDriver::TickFlush where most of the work is done), or bandwidth limits? Maybe hitting world/physics update CPU usage limits? In which case optimization of the network tick would still help, just not as much.
In my base test case I have 100 player characters and 270 replication enabled physics actors. When I test with 200 active mock client connections, I see around 120ms+ in my server tick (the TickFlush method), making it CPU limited. Since all the actors are idle, the bandwidth is negligible (4 bytes per frame). The server tick time scales roughly linearly with the number of connections. This was using stock Unreal Engine 5.1 (No Iris/Push Model/Replication Graph).
There’s further scope to reduce transmission bandwidth, by using delta compression techniques that may be prohibitively expensive to implement on the CPU (which comes with some other trade offs, losing replay ability, which can in turn be mitigated with other techniques). In a simple test running an actor in a straight line, the delta between the last client acknowledged position and send position is a few meters. The delta between the predicted position and send position is often under one centimeter, giving potential to use well under 50% of the bandwidth of sending the absolute position, in this best case cherry picked example. Taking this one step further, you could employ distance based LOD to reduce precision further. Currently you can reduce the net update frequency for distant actors, which is one form of LOD that comes at a cost of visual latency on the client. You could stack reduced precision on top of this reduced frequency to reduce bandwidth further, or increase frequency again while using a similar amount of bandwidth (since a distant actor moving in a straight line is probably fine at a reduced frequency, but when they jump you may want to see that change reflected sooner).
The obvious downside to this is requiring a GPU accelerator for the server, and what that might mean for server costs. The economics would depend on just how expensive your network tick method was. If you’ve run up against network scale limitations in your project, I would be interested to hear how far you managed to push things (actor/client counts) before things started breaking down. I believe replication graph can help a great deal with these issues, however I think(?) it can still run up against similar limitations if your clients all congregate in the same spatial area, which limits the ability of replication graph to filter out lower priority actors.