How to efficiently simulate hundreds of instances of something at once?

Jambax · March 28, 2015, 6:23pm

Thanks for the pointers!

Yeh I was going to try ActorIterator just to see what would happen. Out of interest, why use DistSquared instead of Distance? Doesn’t that just add to the complexity without giving any change to the behaviour? No idea why I used GetSafeNormal on the Forward vector, that was a boo boo on my part. You might be right about the distance/cone culling too, though I haven’t tested that yet. I did make the distance smaller (15K) since.

The test was with only two of the actor components in the world to test against. I think I might be being limited by the drawing of the missiles at this stage rather than the simulation, though I’m not sure about that, will find out in a second.

The ordnance can’t collide with each other, and they ignore the collision channel (at least, they should anyway). The collision check for the ordnance is just a small sphere primitive, setup the same way UT/Shootergame’s is. I’ll have to dig into the profiler now to figure out if the Physics engine is still testing overlaps for the projectiles.

EDIT:
The main costs appear to come from two things, I get a big spike when both Spawning and Destroying the actors, but the actual code itself seems to add less than 2ms of processing time when it has something to lock onto. I’m spawning 250 of them at once. Staggering the spawning reduces the spike when spawning, but increases the overall processing time since they’re not processing the UpdateTarget() function over one frame, it’s spread out across lots of frames.

The spawning might still be related to the overlapping however (the spike is certainly bigger than destruction), since the code checks on overlap if the overlap is the instigator or not.

kamrann · March 28, 2015, 7:43pm

Finding the squared length of a vector is actually a prerequisite to working out its length, so when either will do, the squared length saves you doing a sqrt.

With only two of the actors, it seems clear that the targeting code is not the bottleneck here anyway. I haven’t done anything involving large numbers of actors in UE4 yet, however it’s quite possible that the spawning is the issue. Will the lifetime of the missiles generally be short? You might want to have a pool of missile actors which you recycle, and just toggle their active/visibility states rather than constantly spawn and destroy them.

You could go further if you really want this optimised as much as possible. If you forgoe all the conveniences of actors and components for your missiles, you could create an AMissileManager actor which you just have a single instance of, rather than having an actor per missile (the same kind of idea as AInstancedStaticMesh). You wouldn’t even need a component per missile. Just store whatever you need in an array and deal with collision manually. It sounds like you could probably model a missile as a single point and do away even with the sphere overlap tests. Then you get the cache benefits was talking about too. You’d have to look into how you can tap directly into the underlying collision octree, and write your own rendering component too. It wouldn’t be straightforward, but definitely saves a lot of unnecessary overhead and gives you complete control.

I’d be interested to know what exactly was suggesting with the buffer. Since you’d have to iterate through and dereference the actors in order to update the missile locations, it’s not clear to me how this would help. But then, a lot of what she said went over my head too!

Jambax · March 28, 2015, 8:22pm

Ah of course, that makes sense!

Here’s the current stage of it, this isn’t a stress test at all, but it shows the individual missiles working really well.

anonymous_user_4fe9f58b1 · March 29, 2015, 4:51pm

Yeah. If you only need distances for comparisons, then the squared size is sufficient. It’s a common minor optimization. Only use exact distances when you need them, for instance for linearly interpolating damage according to distance from a blast or some such.

You definitely want to pool actors to avoid constant respawning, yeah. Which is another point in favour of a centralized missile manager.

Remember we’re looking to speed up iterating over the missiles. The critical path of the algorithm is processing the locations, filtering, etc. If you take dereferencing actors out of said critical path, the important part can go that much faster. Inputs for the algorithm are target location, missile location and missile direction. Outputs depend on your missile’s movement component, currently it seems to take the missile’s homing target but it could also just be missile direction. Your missile’s data buffer would end up looking like this:


struct FMissileData
{
	FVector Location;
	FVector Direction;
	AActor* HomingTarget;
};

By using a manager that arranges all of this information in a contiguous, cache-friendly fashion, the costly part of the algorithm can go much faster. It might not sound very intuitive, but the added overhead of copying data in and out of these data buffers can still result in a net gain. Each missile tracked by the manager has an index property that keeps track of where its data is in the buffers, and you use that index to copy data to/from the buffers during the copy pass.

Single threaded

Data copy: Copy all target locations to a buffer, copy all missile locations and directions to a buffer
Data update: Iterate over missile data, choose targets, write outputs
Data copy: Iterate over missile data, copy missile targets from buffer into movement component

Remember that I’m always keeping multithreading on the table as an eventual further optimization, and that’s where data rearrangement stands to gain the most. To illustrate:

Multithreaded

Data copy: Copy all target locations to a buffer, copy all missile locations and directions to a buffer
Kick off (1…n) worker threads
Blocking wait: Until worker threads are finished
Parallel data update: 1…n threads iterate over missile data, choose targets, write outputs
Data copy: Iterate over missile data, copy missile targets from buffer into movement component

That still doesn’t make the best the best use of parallelism because you’re still blocking the game thread while waiting for jobs to finish. A fully multithreaded and asynchronous solution would avoid blocking the main thread, so that the only penalty from having a sudden, huge spikes in missiles flying around will be that they are slower to acquire their target, rather than affecting frame rate. Async would look roughly like this:

Multithreaded async

Non-blocking wait: If there are still worker threads busy, skip this tick
Data copy: Copy all missile targets from buffer into movement component, copy missile location/direction into buffer, copy all target locations to a buffer
Kick off (1…n) worker threads
Parallel data update: 1…n threads iterate over missile data, choose targets, write outputs

Then you could further refine this by adding double buffering, so that missile update threads can still work on a “back” buffer even while the main thread copies data to/from the “front” buffer. But that would only be a gain if you further rearrange the entire algorithm so that even missile movement is part of the worker threads, and said threads essentially never stop working for a data copy.

The biggest caveat with this approach is that you have a nested loop, i.e.: iterate over missiles then iterate over targets. There’s no way to tell without writing it and profiling it, but I think having to switch between missile and target “contexts” will still end up causing cache misses and defeat the purpose of cramming all your data in a linear buffer. You might need to rearrange the algorithm so that you iterate over targets and then over missiles. Or maybe somehow include target data inline with missile data – if you have a quadtree, your data copy could single out (n) closest targets and only operate on those.

Another possible approach is to ignore the linear buffer altogether but still use multithreading. Each worker thread would still dereference actors and get their location, direction, etc. This is a pretty risky rabbit hole to go down, however, as UObject access is not very thread safe. It might be made to work if your missile manager strictly controls UObject allocations so that worker threads never have a missile object deleted from under them. Instead of destroying spent missiles, you just flag them, and worker threads would just ignore those missiles when encountering them. But Epic has repeatedly stated that UObjects are not safe to use outside of the game thread, I’ve just never investigated to what extent this is the case.

Yet another potential lead for multithreading is that scene components support using a “custom location” through bRequiresCustomLocation. You could turn that on and make have your missile’s root components return their location as known by the missile manager instead of RelativeLocation as stored in the component. Hell, you could take this further and, similarly to what kamrann suggests, have a single actor manage all your missiles while still using StaticMeshComponents for each of them.

tl;dr Most likely this will leave you more confused than you were before – just remember this is basically a long term overview of possible optimization paths. It may well be that you don’t need to go that far. I certainly haven’t seen enough missiles flying around in that video of yours to justify anything this aggressive, I’ve been approaching this with the idea of 1000+ missiles in mind.

Also, multithreading might sound scary, but once you get past the inital hump of learning the terminology, synchronization and the common pitfalls, it’s not that different from single threaded programming. The above algorithms wouldn’t even require thread synchronization, just a shared counter for the position in the missile buffer, that each worker thread would update using FPlatformAtomics::InterlockedIncrement.

kamrann · March 30, 2015, 9:01pm

Awesome post , really appreciate that you give the time to share your knowledge.

I’m very clear on the caching thing now. Mostly I’d overlooked the fact that it’s a two dimensional problem (missiles and targets), so you can obviously get a huge gain by doing only O(M+N) work involving dereferences, then O(M*N) on a cache-friendly data structure. But your explanation has me convinced that it could be beneficial even in a one dimensional scenario. Interesting about the ‘context switching’ too - I don’t have the experience to know, but had wondered if that might be a problem.

anonymous_user_4fe9f58b1 · March 30, 2015, 9:16pm

Eh, I’m misusing the term “context switching”. It normally refers to the operating system switching out a core’s workload, which also happens to invalidate the cache. I was basically saying that the extra indirection of looping through enemies within the missile iteration might cause cache misses. But I never assume this sort of thing anymore, I’ve been repeatedly surprised by how resilient L2 cache can be.

kamrann · March 30, 2015, 10:00pm

Yep I realised that, but it seemed to describe the issue well so I misused it too. Probably not a good idea given that you’d also spoken of multithreading
My thoughts had been that depending on cache size and how much is read in at a time, it could make things awkward by immediately overwriting stuff that you wanted kept in the cache. But like I say, I don’t know any of those details so wasn’t sure if it would be a big problem or a non-issue. Presumably this would be quite a common situation and thus can perhaps be taken into account by the OS in its caching strategy.

anonymous_user_4fe9f58b1 · March 31, 2015, 3:25pm

The OS doesn’t work at that level, the CPU caches are managed entirely by the chip. The OS is just that jerk that randomly decides, “Hey, Java needs an update, I’m going to take precious CPU time away from your game engine and schedule it to displaying an obnoxious pop-up instead!”.

Typically, CPU caching algorithms are LRU, and since the target locations are used every missile iteration, they might not get evicted that easily. But if you have many targets, the target data might end up evicting missile data. At 12 bytes per target location (3 floats), this would be a lot of targets however. I’d be curious to experiment on this but it’s still up in the air whether cache optimization, prefetching, etc. would even be necessary in this case.

illYay · April 2, 2015, 12:52am

Thing about missiles though is players likely won’t notice the difference between each missile having a mind of its own and a cluster of missiles having one mind. They’re just missiles, so maybe it’s not so bad. In the case of a game like Planet Side 2, you probably wouldn’t even have missiles beyond a certain distance be considered relevant to a client. It might switch to showing something that looks like a swarm of missiles from far away.

I can kind of see how it would be cool to have many individual objects active, but I’d probably optimize out the missiles and make them dumber and concentrate on having even more actual units in the game at once.

I’m basically describing how I’d achieve insane amounts of missiles flying around in a game I’d make. To the average player it would look like there are in fact lots of missiles all over the place. I honestly don’t even think it would look bad or anything. It would likely not be noticeable at all even to an experienced game developer.

They did something similar with the AI in Dynasty warriors where large groups of soldiers were controlled by one main AI so they were all essentially actually one character moving together. That was slightly more noticeable if you knew specifically to look for it.