How to efficiently simulate hundreds of instances of something at once?

anonymous_user_4fe9f58b1 · March 27, 2015, 3:49pm

Yes and no. It’s basically being aware of how CPU cache works and coding in a way that makes the best use of it. It’s a broad topic, but the jist of it is that memory was not created equal. A RAM access is incredibly costly because the CPU has to wait many cycles before being able to use the read value. While the CPU can sometimes schedule other things while waiting for the RAM access, it’s not always optimal. So the CPU cache is a layer in between RAM and the CPU that is much faster and allows access in just a few cycles. It’s also fairly limited due to an exponential cost in storage the closer it is to the CPU. L1 and L2 cache, for instance, are generally measured in terms of kilobytes, but are incredibly fast (single cycle reads for L1). (As a related anecdote, one of the reasons AMD has been lagging behind in terms of CPU performance is because of the comparatively slow architecture of their cache compared to Intel’s.)

When you read from memory, the architecture assumes you’ll likely be reading close to it, so an entire “row” of memory gets pulled into each cache level for subsequent reads. If the memory you are reading is not in L1 cache, then L2 is tried, then L3, and so on. Subsequent cache levels sometimes exist, but in current technology, L3 is the last on-chip cache level. Failing to read from the cache and having to pull from memory is called a cache miss, and the ensuing stall that comes with it is undesirable for high speed algorithms.

Being aware of this, you can plan algorithms so that your data is contiguous and causes less cache misses. Since actor locations is stored in its root component and those actors are scattered all over the place in memory, it is very hard to cache this data. On the other hand, if you were to create a special buffer where you copy all those locations, then you end up in a situation where the CPU can just walk through its cache and churn out all its calculations with little to no cost in read cycles. An added advantage in multi core CPUs is that L2 and L3 caches are normally shared between cores, so if you have multiple worker cores working on the same algorithm, the same data will be there for all of them to access.

Making the best use of caching is a tricky thing but it’s one of the biggest gains when optimizing. Profilers for console platforms tend to have a lot of information on memory access, it’s trickier on desktop platforms because the game doesn’t have exclusive access to the CPU and the OS can schedule time away from it. Visual Studio’s profiler provides information about cache misses, though I haven’t used it for that on Windows yet.

Anyway, if it comes to that level of optimization, you’ll be in for a treat. In the meantime, use UE4’s and try to eke out every little bit of performance.

Also, when illYay mentioned querying PhysX’s scene, it wasn’t meant as actual traces as much as trying to harness whatever spatial organization structure it is using internally. I’m not familiar with its internals, but I do know that static collision is baked into some sort of octree. I don’t know if dynamic objects are also stored within that tree or if they use a different mechanism. In any event, that information is all tucked away behind the physics interface, so getting access to that information is likely fairly challenging.