When we look at performance in GC, we can make a number of simplifying assumptions that help to decrease the complexity of our algorithm. And in GC, algorithmic simplicity is key: no matter what changes you make internally to the GC, the number of data objects being allocated and managed will remain constant (unless other parts of the system severely optimize their own memory usage, which is necessary too). So too is the amount of work required per object: we must visit it at least once to determine if it is alive and to mark it, and visit things again in order to sweep them. So there isn't a whole lot of performance that we are going to eek out of this basic procedure. However, there are four things that we can really do to improve GC performance*:
- Improve the efficiency of Allocation/Deallocation. We need to decrease the algorithmic complexity of an allocation, because it's a very common operation. Deallocations are (in theory) exactly as common as Allocations.
- Linearize Allocation/Deallocation. If we can allocate items from a nice, large, fresh stretch of memory we can avoid needing to search for appropriate-sized chunks, or needing to copy/compact existing data.
- Decrease the number of items needing to be marked and swept. We can't control how much junk the program generates, but we can control how often we mark/sweep all of it. This is done through heuristics such as generational GC algorithms which choose to mark/sweep a subset of all objects during a GC run, instead of all objects every time. This can also be done in an incremental core where we sweep a portion on each run.
- Decrease the frequency of GC mark/sweep runs. This basically means means we increase our memory footprint, or increase collector latency. More data available to our program means we need to do less searching for free memory.
When Parrot allocates a new swath of memory for it's fixed-size allocator, it immediately iterates over the entire block, breaking it up into individual objects and adding them each to the free list. The loops used for this are very tight and efficient in terms of the number of machine instructions needed to operate them. However, we lose a lot of performance because of processor cache faults. Essentially we need to move the entire block, often piecemeal, into the processor's cache for no other purpose then to prove that it's there. What we can do instead is allocate the block and keep a pointer to the beginning of it only. When we need to allocate a new object, we first see if the free list is empty, and if so we allocate the next object from the most recent memory block. We're still iterating over the entire block, but we're only examining objects when we need to allocate and use them, keeping the vast majority out of the processor cache until they are needed.
Allocating a memory arena in Parrot right now, whether we use any of the objects in the arena or not, is O(n) complexity because we need to iterate over the entire arena and add all items to the free list at once. Increasing the size of the memory arena therefore increases allocation time more or less linearly, which offsets the gains we would make from requiring fewer GC runs. By using the lazy allocator instead, allocating a new arena is always O(1) no matter what the size of the arena is, allowing us to increase the size of the startup pools without cost, and decreasing the number of GC runs during startup without penalty.
It's not all roses and unicorns though, having to check both the free list and the pointer to the newly allocated block for every allocation increases complexity slightly, but that should be offset by better efficiency in terms of processor caching, so I think it will be a net win.
chromatic estimates, and I think I agree with it in theory, that we could see about a 10-15% improvement in the runtime of our test suite from these kinds of changes, once we add the necessary features and then tune all the values and sizes. If we could, for instance, increase startup memory pool size to be large enough so that we don't need to run GC at all for Rakudo startup, they would see some significant gains in the performance of their test suite too. Of course, that might turn Parrot into a huge memory hog, so we need to tread lightly and carefully tune our pool sizes. Also, it wouldn't hurt if we could find ways for things like PGE and Rakudo to allocate fewer objects, but that's a different topic altogether.
[* When I talk about "GC performance", I'm talking about a combination of throughput and latency. We want more of the first and less of the second. Basically, we want the GC to cause less of a slowdown for the user, and be able to turn over junk quickly.]
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.