Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.

Tuesday, August 11, 2009

Lazy GC And Fixed-Size Allocation

So I've mentioned in an off-hand way recently about some of the small projects I've been working on in the Parrot GC. When I do talk about GC it's either on one of the two topics "fix a bug" or "make drastic improvements", but these that I have been working on recently are neither of those: they are small, incremental improvements that will provide only limited improvements.

When we look at performance in GC, we can make a number of simplifying assumptions that help to decrease the complexity of our algorithm. And in GC, algorithmic simplicity is key: no matter what changes you make internally to the GC, the number of data objects being allocated and managed will remain constant (unless other parts of the system severely optimize their own memory usage, which is necessary too). So too is the amount of work required per object: we must visit it at least once to determine if it is alive and to mark it, and visit things again in order to sweep them. So there isn't a whole lot of performance that we are going to eek out of this basic procedure. However, there are four things that we can really do to improve GC performance*:
  1. Improve the efficiency of Allocation/Deallocation. We need to decrease the algorithmic complexity of an allocation, because it's a very common operation. Deallocations are (in theory) exactly as common as Allocations.
  2. Linearize Allocation/Deallocation. If we can allocate items from a nice, large, fresh stretch of memory we can avoid needing to search for appropriate-sized chunks, or needing to copy/compact existing data.
  3. Decrease the number of items needing to be marked and swept. We can't control how much junk the program generates, but we can control how often we mark/sweep all of it. This is done through heuristics such as generational GC algorithms which choose to mark/sweep a subset of all objects during a GC run, instead of all objects every time. This can also be done in an incremental core where we sweep a portion on each run.
  4. Decrease the frequency of GC mark/sweep runs. This basically means means we increase our memory footprint, or increase collector latency. More data available to our program means we need to do less searching for free memory.
Parrot uses (in theory anyway) two types of aggregate objects: PMCs and STRINGs. All other types of data items that get allocated are attached in one way or another to these two types. When we allocate a PMC, we also need to allocate and initialize it's Attributes structure, which is a separate memory object. Currently this is done through an additional malloc() call, although in the auto_attrs branch, it is done using the new fixed-size allocator instead, which should be less expensive then normal malloc calls (#1). By allocating these items from our own managed pools and using a lazy technique to do it, we can improve allocator linearization (#2). By realizing that these fixed-size data items are always considered part of the PMC, We can simplify the marking and sweeping algorithm to ignore them completely and treat them as just an extension of the single PMC object (#3). And finally, by using the lazy allocator and increasing the initial size of our pools, we can decrease the number of GC runs that need to occur during startup, maybe eliminating them entirely (#4).

When Parrot allocates a new swath of memory for it's fixed-size allocator, it immediately iterates over the entire block, breaking it up into individual objects and adding them each to the free list. The loops used for this are very tight and efficient in terms of the number of machine instructions needed to operate them. However, we lose a lot of performance because of processor cache faults. Essentially we need to move the entire block, often piecemeal, into the processor's cache for no other purpose then to prove that it's there. What we can do instead is allocate the block and keep a pointer to the beginning of it only. When we need to allocate a new object, we first see if the free list is empty, and if so we allocate the next object from the most recent memory block. We're still iterating over the entire block, but we're only examining objects when we need to allocate and use them, keeping the vast majority out of the processor cache until they are needed.

Allocating a memory arena in Parrot right now, whether we use any of the objects in the arena or not, is O(n) complexity because we need to iterate over the entire arena and add all items to the free list at once. Increasing the size of the memory arena therefore increases allocation time more or less linearly, which offsets the gains we would make from requiring fewer GC runs. By using the lazy allocator instead, allocating a new arena is always O(1) no matter what the size of the arena is, allowing us to increase the size of the startup pools without cost, and decreasing the number of GC runs during startup without penalty.

It's not all roses and unicorns though, having to check both the free list and the pointer to the newly allocated block for every allocation increases complexity slightly, but that should be offset by better efficiency in terms of processor caching, so I think it will be a net win.

chromatic estimates, and I think I agree with it in theory, that we could see about a 10-15% improvement in the runtime of our test suite from these kinds of changes, once we add the necessary features and then tune all the values and sizes. If we could, for instance, increase startup memory pool size to be large enough so that we don't need to run GC at all for Rakudo startup, they would see some significant gains in the performance of their test suite too. Of course, that might turn Parrot into a huge memory hog, so we need to tread lightly and carefully tune our pool sizes. Also, it wouldn't hurt if we could find ways for things like PGE and Rakudo to allocate fewer objects, but that's a different topic altogether.

[* When I talk about "GC performance", I'm talking about a combination of throughput and latency. We want more of the first and less of the second. Basically, we want the GC to cause less of a slowdown for the user, and be able to turn over junk quickly.]

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.