Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.

Monday, August 3, 2009

ParrotTheory: Garbage Collection

So I've talked a lot about garbage collection on my blog before, and I think now is the time to really go in depth about some of the theory behind them. Garbage Collection is a subset of a larger field of study on Memory Management. In fact, much of the functionality in the Parrot GC system is only concerned with memory management, not the actual detection and collection of garbage.

When we look at a non-garbage-collected program, memory is typically allocated from the operating system with malloc and variants, and is returned to the OS with free. This is fine for tiny little programs, but as complexity increases you can start experiencing memory leak problems, and you end up spending more and more and more effort allocating memory, checking if memory is unused, and returning it to the OS manually. In short, it's not a system that scales well.

Enter Garbage Collection. GC is a system in your program that automatically handles memory allocation and dealloction. It does this by detecting when allocated memory is not being used any more and frees it or recycles it automatically. There are two basic types of Garbage Collection: Cooperative and Non-Cooperative.

Cooperative GC

Cooperative GC, more commonly known as "reference counting" is a form of GC where the entire program needs to participate in the GC algorithm in order for it to work correctly. In a cooperative GC system, every time a pointer to an object is made, the reference count needs to be increased. Every time a pointer is deleted or redirected, the reference count needs to be decreased. When the reference count for a particular object reaches 0, nothing points to it and it can be automatically and immediately freed. Cooperative GC can be done more easily through the use of macros:

#define ADD_REFERENCE(obj, newref) do {\
obj->refcount++; \
newref = obj; \
} while(0);

#define REMOVE_REFERECE(obj) do {\
obj->refcount--; \
if(obj->refcount = 0) \
free_my_object(obj);\
} while(0);

So that's not such a huge problem and all the code fits into these two little macros. The problem is remembering to faithfully use these macros for every single pointer access throughout the entire program. Notice that, especially for concurrent systems this really needs to be every single access, because another thread could be decreasing the refcount at the same time that you are using it, and free it out from under you.

Non-Cooperative GC

The other form of GC, and the type that Parrot uses, is a non-cooperative GC. In a non-cooperative system, the GC module is completely separate, and the rest of the program does not even need to be aware of it's existence. In the Boehm-Demer-Weiser Collector, for example, the only change that needs to be made to the program is to replace calls to malloc/free with the allocation/deallocation functions, and everything else proceeds like normal with garbage collection happening invisibly in the background. The benefit to this system is that you write a single GC system, and then can write normal code throughout the rest of your program without ever having to remember to reference your data again.

Non-cooperative collectors typically work in two stages: mark and sweep. In the mark phase, the collector walks the stack and the heap looking for pointers to data. Every data item has an associated flag, and when we find a pointer to that item we mark the flag with a special "Alive" value. We then look through that data item for other pointers to follow and mark. The collector starts at a root set of data pointers that we know we can reach (current processor register contents, current stack contents, global variables, etc) and then trace through memory like a large graph. In the sweep phase, we iterate over data items in the heap and look for all data items which have not been marked. Anything that is not marked is not reachable, and can be freed.

I don't want to do a huge compare/contrast of the two GC types, they both have pros and cons. Perl5 uses Cooperative GC, while Parrot (and therefore Rakudo Perl 6) uses Non-Cooperative GC instead. The only real differences are where and when objsects are detected to be dead. For the rest of this post I'll just focus on non-cooperative systems because that's what Parrot uses.

Non-cooperative GC tends to run all at once, such as when a new allocation is requested from the program. The GC will look for some available memory and i none is found it will mark then sweep memory to try and free some up. The mark and sweep algorithm could be expensive and this will cause a system slow-down or even a complete pause while it is running. The basic mark and sweep algorithm is notorous for these pauses. Luckily a number of new algorithms have been developed over the years to help mitigate their effects.

Algorithms


In an Incremental GC, we break GC runs up into pieces. We do a subsection of the entire algorithm each time, with the intention that we still have pauses, but the pauses are each shorter and more spread out. Incremental GCs are typically implemented using a tri-color mark: Instead of a memory object being marked "alive" or "dead", we use three colors: "black", "grey", and "white". White objects are completely not marked ("dead"). Grey objects are marked, but their children are not marked. Children in this case are pointers in one memory object that point to other memory objects. Black objects are completely marked including all their children. By keeping track of the state of objects and the the states of their children we can pause at almost any time and resume by taking the list of all grey objects and continuing the mark from there. Incremental GC helps to spread the pauses out, but doesn't really do anything to fundamentally decrease the amount of time that's needed to pause over long execution runs.

In a Generational GC, we use the heuristic that most memory objects are generated, used, and abandoned very rapidly. Think about temporary variables in a tight loop, for instance. So we only perform GC mark and sweep on certain groups of memory objects, called generations. Memory objects are allocated into the "young" generation, which is processed by the GC very frequently. Objects that survive a sweep can be graduated into an older generation which is swept less frequently. This algorithmic change means less pause time for the GC (because not all memory is being marked and swept). However, dead objects in the older generations won't be found and freed as often, resulting in longer delays for some objects to be destroyed (think closure of a file handle when it's containing scope exits), and more memory footprint needed to hold these dead objects.

Concurrent GC makes use of system-level parallelism to perform aspects of the GC algorithm in separate threads, which negates pauses entirely. Actually this isn't entirely true, in a system where there are more running threads then there are hardware processors to execute them we still have pauses in the form of thread context switches. Concurrent GC has the benefit that we can really run any algorithm we want regardless of execution time or complexity, so long as it runs in another thread, without any additional performance penalty. In a concurrent system, however, there is typically a need for significant synchronization, plus the overhead inherent in executing separate threads (which is more expensive on some systems then in others).

A basic mark and sweep algorithm has another problem of memory fragmentation. We allocate memory linearly in memory, but start to deallocate it randomly, leaving large odd-shaped holes in memory between live data. The allocator runs quickly at first, but as fragmentation increases, it becomes more complicated to find suitable holes in memory for new allocations to be made from. To avoid this problem, we can use a special GC algorithm called a compacting or copying collector. A compacting collector actually moves data in memory to keep fragmentation down. If we think of memory as a long chain, we move objects to the front of the chain, forming a "dense prefix" where live data tends to be, and a sparsely-populated area at the end of the chain where there are few if any live data items and therefore the allocator can execute more quickly. We can then focus our mark and sweep efforts on the dense prefix because that's where all the data items are.

In a region-based memory system, we break memory up into chunks called "regions", and associate each region with an executing context. When the context exits (such as through a subroutine return) we can free the entire region at once. A freed region can then be used by the allocator to quickly allocate memory linearly.

These various algorithms aren't atomic, we can mix and match features from each to suit individual needs. For example, Java's new Garbage First collector uses a compacting region-based collector that focuses it's attentions on sparsely-populated regions to clear garbage out of the way of the allocator. This gives Java good linear allocation performance while also decreasing the number of long pauses the collector needs to run.

Overview


Garbage collection methods and algorithms are a trade off of several components: memory footprint, throughput, and latency. Memory footprint relates to how aggressive the GC is. If it is very aggressive and always frees dead objects quickly, it will use the least amount of memory. Throughput involves the overall speed of the GC in a long-running application. The GC should be able to process a lot of data quickly. Latency is the amount of pause time necessary for a GC run, we want to minimize this. The general rule of thumb is that you can maximize two of these requirements at the expense of the third.

If we are running a GUI application such as a videogame, we want to decrease latency because pauses for a GUI application are unacceptable. We may also want to maximize throughput because those kinds of applications churn through a lot of memory for all it's physics and graphics calculations. If we are running a background service process, we may want to keep memory footprint low, but may not care at all about latency. A web server application may want high throughput and low latency, but will be allowed to hog extra memory.

So that's garbage collection in a nutshell. I can always go more in-depth about any topics that people are more interested in.

4 comments:

  1. <nit>
    With reference counting, you only need to use the macros when you are *copying* a pointer to an object (ADD_REFERENCE), or discarding one (REMOVE_REFERENCE).
    </nit>

    ReplyDelete
  2. Ah yes, good point. I was oversimplifying a little too much.

    There are situations, such as iterating over memory pools at a very low level, where you are actually creating a new pointer to an object, and you would want to mark it so it doesn't disappear while you are processing it. Of course, that's esoteric and rare enough that most refcounting systems like Perl5 can avoid it entirely. Just mentioning for the sake of completeness :)

    ReplyDelete
  3. I think your terminology of "Cooperative GC" vs. "Non-Cooperative GC" is a bit misleading.

    First of all, reference-counted systems cannot really count as garbage collected systems because of the problem of circular references: You can have a strongly connected component of objects that are not reachable but that will not be freed because each object by itself is being pointed to by some other object.

    Second, what you call "Non-Cooperative GCs" can also benefit from cooperation, and in fact, cooperation is often used. For example, whenever you mark a pointer explictily as belong to the root set, you are using cooperation with the GC. Another example would be non-conservative GCs that have knowledge about the layout of your program's structure. If the GC knows that certain fields in your structures do not contain pointers, then it is, in a sense, using a type of cooperation (because this information has to be supplied to the GC).

    ReplyDelete
  4. You're right that the terminology is a little bit confusing, but these aren't terms that I created. Cooperative/Non-Cooperative terms are used frequently in books that discuss garbage collection (See: Dragon Book) and in academic papers on the subject.

    Reference Counting is a type of garbage collection because refence counts are used to detect dead objects and collect them. Now, I know a lot of people use the term "Garbage Collection" as being an alternative to "Reference Counting", but that's not really accurate. This is why terms like "Cooperative" and "Non-Cooperative" are used.

    Parrot is a great example of what you mention here: It's "Non-Cooperative" GC in fact gets a lot of cooperation from the rest of the core system. Of course, the Parrot GC doesn't work on data objects of arbitrary sizes like most GCs do, it is currently restricted to PObjs, which means we can optimize significantly by looking only in specific places for pointers and not needing to sweep all of the stack/heap.

    In reality GCs exist on a continuum between cooperative and non-cooperative extremes. No GC system is perfectly one or the other, even though these terms are used. It's analogous to saying that no language is perfectly dynamic even though we routine label some as being "dynamic languages".

    ReplyDelete

Note: Only a member of this blog may post a comment.