Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.

Friday, June 18, 2010

ParrotTheory: Locks and Synchonization

I've talked a good amount in the past and recently about threading, so today I'm going to talk about some of the small bits that make threading and concurrency work. I'm thinking that we're also going to need to implement some of these primitives in Parrot eventually, so consider this blog post an extremely verbose way of planning for that.

Let's start the discussion by looking at two small code snippets; a structure definition and a routine that uses that structure:
typedef struct _array {
  int length;
  int *values;
} array;

void push( array *ary, int x) { 
  array->length++; 
  realloc(array->values, newlen * sizeof(int));
  array->values[array->length - 1] = x;
}
This isn't nearly as contrived an example as it may seem initially, though I purposefully made the code a little bit naive. It's worth noting here that the idea of a structure which contains both an array and the integer length of that array is very common. It's how Parrot implements it's STRING type and several of its array PMC types as well. In fact, the push_int VTABLE method for the ResizableIntegerArray PMC probably looks extremely similar to this example code.

Astute observers, and veterans of threading trench warfare will see a problem with this code: There are no locks and no concurrency safeguards, so this code isn't thread safe. Let's take a short walkthrough of this code where we have a preemptive thread switch in the middle to another thread also attempting to access this same method on the same object:
  1. We have an array object with length 5 and items {1, 2, 3, 4, 5}
  2. Thread A attempts to push the number 6 to the array, Thread B attempts to push the number 7.
  3. Thread A enters the function. We set length to 6 and realloc to gain a sixth slot. The array now contains values {1, 2, 3, 4, 5, ?}, because the last value isn't initialized by realloc.
  4. At this point, there is a preemptive thread switch and Thread B takes over control.
  5. Thread B enters the function and sets length to 7 and reallocs again. The array now contains values {1, 2, 3, 4, 5, ?, ?}.
  6. Thread B sets the sixth element to 7. The array is now {1, 2, 3, 4, 5, ?, 7}
  7. Thread B gets preempted, Thread A takes over.
  8. In thread A, the value of length is still 7, so [ary->length - 1] is still 6. When we add x to the array, we now have {1, 2, 3, 4, 5, ?, 6}
 There are some things we could try here, such as saving the values we need from the structure to local variables, so that even if we preempt in the middle of the functions the length field will be an accurate index. Or, we can try to rearrange the order of operations so some errors appear less frequently or less obviously.  The real problem here is that we have three operations that need to stay coupled: Increasing the known length of the array, reallocating array storage, and assigning an item to that storage. Any break between a set of coupled operations causes a problem. We call these areas of code where operations are sensitive to cross-thread interference critical sections.

As another example of a very small critical section, consider this one line of code:
foo = i++;
This seems pretty simple, but in fact it is not. This line of code requires several assembly language instructions, especially when we are talking about a non-optimized build:
  1. Fetch the value of i from memory into a processor register
  2. Copy the value of i to the variable foo
  3. Increment the register containing the value for i
  4. Copy the value of the register back to memory where i is located.
A preemptive thread switch can happen in between any of these steps. Consider the case where we break between steps 2 and 3 to another thread performing the same operation: If i is 5, Thread 1 foo is 5, Thread 2 Foo is 5, and at the end of the code snippet i is 7! If this code snippet is, for instance, in an SQL database generating unique integer keys for rows in a particular table, we've just generated non-unique keys and created a world of hurt for ourselves.

To get around these kinds of problems, one solution is to use a synchronization primitive. A synchronization primitive is any of a class of algorithms and objects that are designed to synchronize and limit access to shared resources. In this sense, a resource is anything that multiple threads might want to access: An IO stream, a global variable, a shared pointer, even a sensitive sequence of instructions,  etc. Any time we have a critical section of code that is sensitive to sharing we want to find a way to limit access to a finite number of simultaneous threads (usually one). There are several ways to do this.

A mutex, short for "mutual exclusion", object is a type of lock that helps to prevent access to a critical section. To pick a pertinent example, think of a mutex like a basketball. Only the one person in the game with the basketball can do things: shoot, pass and dribble. Other players on the team can do other stuff like running, covering, or posting, but they cannot do ball stuff without the ball. You cannot shoot the ball if you do not have the ball, you cannot pass the ball if you do not have the ball. This is a convention of the sport. If we were playing Calvinball instead, maybe we could do these things without looking preposterous. By convention also, if we as programmers declare that a certain shared resource can only be accessed by a thread (player) with the mutex (ball), the those are the rules for our system (game) and things can move along smoothly. Here's an example of that convention in action:
Mutex *m;
AQUIRE_MUTEX(m);
// critical section code
RELEASE_MUTEX(m);
The power in this code is that the AQUIRE_MUTEX() function will attempt to gain ownership of the mutex, and will wait indefinitely for the mutex to become available if some other thread already owns it. ACQUIRE_MUTEX is like waving your arms in the air, shouting "I'm open" until the current ball carrier passes the ball to you. Until you get the ball, you just have to stand with your arms in the air until you get it. Because of that behavior, no two threads can enter the same critical section, assuming of course that the programmer (you) has properly protected that critical section with the mutex. Keep in mind that there is no intrinsic property of the critical section itself that prevents multiple threads from running it simultaneously and corrupting data. The exclusion comes from the proper and pervasive use of locks like our mutex to keep the critical section safe. Here's another example:
Mutex m;

int pop(array* a) {
  ACQUIRE_MUTEX(m);
  int item = a->values[a->length - 1];
  a->length--;
  RELEASE_MUTEX(m);
  return item;
}

void push(array* a, int item) {
  a->values[a-length] = item; 
  a->length++;
}
In this example we can see that we aren't properly using the mutex everywhere, so we can't guarantee that we won't get corrupt data. Multiple threads could just as easily enter the push function simultaneously as could attempt to enter the pop function. If you don't use mutexes everywhere, it's almost as good as not using them anywhere. This is a convention that the coder must decide upon beforehand and follow diligently.


There are multiple ways to implement locks and mutexes. One idea is a spinlock, which attempts to access a flag and enters an endless while-loop until it can. An empty while-loop can be very inefficient on a processor, but if we call a sleep command inside the loop to allow other threads to run while we wait it isn't such a big problem. Spinlocks implemented by the OS inside the kernel event loop can be very efficient indeed. In fact, as a general rule, if the OS implements locking primitives they tend to be much better to use than anything you can write in userspace.

Another type of lock primitive is a semaphore, though it is subtly different. A semaphore allows a finite number of threads to access a finite number of shared resources at a time. Where a normal mutex, like a spinlock, allows only one thread to enter at a time the semaphore may allow one or more. Consider a case where we have five worker threads in a web server, and 100 incoming connections. A semaphore uses a first-come-first-served method to assign incoming connections to available threads. Each incoming connection attempts to access the semaphore. As requests are completed, threads signal their availability and the semaphore assigns the next connection in the list to that thread. A semaphore with only one shared object acts like a normal mutex or spinlock.

The overhead of a lock is the amount of effort it takes to acquire and manage the lock. In a uniprocessor system the lock may be very simple to obtain: First disable interrupts so we cannot be preempted by another thread, check the status of the lock, obtain the lock if it's available, and re-enable interrupts. In a multiprocessor system, especially one with shared memory, the overhead and error-checking involved can be much higher. In these systems the performance gain from using threads can be much higher too, so it's a trade-off.

Granularity is the amount of stuff in your critical section protected by a lock. Course Granularity means that we have lots of code inside our critical section. This is good because we need fewer locks and therefore experience lower overhead. Plus, it's easier as a programmer to make sure we acquire fewer locks over large swaths of our program. The downside is that the larger our protected critical section is, the more likely other threads are going to be blocked waiting to enter it. This, in turn, can create problems like high latency. Fine Granularity is the opposite, where we lock as little code as possible. The upside is that we don't have to worry about multiple threads blocking for long on small bits of code. The downside is that acquiring more locks means more lock overhead, and more programmer effort to implement all the locks consistently and safely. Fine granularity can also lead to deadlock, where multiple threads are stuck waiting for locks that other threads own.

The Python interpreter, as an example, implements a single Global Interpreter Lock, which is a lock to govern the entire Python interpreter. Only one operating system thread can be running the interpreter at once, to prevent corruption of global data. I think new versions of Ruby do this too.

There are other methods of synchronizing access to shared resources. One method is to make all data immutable; If you can't modify data, you can't corrupt it. Since Parrot's strings are immutable, you shouldn't ever need a lock when working with them. You may still need to worry about playing with a mutable container PMC which holds strings, or the mutable registers which point to strings, however.

Parrot is definitely going to want to make use of OS-supplied locks in some fashion. Maybe we want to make a PMC wrapper around system lock primitives, or we want to create some kind of lock manager that uses a single system mutex to distribute a series of immutable tokens to worker threads. The exact details of locking are certainly up for debate, but the fact that we don't want to brew our own should be obvious.

Since locks need to be used consistently for them to be of use at all strongly hints at the fact that Parrot should probably do the locking internally. We probably don't want to apply locks to every single operation, since the common case programs are single-threaded applications and we don't want to apply the performance penalty of lock overhead to programs which don't need it. If Parrot can identify only those PMCs which are shared, it can apply locks selectively to those PMCs only, limiting overhead to only the places where it is necessary. For instance, if we add a synchronize op:
$P0 = syncronize $P1
We can create some kind of wrapper PMC type whose vtables enter a lock, call the vtable of the synchronized PMC, and then release the lock. In this example, if everybody used $P0 when they wanted to modify $P1, all operations would be safe. The onus would be on the programmer to explicitly mark the PMC as synchronized, of course, and many programmers will probably forget to do that.

Maybe instead of passing PMC references between threads directly we create and pass clones and modify them separately on different threads. Then, when we want our changes from one thread appear in another thread, we would call some kind of propagate op:
propagate thread, obj
This would pause the specified thread, update the object contents, and then unpause the thread. This would be very similar to the message passing that languages like Erlang use (not exactly the same, because Erlang wouldn't pause the recipient for this, but you get the idea).

Maybe we have a system where we only share read-only copies. So thread A would own the PMC, but thread B could get a read-only copy of it. This would completely obviate the need to lock the PMC since only one thread could write to it, but then we need some kind of mechanism where thread B could make modifications back if necessary, or maybe B could gain the writable copy and make A's copy read-only. This system could get very complicated very quickly, however.

We could also avoid most locks if we used a transactional memory system to avoid memory corruption, but that could still add overhead to the single-threaded common case and then we would still want a system of locks for other operations that don't require locking a PMC.

These are only a handful of the many potential options that Parrot has ahead of it, and I can go into greater detail about any of them that people are interested in thinking about. I think Parrot is going to want at least some kind of locking mechanism, so we could start prototyping those things immediately if we wanted. How these mechanisms get implemented and applied within Parrot is obviously the bigger issue that we can ignore for now.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.