Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.

Thursday, October 29, 2009

ParrotTheory: Microprocessor Design

Long before I was involved in helping with the Parrot virtual machine, I was in school learning about and even designing actual microprocessor hardware. In my posts I nonchalantly talk about things like caches, pipelines, and hazards. Now I'm going to explain a little bit about what these things are and why they matter to Parrot (and all code, for that matter).

A typical production-grade microcontroller, not a toy or a colorful explanatory diagram in a book, is broken up into several stages. Each stage represents a portion of work or computation that can be done in a single clock cycle. I'll call these "clock cycles" for simplicity, but it's worth mentioning that there is not necessarily a strict relationship between the clock frequency and the instruction frequency. It's also worth mentioning here that the smaller your cycles are, the faster your clock can operate. Your clock's frequency is limited by the length of time it takes to execute your longest stage.

If we look at a simple and classic design, like the original MIPS microcontroller, it is broken up into 5 distinct stages: Instruction Fetch (IF), Instruction Decode (ID), Register Lookup (REG), Execute (EX) and Write Back (WB). In the IF stage, the instruction is retrieved from memory. ID breaks that instruction up to a series of individual components like control sequences, register names, etc. REG retrieves the values of the operand registers based on the information in the instruction word from the register file (an array of register values). EX performs the desired operation and produces a result. WB takes that result and stores it back into the register file. Wash, rinse, repeat.

A given instruction is only being executed in one stage at a time, so in this example it takes 5 clock cycles for the instruction to execute fully, and 1 instruction is executed every 5 cycles. This may sound redundant, but I'll explain later that it might not be. This would be a bit of a waste because at any given time 4 of the 5 stages can be empty and unused. To prevent this, microprocessors use a technique called pipelining, where we stagger execution to keep all the stages full. At any given time, each stage can be processing a different instruction. In this case, it would take 5 cycles to complete 1 instruction, but we can now complete 1 instruction every cycle. This is a 5x speedup!

[Note: some readers will realize that I'm talking about a multicycle processor here, not the more simple single cycle design. This is on purpose.]

The original MIPS design is a very simple example, modern processors from SPARC, AMD, or Intel might have more then 20 stages.

We run into problems called pipeline hazards if we do things like try to read a value from a register that is being calculated concurrently by another instruction in the processor. In this case, the processor bubbles, or delays for a few cycles until all the necessary data is ready. A smart optimizing compiler can automatically rearrange instructions to help reduce bubbling. To see how, take a look at this sequence:

a = b + c
d = a + 2
e = f + g

In this sequence, the value a is used immediately after it is calculated. This will cause the processor to bubble for a few cycles until the value of a is ready. Remember, it takes 5 cycles to compute a single instruction, so the first instruction must complete fully before the second instruction can begin. This means that for 4 cycles, new instructions cannot enter the pipeline, being instead replaced with empty "bubbles". By simply reordering things a little bit, we can speed things up:

a = b + c
e = f + g
d = a + 2

Same exact code, but now executes a little bit faster. This is because the instruction "e = f + g" is not dependent on the value of a, so it can enter the pipeline before the value of a is calculated. Now, we're down from a total of 7 cycles to 6 to perform the exact same work

[Note: I'm definitely simplifying. Don't shoot me!]

The EX stage, which is the actual workhorse of the processor, is relatively complicated in itself. It consists of two separate components, the arithmetic-logic unit (ALU) and the Memory Unit (MEM). Instead of being a series of stages in a row, the ALU is a set of parallel execution cores. Each core performs a different operation, so one may add/subtract, one may multiply, one may bitshift, etc. Data comes in and is copied to every core, every operation is performed on it, and a multiplexer is used at the end to select which value we want to keep. The MEM component can write or read data in memory at a given address. It's the MEM component that is usually one of the slowest parts of the processor, and serves as a huge bottleneck for the rest of the hardware. This is because MEM needs to interface with the cache and the RAM, which both may be running at a significantly slower clock frequency.

The ALU can be remarkably fast, because there are often very straight-forward tradeoffs to be made between numbers of transistors and speed. When you hear statistics about modern CPUs having billions of transistors, you'll know that they're trading more transistors for improved performance. I can talk about this more later if people are interested.

What we can do to make things even faster is to start doubling up on parts of the pipeline. We can be fetching and decoding instructions in large batches and passing them to multiple parallel EX cores. So long as we have the necessary hardware to detect hazards, we can be essentially executing multiple instructions together at each stage, with multiple stages in process. Some modern Intel CPUs, for instance, may contain two ALUs and two specialized MEM units (one to read and one to write). This means it can execute two mathematical instructions and two memory instructions at the same time. A smart compiler will know to order instructions like this to maximize parallelism. This design which uses parallel pipelines in a single processor core is called superscalar. When a compiler rearranges instructions to take advantage of parallelism or even to avoid hazards, it is called instruction pairing.

Consider a modern processor executing at 2GHz with 20 pipeline stages and can complete 1 instruction per cycle. That's 40 billion instructions per second that can be executed if everything is running at it's maximum efficiency. Not too bad. However, keeping things maximally efficient is a difficult task. Reducing hazards is a great thing, but the processor can only run fast if it can keep it's pipeline full, and can only keep the pipeline full if it knows what instructions to execute ahead of time. When we look at a conditional in C, we may not know what instructions to execute next until we've already calculated the value of the conditional:

if(a == 0) {
a = 1
} else {
a = 2
}

If the comparison a == 0 is in the pipeline now, we don't know which instruction to load next: a = 1, or a = 2. Without knowing we would have to wait 20 cycles until the comparison completed before we could load the next instruction into the pipeline. When you consider the case of more stages and more parallel pipelines, if we have to stall and wait for a result to be computed we can lose a lot of processor time. You can start to see huge performance decreases from these kinds of situations. To help resolve these situations we have branch predictors.

Branch predictors are specialized pieces of hardware that attempt to predict where control flow will be going, and use that information to speculatively load the pipeline with instructions. If the branch predictor is correct, things chug along like normal and everything is fast and wonderful. If the branch predictor is incorrect we need to flush the incorrect instructions out of the pipeline and start fetching the correct instructions. This can be expensive for a number of reasons: We need to completely refill the pipeline, which means we need to fetch more instructions from slow-moving memory.

I haven't made much mention of it so far, but memory caching is a huge deal. Memory runs at a much slower speed then the processor does. To keep the processor moving quickly we bring small blocks of memory into a small bit of fast-access storage inside the processor called the cache. The cache is small but very fast, and information in the processor cache can be accessed very quickly. If we try to access memory that isn't in the cache, we need to stop everything and load that data from memory into the cache before we can continue. This is called a cache miss and can be very expensive. If we look for memory that is not in memory either but is located on disk, we have what's called a page fault, which is more expensive still, but discussing tha is well beyond the scope of this post.

If the branch predictor is wrong, we need to flush the pipeline and maybe flush the cache too. This can become prohibitively expensive, with slowdowns around 100x or more. This is why sometimes your 3GHz quad-core processor grinds along like an old 486: poorly written code is causing all sorts of hazards and misses, and drives efficiency down to the floor. It's not the processor that's slow, it's the processor needing to wait for the pipeline to refill, or the cache to refill, or even the virtual memory manager to load the correct page from disk. It's the waiting that's a huge performance killer.

In a lot of cases, doing more can actually be better for performance. Consider this code, for instance:

if(a != NULL) a = NULL;

This is very straight forward, and only does the work of setting a to NULL if it isn't already. What you don't see is that this is probably a net performance loss: the branch gives us a 50% chance of missing in the branch predictor, and the comparison instruction is already moving the value of a into the cache and then it's an extra instruction to compare before we set the value. In this case, it is likely much faster to just write

a = NULL;

But then again, that's not the whole story. It's not just a matter of keeping the pipeline full, it's also a matter of how much effort it takes to read from and write to the cache. If we need to write through cache to main memory, we'll waste a lot more time then we would have if we did the conditional. There are always tradeoffs to make, some of them painful.

In other cases, it's better to write more code to be faster. Here's an example:

int i;
void *a[8];
for(i = 0; i < 8; i++) a[i] = NULL;

We could make this faster by telling C that i can avoid memory entirely by defining it as "register int i", but that's a small optimization. The loop is going to cause at least one branch predictor failure, either entering the loop (and not knowing the first time to jump back to the top) or exiting the loop (and not knowing when not to jump back to the top). By unrolling the loop we can make things much faster:

a[0] = NULL;
a[1] = NULL;
a[2] = NULL;
a[3] = NULL;
a[4] = NULL;
a[5] = NULL;
a[6] = NULL;
a[7] = NULL;

I don't want to talk about all potential optimizations here, that would be a huge list. But by being conscious of the types of factors involved in performance I hope that other developers are able to spot other situations where code may be far less efficient then it could be.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.