Whiteknight's World: June 2009

Monday, June 29, 2009

Understanding Opcode Dispatch

I've been reading a lot of papers lately with respect to opcode dispatch because I've been trying hard to prepare for a possible migration to L1. As I've mentioned on other posts before, L1 is going to enable a number of cool techniques and optimization possibilities that we don't really have access to right now, and make a number of other optimizations which are currently possible more beneficial. What I want to do is understand all these possibilities and their requirements so we can design L1 specifically to provide them. The goal of this post is to go over some ideas about opcode dispatch and runcores from the literature I've been reading, to make sure everybody has a basic understanding of some of these concepts.

This is the start of what I hope will be an ongoing series of posts where I try to discuss theory and algorithms behind some of Parrot's systems, and theories about things we would like to add to Parrot in the future. I'll tag posts like this "ParrotTheory".

"Dispatch" is the action of moving control flow into the logic body of an opcode, based on the value of a "program counter" (PC) pointer. In microprocessor hardware terminology, the PC is a hardware register that points to the memory address of the currently executing machine code instruction. x86 aficionados will probably recognize this as the "IP" (Instruction Pointer) register. Same concept, different name. Bytecode comes in to the interpreter as a large block of sequential opcodes, and the current opcode is pointed to by *PC. Moving to the next opcode means we increment PC, so PC++ or something similar.

Most simplistic VMs will use a switch-based core, which uses the C switch statement to dispatch:


for(pc = program_start; pc < program_end; pc++) switch(*pc) {
 case INSTR_PRINT:
   ...
   break;
 case INSTR_PUSH:
   ...
   break;
 case INSTR_POP:
   ...
   break;
}

Control flow in this snippet is a complete mess because we're following the switch, breaking to the bottom of the switch, and then jumping back up to the top of the loop to repeat the process. A good optimizing C compiler will probably convert this switch statement into a dispatch table of jump addresses, which creates a series of indirect jumps. Hardware has trouble with indirect jumps in general because it needs to load the address to jump to from a different address in memory (which can change at runtime and therefore cannot be easily predicted). I can describe this in more detail elsewhere for anybody who is interested (I'm doing a particularly shitty job right now). We end up running the CPU at a fraction of it's maximum throughput speed. It has to keep refilling the instruction pipeline because it can't anticipate where control flow will move to next. This is bad.

A system called direct threading stores all the opcode bodies together in a single large function, similar to the switch core. However, instead of using a loop and a switch, it uses a goto instruction to jump directly to a label. Each opcode has a label, and those are usually stored in a table somewhere. So instead of the example above, we have this:


INSTR_PRINT:
...
goto jmptable[*pc++];
INSTR_PUSH:
...
goto jmptable[*pc++];
INSTR_POP:
...
goto jmptable[*pc++];

It turns out that this is a little faster, but the hardware microprocessor is still not able to properly anticipate where all the indirect jumps lead. Sometimes it can, and some architectures handle this case better then others, but it's far from perfect. Parrot has two cores that implement direct-threading: the computed goto (CG) and predereferenced computed goto (PCG or CGP) cores. Because of limitations in some compilers and the C99 spec, CG and PCG are only available when Parrot is built with GCC. These can be accessed with the "-R cgoto" and "-R cgp" flags respectively.

[Update 01/07/09: As a note of clarification, switch-based cores and indirect-threading cores perform differently on different systems. Some platforms handle one better then the other. Most compilers will generate bounds-checking logic for the switch core that does not exist in the direct-threaded core. I've seen numbers that show the switch core leads to almost a 100% branch prediction failure on some systems, while direct-threading leads to only about a 50% branch misprediction on those same systems. I would like to see a more robust cross-platform comparison of these two.]

In these cases, it's really interesting to note that the number of machine code instructions needed to dispatch an opcode is not nearly as important to system performance as the ability of the microprocessor to anticipate control flow and keep it's pipeline full. Getting a speculation wrong means that the processor will have to flush it's pipeline and reload instructions. Some processors will even stall, and not execute anything until the new address is known. These problems, called pipeline hazards are very very bad for performance.

Pipeline hazards are bad for performance, but cache hazards are even worse. Cache hazards occur when the processor attempts to access memory that is not stored in it's cache, and has to load the necessary data from the comparatively slow RAM. We run into a cache hazard in terms of opcode dispatch when the code size of the opcodes is very large, and we can't fit it all into the processor cache. So what we want is a dispatch mechanism that is good in the processor cache, but also makes things easier on the branch predictor. This is one of my motivations for making L1 a very small opcode set, to ensure that all opcodes can easily fit into the processor cache without creating all sorts of cache hazards.

Inlining, which is a technique frequently used in JIT and modern optimizers, tries to help with branch prediction by converting a stream of opcodes with indirect branches into a stream of machine code instructions with relative branches. Instead of dispatching to the opcode body, the opcode body is copied into a running stream of machine code and executed directly by the processor. This completely eliminates pipeline hazards due to indirect opcode dispatch. However, you end up with more code to cache because you have an entire stream of machine code, where there may be duplication of individual ops. You also spend a lot of overhead calling memcpy repeatedly on the opcodes. This technique increases memory footprint and can reduce cache performance.

A subroutine-threaded core stores each opcode as a separate C-level function. Each op in sequence is called and then the op returns back to the runcore. This is two branch instructions to dispatch each op, compared to only one for a direct-threaded core. However, recent benchmarks I have seen in Parrot show that the subroutine core actually performs faster then the direct-threaded core does. This is because modern microprocessors have lots of hardware dedicated to predicting and optimizing control flow in call/return situations, because that is one of the most common idioms in modern software. This is a nonintuitive situation where more machine code instructions actually execute faster then fewer instructions. Parrot's default "slow" core ("-R slow") and the so-called "fast" core ("-R fast") use this technique (actually, these cores aren't exactly "subroutine-threaded", but it's close). From the numbers I have seen, the fast core is the fastest in Parrot. Here's how it works, basically:


for (pc = program_start; pc < program_end; pc++) {
   functable[*pc](interp, args);
}

[Update 01/07/09: There is some disagreement in the literature about what "Subroutine-threading" really is. Some sources refer to the example I posted above as being Call-Threaded code, and use the term "subroutine threading" more in the way that I am describing context-threading below.]

Context threading, which is a slightly more advanced technique, combines some ideas from the subroutine-threaded and direct-threaded modes, and a little bit of JIT magic. We create what's called a Context Thread Table, which is actually a memory buffer of executable machine code. We store opcodes as functions like we do in the subroutine-threaded system, and use a simple JIT engine to create a series of calls to those functions. This PASM code:


new P0, 'foo'
add_i P0, 1
push P1, P0

Becomes this sequence in machine code:


call Parrot_op_new
call Parrot_op_add_i
call Parrot_op_push

What context threading does, in essence, is aligns the VMs PC pointer with the hardware IP register, and maximizes the ability of the hardware to predict branches. There are some complexities here involving branches at the PIR level, but they aren't insurmountable. Parrot does not currently have a context-threaded core, but I would very much like it if somebody added one. Numbers I've seen show that a good context-threaded core reduces pipeline and cache hazards by significant amounts, which has the effect of increasing performance significantly.

So that's a quick overview of some basic opcode dispatch mechanisms. I know I am leaving a few issues out, and I know I misrepresented a few topics here for brevity and clarity. Next time I think I will talk about method dispatch and some optimization techniques used for that.

Sunday, June 28, 2009

Juggling Projects

Lot's of work has been going on in Parrot land post-YAPC. Lots of people are doing cool new work, and there has been a steady influx of interested new participants. Some people suggested that my Parrot4Newbies posts recently have been bringing them in, but the reality is that there are lots of interested people already who are just looking for opportunities to get involved, and anybody can make those opportunities public.

I have been blogging like a wild man this past week, and particle suggests I can lower the rate of Parrot4Newbies posts to give the community time to think of new ideas for it. I'm fine with that, I'll try to get a nice good post for it out again next week. As is my style, I have about half a dozen posts written, they just need some finishing touches before I publish them.

During YAPC I was working on a new IO architecture where IO PMC types delegated non-subclassable attributes to a new Handle type. The idea was to make things like FileHandle and Socket types subclassable from PIR, and just delegate the messy stuff to a non-subclassable type. Basically, it was a utilitarian solution that resolved the immediate need while not addressing the underlying problem of subclassing PMC types. Allison convinced me that the right way forward was not to write a hackaround that's specific to the IO system, but to fix PMC subclassability entirely.

The real problem is that some PMC attributes are subclassable (PMC*, STRING*, INTVAL, FLOATVAL), and all other types are not. So when you create an object of a C type, only the subclassable attributes are accessable. If you try to access non-subclassable attributes, Parrot throws an exception. This behavior is obviously wrong, we shouldn't be allowed to even create a subclass of a type that's going to be unusable for what we want. Or, we should find a way to make all attribute types properly subclassable. The way to do this is to allow arbitrary C types to autobox into PMCs so that they can be stored in the ResizablePMCArray of the Object's attributes. Then, the GETATTR/SETATTR macros need to be taught how to box and unbox types at runtime. It's simultaneously a very hard and very easy problem to solve, and if anybody is interested in attacking it before I am ready myself, that would be awesome.

So, I backed out all the changes concerning the subclassability issue from the IO work, and focused on other things. Right now I'm stalled on that because there are some issues involving the new Pipe code that Infinoid wrote not playing well with the old FileHandle code. Hopefully we can get that branch fixed and merged to trunk this week, and then the field is set for AIO.

Austin Hastings put his Close compiler source code up onto Google Code, and I've been eagerly reading through it. Haven't had a chance to play with it yet, but it looks awesome: like the replacement for PIR that we've always wanted. A few changes and improvements need to be made, but it's awesome already.

A few days ago Coke started a new branch to begin the conversion of Contexts into a new PMC type. This will make them properly garbage collectable and fix all the leaky memory that's associated with Contexts right now. At least, that's the hope. I started playing with this branch yesterday, doing some of the conversions, and I have to say that it is a lot of work. Contexts, simply put, are fundamental to the operations of Parrot. Also, interacting with them doesn't happen through a particularly clean or well-encapsulated API. In hindsight we probably should have worked to encapsulate it better first, but we've got an open branch right now and might as well get to work on it. Hopefully any lessones that we're going to learn the hard way can be learned quickly. I have some hopes that this branch may succeed, but I'm also planning backup steps in case it doesn't

I've also been looking closely at the GC now, and have even put together a little bit of test code locally for implementing a "very concurrent" GC core based on a paper cotto sent me a while back. I probably won't have time to break ground on this before 1.5, so if anybody is interested in helping with the GC please let me know and we can start putting together a game plan now.

I've also seen some interest in the threading system recently. Infinoid found some problems with it while trying to write tests for the new Pipe types, and we've had some new hackers interested in working on the threading system to support the concurrency primitives in Rakudo, so maybe we will get some good work done on that system in the near future.

So that's a brief glimpse of some of the things that are going on in Parrot this week, I would love to hear about some other projects that people are working on too (especially since we didn't have #parrotsketch last week to share).

Friday, June 26, 2009

L1: Let's Review FAQ

Buzz has been steadily increasing about this whole L1 idea, especially after some initial information about it was loaded onto the Parrot website. We had a big conversation last evening on #parrot, and I've received a few emails and blog comments about it as well. I would like to go through and try to address some of the common questions and concerns that I've been hearing.

Question 1: Why Is L1 Better Then PASM?

Another common rephrasing of the question would be "Why don't we just extend PASM to do what we want L1 to do?". I would turn the tables and ask "What is so good about PASM that we can't even conceive of a replacement?". Popular lore has it that Thomas Edison churned through 10,000 failed designs before he came up with the working lightbulb. We should not be so foolhardy as to think that we can't have made a fundamental mistake with PASM.

The problem with PASM is that it's the wrong level of abstraction. Frankly, it's too high level and is focused more on ease of use by the programmer then ease of execution by the VM. Don't believe me? I suggest you look at all the hundreds of files that are currently written in PIR: Main programs for HLLs, runtime libraries in HLLs, NCI wrappers for libraries, the entire test suite, etc. If PIR wasn't so great for people to be writing, it would long ago have been identified as a pain point and replaced with something better. Writing PIR is currently a required part of building an HLL compiler, and very few people are complaining about that, or complaining enough to have it changed! This in and of itself is proof that PIR (which is just PASM in disguise) is more suited for use by humans then by the machine.

And it's not like we don't have the tools or the skills necessary to write a replacement language compiler. In Parrot world you can't shake a stick without hitting somebody who has written one. If PIR was too low-level, we would have a compiler for a replacement language prototyped in less then 5 hours. Seriously.

L1 is better then PASM/PIR in this case because we can specially design it from the ground up with the purpose that it should be easy and fast for the machine to execute. We can take lots of steps to make it easy and fast to execute, a lot of steps that the designers of PASM didn't take because they didn't know. Back when Parrot was first being designed, how could the developers have know about the amazing performance of JavaScript engines such as SquirrelFish or TraceMonkey that have come out recently? The answer is that they couldn't have known what we know now about VM design.

We can focus on execution performance to the exclusion of human readability because people shouldn't be writing it directly. Let me repeat: we have the tools, the expertise, and the experience as a community to write a higher-level language compiler to do the hard work for us. Regardless of what intermediate language we end up with on Parrot, I don't expect humans will have to be manually writing any of it.

Question 2: Won't another layer make things slower?

Here's what we have right now:

People assume that I'm planning about just adding another compiler step and then another target interchange format for L1 at the end of this list. Not so. PCT has the potential (though it hasn't been implemented yet) to output code in any format. Why don't we do this instead:

And if you see that image and ask "Why do we need PIR/PASM in this case at all?", now you're thinking along the right lines. Imagine a world where we never ever need to write anything in PIR or PASM by hand again. Now imagine that same world where our tools like PCT don't output PIR as an intermediate form but, unbeknownst to you the programmer, directly outputs executable bytecode like L1. In such a world as we are imagining here, do we need to keep PIR or PASM around? Do they serve any purpose? Maybe as a very small bootstrapping layer to build PCT in the first place, in which case there is a lot of bloat that we could throw away, and we would only need a PIR compiler for the miniparrot build step and wouldn't need to build IMCC into libparrot or the Parrot executable.

It's not that we're adding a new layer of complexity, we're starting to realize that PASM is the wrong level of abstraction for our needs in Parrot, so we are replacing it wholesale with L1.

Now obviously there are intermediate steps betwen here and there where things are going to be slower and more complicated: We will need to use PIR as an intermediate step until PCT is capable of outputting L1 directly. But let me ask this: Do we want PCT to output something that executes slowly, or something that executes more quickly, when all the necessary work is done?

Question 3: L1 is going to be lower-level then programmers like

No question that PASM/PIR are far more friendly to the programmer then any conception of L1 is going to be. PIR and PASM are supposed to be sufficiently low-level that people don't want to program in them, but at the same time they aren't enough of a pain point that we made a replacement language into a priority. If L1 is such a pain point for programmers as I think it will be, we will write a compiler for a better language so we never need to write L1 directly.

We already have NQP which is a thin Perl6-like language that works well with Parrot's execution model. We also have Close in development which is going to be a C-like language that does almost the same thing. Assuming both these two languages have all the capabilities, why wouldn't we use these to do all our coding and never use PIR, PASM, or L1 again? Remember, I keep saying this: We have PCT, one of the most powerful compiler construction tools ever conceived. We should be using that to make languages that insulate us from whatever intermediate language Parrot uses. We should be reaching a point where we never ever ever have to write or even read another line of PIR or PASM ever again. Ever.

And when we do reach that point where we are so completely insulated from it, it won't matter to the programmer what intermediate language Parrot uses, because we won't be seeing it.

Question 4: A lot of development effort is going to be wasted

This is a good concern, that a lot of things people have spent a lot of effort to develop are basically going to become obsolete. But I have to also ask the question: "Why don't we all just write in Fortran, considering that so many people have spent so much energy on that compiler?". The effort isn't all wasted, we did learn a lot of important lessons about Parrot and the right way forward, and we've used all the things we've developed to bootstrap the creation of better tools and better ideas.

PIR and PASM got us to a place where we have PCT and several HLLs in active development. That doesn't mean we need to be trapped underneath these languages forever. We use them as a bootstrapping layer to build better tools, and use those instead.

Question 5: How does L1 compare to C?

Alternatively, why are we rewriting all our C code in L1? Let me ask what is the difference between C code and Assembly code? Everything you can do in assembly you can do in C. You will probably need to call a library to do some things, but if you can do it in ASM, you can do it in C. A compiler converts both down to the same machine code that is executed by the same hardware. You can say that one is "faster" then the other, because your average compiler isn't always smart enough to generate "good" machine code, but given the proper optimizations, there should be no speed difference. In other words, given the exact same machine code, it doesnt matter to the processor whether that code was written by a human in assembly and assembled to machine code, or written in C and compiled into machine code.

All that information out of the way, let's conceive of L1 as being a portable assembly language with all the same exact capabilities as C. Given the same exact capabilities and a good compiler (for our purposes, a JIT engine is just a "compiler"), both sets will be able to do the same stuff, and can be converted down into the same machine code for execution by the hardware at the same speed.

So given that the two are equivalent, why have L1 at all, and why not keep everything in C? The difference is semantics and algorithmic complexity. It's a difference of where the primary control flow happens. Right now we have a register-based system (PIR) running on top of a stack-based system (C). Switching between the two is slow, and unfortunately we switch pretty often because neither one supports all the semantics we need. L1 gives us an opportunity to access the low-level details that C has (calling native functions, accessing memory pointers, etc) but to use the high-level semantics that PIR requires (register-based operation, etc). We gain the ability to keep control flow executing in only a single context, and to eliminate all the algorthmic complexity of switching between semantic contexts.

An L1 dispatcher handles all control flow, able to call C functions directly as needed and redirect to offsets in bytecode. It can handle function calls, exception throwing, vtable overrides, and other CPS magic without having to recurse down the C system stack. And we gain all sorts of potential for optimizations because we have a unified environment like this.

In short, L1 allows us to design a language with the power and flexibility of C and the semantics of PIR, which in turn will let us reduce algorithmic complexity at the global level.

Question 6: Why not just fix PASM/PIR?

We can't really fix PASM in place because we have conflicting needs: We simultaneously need to extend it in order to match all the power of C, and shrink it to make it easier to analyze, optimize and JIT compile it. What we would end up doing is growing PASM to include what I now think of as being "L1", and then we would need to shrink away most of what we now think of as being "PASM", and we would be essentially left with only L1.

Keep in mind that this is not necessarily a bad development path for us to follow, but the end result is L1, not what we currently know as PASM. We can start by creating an op library for the new L1 ops, start transitioning everything over to use them, and then start deprecating away our old PIR ops (or rewriting them in terms of L1). What remains at the end of this is the small selection of L1 ops and various HLLs (I'm including PIR/PASM here as an HLL) which compile to L1, and L1 is executed directly by Parrot.

Question 7: What does L1 Buy Us?

L1 is going to give us a number of benefits:

Decreased algorithmic complexity, because we're not having to shuffle data between the PIR registers and the C system stack
Improved, easier, faster JIT. A "fixed" JIT
Potential to plug more easily into existing JIT engines (LLVM and libJIT)
Potential for trace-based JIT, where we trace out "hot spots" in the code and JIT them with type-specific information to speed up dispatch.
Potential for high-level optimizations including subroutine inlining, dead code elimination, etc
Potential for context threading, where we try to align the VM with the control flow of the underlying machine, to maximize branch prediction and caching at the hardware level
Potential for improved GC and resource allocation performance because it will be easier to analyze where memory is allocated and where it falls out of scope. This includes "escape analysis", where we determine the lifespan of a reference and are able to deallocate it more quickly and efficiently.
Potential for easier profiling, because everything will be in L1, we only need one tool to analyze L1 control flow (which we can share with the trace-based JIT and the GC escape analyzer)

It's quite an impressive list of possibilities, but I will stress that L1 doesn't get us these things for free: It does however give us the potential to do these things, a potential that we don't have right now with C/PIR.

Parrot4Newbies: PMCs

This is the third installment in my series of Parrot4Newbies, and today I am going to talk about PMCs. As a general reminder, if you're interested in any of the topics I've discussed so far, make sure to keep track of the comments. I'm going to post more tasks on each topic in the comments, and I hope other people will post new ideas there as well.

I've talked about documentation and the test suite as great places to get involved in Parrot quickly. However, neither of these involve a lot of code, especially not a lot of C code which many people know very well. If you know C pretty well and want to get right into Parrot's guts, I can't think of any place better to get started then the PMC system. PMC types are defined in /src/pmc/*.pmc files. Parrot's core API for dealing with PMCs is located in src/pmc.c, and our implementation of objects (which are based on the Object PMC) is located in src/oo.c. The current compiler tool for converting PMCs from their strange C-like language into pure C is located in lib/Parrot/Pmc2c/*.

Refactor Messy Code

Several of the core PMC types are very old and in need of a major cleanup. Some things need to be refactored for readability and to maximize code reuse. Other things, especially critical core types, need a little bit of optimization lovin'. Some good candidates here for general cleanup are Integer, Float and String PMCs, which are used frequently to autobox INTVAL, FLOATVAL, and STRING* core primitive types. Also, some of the newer types such as Socket and Sockaddr could use some tweaking to become more mature and stable.

Certain types that deal with the underlying system, such as OS, Env, and File, always need tweaking, extending, and testing to provide all the functionality that users are going to expect in order to examine and manipulate the underlying machine in a consistent way.

Verify the Spec

The various design documents in docs/pdds/* refer to some of the Core PMC types and discuss some of the important components of each. Of specific interest are PDDs 15, 17, 20, 21, 22, 23, 24, and 28. Also, there are several design documents still in draft in docs/pdds/draft/*. Take a look through some of these documents to see how the drafts compare to the current implementations. Feel free to make one conform more to the other, and submit patches for both the PMC types and the spec documents to update them. Of particular interest here are PDDs 8 and 14.

Rename API Functions

This is a task that actually applies to multiple subsystems, not just the PMC system. There is a page on the Parrot Wiki that we're editing right now to provide more details about other projects in this area.

Basically, functions in src/pmc.c need to all be renamed to Parrot_pmc_*. We also need to evaluate which functions represent the public-facing PMC API (which should all have the PARROT_EXPORT directive), and which items should not be part of that API. So the function pmc_new, should be renamed to Parrot_pmc_new.

Renaming just one function at a time (which would be ideal in terms of small, easy-to-review patches) should be a trivial matter for a coder who's any good with Perl5 (or even sed, if you're into that kind of stuff).

Also, the file src/pmc.c should probably be moved to src/pmc/api.c, and broken into subfiles depending on functionality. Check out the page on the wiki for details about a move like this.

Conclusion

PMCs are Parrot's basic aggregate structure, and the various core PMC types encapsulate a large amount of Parrot's functionality. Unfortunately, many of the central PMCs and PMC mechanisms are old and in need of some tender loving care from a decent coder. It's a system that's easy to get involved in, and there are some real benefits to the project that won't cost more then a small amount of time with the right tools and right knowhow.

Thursday, June 25, 2009

Concurrent Garbage Collection

I mentioned in passing a few posts ago that cotto sent me a link to a paper about a concurrent garbage collector VCGC. eternaleye sent me another paper that proposed improvements to the VCGC algorithm, while following the same general concept. The second one is called MCGC. I read over those two quite enthusiastically, and then moved on to read a paper about G1, the new Java VM garbage collector. There are some similarities between the two approaches, but plenty of differences too. In the end I think a collector somewhere in this family is going to be the GC of choice for Parrot (at least, for birds built on multithreaded systems). Of course, it may be just one of a possibly large stable of collectors, each suited for different needs.

The first paper introduces the "Very Concurrent GC" (VCGC) which is able to use a multithreaded approach without the need for fine-grained thread synchronization. "Balderdash!" I can hear you saying, but have faith: I've read the paper and the algorithm is so beautiful in it's simplicity that I can't believe I didn't think of it first. And it's very plausible. Here's the gist of it: Each memory allocation round has a color. We allocate memory with color x and operate like normal until we hit a synchronization point. At the synchronization point, we increase x++. All memory allocated before the synchronization point now has color x-1, and all memory allocated thereafter will have a color of x. We continue executing again until the next synchronization point, and then we bump x up again. Memory allocated in the first window now have color x-2, Memory allocated in the second window now have color x-1, etc. All memory chunks with color x-2 are swept and reclaimed to the system.

In this system, memory blocks implicitly change color because we change what the color numbers mean at each synchronization point. Without proactive marking, the blocks "fall off" the end of the window and get collected by a collection thread. The collection thread has no other job then to iterate over the allocation space and free all memory with color x-2. We call this thread the "Sweeper". Saving chunks from certain doom is the job of the "Marker", a thread that is the only thread in the system capable of changing the color of a block. The Marker runs a normal GC mark algorithm, starting at the root set and bumping all memory that's reachable to color x. The point when both the Marker thread and the Sweeper thread have completed their current run is called the synchronization point, which is when we increment x (called the "epoch") and restart both threads to run again. It's simple and low-overhead because it doesn't require any moving or compacting, and it only has to twiddle a handful of bits to make everything work. Also, it appears to scale well to multicore systems and multithreaded programs.

G1 is a very interesting collector which I find is conceptually not entirely dissimilar from VCGC, although I'm probably minimizing the differences in my own mind. It seems to be based on more of a lazy approach, picking low-hanging fruit to reduce the need for complete end-to-end GC runs and making allocation more efficient. G1 divides the heap into regions, and focuses on regions where there are the fewest active blocks. In the region, G1 frees any garbage it finds and copies any live items to a "dense prefix" somewhere else. This allows the entire region to be used by the allocator for easy linear allocations.

G1 appears to be very heterogenous in that memory of all sizes is allocated from a single pool. In that sense, a G1-like collector may be suitable for use with our STRING system, which is badly in need of performance tuning. Something like VCGC would probably be more useful for the common case of homogenous header pools, like our PMC pools and our sized pools, which facilitate very rapid array indexing through the pool.

VCGC and variants suffer from a few worst-case scenarios, such as situations where garbage is very long-lived (thus wasting time where the sweeper repeatedly checks things that are not garbage) and situations where garbage is very short-lived (where blocks become garbage quickly, and need to wait for two epochs to be swept, which increases memory consumption). The benefit of course is simplicity in the algorithm. Adding complexity, such as in MCGC, can reduce these problems.

My plan in the near future, probably after the AIO project, is to start prototyping a new concurrent collector core modeled on VCGC. I will use a simple and direct implementation of it initially, no bells or whistles or fancy-schmance optimizations. Once we have a basic concurrent core installed and working properly, it will be easier to add these kinds of optimizations in an incremental fashion.

And I have plenty of potential optimizations in mind, including some simple tweaks to the basic algorithm that I think will add some time-saving generational semantics. If you have some ideas too, I would love to hear them. I may create a planning page on the wiki soon to start putting ideas together for this.

What's really most important at this point, and pmichaud mentioned this several times at YAPC::NA, was just getting a second core working. We don't even care what core it is, we just need a second one to prove that our architecture is pluggable and work out any kinks in the API. Once we have a second core in place, it will be that much easier to add a third and a fourth, etc. With all this in mind, we really don't need to be swinging for the fences right now, just looking to add something quick that maybe offers some small performance benefit over the current system.

Parrot4Newbies: Test Suite

The second post in my Parrot4Newbies series, today I am going to talk about Parrot's test suite. This is another great way for the average new user to get a deeper understanding about Parrot's capabilities because the test suite is, in theory, a comprehensive exercise of Parrot.

Parrot's test suite has historically been written in Perl 5, with a series of modules derived from Test::More and Test::Builder and others with some custom methods to handle creating, compiling, and executing PIR and PASM code. Now don't get me wrong, we do like Perl 5. However, we want to get rid of as much Perl 5 from the Parrot repository as we possibly can. For one thing we want to reduce the number of dependencies. For another, we have to look forward to a future where Perl 5 is actually running on Parrot, which would create a very fun bootstrapping problem. Maybe that's just wishful thinking on my part, but it's a fun goal to have in mind (and a fun project for an adventurous team of new developers to attempt!)

Run Tests, Report Problems

The fastest way to get started with tests is to get the current version of Parrot from SVN, and run this little script:


perl Configure.pl [args]
make
make test

The "test" target runs the core functionality tests, the documentation tests, and the coding standards tests, plus a few others. This can take quite a long time to execute, so running the "make coretest" is probably more attractive for most coders. The "coretest" target only runs tests of the core functionality.

If you're looking to really optimize and run tests quickly, you will want this incantation:


make -j9 [testname] TEST_JOBS=5

Where [testname] can be "test", "coretest", "fulltest", "distro_tests", or any of the other test targets. This particular command executes testing in 5 threads in parallel, which can go very very quickly on some systems.

When you run tests and see failures, you can either submit a bug report to trac, or you can be ambitious by fixing it and submitting a patch. Patches are always very nice to have, but even reports of failures (including information about your system and build options) are very good too.

Upload Smolder Reports

The smolder service maintains an online log of test progress, so we can see what tests are doing on each platform. If you're running tests regularly, please consider "make smolder_tests". This will run through the test suite and upload the results to the smolder server so our developers can keep track of them.

Even better would be to set up a smolder slave bot. Set up a script on your system to run this sequence at intervals:


cd $HOME &&
svn co https://svn.parrot.org/parrot/trunk parrot-smoke &&
cd parrot-smoke &&
perl Configure.pl &&
make &&
make smoke

This is especially important if you're on a rare system (something that isn't x86+Windows, x86+Linux, or x86+Darwin). We can never have enough reports from rare systems.

Convert Tests to PIR

Many of our tests are written in old-style PASM, and many of them are still managed by Perl 5 scripts. We need to rewrite all these test files to use pure PIR instead. Here's an example of a Perl 5 test:


use strict;
use warnings;
use Test::More
use Parrot::Test tests > 1;
pasm_output_is( <<'CODE', <<OUTPUT, "gt_ic_i_ic" );
     set I0, 10
     gt 11, I0, ok1
     print "nok gt\n"
ok1:
     print "ok 1\n"
     gt 9, I0, nok1
     print "ok 2\n"
     branch ok2
nok1:
     print "nok gt 2\n"
ok2:
     end
CODE
ok 1
ok 2
OUTPUT

And here is that same test rewritten in PIR:


.sub main :main
plan(2)
gt_ic_i_ic_1()
gt_ic_i_ic_2()
.end

.sub gt_ic_i_ic_1
I0 = 10
if 11 > $I0 goto ok_1
ok(0, "gt_ic_i_ic1")
goto end_1
ok_1:
ok(1, "gt_ic_i_ic1")
.return()
.end

.sub gt_ic_i_ic_2
$I0 = 10
if 9 > $I0 goto ok_2
ok(1, "gt_ic_i_ic1")
goto end_2
ok_2:
ok(0, "gt_ic_i_ic2")
.return()
.end

So the transformation isn't entirely straight forward but the end result is pure PIR, not a hard-to-read mixture of PASM and Perl 5. Notice that I am not line-for-line translating the PASM into PIR, I rearrange it a lot to make it more readable. The end result, the features being tested, are absolutely the same.

PMC Testing

As the policy stands right now, every file in src/pmc/ should have an associated file in t/pmc/. There is actually a test to verify that we have all these necessary tests! Just having these test files isn't enough, we need to make sure the tests are actually giving the PMCs a good workout. Look through the various tests and make sure each PMC is well tested. Some PMCs such as Integer implement many VTABLEs, and we want at least one test to exercise each of them. Writing tests like this helps not only to figure out what all the core PMC types do, but you're also forced to trace through some code to figure out where the various VTABLEs are called from so you can figure out how to test them.

Conclusion

The test suite is a key part of the Parrot development process. It helps us to find small regressions and bug earlier so they don't grow and fester into larger bugs later. The more comprehensive our test suite is, the more comfortable we can be making Parrot releases because we know that HLLs and other users of Parrot that use the features we test for will work as expected. It's also a great way for new developers to get involved and start getting an idea about the capabilities, current state, and limitations of Parrot. Once you see for yourself the areas where Parrot needs some work, you'll be more able to start making the necessary fixes yourself.

Wednesday, June 24, 2009

Parrot4Newbies: Documentation

In response to some requests I received and some signs of interest that I saw at YAPC, I decided to start a little series of posts that I am going to call Parrot4Newbies. The goal of these posts are to take a look at some particular subsystems and projects in Parrot-land, describe what kinds of jobs and work need to be done, and break down these tasks into small steps that new users can follow in order to become acclimatized and familiarized to the Parrot development process. I will be publishing these posts pretty frequently, in addition to my "normal" posts which I try to put out at least 5x per week, so the volume here may be getting pretty heavy. I will be using the "Parrot4Newbies" tag on these posts, so you can filter them if you want.

The first area where Parrot needs help, which is actually a common need of most open source software projects, is documentation. Writing and improving documentation is a great way to get familiarized with the codebase for people new to Parrot internals. There are actually several components to our documentation, all of which need help.

There are two good ways to help: (1), open a bug report (requires a username) when you find a problem that you would like somebody else to fix. (2) Even better is to fix the problem yourself and create a bug report with an attached patch.

Project Documentation

We have a lot of documentation in the repository in the folders docs/. However, our documentation is far from comprehensive. Find files that look incomplete, out-of-date, or just plain confusing and misleading, and fix them up.

Our design documents in docs/pdds/ are the documentation that describes what Parrot is and how it is implemented. Read through these to get a good understanding of Parrot, and look for places that are incomplete, not true to reality (which may mean the code is wrong or the specs are wrong), or misleading.

Parrot Book

We have a book at docs/book/ that needs work. We have a first edition being published soon, but it needs a lot of work so we can put out a second edition after Parrot 2.0 is released. Read over the book and find things like core topics that aren't covered (or aren't covered well), typos, out-of-date errors, etc.

When you see code examples, try them and make sure they work too! Don't just trust our word for it make sure it does what it says it does!

Examples and Tutorials

We have a large amount of examples in examples/, and a PIR tutorial in examples/tutorial/ that need lots of help. Read through the examples, try them out locally to see if they work and how, and clean them up a little. Our tutorials especially are incomplete and don't cover all the topics that a new user needs to know. Feel free to write new tutorial pages entirely if a particular topic isn't well covered already.

File- and Function-Level Documentation

All our .c files contain POD-formatted documentation. This is broken into two distinct types: File-level which is the stuff at the top of the file that describes the particular system in a high-level overview; and Function-level which provides documentation for each individual function. Both of these things are required and are verified by our test suite. To run these tests, you type "make codetest" from the parrot repo, after configuring and building Parrot. This will run through a series of tests in addition to the documentation ones. Specifically, the test t/codingstd/c_function_docs.t covers the function-level tests, and there are several files failing this test:

compilers/imcc/instructions.c
compilers/imcc/optimizer.c
compilers/imcc/parser_util.c
compilers/imcc/pbc.c
compilers/imcc/pcc.c
compilers/imcc/reg_alloc.c
compilers/imcc/symreg.c
compilers/pirc/src/bcgen.c
compilers/pirc/src/pircapi.c
compilers/pirc/src/pircompiler.c
compilers/pirc/src/piremit.c
compilers/pirc/src/pirmacro.c
compilers/pirc/src/pirpcc.c
compilers/pirc/src/pirregalloc.c
compilers/pirc/src/pirsymbol.c
config/gen/platform/ansi/dl.c
config/gen/platform/ansi/exec.c
config/gen/platform/ansi/time.c
config/gen/platform/darwin/dl.c
config/gen/platform/darwin/memalign.c
config/gen/platform/generic/dl.c
config/gen/platform/generic/env.c
config/gen/platform/generic/exec.c
config/gen/platform/generic/math.c
config/gen/platform/generic/memalign.c
config/gen/platform/generic/memexec.c
config/gen/platform/generic/stat.c
config/gen/platform/generic/time.c
config/gen/platform/netbsd/math.c
config/gen/platform/openbsd/math.c
config/gen/platform/openbsd/memexec.c
config/gen/platform/solaris/math.c
config/gen/platform/solaris/time.c
examples/c/nanoparrot.c
examples/compilers/japhc.c
examples/embed/lorito.c
src/atomic/gcc_x86.c
src/debug.c
src/gc/gc_malloc.c
src/gc/generational_ms.c
src/gc/res_lea.c
src/io/io_string.c
src/jit/amd64/jit_defs.c
src/jit/arm/exec_dep.c
src/jit/i386/exec_dep.c
src/jit/ppc/exec_dep.c
src/nci_test.c
src/pbc_dump.c
src/pbc_info.c
src/pic.c
src/pic_jit.c
src/string/charset/ascii.c
src/string/charset/binary.c
src/string/charset/iso-8859-1.c
src/string/charset/unicode.c
src/tsq.c
src/jit/alpha/jit_emit.h
src/jit/arm/jit_emit.h
src/jit/hppa/jit_emit.h
include/parrot/atomic/gcc_pcc.h
src/jit/ia64/jit_emit.h
src/jit/mips/jit_emit.h
src/jit/ppc/jit_emit.h
src/jit/skeleton/jit_emit.h
src/jit/sun4/jit_emit.h

Some of these files are complicated, which is why nobody has been brave enough to document them. However, while documenting things you should feel free to clean the code up a little bit so it becomes easier to document and understand. Also, most of these are not core files, they tend to be files on the edge of the codebase in some of our complicated systems: JIT, the IMCC and PIRC compilers, etc, so you're going to need to read up on some core systems in order to what is going on in all of these. More reading and looking at code means you learn more, more quickly!

Most of the rest of the code files in the repository have some documentation, but it might not be high quality. Feel free to look into other files that interest you as well and improve existing documentation to be easier to read and more accurate. This is a great way to get started as well!

Conclusion

So that's a quick list of things that a new user should feel free to take a look at as a way of getting familarized with Parrot and it's documentation. Documentation is a great way to learn the code because you have to understand it and describe it in your own words in order to write the docs. It's a great way to get involved and will have an immediate impact on the quality of Parrot and our ability to attract even more new developers. Fixing some documentation

Please do not hesitate to send in patches, however small. Also, if you have questions and need clarifications, please feel free to ask on #parrot or on the parrot mailing list. We would love to help you help us.

YAPC: L1 Recap

On request, I have been asked to summarize some of the discussions we had about L1 at YAPC. We talked about L1 as a small group on Sunday during the PVMW unconference (Myself, cotto, particle plus two others), and again as a slightly larger group at Lulu's restaurant for Tuesday lunch (Myself, chromatic, pmichaud, particle, cotto, jhorwitz and three others). I will try, as well as I am able, to summarize what we discussed both times. I'm going to post most of this post also to the Parrot wiki so more then just my readers here will have access to it.

Microprocessor analogy

When you look at modern microprocessors, you see these complex instruction sets like x86. In traditional machine code, each bit in a machine code opcode travels down wires and acts as control signals to the actual circuit hardware. The problem with this is that the opcode structure becomes intimately tied to the architecture of the hardware: If you change one, you need to change the other (because different bits now need to travel down different wires to different circuits), and then things built on top break: Your assemblers now need to output new machine code, your pre-existing binary executables are no longer valid, etc.

In addition to this, some machine code words are very complex. x86 has hundreds of operations, and each operation can use a large number of addressing modes. Combine these together and you get hundreds, maybe millions of permutations. A single op, depending on the types of the sources and the destinations can have a number of distinct behaviors. add ax, bx is pretty far different from add [ax], bx, which is different from add ax, [bx], etc. Having to decide what behaviors to activate for each opcode from a large set of opcodes is hard to do.

Reasons for L1

Consider the idea now of converting those "high-level" machine code words into "low-level" microcode where each microcode operation is small, simple, atomic, and fast. Because each of these lower level instructions is more simple and atomic, they are easier to use in automated translation and with JIT.

Right now we have a lot of our critical code paths are written in C, and a lot of them are written in PIR. It's immensely expensive moving data between PIR registers and the C system stack to call functions in each language, and we are incurring that cost very frequently. In addition, we lose out on lots of opportunities because of the boundary between the two: We cannot easily optimize, we cannot easily inline functions, we cannot easily profile (we can typically profile C or profile PIR, not both at once), etc.

L1 is an abstraction that allows us to insert a new abstraction layer into Parrot to help reduce algorithmic complexity on the global level and give us more opportunities for optimization then we have in the PIR/C combination so far. Other projects like TraceMonkey are using cool optimization techniques such as polymorphic inline caching (which Parrot has a basic and messy implemenation of now, which has been deprecated), trace-based JIT, and context threading. These optimizations together are producing some stunning benchmark results for leading-edge javascript interpreters.

What L1 will look like

So the question invariably arises, what will L1 look like to the average programmer? Will it look like some grizzled assembly language? Will we be forced to reimplement large swaths of Parrot in some freaky home-brewed assembly? The answer, I think we are all pleased to consider now, is no. We have skills and background and tools for programming language design and we have several existing parsers for languages that we could utilize. One specific example of things that we could do, as mentioned by pmichaud, is to write core PMC types in NQP. It would have to have some extensions to allow direct register addressing, definition of vtables, and definition of attributes which may be references to arbitrary C data structures. Also, it would have to have a new backend which outputs L1 bytecode instead of PIR->PBC as PCT normally does now. We also have a new language in development by Austin Hastings called "Close", which is a very low-level language based on C that will eventually have all the capability of PIR (and therefore L1), but with familiar C-like syntax.

What this means is that there won't be a human-editable form of L1 bytecode, because we can write it in NQP, Close, or even other high-level languages which currently run on Parrot (although we will want to find one with a small runtime, or even a variant with no runtime). The short answer: L1 looks like any language we want it to look like (but probably Perl6-like in NQP or C-like in Close).

Next Steps

So what are the next steps? Well, pmichaud suggests that a good way forward is to actually start prototyping some key core PMC types in NQP now, so we can get a feel for what syntax and semantics we will actually want and need. Once we know what our needs are, we can start modifying NQP to support our new syntax for dealing with these things.

With the synatx requirements set, we can then modify the NQP parser to output normal C for now, and eventually redo the backend to output L1 instead.

chromatic suggests that with our current plug-in architecture we could create a library of L1 dynops, and create an NQP backend to target those. Once we have core PMCs as written in NQP executing only on L1 dynops, we can start the rest of the conversion process.

Here is a quick checklist:

Write core PMC types in NQP, making careful note of what syntax we need to add to support all existing operations. Good candidates for this prototyping would be Integer, ResizablePMCArray, and Hash (which is soon to be renamed to AssociativePMCArray).
Write existing PIR opcodes in NQP, making careful note of what syntax we need to add to support existing operations.
Write a dynop lib for L1 ops
Modify the NQP compiler to have all the necessary syntax from steps #1 and #2, and to output only those L1 opcodes created in step #3.
Modify IMCC (or PIRC, depending on time frame) to output L1 or PBC, depending on some kind of switch
Move L1 opcodes into the core instead of being a dynoplib.
Make the final switch. All ops and PMCs get compiled from NQP into L1 during the normal build process (possibly using a bootstrapping step)
Optimize like rabid ferrets.

Monday, June 22, 2009

IO work at YAPC.

The sysadmin heritage of Perl has really been shining through here at YAPC, a lot of our newcomers are really interested in process management and interprocess communication topics. Unfortunately, this seems to be an area where Parrot is particullarly weak. Luckily I managed to get a pretty good idea of the kinds of things that the workshop attendies are interested in, and even picked up a few good ideas about implementation. We had an "unconference" yesterday which is so awesome I would like to do it again some time. Earlier this morning I talked to a group of interested parties about the state of Parrot's GC, and may have found a volunteer to help implement the concurrent GC core I've been planning. I also did a little bit of talking about L1 to introduce the idea to people who haven't been following my blog (everybody on the planet besides myself). I'm just finishing up a session about IPC now, and have heard a lot about the things that people want, and heard lots of cool ideas about how to implement them.

I've been doing a lot of hacking on the io_cleanups branch, which I started two days ago. It was started with the goal of cleaning up the IO system a little bit and better integrating Sockets and Pipes. It's gone a little bit beyond that now, and I'm getting pretty close to having a cool result to share. I posted a patch to the IRC chatroom in hopes Infinoid would work his magic and tell me what I'm doing wrong. What I need to do is hunker down with the debugger and see where my stuff isn't working correctly, I just haven't been able to focus on it for long enough because of the festivities. I'm sure the problem has to do with my mishandling of the buffering code somewhere.

The gist of the work in the branch is this: The IO system should sanely handle all it's PMC types (FileHandle, Socket, Pipe, StringHandle), and handle them in a way such that they are subclassable from PIR. Currently we don't even have a separate Pipe type in trunk yet. Infinoid sent me a patch to add Pipes, and once I get past my current set of issues I will add it into the branch.

In order to make this all work, I changed the Handle type (which had been a mostly-useless parent type of all the IO PMCs) into a delegate type that implements all the non-subclassable attribute types that FileHandle used to use: the low-level file descriptor, the output buffer, etc. So now, FileHandle does not extend Handle, instead it has a Handle PMC as an attribute. Initially I ran into an interesting memory corruption problem where Handles were being destroyed before their parent. When FileHandle was destroyed it tried to flush it's contents to the Handle (which had already been garbage collected), and hilarity ensued. To get around that problem I moved all the buffering logic into Handle, which means any other PMC types that delegate to Handle (specifically Pipe seems like a good candidate for this) will get to have buffering for free if they want to use it. There are still some kinks to work out with this method, but once I get the implementation working it will be easier to get feedback from people.

In other YAPC news, Austin Hastings is working on a very cool C-like language compiler called Close. Basically, it's C-like syntax (curly brackets, if/else/for/while, etc). It looks to be on the same kind of level as NQP (low complexity, no runtime, good for writing PIR-level stuff that's not in PIR, etc). It really appeals to me because it's essentially a familiar syntax layer over PIR, which I find can be very tedious to write large things in as-is. He says he has a 0.1 release he's going to make public sometime soon, and I'm looking forward to that.

I talked to Patrick yesterday about my AIO plans, and he suggested I speak up about it on #perl6 to get some use-case information there. Obviously I don't want to tailor the work directly to Perl6's needs, but a flexible-enough system should satisfy them without too much wrapping and layering. Unfortunately the IO Synopses isn't entirely finalized, and there are a lot of things up in the air still. My AIO implementation, if well-enough received, could play a part in influencing that document.

Also, I MET LARRY WALL. HE WAS IN MY CAR! It was like nerd nirvana. I did, fortunately, manage to contain my school girl giggles and get us to the arrival dinner without making a complete ass of myself. Awesomeness.

So that was my Sunday at YAPC. It's now Monday morning and I'm looking forward to a full day of meeting people, watching cool presentations, and hacking hacking hacking. More blog posts to come, I'm sure.

Saturday, June 20, 2009

YAPC: Day 1

Wake up at 3am. Get in the car, drive 5 hours through the pounding rain. Get lost on CMU's godforsaken labyrinth of a campus. Finally, a local student showed me how to get to the PVM workshop (I think he had to answer the riddle of the sphinx first, I didn't see).

But here I am, and it's kicking ass. Already I've met cotto, particle, pmichaud, Util, kid51 and a whole bunch of new faces. I've seen some great presentations (all by pmichaud, I missed the earlier ones by particle) and I've gotten a lot of good hacking done already.

Infinoid was digging through the threading code and managed to plug a 62MB leak from the threading code. Spurred on by that, I made a few small cleanups to the GC and PMC reuse code, and then started getting to work on the io_cleanups branch. The task on the front burner right now is getting sockets optimized away from unnecessary PCCINVOKE calls like we did for FileHandles and StringHandles in the io_rewiring branch. I also want to get working on Pipes too. In fact, it was this pipe work that clued Infinoid in to the problems in the threading code in the first place. Hopefully we get moving forward with that soon.

So the threading system needs a lot of work, and I may push that into my task queue in the not-so-distant future. I'm sure I'm going to have to do some refactors and fixups on the road to AIO, but I will tackle that when I get there.

I'm watching cotto do some awesome work on the pmcc (the PCT-based utility to parse PMCs). I can't really help with that so I'm just an enthusiastic spectator for now.

Several of the attendees are giving Parrot and Rakudo a workout and are filing tickets and patches. I look forward to digging through some of those later, if nobody beats me to them.

A good start to the conference today, I'm looking forward to the next 3 days here.

Friday, June 19, 2009

L1: The Big Picture

What I think I have been missing in my past several blog posts about L1 (and the ensuing comments), as judged from questions I've received, is an idea about the big picture. What exactly is this thing called L1, and what would it mean for the average Parrot developer? This question is the subject of this post, which by necessity will be quite long.

What Changes?
Let's take a look at normal work flow from the perspective of the VM. The programmer writes a program. This could be directly in PIR, or in a high-level language which is compiled down to PIR. The exact path doesn't matter as far as Parrot is concerned, in either case it takes a PIR input file to execute.

IMCC, Parrot's PIR compiler front-end compiles the PIR code into PBC byte code. It's this PBC that's actually executed by Parrot. Parrot's runcore (I use the singular here, but Parrot actually has several distinct cores, a fact that is not important here) reads the bytecode and separates out the various elements. The opcodes, which have a one-to-one relationship with the PIR statements, are integer numbers in PBC. These opcode numbers are typically passed into a lookup table and a corresponding handler function is called to perform the actions of that op. In C, this operation looks similar to this:


while(!halt) {
    (interp->opcode_table[opcode])(interp, args)
}

This, in essence, is the "core" of Parrot. Each opcode is implemented by a separate C function which is called from a dispatch table by numerical index. Those opcode functions in turn call various API functions to do their calculations. Quick Recap: Opcode is a numerical index into a lookup table. It calls a C function, which in turn calls API functions to perform the required operation. Execution state is saved in the interpreter object (or in a large variety of structures which are attached to it, including the register arrays).

So what does L1 do to change this picture? Actually, not a whole hell of a lot, despite all the wonderful things I've tried to claim that L1 will do. The runcore? doesn't really change. The internal APIs? no change at all really (well, maybe a small change to the interface, but nothing huge). L1 ops are written in C, taking the place that PASM ops occupy now, and PASM ops instead are written in terms of L1 ops. Also, we're going to rewrite PMC VTABLEs in L1, but that's not exactly the core. The runcore now executes L1 bytecode instead of PASM bytecode, but almost everything else is the same.

So what changes? Why add that new layer? I'll explain a few reasons below.

L1 and JIT
Let's look at the JIT system again, god knows I love talking about it. What JIT does is convert an incoming bytecode stream into a machine code stream and executes that directly. So instead of the runcore loop example I showed above, we unroll the loop entirely and call all the functions in a long precompiled sequence. Or, we can get more fancy and eliminate the function calls entirely and simply execute the function guts in sequence. In this case, JIT is the runcore, not just a component of it. JIT takes over the execution of the opcodes by assembling them into machine code and executing the whole block of code directly. Basically, JIT executes the same opcode function logic, but gives us the ability to eliminate the dispatch logic that my previous runcore example used.

Also, a good JIT engine will perform machine-level optimizations, which gives us an added boost.

We want JIT because it lets us squeeze performance out of every loop iteration, and a program with one million opcodes will iterate over that runcore loop one million times. Plus, JIT gets us another benefit, which comes more or less for free: the ability to compile PIR programs into native machine code executables, like you can do already with C. You already have the machine code in memory, all we need to do is output the right file structure and then dump that machine code into a file. Bada-bing, executables.

So we want JIT. No question about that.

The big problem is that the opcode definition itself is very different from the JIT version of that opcode. The opcode logic is written in C and executed directly. The JIT version is a C function that writes the opcode logic at runtime and then executes it. Currently in Parrot, we have to write these things differently. I wrote a post recently that shows examples of each. See the files in /src/ops/* for the opcode definitions, and src/jit/* for the JIT versions of them (which are messy and ugly). When we change one we have to manually change the other for all platforms, which can be difficult if the person making the changes aren't familiar with all our platforms.

What we need is a way to automatically translate PASM opcodes into their JIT definitions AND their direct-executable function definitions. We do this by taking arbitrarily complex ops and decomposing them into atomic parts.

L1 and Control Flow
Let's now look at control flow. We have a PIR program that defines a class Foo and instantiates an object of that type. Here's a snippet:


$P0 = new 'Foo'
$P0 = 1
$P0 += 1

.namespace ["Foo"]

.sub add_i :vtable
...
.end

The third line in this snippet calls the add_i VTABLE, which calls the function src/pmc/object.pmc:add_i(). This in turn checks to see if we have a PIR-based override. We do, so we call back into PIR to execute that override. This creates a new runloop further down on the C stack to execute the PBC of the override. When that returns, the sub-runloop exits, returns to the C code in object.pmc, returns to the op definition in the top runloop, and then returns to the top runloop to execute the next op in our hypothetical little program.

This creates all sorts of problems here, and I'll give an example of one. Let's say an unhanded exception is thrown in the vtable override. The interpreter can't find a handler, so the runloop terminates and exits. This returns back to the code in object.pmc, which returns back to the opcode definition, which returns back to the top runloop, which has absolutely no knowledge of the exception. An unhandled exception should cause the VM to exit, but it gets lost in the call stack and does not have this behavior. The problem is that we have recursive calls into separate runloops separated by an arbitrary depth of C functions in between.

The solution to this problem is to unify to have only a single runloop active at any time in a given execution context, and to not allow recursive calls into other runloops. We do this, in turn, by having a single execution environment that is consistent on itself. This is L1.

L1 and Optimization
Besides the benefits in simplifying JIT, I haven't really discussed how L1 will help speed up Parrot. It's not what L1 itself will do that provides optimizations, it's what L1 enables that will do it. L1 provides a consistent input language where flow control and variable usage can be analyzed. We don't have that now because some of our flow control happens in C and some in PIR, and there's just no way to consistently analyze that. We can do flow analysis to optimize algorithms at the high level before we pass it to the JIT engine to optimize at the low level. We also gain the ability to do escape analysis and lifetime analysis on our PMCs for a variety of optimizations, especially in the GC, the PMC allocator, and maybe a few other places. We can also simplify (maybe eliminate entirely!) the stackwalking code in the GC. We can do all these things only after we have L1.

Conclusion
So I've tried to give a wholistic high-level overview of the current environment and why I think we need a new low-level code form. In terms of architecture, not a lot has to change to enable L1. However, the benefits that we can get from having it are huge in the long run.

L1: The Implementation

Finishing my short trilogy of L1-related posts today, I am going to talk about the actual implementation of this phantasm. This is an area where chromatic and I disagree pretty strongly, so I will just say that this is not canonical and there are plenty of other differing opinions about all this stuff. PerlJam said pretty eloquently on IRC today:

PerlJam> whoever implements something first is likely to have a strong influence on what the final implementation looks like

With that in mind, I've been reticent to publish this blog post for fear any lousy ideas and misconceptions I have will negatively taint the entire design process. I was even more hesitant when I saw three times today alone my previous L1 blog posts being used to explain the idea to other parroteers. It turns out that what I've written so far is the only concise written account of this little language, besides the disjoined conversations that have occurred randomly on IRC and email.

My conception of L1 is a low-level set of operations that represents a nearly direct way to interface with the Parrot API. I'm envisioning a one-to-one relationship between special API functions (in C), including VTABLEs, and L1 operations. We will have L1 ops for most of the VTABLE functions (with the implicit understanding that the number of VTABLE functions is going to shrink dramatically). I've mentioned the analogy of microcode before, and I think it's a fitting one. Microcode, as traditionally used, is a set of lowest-level machine instructions that usually map pretty closely to the underlying circuitry. Complicated and multi-stage operations can be implemented as a sequence of multiple atomic micro-ops. Also, there is the benefit that the machine code format becomes divorced from the underlying circuitry, so that one can be more easily modified without having to modify the other.

In my mind L1 should satisfy these conditions:

Should map closely to the underlying "circuitry". That is, the operations that Parrot understands (API Functions) should map directly to individual L1 ops. Notice that the capabilities of our "machine" (virtual machine) is more expansive then the capabilities of an ordinary hardware machine.
We need to be able to separate PIR (which should stay relatively stable for our end users) from the underlying engine that runs it. This way we can make major improvements to one without having to redo the other.
The operations should be simple and atomic enough to be executed quickly, JITted without much effort, and easily analyzed for optimization. This means they should be very simple logically, and as much as possible approach a common implementation pattern.
Provide all the functionality needed to implement all flow control devices currently implemented in PIR and C. Right now we switch back and forth between PIR <-> C, and that's costly for a number of reasons. In an L1-enabled Parrot, control flow should never leave L1, and should be able to do everything both PIR and C do (including calling into arbitrary external library functions and arbitrary PIR functions).

All that said, here's a basic list of operators that I think will be good L1 ops:

Converting to/from PMCs: boxint, boxfloat, boxstring, unboxint, unboxfloat, unboxstring
PMC Ops: clone (shallow), new, destroy, getattr, setattr, vtable (this could be one large op, or a family of ops to call each VTABLE entry)
Register Ops: move, clear
Basic Arithmetics: addint (i = i + i), addfloat (f = f + f), subint, subpmc, subfloat. (Include int, and float variants of arithmetic operators mul, div and mod).
String ops: concat (s = s . s), sfront (s = substr s, 0, n), sback (s = substr s, n)
Parrot Core Ops: sched (sets a PMC for execution, backend for throw, raise, schedule. Also used for launching a new thread, etc), findpfunc, findcfunc, findfuncsig (find a function by name and signature) loadlib, exec (basically "invoke"), cont (create continuation. a return is a "p = cont, exec p" operation). setnamespace, getnamespace.
Control Flow: jump and branch primitives, halt, etc.

This is a pretty limited list of ops, and I admit that I'm probably missing a few and have probably added a few that are extraneous. Keep in mind when reading this though that L1 is designed for speed and ease of execution, not for programmer convenience. L1 ops need to be trivially simple: a single arithmetic operation, a single VTABLE call, a single API function call, etc. Painfully, obnoxiously simple. PASM and PIR are better for the average programmer to use.

L1 is going to be register based, it doesn't make sense to do it any other way. I don't know how exactly we are going to handle registers, since each L1 op is going to need to be able to take in registers for its destination and source arguments, and also needs access to temporaries so that the underlying L1 op stream isn't overwriting data in PIR-level registers. I draw strong parallels to the MIPS architecture which reserves certain hardware registers to be used for implementing instruction macros without having to overwrite user-accessable registers.

Enough with the explanations, here's a quick idea of some ways that PIR ops could be written in L1:


op add(out PMC, in PMC, in PMC) {
vtable[add] $1, $2, $3
}

This snippet shows a few things. First off, I have no idea how to intelligently provide access to several hundred VTABLEs. Do we have one L1 op per vtable, or do we provide a lookup mechanism? I don't think we want anything like this (in pseudocode):


func = find_vtable obj, "add"
result = call func, args

VTABLE calls are very common operations and it makes good sense that it would be a simple, atomic operation to invoke one. Looking it up by string is definitely not a simple or reasonable way to do it. Using symbolic integer constants to look them up instead of using strings is slightly better, but you still have the problem of having to find the vtable first and invoke it second, two steps for such a basic common operation seems lousy to me. The alternative is to use one L1 op per vtable:


vtable_add result, src1, src2

But again, we would end up with a huge number of such ops to call a huge number of VTABLES. This might provide impetus for us to decrease the total number of them in existence, or it might just lead to unhelpful explanations to "do it all in PIR instead" when L1 becomes so ugly and unmanagable that people sweep it under the rug and choose to ignore it completely.

In the example above I used the same argument placeholders $1, $2, etc that are currently used in .ops files, and I think that's fitting. These op definitions are little more then macros and the values for these arguments can be substituted in very simply when a program is converted from PBC to L1BC. If we're using that notation for arguments (which represent the register set that is available for use from PIR), then we would have to use a different notation for representing the set of temporary registers which are reserved for use internally by L1. I'll kick out a few ideas, but I don't have any strong preference how these register sets are delimited. Consider for instance the case of the swap opcode:


op swap(inout PMC, inout PMC) {
move [P0], $1
move $1, $2,
move $2, [P0]
}

Here, the [] brackets denote a temporary L1-only register. Again, just one out of a hundred equally good (or bad) ideas for this. We do need a way to specify whether the register is a P, S, I or N still, and to address them from each family, so in some respects we're stuck with the "P0" base string. I won't really discuss this any further because it's so open and completely fruitless.

Implementing PIR in terms of L1 ops is only part of the problem, the rest is trying to implement PMCs in it too. For PMCs, we need these kinds of operations, in addition to the list I mentioned above:

allocate, access, and free memory buffers and custom storage
Access and manipulate attributes
freeze and thaw, including attributes and children
mark attributes and children for GC
access and manipulate PMC flags, pmc_ext, pmc_sync, and other fields

Again, I'm sure this list isn't complete, but it should give a good idea for the kinds of things we need L1 to be capable of.

Tuesday, June 16, 2009

Parrot 1.3.0 "Andean Swift"

I put out the 1.3.0 release earlier this afternoon, here's the release announcement:

On behalf of the Parrot team, I'm proud to announce Parrot 1.3.0 "Andean Swift." Parrot is a virtual machine aimed at running all dynamic languages.
Parrot 1.3.0 is available on Parrot's FTP site, or follow the download instructions. For those who would like to develop on Parrot, or help develop Parrot itself, we recommend using Subversion on our source code repository to get the latest and best Parrot code.
Parrot 1.3.0 News:
- Core
+ Optimized parts of the IO system
+ Fixed inheritance hierarchy of FileHandle and Socket PMC types
+ Fixed leaks involving subroutines and Parrot_Context
+ Cleaned up and refactored GC internals, including fixes and optimizations
+ Optimized PMC class manipulations to use type numbers instead of string names
+ Fixed problems involving hashval calculations in strings
+ Removed unnecessary MULTI dispatches in built-in PMCs
+ Fixed memory leaks involving PMCs that were not properly destroyed
+ Fixed creation of PMCProxy PMCs in correct namespaces
+ Added preliminary Pipe support
+ Fixed cloning of Object PMCs
+ Added root_new opcode
+ Added initial versions of Packfile PMCs with read/write capabilities
- Compilers
+ Fixed several memory leaks in IMCC
+ Updated PCT to use root_new opcode
+ Added support for keyword "self" in NQP
- Documentation
+ Improved and expanded /docs/book
+ Updated project documentation
+ Defined 'experimental' status and procedures in DEPRECATED.pod
- Miscellaneous
+ Cleaned code and improved code-level documentation
+ Various bugfixes, code cleanups, and coding standard fixes
+ Added an experimental compiler library to help use PIR libraries from HLLs
+ Updated OpenGL library and examples to support experimental HLL import
Thanks to all our contributors for making this possible, and our sponsors for supporting this project. Our next release is 21 July 2009.

Enjoy!

I picked the name "Andean Swift" to pay homage to the serious optimization work we've done this month. I wanted a fast bird, and naturally I thought of using "Peregrine Falcon" first. However, considering chromatic's awesome work in plugging memory leaks, I thought it would be more fitting to pick a bird that was both fast and small. We can reserve "peregrine falcon" for a future optimized Parrot par excellence. Maybe that will be the one where we introduce the super-fast L1 operations, or a faster concurrent GC, or something else extraordinary.

Monday, June 15, 2009

Pair Programming

I was talking with chromatic tonight about Parrot development. One thing that we both want to talk about at YAPC is development priorities, although admittedly we want to talk about them from different angles. chromatic suggests that group work, even pairing up would be a great way to get high-quality code written for important tasks, and I agree. Some of the best work in Parrot-land has been written by pairs of people who are highly synchronized. Just this month Infinoid and I did some amazing work on the IO system together and squeezed a lot of performance out of the bird.

I've said before and I'll say again, that getting people to focus in a volunteer-driven community is like herding cats. You just can't force volunteers to work on any one particular task, although I am a firm believer that you can do a lot to motivate them so the choose to do what's needed. Of course, this is more of a sprint technique then a marathon one, you're not going to get people to do the work that they don't personally want to do for too long. Any longer and it leads to serious burnout.

I am definitly able and willing to work on things that the project needs, although I have plenty of my own pet projects that provide me personal satisfaction. I am, for instance, hell-bent on putting AIO together starting this month. I've also developed a personal vendetta against the GC system that I plan to tackle in the not-so-distant future. Of course, I'm pretty sure the GC is going to become a community priority eventually, and then I'll just be another part of a (hopefully large) development team.

When prompted, chromatic listed these few things off the cuff that he thought were current development priorities:

Parrot Calling Conventions system refactoring and optimizing
Installable Parrot
HLLs running from an Installable Parrot
Fixing Multiple Dispatch semantics (which is really an ancillary part of the PCC work, I think)

To this list I would suggest that the following things are also, more or less, pre-2.0 development priorities. In no particular order:

Replace IMCC with PIRC
The Garbage Collector
The JIT system
The packfile system
L1

I'm sure there are a few more things that I am just not thinking about right now, but I think this is a pretty good representative list of ongoing development tasks that need some work. There are also plenty of other things which aren't priorities but would definitely be nice to have. Follow that with a long list of "green fields" wishlist features that aren't necessary but would be very cool to have.

I'm hoping beyond hope that we can talk about all this cool stuff at YAPC next week. I know the time is really only allocated for a hacking workshop, but maybe some afterhours conversation can address some of these things.

Sunday, June 14, 2009

User Access Restrictions

I'm a big believer in the open culture movement. My work on open-source software and open content education materials should be demonstrative of that. However, I know that "open" isn't good all the time, and there are some legitimate places where openness is a bad thing and a few access restrictions are necessary. That said, I will say that organizations tend to go overboard in this department, not able to separate which restrictions are truly "necessary" and which only seem that way to people who don't know how liberating openness can be.

The work I am doing for WiTTIE is proving to be quite challenging because we are having to add lots of custom user access permission facilities to the wiki. The problem is that we are working together with classrooms from around the world, and teachers in different places are bound by different rules and laws and needs. Some teachers are going to need to prevent all public access to the materials that their kids write (even read access!), some will not. In all cases we are preventing non-users from editing, and we are preventing non-participants from getting accounts in the first place, so hopefully spam and evil nonsense will be completely non-existent.

The problem is that we are trying not only to use wikis for a particular educational experiment, but also to sell it as a viable teaching tool that teachers can rely on going forward. If we try to push a radical ideology in addition to a strange new software tool, we're likely to sell very few teachers on either. We have to take into account the needs of our target audience and let some bits of the ideology fall by the wayside. It's my hope that the continued successes of wiki platforms in promoting open collaboration will incrementally change the minds of people who demand closedness as the only path forward, but some people can be so entrenched in their processes and beliefs that such incremental changes will never happen.

MediaWiki, if you are willing to disable caching and a few extra services can provide a very robust platform to manage read and write access to pages based on a matrix of user groups versus namespaces. This is what I am calling course-grained control. Groups of users have certain permissions across an entire group of pages. All members of the group have the same sets of permissions to each page in the namespace, and this schema works very well in MediaWiki. The problem arises when we need more fine-grained controls, specifying that individual users have different rights from the rest of the group for particular pages.

Consider the case of where a student is tasked to perform a particular assignment without any collaborative assistance. You want to see exactly what the student can produce without relying on help or input from other students. In this case, you would want to make the page read-only but whitelist the particular student so he would be able to get around the permission for that one page. Or, consider the case where a teacher posts grades for completed assignments on the wiki, and only wants each student to be able to view their own. In that case we would make pages completely read restricted, but allow students to be whitelisted to view the pages that contain their particular grades only.

And because these kinds of permissions are going to change over the course of a semester, we need a decent interface for teachers to be able to set them. Notice that what is "decent" for a tech-savvy coder like myself is far different from a tech-n00b teacher in rural Africa, or India.

I've kicked around a few ideas for how to make pages secure using this kind of fine-grained mechanism. We're looking at a lot of existing user-access extensions now to get more ideas, and hopefully we can find one that does exactly what we need. I doubt it, but it is possible. What I suspect we are going to end up with is a whitelisting mechanism where pages in certain namespaces are locked down for zero access by default, but add a teacher-editable whitelist page to specify certain users are allowed to bypass those restrictions.

So if we have a namespace Permissions: that was only editable by teachers, it could pages for each page in other namespaces. So the permissions page for "User:Whiteknight/" would be at "Permissions:User:Whiteknight/" or even "User permissions:Whiteknight", etc. For ease we might just create a new database table entirely and use a special page Special:PageWhitelist to deal with our needs. Each page would then have a tab that points to the associated permissions interface. From there you would show the default access settings for all pages in the namespace plus a user-editable whitelist. I've been hoping to find a way to unify this fine-grained functionality with the existing page protection mechanism (since that gives us a per-page editing restriction), but I'm not sure it will be doable that way. I'll keep looking at that, and would be very grateful if somebody more knowledgable then myself could give some insight about it.

Jason is getting started with the feedback and peer-review extension, and Nick is starting to take a look at existing extensions to tackle these user access restrictions. We've also added a new member to the team, Michael, who is a student in Jen's class and will be helping primarily as a translator between teacher-speak and coder-speak. He'll also be helping to get requirements for our various components nailed down, because there is still a lot of wiggle in the specifications at this point. Once we get specifications nailed down I have high hopes that we will be able to find suitable solutions quickly.

Blog Closed

Pages