Microprocessor analogy
When you look at modern microprocessors, you see these complex instruction sets like x86. In traditional machine code, each bit in a machine code opcode travels down wires and acts as control signals to the actual circuit hardware. The problem with this is that the opcode structure becomes intimately tied to the architecture of the hardware: If you change one, you need to change the other (because different bits now need to travel down different wires to different circuits), and then things built on top break: Your assemblers now need to output new machine code, your pre-existing binary executables are no longer valid, etc.
In addition to this, some machine code words are very complex. x86 has hundreds of operations, and each operation can use a large number of addressing modes. Combine these together and you get hundreds, maybe millions of permutations. A single op, depending on the types of the sources and the destinations can have a number of distinct behaviors. add ax, bx is pretty far different from add [ax], bx, which is different from add ax, [bx], etc. Having to decide what behaviors to activate for each opcode from a large set of opcodes is hard to do.
Reasons for L1
Consider the idea now of converting those "high-level" machine code words into "low-level" microcode where each microcode operation is small, simple, atomic, and fast. Because each of these lower level instructions is more simple and atomic, they are easier to use in automated translation and with JIT.
Right now we have a lot of our critical code paths are written in C, and a lot of them are written in PIR. It's immensely expensive moving data between PIR registers and the C system stack to call functions in each language, and we are incurring that cost very frequently. In addition, we lose out on lots of opportunities because of the boundary between the two: We cannot easily optimize, we cannot easily inline functions, we cannot easily profile (we can typically profile C or profile PIR, not both at once), etc.
L1 is an abstraction that allows us to insert a new abstraction layer into Parrot to help reduce algorithmic complexity on the global level and give us more opportunities for optimization then we have in the PIR/C combination so far. Other projects like TraceMonkey are using cool optimization techniques such as polymorphic inline caching (which Parrot has a basic and messy implemenation of now, which has been deprecated), trace-based JIT, and context threading. These optimizations together are producing some stunning benchmark results for leading-edge javascript interpreters.
What L1 will look like
So the question invariably arises, what will L1 look like to the average programmer? Will it look like some grizzled assembly language? Will we be forced to reimplement large swaths of Parrot in some freaky home-brewed assembly? The answer, I think we are all pleased to consider now, is no. We have skills and background and tools for programming language design and we have several existing parsers for languages that we could utilize. One specific example of things that we could do, as mentioned by pmichaud, is to write core PMC types in NQP. It would have to have some extensions to allow direct register addressing, definition of vtables, and definition of attributes which may be references to arbitrary C data structures. Also, it would have to have a new backend which outputs L1 bytecode instead of PIR->PBC as PCT normally does now. We also have a new language in development by Austin Hastings called "Close", which is a very low-level language based on C that will eventually have all the capability of PIR (and therefore L1), but with familiar C-like syntax.
What this means is that there won't be a human-editable form of L1 bytecode, because we can write it in NQP, Close, or even other high-level languages which currently run on Parrot (although we will want to find one with a small runtime, or even a variant with no runtime). The short answer: L1 looks like any language we want it to look like (but probably Perl6-like in NQP or C-like in Close).
Next Steps
So what are the next steps? Well, pmichaud suggests that a good way forward is to actually start prototyping some key core PMC types in NQP now, so we can get a feel for what syntax and semantics we will actually want and need. Once we know what our needs are, we can start modifying NQP to support our new syntax for dealing with these things.
With the synatx requirements set, we can then modify the NQP parser to output normal C for now, and eventually redo the backend to output L1 instead.
chromatic suggests that with our current plug-in architecture we could create a library of L1 dynops, and create an NQP backend to target those. Once we have core PMCs as written in NQP executing only on L1 dynops, we can start the rest of the conversion process.
Here is a quick checklist:
- Write core PMC types in NQP, making careful note of what syntax we need to add to support all existing operations. Good candidates for this prototyping would be Integer, ResizablePMCArray, and Hash (which is soon to be renamed to AssociativePMCArray).
- Write existing PIR opcodes in NQP, making careful note of what syntax we need to add to support existing operations.
- Write a dynop lib for L1 ops
- Modify the NQP compiler to have all the necessary syntax from steps #1 and #2, and to output only those L1 opcodes created in step #3.
- Modify IMCC (or PIRC, depending on time frame) to output L1 or PBC, depending on some kind of switch
- Move L1 opcodes into the core instead of being a dynoplib.
- Make the final switch. All ops and PMCs get compiled from NQP into L1 during the normal build process (possibly using a bootstrapping step)
- Optimize like rabid ferrets.
The cost of moving stuff between C and PIR is mainly due to the fact that Parrot's register frame is completely dynamic. The assortment and number of registers may vary in each function call. This makes easier to develop compilers but at the cost of added complexity in the interface between managed and native code. With a fixed register stack moving data between native/managed code incurs a predictable and well defined cost and JIT optimization of managed code becomes considerably easier. The complexity is then moved once again toward the compiler writer who should take responsability for optimizing register allocation. However this is not at all a new story in compiler theory and a pool of good algorithms is available to solve these problems. Why don't you chose to simplify the architecture of the VM and pretend something more from compiler writers?
ReplyDelete