Question 1: Why Is L1 Better Then PASM?
Another common rephrasing of the question would be "Why don't we just extend PASM to do what we want L1 to do?". I would turn the tables and ask "What is so good about PASM that we can't even conceive of a replacement?". Popular lore has it that Thomas Edison churned through 10,000 failed designs before he came up with the working lightbulb. We should not be so foolhardy as to think that we can't have made a fundamental mistake with PASM.
The problem with PASM is that it's the wrong level of abstraction. Frankly, it's too high level and is focused more on ease of use by the programmer then ease of execution by the VM. Don't believe me? I suggest you look at all the hundreds of files that are currently written in PIR: Main programs for HLLs, runtime libraries in HLLs, NCI wrappers for libraries, the entire test suite, etc. If PIR wasn't so great for people to be writing, it would long ago have been identified as a pain point and replaced with something better. Writing PIR is currently a required part of building an HLL compiler, and very few people are complaining about that, or complaining enough to have it changed! This in and of itself is proof that PIR (which is just PASM in disguise) is more suited for use by humans then by the machine.
And it's not like we don't have the tools or the skills necessary to write a replacement language compiler. In Parrot world you can't shake a stick without hitting somebody who has written one. If PIR was too low-level, we would have a compiler for a replacement language prototyped in less then 5 hours. Seriously.
L1 is better then PASM/PIR in this case because we can specially design it from the ground up with the purpose that it should be easy and fast for the machine to execute. We can take lots of steps to make it easy and fast to execute, a lot of steps that the designers of PASM didn't take because they didn't know. Back when Parrot was first being designed, how could the developers have know about the amazing performance of JavaScript engines such as SquirrelFish or TraceMonkey that have come out recently? The answer is that they couldn't have known what we know now about VM design.
We can focus on execution performance to the exclusion of human readability because people shouldn't be writing it directly. Let me repeat: we have the tools, the expertise, and the experience as a community to write a higher-level language compiler to do the hard work for us. Regardless of what intermediate language we end up with on Parrot, I don't expect humans will have to be manually writing any of it.
Question 2: Won't another layer make things slower?
Here's what we have right now:
People assume that I'm planning about just adding another compiler step and then another target interchange format for L1 at the end of this list. Not so. PCT has the potential (though it hasn't been implemented yet) to output code in any format. Why don't we do this instead:
And if you see that image and ask "Why do we need PIR/PASM in this case at all?", now you're thinking along the right lines. Imagine a world where we never ever need to write anything in PIR or PASM by hand again. Now imagine that same world where our tools like PCT don't output PIR as an intermediate form but, unbeknownst to you the programmer, directly outputs executable bytecode like L1. In such a world as we are imagining here, do we need to keep PIR or PASM around? Do they serve any purpose? Maybe as a very small bootstrapping layer to build PCT in the first place, in which case there is a lot of bloat that we could throw away, and we would only need a PIR compiler for the miniparrot build step and wouldn't need to build IMCC into libparrot or the Parrot executable.
It's not that we're adding a new layer of complexity, we're starting to realize that PASM is the wrong level of abstraction for our needs in Parrot, so we are replacing it wholesale with L1.
Now obviously there are intermediate steps betwen here and there where things are going to be slower and more complicated: We will need to use PIR as an intermediate step until PCT is capable of outputting L1 directly. But let me ask this: Do we want PCT to output something that executes slowly, or something that executes more quickly, when all the necessary work is done?
Question 3: L1 is going to be lower-level then programmers like
No question that PASM/PIR are far more friendly to the programmer then any conception of L1 is going to be. PIR and PASM are supposed to be sufficiently low-level that people don't want to program in them, but at the same time they aren't enough of a pain point that we made a replacement language into a priority. If L1 is such a pain point for programmers as I think it will be, we will write a compiler for a better language so we never need to write L1 directly.
We already have NQP which is a thin Perl6-like language that works well with Parrot's execution model. We also have Close in development which is going to be a C-like language that does almost the same thing. Assuming both these two languages have all the capabilities, why wouldn't we use these to do all our coding and never use PIR, PASM, or L1 again? Remember, I keep saying this: We have PCT, one of the most powerful compiler construction tools ever conceived. We should be using that to make languages that insulate us from whatever intermediate language Parrot uses. We should be reaching a point where we never ever ever have to write or even read another line of PIR or PASM ever again. Ever.
And when we do reach that point where we are so completely insulated from it, it won't matter to the programmer what intermediate language Parrot uses, because we won't be seeing it.
Question 4: A lot of development effort is going to be wasted
This is a good concern, that a lot of things people have spent a lot of effort to develop are basically going to become obsolete. But I have to also ask the question: "Why don't we all just write in Fortran, considering that so many people have spent so much energy on that compiler?". The effort isn't all wasted, we did learn a lot of important lessons about Parrot and the right way forward, and we've used all the things we've developed to bootstrap the creation of better tools and better ideas.
PIR and PASM got us to a place where we have PCT and several HLLs in active development. That doesn't mean we need to be trapped underneath these languages forever. We use them as a bootstrapping layer to build better tools, and use those instead.
Question 5: How does L1 compare to C?
Alternatively, why are we rewriting all our C code in L1? Let me ask what is the difference between C code and Assembly code? Everything you can do in assembly you can do in C. You will probably need to call a library to do some things, but if you can do it in ASM, you can do it in C. A compiler converts both down to the same machine code that is executed by the same hardware. You can say that one is "faster" then the other, because your average compiler isn't always smart enough to generate "good" machine code, but given the proper optimizations, there should be no speed difference. In other words, given the exact same machine code, it doesnt matter to the processor whether that code was written by a human in assembly and assembled to machine code, or written in C and compiled into machine code.
All that information out of the way, let's conceive of L1 as being a portable assembly language with all the same exact capabilities as C. Given the same exact capabilities and a good compiler (for our purposes, a JIT engine is just a "compiler"), both sets will be able to do the same stuff, and can be converted down into the same machine code for execution by the hardware at the same speed.
So given that the two are equivalent, why have L1 at all, and why not keep everything in C? The difference is semantics and algorithmic complexity. It's a difference of where the primary control flow happens. Right now we have a register-based system (PIR) running on top of a stack-based system (C). Switching between the two is slow, and unfortunately we switch pretty often because neither one supports all the semantics we need. L1 gives us an opportunity to access the low-level details that C has (calling native functions, accessing memory pointers, etc) but to use the high-level semantics that PIR requires (register-based operation, etc). We gain the ability to keep control flow executing in only a single context, and to eliminate all the algorthmic complexity of switching between semantic contexts.
An L1 dispatcher handles all control flow, able to call C functions directly as needed and redirect to offsets in bytecode. It can handle function calls, exception throwing, vtable overrides, and other CPS magic without having to recurse down the C system stack. And we gain all sorts of potential for optimizations because we have a unified environment like this.
In short, L1 allows us to design a language with the power and flexibility of C and the semantics of PIR, which in turn will let us reduce algorithmic complexity at the global level.
Question 6: Why not just fix PASM/PIR?
We can't really fix PASM in place because we have conflicting needs: We simultaneously need to extend it in order to match all the power of C, and shrink it to make it easier to analyze, optimize and JIT compile it. What we would end up doing is growing PASM to include what I now think of as being "L1", and then we would need to shrink away most of what we now think of as being "PASM", and we would be essentially left with only L1.
Keep in mind that this is not necessarily a bad development path for us to follow, but the end result is L1, not what we currently know as PASM. We can start by creating an op library for the new L1 ops, start transitioning everything over to use them, and then start deprecating away our old PIR ops (or rewriting them in terms of L1). What remains at the end of this is the small selection of L1 ops and various HLLs (I'm including PIR/PASM here as an HLL) which compile to L1, and L1 is executed directly by Parrot.
Question 7: What does L1 Buy Us?
L1 is going to give us a number of benefits:
- Decreased algorithmic complexity, because we're not having to shuffle data between the PIR registers and the C system stack
- Improved, easier, faster JIT. A "fixed" JIT
- Potential to plug more easily into existing JIT engines (LLVM and libJIT)
- Potential for trace-based JIT, where we trace out "hot spots" in the code and JIT them with type-specific information to speed up dispatch.
- Potential for high-level optimizations including subroutine inlining, dead code elimination, etc
- Potential for context threading, where we try to align the VM with the control flow of the underlying machine, to maximize branch prediction and caching at the hardware level
- Potential for improved GC and resource allocation performance because it will be easier to analyze where memory is allocated and where it falls out of scope. This includes "escape analysis", where we determine the lifespan of a reference and are able to deallocate it more quickly and efficiently.
- Potential for easier profiling, because everything will be in L1, we only need one tool to analyze L1 control flow (which we can share with the trace-based JIT and the GC escape analyzer)
> "We should not be so foolhardy as to think that we can't have made a fundamental mistake with PASM."
ReplyDeletePASM might not be a mistake but perhaps the assumptions Parrot is based on?
> "The answer is that they couldn't have known what we know now about VM design."
I'm not sure that this is true, who gave the link to
ftp://ftp.create.ucsb.edu/pub/Smalltalk/Squeak/docs/OOPSLA.Squeak.html for what purpose?
> "the experience as a community to write a higher-level language compiler"
I hesitate to point to the awful truth, but ... almost none of the HL-languages targeting Parrot are close to being complete (useful, efficient, ...) but far away from it. Most of them are "dormant", "inactive", "retired" or just toys. https://trac.parrot.org/parrot/wiki/Languages
I'm a bit torn on this whole thing. I can see tremendous advantages on a simpler low level language IF:
ReplyDelete1) the operations it performs have a contract that makes it clear when machine-specific features are in use inside an op, so JIT knows it's OK to use them.
2) ops do not need to pass everything through a return/parameter hoop. They should be able to float some registers -- both abstract L1 registers and architecture feature registers -- to the next operation. And JIT needs not to stomp on those either.
3) ops should be coroutines, in that a single opcode would normally be invoked multiple times to complete a task in steps, but if JIT is told how, it could also
4) ops should, except for very rare cases, restricted to a very small runtime. That means if you are for example strncmp'ing two strings in an op, and they are big enough to cause huge latency, your op should yield back somewhere in there and continue the job.
...but all 1,2,and 4 of those suggestions were sort of met with "meh, we don't need that." But you do. If you want any sort of control over the applications latency, something, somewhere has to step in and tell developers "this is how much work you can do in one chunk, anything longer has to be split up." Machine language and microcode both come with cycle counts in the documentation and a complete and thorough description of every side effect and possible outcome. L1 could be an opportunity to bring the code and documentation up to this standard. But it doesn't sound like that's what's on the agenda.
There are also a lot of advantages to moving away from C for PMC code. For one, a different language could clean up pointer references so it's harder to reference pointers without passing them through GC hoops, if the particular GC that is running wants that to happen.
It could define a framework to actually formally specify the structures we are using -- e.g. these types of structures are always from this type of GC pool, that type of structure is does or does not have a buffer that might be mutated by GC. It could un-uglify working with the internal API and hideall those nasty C-namespace Parrot_subsystem_of_this_other_subsystem_just_friggin_do_it_already()
nonsense.
But is NQP maliable enough to do so? Is that really what Close is aiming at? We don't know that yet.
At the same time I'm not entirely convinced that the barriers presented as arguments against PIR are all that insurmountable. PIR(PASM), as you have noted, is also a pretty compact representation, moreso than L1 would be.
I'd like to encourage those who would advocate for the "let's fix PIR" to suggest how PIR might meet as many of the stated goals of L1 as they can, rather than blow them off as nonconcerns -- because they are well founded criticisms of where we are now.
More good questions!
ReplyDelete1) if we use a JIT engine like LLVM or libJIT, those will be able to optimize in machine-specific ways that they know about. I'm not sure if this answers your question.
2) I was actually thinking about this topic today. Do a google search for "context threading" to see ways that ops can be dispatched to help minimize instruction cache hazards. I think for most compilers we could also specify that ops are dispatched with "fastcall" semantics, and hopefully the optimizer will prevent necessary registers from getting clobbered on every dispatch. I was actually going to play with "fastcall" tonight to see if it would yeild any speedups at all.
3) I don't think we need them to be coroutines. Coroutines require a certain notion of saved state which would be too expensive for ops that are called so frequently. A good fast dispatcher to normal subroutines should be sufficient.
4) Parrot doesn't make any hard realtime guarantees, so it doesn't necessarily make sense to make individual ops conform to a particular time profile. This is especially true of "call" ops which can call arbitrary C functions (which themselves could be very expensive).
And why not a human-readable pbc instead of a new language?
ReplyDeleteMy only comment is... given how much investment there's been in PIR/PASM to date, would it not be possible to actually do the "extra stage" approach first, get some PASM/PIR -> L1 translation in place, and look for the other opportunities later on?
ReplyDeleteReading this, I kept asking myself, for those pesky low level things you're going to invariably need to do implementing your HLL, if not PIR/PASM, then what? Allusions are made to a yet-to-be-fully-conceived "other" language to implement HLLs in, but PIR makes up the majority any language which can't rely wholy on the builtins provided by parrot.
Sequence I would think least destabilizing, while still allowing for the refactoring of things...
- you get to focus on implementing a single language -> L1 path first
- there is a ton of existing code you can throw at it for testing, tests around those, etc.
- lets you focus mostly on L1 and the VM side implementation at the outset
- once thats in place, nothing to stop you from building out new compiler bits that target L1 directly
- new intermediate languages built "natively" to L1 emerge with aim to be what you implement an HLL against.
- HLL authors flock like druken dwarves to a keg to migrate their languages to exploit these languages, due to the sheer pwnage of their capabilities.
Sure... none of that existing code will be able to take full advantage of what L1 offers, and there may be some speed impacts, but it will all still continue to work, the HLL authors get to enjoy a period of stability while the new approach matures, and migration can be taken in pieces in many cases once it's all ready.
I think a period of said stability is important for building a community around all of this, everyone would love perfection, but for the moment, "working the same as it did last month" is an important milestone you may want to consider for a time, preserve some of the runtime as it is, offer a compatibility path as the new is introduced, and you will get people accumulating around this.
I leave you with a story, myself as a newbie into the *nix world and wanting to pick up some languages within. At the time, I was attracted to TCL, and taugh myself some, and went out to buy a book on it. Got home, sat down, and after running into a few snags, discovered that what I was trying to teach myself wasn't even syntacticly valid on currently shipping TCL versions. I turned then to Perl5 and never looked back, because I wasn't going to bother with something still going through that much flux between releases.
I can't help but feel the same thing every time I take a stab at parrot.
Anyhow, hope this was in some way interesting, from the perspective from someone on the outside who's been looking in for a while.
-
There are two good options right now for this magical "other language" that I am talking about. The first is NQP, the Perl6-like bootstrapping layer that PCT uses to help write compilers. NQP is nice because it uses Perl6 syntax but has no runtime library, so it's very light weight. Because it's modeled on Perl6, it doesn't have a lot of the constructs that we would need to make the L1 conversion (mostly pointer-related operations), but those could be added via some kind of extension.
ReplyDeleteThe second option for this language is Close, which actually got posted online yesterday afternoon. Close is a C-like language that has a very small runtime and is very close to the underlying VM. Because it's modeled on C syntax, I think it would be less of a stretch to extend it to do the pointer stuff that we would end up needing.
The sequence that we are generally looking at following to implement L1 is this:
1) Write PMCs and Ops in NQP or Close or whatever, to figure out what syntax and capabilities those languages would need to have
2) Write a dynamically-loadable library of new L1 ops
3) Modify PCT to be able to output L1 ops instead of normal PASM ops based on a switch. Here we can test that HLLs and other programs work with L1.
4) Start compiling the PMCs and PASM Ops into L1, and testing that
5) Make the final switch: Move the L1 ops into the core, make PCT output L1 by default, replace all PMCs and Ops with their counterparts written in NQP/Close.
Thanks for all the comments!
Fastcall is close to what I'm getting at with the "coroutine" meme. Except using more registers, including the extra goodies of whatever architecture one is running on.
ReplyDeleteSay we defined abstract families of registers (and a small chunk of immediate state memory.) These would not necessarily be modeled in a traditional fashion like "general" or "fpu" but would likely each be an assorted grab-bag.
Over time we'd get a feel for what would make a good set of typical register families and they would each develop a personality -- e.g. family 1 is typically used for this, family 2 for that... and adjust per-architecture which family certain registers are assigned to.
Now suppose we had a special (non calling) L1 opcode that said: "stay out of these families, JIT, until further notice" or "start using this passing style" This opcode would be inserted before any opcode that did anything unusual.
Now the runcore knows A) it needs to stay clear of that family and B) that family should be treated as in-use for green-thread purposes and in certain scenarios may need a save/restore and C) how to shuffle operands around, if special needs are in force. And that's all available to JIT as well.
The point is the data could be kept in registers between operations. Certain sets of operations could be coded to expect this and use it.
So then lets take a look at one of those unusual types of operations. Say we have an operation that loaded family 2 with a set of data, then yielded, then when called a second time, summed the data in family 2 and placed the result in the family normally used to pass return values.
In a normal invocation, the opcodes would look like this when humanized:
USEFAMILY(2)
SUMWITHFAMILY2(foo, 8)
DONEFAMILY(2)
That would sum 8 values located at foo and the result would reside wherever results normally reside. It would actually call SUM twice to do it, however. The default would be to re-call until there is no yield.
Now suppose we had another opcode designed to work with SUM, that takes the absolute values of things already located in family 2. In this case we could get the sum of absolute values by:
USEFAMILY(2)
YIELD1 SUM(foo, 8)
ABSFAMILY2(8)
SUM(foo, 8)
DONEFAMILY(2)
The YIELD1 opcode modifier tells the runcore to proceed to the next instruction after the yield, instead of continuing to call SUM until it completes. We could have any number of interchangeable middle opcodes to jam between the two sums.
Now this example is not that compelling, but I am sure better ones could be come up with.
> And why not a human-readable pbc instead of a new language?
ReplyDeleteSuch a language already exists: it' PASM!
"PASM is just a text form of PBC", see PDD06.
But I am following the developpement of parrot since 3 years now and I think that Whiteknight is simply right by thinking that PIR/PASM is not the correct abstraction level: the move from PIR to L1 is similar to a move from CISC to RISC.