Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.

Monday, January 25, 2010

The Problem with PIR

I've not liked PIR for a long time now, but I've never quite been able to put my finger on the reason why until now. It's a reasonably nice language to use for low-level interaction with Parrot and it's certainly more friendly than most other assembly-level languages that I've used in the past. So what's my problem with PIR?

Let's spin the question around and first ask this: What is the purpose of PIR?

When you talk about an assembly language, you tend to talk about a human-readable language that has a 1:1 relationship with the underlying machine code. On an x86/Windows computer for example I can write MASM code, compile that into machine code and then disassemble that back into MASM code that would be nearly identical to the original input. I can use a simple table lookup to convert between assembly language mnemonics and binary machine code words. This allows, among other things, live disassembly of code in a debugger and, within reason, in-place code modification. OllyDbg is a great program in this regard and was one of my favorites back when I was doing more work in this area.

PIR is not an assembly language. It's close, but if you read my paragraph above and take that as a strict definition of what an "assembly language" is, PIR is not that. PIR does not have a 1:1 relationship with the underlying bytecode, and it's not as simple as a table lookup to convert between the two forms. It's not significantly more difficult, but it is not strictly so easy either.

Another benefit to assembly language is that they tend to be easy to parse. Ridiculously easy, in fact. Almost all assembly languages use this format for their instructions:

<label>: <mneumonic> <arg>, <arg>, <arg> ...

If all your statements take this form (where the label is typically optional and there can be a different number of args, depending on the particular mneumonic), parsing is painfully simple. And fast. Here's pseudocode for a basic assembler using statements of this form:

Statement * stmt;
while(! file.end_of_file()) {
char * line = file.readline();
byte[INSTR_LENGTH] machinecode;
stmt = parse(line);
if(stmt.has_label)
save_label_address(stmt.label);
machinecode[0] = mneumonic_to_machine_instr(stmt.mneumonic)
for (int i = 0; i < stmt.args.length; i++)
machinecode[i+1] = convert_to_machine_addr(stmt.args[i]);
}
update_jump_addresses();

This code example calls several simple, single-purpose helper functions which I won't discuss in detail here but I could explain if any of the readers were interested. The point I am trying to make is that we could make a reasonably fast, stable assembler for a PASM-like assembly language if we optimized for parse-ability. The total parser could be within 300 lines of code and be completely readable and straight-forward.

This assembler, of course, wouldn't provide hooks for things like optimizations but that's the point exactly. What comes out of this assembler would have an exact 1:1 relationship with what went into it. We could make a disassembler that had almost exactly the same program structure too, and do complete back-and-forth translations of the code losing nothing but variable and label names, comments and whitespace.

And herein lies an interesting dichotomy: assembly language is the lowest-level interface to the machine, but is almost completely disregarded in favor of coding in C or C++. Most programmers don't want to program assembly. Most programmers shouldn't. I can entertain arguments about that point in the comments or even address them in a different post. Assembly language can be a huge waste of time, a huge headache, and your bug rate will tend to be higher than if you used a higher-level language. If the computer that you are using right now has software which represents over a billion lines of source code total, I would estimate that less than one ten-thousandth of one percent of that code was written directly in assembly language. Your bootloader is probably entirely written in assembly language, and small snippets in your kernel and device drivers may have been as well, but that's it. Almost nobody else uses assembly for anything because it doesn't make sense to do it. For almost the same capabilities and almost none of the same headache (or, different but less severe headache) people use C or C++ instead for low-level applications.

I've digressed, but I think some readers are going to understand my point. PIR isn't really an assembly language; it's had too many additions and syntax "improvements" to aid in readability and usability for the developer and has lost the intimate relationship with the underlying bytecode representaton because of it. At the same time, it's not really not an assembly language either: It's not as high-level, comparatively, as C is on the hardware machine and doesn't really provide the same kinds of structures or paradigms as a normal structured programming language would do. It's nice for the programmer, but not nice enough. Plus, programmers don't want to be using an assembly language anyway. Why pretty up an assembly-level language for developers in the first place if developers don't want to use assembly-level languages anyway? Why not just give them a low-level structured programming language that presents structures and idioms that they will be more familiar with? Sure, PIR offers a large series of complex and limited macros to simulate high-level control stuctures (which, by the way, are error-prone on many levels and difficult to parse), but why not give developers a real language with those structures properly integrated into the language from the beginning?

So what's the alternative?

NQP seems like a decent language to jump in and serve as the "system language" of choice for most developers on Parrot. Because of it's use in PCT, NQP does enjoy a certain amount of "market share" among developers and that kind of installed base is hard to ignore. Also, it's well-documented. I'm a little bit mistrustful of NQP because it is tied very closely to the Perl 6 language and doesn't necessarily represent the needs of Parrot developers in an agnostic way. I don't have any particular complaints, just a general mistrust of relying on software that has a different motivation than what I need. Also, NQP isn't even hosted in the Parrot repo, though the newest versions of it are copied to the repo for releases. A good systems language, in my opinion, has no motivation besides effectively bridging the gap between the raw power of the machine and the friendly abstraction needed by the programmers. I have high hopes for Austin Hasting's Close language, though that project appears to be on hold for now while he works on other foundational things. We want something simple but structured, with access to the intimate details of the Parrot VM but with a layer of abstraction to help keep programmers from getting lost in the arcana and minutia.

What I would like to see on the road to 3.0 are these things:
  1. PIR should really be deprecated as the primary interface to Parrot. To program on Parrot you should either use PASM (typically as the raw text-based output of automated code generators), NQP (or another similar system language), or one of the HLLs. PIR does not satisfy a useful niche, it has been thrust upon people as the way to program on Parrot when it doesn't have any particular merits in this regard.
  2. PIR can continue to exist, but as a separate compiler and not a part of Parrot directly. PIRC should be setup as a stand-alone compiler tool that converts PIR->PBC. The Parrot executable should have the ability to load this as an HLL compiler, but Parrot should natively act on PASM and PBC only.
  3. PASM should be fixed to accurately represent the Parrot bytecode format in a 1:1 way. It's not far off from this, but a few fixes are needed. PASM should also not have things like macros or any kind of "syntax sugar", and should have limited assembler directives.
  4. PCT should be updated to directly output PBC or, barring that, to output PASM instead of PIR. Of course, once we have PASM output, it's a matter of a simple table lookup to convert that into PBC. If PASM is the primary human-readable interface to Parrot, and if we have simple, fast routines for converting PASM to PBC on the fly, more programs and utilities will output PBC directly. This, in turn, improves performance for everybody.
  5. Extend NQP or whatever systems language we want to push as the default to have access to all or almost all capabilities present in PASM. Possibly give the option to inline small PASM sequences easily.
There are lots of benefits to doing this, I think: Decreasing developer confusion, increasing performance, and gaining momentum to build languages that people actually want to use and have fitness for particular purposes, etc. I think it's a really good idea and I would like to hear feedback about it. I say we let the assembly languages (PASM) be assembly languages which are intended to be close to the machine, and that we give developers structured programming languages by default that people will actually be comfortable programming in. It seems win-win to me.

11 comments:

  1. I've long been a proponent of PASM being a direct translation of PBC, so I'm all with you on that point.

    I think you may have forgotten Winxed as a potential "high level way to write low-level Parrot code". I haven't used it myself, but IIRC that was the original intent.

    As for NQP, as Plumage and other projects use NQP more seriously, we've been finding lots of features we want to make NQP more useful outside the Perl 6 implementation effort. These requests are informally tracked on a page on the NQP-rx wiki:

    http://wiki.github.com/perl6/nqp-rx/requests

    So far pmichaud++ et al. have been very good about making improvements to NQP-rx for us; I was able to convert hundreds of lines of PIR to NQP over time.

    -'f (japhb)

    ReplyDelete
  2. I like Winxed too, though I don't know enough to say whether it would be good for this purpose. When you look at a "system language" like C you see a few common themes: the language is small and there is very little runtime included by default. I like NQP because it explicitly doesn't have a large runtime. I also like the Kakapo project because it provides some highly-useful optional functionality to NQP.

    I like NQP though like I mentioned above I'm hesitant to rely on projects that have other driving motivations than I have. Of course, if NQP is meeting the needs of developers and is working with other projects, my concerns disappear. Plus, if everbody is using it, it becomes the de facto standard whether we like it or not (and I think a lot of people DO like it).

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. You compared PIR to HLA here. Reading about HLA made PIR sound attractive. Could it be that PIR is not as good a high-level-assembler for PBC as HLA is for x86? If so, maybe PIR needs to be improved to match HLA's quality.

    Yours,
    Tom

    ReplyDelete
  5. I compared PIR to HLA as a way to show what PIR is: It's not a "bare bones" assembly language, it has nicer syntax and a macro facility that many programmers will feel familiar with. I don't want to claim that PIR has the same quality as HLA.

    People who use assembly language need good tools to work with, like HLA. However, my bigger point is that most developers don't need to use assembly in the first place. In fact, most developers would prefer not to use it if possible. So no matter how nice PIR becomes, it's still the wrong tool for the job.

    And this all ignores the fact that the "nicer" PIR becomes in terms of syntax and usability, the harder it becomes to parse and convert quickly to bytecode.

    ReplyDelete
  6. In your post on the Neko VM, you seem to favor PIR over PASM. Now, in this post, you seem to be favoring PASM over PIR. Is this because PASM is not ready yet, so you think that PIR should be used now, but then deprecated in favor of PASM?

    ReplyDelete
  7. I don't favor PIR over PASM, but most other developers do.

    Look at it this way: We don't have a "system language" like the kind I talk about in this post; a good structured language that people can use for low-level development. The only options are PIR and PASM.

    PASM is far too low-level to write directly. It's an assembly language and assembly languages are a huge pain in the ass for most people.

    So the only alternative we have right now is PIR. I don't like PIR though, so I'm hoping we develop a new alternative (or settle on NQP as a reliable one).

    ReplyDelete
  8. Winxed has zero runtime. The stage 0 compiler generates some predefined functions that can be seen as a sort of runtime library, but stage 1 inserts code inline instead of inserting subs and calling them.

    I don't think we should have a de facto standard that everyone uses wheter he likes or not. There's more than one way to do it.

    ReplyDelete
  9. There is more than one way to do it, of course. But at the moment we basically say to people that PIR is one of the preferred ways of interacting with Parrot, and I dont think it should be. PIR is a terrible language to use for programmers and is rediculously sub-optimal for parsing.

    Think about C. In Linuxland, C is a defacto standard for most types of infrastructure programming where compiling to efficient machine code is a good thing. Sure, you couldu use any number of other languages: Haskell, C++, Ada, even Fortran. But, many people use C for these applications and because of that there are hundreds of libraries and resources, not to mention fantastic optimizing compilers for C.

    You talk about client-side web programming, the defacto standard is JavaScript. And because so many people use it there are some remarkably efficient VMs for it, great libraries, resources, frameworks, etc.

    The bigger point is this: PIR is not a good way to interact with Parrot, so we shouldn't recommend it to new programmers. When new developers get started with Parrot, we shouldn't say "Here's PIR, use that to write your programs", we should say instead "Here's NQP, use that to start writing programs".

    ReplyDelete
  10. C is a de facto standard, agreed, but no one is forced to use it. All you need to do for interoperability is to be able to provide and use functions with C calling conventions.

    Given that most people interested in parrot have some perl background nqp is a good choice. Some people, like me, are more comfortable with a syntax closer to C/C++/java/javascript, thus I wrote Winxed, and consequently I'll recommend it to anyone like me, when it reaches some stability.

    I agree that PIR is not a good choice to write programs, but is a good target for compilers. PIR takes care of calling conventions details, so they can be changed without breaking PIR code or code generators. Generating PASM is doable, but has some cost.

    ReplyDelete
  11. This is why I had high hopes for the Close project: it's C-like but also is tailored for use on Parrot. It, or something like it, would be an ideal choice for people new to Parrot but familiar with C/C++/Javascript.

    If Winxed satisfies that too, good. I would love to see more Winxed.

    ReplyDelete

Note: Only a member of this blog may post a comment.