Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.

Tuesday, February 9, 2010

GSoC Idea: Bytecode improvements

Bytecode.

It's a part of the VM that really "just works" and so nobody spends much time playing with it. This is not to say that nobody has touched it, but in my tenure as a Parrot developer I have not seen nearly so much development on the entire subsystem as I have seen in other places. This has been changing recently with some cleanups to the freeze/thaw system, which is related to the bytecode system. There are several packfile-related PMC types waiting in the wings, but Parrot only uses them as a thin wrapper layer to provide PBC access from PIR.

There are several issues that need attention in the realm of Parrot's bytecode. Some of these issues, or combinations thereof, could make for very interesting--and very rewarding--GSoC projects.

First project that I would like to see is in making Parrot's bytecode more portable. Currently bytecode isn't really portable between different systems like it should be, nor is it portable between different versions. Everytime somebody increases the bytecode version number in the PBC_COMPAT file, all previously-generated bytecode files become invalid.

One way to tackle this problem would be to include metadata in the bytecode file. This metadata could include a list of PMC types and their corresponding type numbers, and opcodes and their corresponding index numbers. A verification step at Parrot load could check versions and, if different, could loop over this metadata and attempt to remap old numbers to the new numbers.

...And while we're talking about bytecode portability, Parrot needs a "robust testing strategy" for testing and verifying PBC files. I mention this because I see the warning several times per test run, every test run, every day. Seriously, somebody fix this warning!

The packfile system code is terrible and very complicated. We do have those PMC types that act as thin wrappers over the packfile structures and APIs. One thing that I would like to see is a complete refactor of the system that replaced all of the packfile structures with the equivalent, properly-encapsulated, PMCs.

Parrot's bytecode also stores long lists of the constants used in the code. These long lists, unfortunately, often contain duplicates. A post-processing step on the generated bytecode could perform some basic constant folding on the code to decrease the final size of the generated PBC file.

And speaking of constant folding, there are plenty of other optimizations that could be performed on bytecode. Hooks to add optional optimizations would be a most welcome addition, especially if we could notably decrease file size or even improve performance (though performance-improving optimizations would probably be better suited at the PIR or PAST levels). One such optimization that could be useful is a translator routine that converts bytecode files to a "native" format where data is in the correct byte order.

The pbc_dump program, which attempts a basic disassembly of a bytecode file, is woefully inadequate for most needs. Major improvements to this would be nice. A full-featured PBC disassembler that could produce round-trip capable code would be even better. The pbc_merge program similarly is in need of some love. Major improvements to this could be made as part of a larger set of refactors. Tangentially related but still important is a proper PIR->PASM translation utility. IMCC has an option to do this, but produces output code that is known to be incorrect and incomplete.

The freeze/thaw system converts individual PMCs to serialized strings suitable for saving to a file and then loading again at a later time. Better integration of the freeze/thaw code would enable storing all Parrot data, be it bytecode or data, to a similarly-structured file. Eventually we could see the freeze/thaw types be inheritable, so users could define how their data is serialized. Information in the file could direct Parrot which serialization types are required to read the data later. Some ideas of subtypes of the PMC serializer type are serializers that checksum data for later verification, or types which encrypt or obfuscate the data for security purposes.

So these are a few random ideas about projects that have to do with the PBC system. I think this is an area that is ripe for some GSoC-level projects. Also, I think projects in this area would be majorly beneficial to a lot of people and would help to make bytecode a compelling format for shipping portable executable files to end-users. If you like one of these ideas, or if you have ideas of your own to fix these systems, please do let me know!

3 comments:

  1. By seeing the Subject, my expectation was that it would deal with Lorito (aka L1) but surprisingly, you are not saying a word about Lorito:
    What you are proposing here is more or less a clean up of the actual Bytecode instead of the complete refactoring that Lorito is intended to be.

    So my question is the following: what is the state of Lorito and wouldn't it be wiser to focus on a complete redesign?

    ReplyDelete
  2. Good question. I talk about Lorito as if it is some exotic new thing, when the reality is that Lorito is more of an idealized subset of PASM.

    I don't think the PASM (and by extension, PIR) opset was too well-designed in the beginning. There are some places where it grew very organically and haphazardly, and there were definitely points in Parrot's history when "the more, the merrier" was a driving design philosophy. As a result PASM became extremely complex, much more CISC than RISC. This complexity in turn drove up the algorithmic complexity of IMCC, and makes things like JIT insanely difficult to do correctly.

    Lorito is intended to be a redesigned opset with a focus on using low-level primitives which are easy to optimize and JIT. Even with Lorito, some of the problems we have now still remain: We still need a mechanism to serialize Lorito into a bytecode, we still need a way to work with that bytecode from PIR, we still need a sane, translatable, and portable bytecode format, we still need major cleanups in the whole serialization system.

    So if we have two dialects of opcodes (PASM and Lorito), both of them need to be converted to bytecode, which means the bytecode system needs to be refactored and cleaned up. The act of creating the new opset is largely orthogonal to this work.

    ReplyDelete
  3. > The act of creating the new opset is largely orthogonal to this work.

    Thanks for the quick reply, it was exactly what was not clear to me.

    As a side note: I know that you are still in the design phase for Lorito and experimentation phase for JIT but IMHO you should define different subtasks for Lorito and create tickets targeting different milestones on the roadmap.

    Secondly: the Parrot Roadmap notes on google spreadsheets are not in sync with the corresponding tasks in trac, which one is the correct one? :)

    ReplyDelete

Note: Only a member of this blog may post a comment.