Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.

Friday, June 26, 2009

Parrot4Newbies: PMCs

This is the third installment in my series of Parrot4Newbies, and today I am going to talk about PMCs. As a general reminder, if you're interested in any of the topics I've discussed so far, make sure to keep track of the comments. I'm going to post more tasks on each topic in the comments, and I hope other people will post new ideas there as well.

I've talked about documentation and the test suite as great places to get involved in Parrot quickly. However, neither of these involve a lot of code, especially not a lot of C code which many people know very well. If you know C pretty well and want to get right into Parrot's guts, I can't think of any place better to get started then the PMC system. PMC types are defined in /src/pmc/*.pmc files. Parrot's core API for dealing with PMCs is located in src/pmc.c, and our implementation of objects (which are based on the Object PMC) is located in src/oo.c. The current compiler tool for converting PMCs from their strange C-like language into pure C is located in lib/Parrot/Pmc2c/*.

Refactor Messy Code

Several of the core PMC types are very old and in need of a major cleanup. Some things need to be refactored for readability and to maximize code reuse. Other things, especially critical core types, need a little bit of optimization lovin'. Some good candidates here for general cleanup are Integer, Float and String PMCs, which are used frequently to autobox INTVAL, FLOATVAL, and STRING* core primitive types. Also, some of the newer types such as Socket and Sockaddr could use some tweaking to become more mature and stable.

Certain types that deal with the underlying system, such as OS, Env, and File, always need tweaking, extending, and testing to provide all the functionality that users are going to expect in order to examine and manipulate the underlying machine in a consistent way.

Verify the Spec

The various design documents in docs/pdds/* refer to some of the Core PMC types and discuss some of the important components of each. Of specific interest are PDDs 15, 17, 20, 21, 22, 23, 24, and 28. Also, there are several design documents still in draft in docs/pdds/draft/*. Take a look through some of these documents to see how the drafts compare to the current implementations. Feel free to make one conform more to the other, and submit patches for both the PMC types and the spec documents to update them. Of particular interest here are PDDs 8 and 14.

Rename API Functions

This is a task that actually applies to multiple subsystems, not just the PMC system. There is a page on the Parrot Wiki that we're editing right now to provide more details about other projects in this area.

Basically, functions in src/pmc.c need to all be renamed to Parrot_pmc_*. We also need to evaluate which functions represent the public-facing PMC API (which should all have the PARROT_EXPORT directive), and which items should not be part of that API. So the function pmc_new, should be renamed to Parrot_pmc_new.

Renaming just one function at a time (which would be ideal in terms of small, easy-to-review patches) should be a trivial matter for a coder who's any good with Perl5 (or even sed, if you're into that kind of stuff).

Also, the file src/pmc.c should probably be moved to src/pmc/api.c, and broken into subfiles depending on functionality. Check out the page on the wiki for details about a move like this.

Conclusion

PMCs are Parrot's basic aggregate structure, and the various core PMC types encapsulate a large amount of Parrot's functionality. Unfortunately, many of the central PMCs and PMC mechanisms are old and in need of some tender loving care from a decent coder. It's a system that's easy to get involved in, and there are some real benefits to the project that won't cost more then a small amount of time with the right tools and right knowhow.

2 comments:

  1. String definitely needs a major rework.

    First off the hashvals, the whole "seed" thing is kind of silly when we are using the current hash function, which is very predictable no matter what the seed is to anyone sophisticated enough to actually launch an algorithmic complexity attack. There's two choices here.

    We could start using a stronger hash. However there is not much stringer than that which would be computationally cheap.

    Or, we could scrap the hash seed altogether and later on develop a proactive system to detect algorithmic complexity attacks and deal with them in a more direct fashion. The advantage is we get to keep the incremental properties of the current hash (or any Karp-Rabin equivalent)

    ...which, sigh, we are not using. But a further refactor would allow us to keep hashvals valid during string concatinations and substitutions, since our current hash has the property of being able to be shifted by multiplication and is thereafter additive.

    Finally, there's the prospect of dealing with in-place substrings so less unnecessary copying happens during primatives.

    In any case, all the classes Str, Array, and Hash could use a unified system whereby they can vary their internal representation. That is to say, we'd have more than one type of String PMC -- one for short strings that stores the data right inside the PMC, and another for longer strings that has the extra storage. Likewise for hashes of less than 4 buckets (which is 90+% of them in current use.)

    However, when I asked about this on IRC there was some inclination towards not having separate PMCs for -- lets call them "nanoobjects" -- and to just typecast and switch-statement everything to death.

    With a bunch of charsets on top of all that, plus SIMD code to speed things up that would vary by architecture -- the code is going to get messy fast. I don't know if anyone can suggest a better plan for "nanoobjects" than either a mess of switch statements and ifdefs, or a self-promoting PMC that changes from a nanoobject to a bigger object on resize.

    At the very least, I think that a flagbit indicating whether the PMC is a "nanoobject" would be good. It may be we already have this via an opportune GC flag. But someone that knows a bunch about GC is going to have to chirp up and say, "yeah that flag can be used for that"

    ReplyDelete
  2. The string code is definitely messy for a variety of reasons. Bacek has recently been doing some tangentially-related cleanup work on keys, and I remember Simon Cozens talking about a major refactor of the entire string system a while back (although I dont know how far he got).

    The string hashing mechanism is definitely a sore thumb in terms of something that needs lots of work. We cache computed hashes in the STRING structure itself, and only compute those hashes lazily to amortize some of the costs of a more robust algorithm.

    We've looked into using an external hashing library too, but I dont know what progress was made on that decision, or what information was gathered.

    But you're right that our current algorithm is very shitty because it's so vulnerable to attack. A bigger issue is that the strings and hashes subsystems are both very messy an in need of major refactoring and cleaning before we can do much about improving the algorithm. Not a strict requirement, but we're going to want cleaner code if we plan on doing any major work here (and we do plan on doing major work!)

    Your idea about "nanoobjects" is very interesting to me because we can implement faster algorithms for smaller data types, which as you mentioned tend to be the more common case. I would be very interested to see an implementation of something like this, although I don't know what the best way would be to do it. You may consider writing a subclass of String and Hash PMCs, and "promoting" from your subclass to the real class when things get larger then your optimizations can handle. I think we do have at least one flag available for this purpose, although I'm not sure it's the best way to mark nanoobjects like this.

    Please keep me updated about any more ideas you have on that topic.

    ReplyDelete

Note: Only a member of this blog may post a comment.