Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.
Showing posts with label GSOCIdeas. Show all posts
Showing posts with label GSOCIdeas. Show all posts

Tuesday, March 23, 2010

Potential GSoC students

I've gotten several emails from prospective GSoC students in response to some of the blog posts I've written about potential project ideas. It looks like we're going to have interested applicants for several of the projects, including a few new ones that I hadn't even considered. These steps are especially important if you are new to the Parrot project and are applying for a GSoC position for the first time with us.

In each case, the advice I give the students is always basically the same:

  1. If you aren't a super C coder, or only have limited experience, pick up a copy of K&R. Read it cover-to-cover. "real world" code, like that found in Parrot, can differ significantly from the simplified, mickey-mouse code you experience in most introductory programming classes or the exercises in the back of a "Learn C in 24 hours" book. K&R isn't as in-depth as some tutorial books, but it's complete, concise, and serves as a great reference. Likewise, if your project is centered on Perl code but you aren't a great Perl coder, pick up a copy of the Camel book. Nothing beats having a good reference available as you work. If you don't know all the answers, you should know where to find all the answers.
  2. Get a copy of the Parrot source code, build it and run the tests. Start reading the code. You'll definitely want to skim over at least most of the src directory, and you will want to pay particular focus to the subsystems involved in your proposed project. If you're doing a GC project, read the code in src/gc. Strings? src/string. As you read code, make changes; especially changes to code formatting, readability, and documentation. Submit these changes as patches and be prepared for feedback. If you see problems but aren't able to fix them yourself, submit bug reports. It's hard to accept a GSoC project if we aren't familiar with the quality (or quantity) of your work.
  3. Get involved. Join the parrot-dev mailing list. This list isn't super-high traffic, so don't worry about getting your inbox flooded. Send a welcome message to the list to introduce yourself. If you're really interested to see what's going on at a deeper level, join the parrot-tickets list as well This list represents a live feed of events happening in our issue tracker. This will give you an idea of the kinds of problems users have and how we go about fixing them. One of the best things you can do is to join the #parrot chatroom on IRC. Introduce yourself to people there, follow the conversations, and get involved. Ask questions. It's hard to accept a GSoC project if we aren't familiar with you as a person, and if you aren't an active community member.
So here are my three recommendations in a nutshell: Beef up your skills, Get familiar with the project, and get involved in the community. Do these things, along with submitting a good proposal, and your chances of getting accepted go way up. Good luck!

Wednesday, March 10, 2010

GSoC Idea: GMP Bindings

This conversation happened yesterday on IRC, with some off-topic things edited out:

darbelo: That reminds me. I hate our bignums and want them to die...
whiteknight: darbelo: I agree with the bignums thing 100%. I want bignums out of the repo and moved to their own project
whiteknight: There's no sense keeping them when I suspect a majority of users can't use them because they don't have GMP installed
darbelo: Actually, I wouldn't mind them being in the core if they weren't dependant on a lib I don't have.
darbelo: I actually had started to write a stand-alone BigInteger PMC after last year's SoC.
whiteknight: that would make an awesome project too.
whiteknight: I think we should have lots of projects like that, and for developers to be able to pick which solution they want
whiteknight: as we are now, it's easier to force BigInt to pretend to do what we need instead of just using the best solution, which might be DecNumber or something else
darbelo: Maybe, but GMP is much, much more than just bignums. It's a pretty big library.
darbelo: Our PMCs don't even start to scratch the surface of what GMP can do.
whiteknight: darbelo: so moving those PMCs out to a separate library and adding wrappers for other functionality might be nice
darbelo: I would consider a GMP binding much more valuble to parrot than our current use of the lib, yes.
bubaflub: last year i worked a bit with GMP library and suggested to dukeleto we work on a GMP binding for parrot
bubaflub: last year with GSOC and perl 5
darbelo: bubaflub: That would be nice to have.
bubaflub: though we used an existing perl 5 binding (Math::GMPz)
whiteknight: yes, that would be a wonderful project
bubaflub: but we could nab the test suite and what not
whiteknight: exactly. We have the two PMC types, and we could write wrappers for the rest of the library and get all sorts of additional power
bubaflub: i think access to the GMP library in general would be nice; the stuff i worked on last year was setting some foundational stuff for cryptography libraries

Parrot has two PMC types that wrap GMP: BigInt nd BigNum. I think, and apparently a few people agree, that these two types have no business being in the core Parrot repository and should be moved to another project. The immediate benefit to this would be that the bindings for GMP could be improved and expanded independently, instead of only providing what little functionality Parrot actually makes direct use of.

A good GSoC project for this year would be to move (or fork) the current BigInt and BigNum PMC types to a new project and use them as the cornerstone for writing a more comprehensive interface for the GMP library.This could include other PMC types, NCI function wrappers, PMC methods, ops, and other things to allow access to the power of the GMP library. Adding custom Integer-like and Float-like PMCs that autopromote to their Big- counterparts would be nice too.

For more info about this project, you could probably get in touch with myself, darbelo, or bubaflub.

Tuesday, March 9, 2010

GSoC: Parrot in Summer 2010

Jonathan Leto sent out a great email to the list today about Parrot's involvement in GSoC this year. Parrot will be combining together with the Perl foundation again and entering as a single organization. I very much like this arrangement, under the blind assumption that we do better together in terms of student allotment than we do apart. I have no reason to doubt that.

Mentors: If you want to sign up to be a potential mentor, you can do it on the Perl foundation wiki.

Project Ideas: If you have any project ideas (I know I do!), list them on the Perl foundation wiki. If you tell me about the ideas as well, I'll feature them in a blog post and hopefully drum up some interest among prospective students.

Thursday, February 11, 2010

GSoC Idea: Immutable Strings

Immutable strings are an interesting idea. Fundamentally, in a system with immutable strings the actual string storage is considered immutable at the lowest level. This means that when we modify strings, we create copies instead of modifying the strings in place. Immutable strings carry some benefits. First and foremost, we never need to resize strings in place. This means that memory can be allocated more linearly, and can be kept track of more simply. Instead of a string header needing to contain both the size of the currently allocated buffer and the size of the string that buffer contains, we only need to store one size value.

When we look at a code example like this:

$S0 = foo($S0)

It will look to the programmer as if the string register $S0 has been modified to become whatever the output of the foo function is. However, this is only true of the header being modified. Internally, the header will be pointing to a new memory buffer entirely.

Parrot currently uses a system called "Copy on Write", or COW. In a COW system making copies are cheap because we make incomplete copies. When we do this in Parrot:

$S1 = $S0

We aren't copying the string, we are only copying the header. Both headers will point to the same underlying memory buffer but now a COW flag will be set. If an attempt is made to modify either string, a complete copy is made first and then the modification happens. In theory this is a great optimization: We create copies lazily, and don't have to actually copy memory buffers (which can be expensive) until a modification is made somewhere.

The problem with the COW system is that Parrot has to make a lot of copies of strings, especially for strings which are constant or are being passed as parameters to subroutines. Consider this code:

$P0 = new ["foo"]
$S0 = typeof $P0
$S0 = $S0 . "bar"

Here $S0 is orignally the typename of the $P0 object, "foo". Obviously the name of the class should be constant, if we are allowed to just rename the class like this, we could cause all sorts of problems in the code. So the typeof operator returns a COW copy of the string, so that any modifications happen to a lazy copy.

The problem is that most accesses of class type names are not like above, but are instead like shown below:

$S0 = typeof $P0
if $S0 == "foo" goto ...

or

$S0 = typeof $P0
$P1 = new [$S0]

In cases like these, $S0 is treated as a constant string in the code to key changes in program flow. Most of the time the output of the typeof operator is treated as a constant so a copy doesn't need to be made, lazily or otherwise. In fact, having to make COW headers every time can become extremely wasteful. A similar situation happens in the PCC system, where strings passed as parameters to subroutines are passed as COW copied, not as straight references. This likewise can become extremely expensive when a COW header isn't needed.

I've rambled enough, I can talk about the theory behind COW and Immutable strings another time. For now, let's get to the GSoC project.

For GSoC, myself and several other developers (chromatic especially) would like to see Parrot converted to use immutable strings instead of COW. This could be done in several steps:
  1. Conversion of all string functions to take a arguments and return a string pointer result, instead of modifying argument pointers directly.
  2. Modification of the string allocator to use immutable strings instead of COW (involves changes to a particularly messy portion of the GC)
  3. Deprecating and eventually deleting all the functions that deal specifically with COW
  4. Creation of a new PMC type, which would be a piece-wise string builder type.
  5. Optimizations and improvements throughout the codebase to rely on the behavior of immutable strings to prevent unwanted changes to constant strings.
  6. Benchmarks of the new system over the old system to compare performance changes
So that's the GSoC idea involving immutable strings. It's slightly ambitious, but it is pretty easy to break it up into smaller, easier steps. Interested, motivated, and talented students are definitely encouraged to apply.

Tuesday, February 9, 2010

GSoC Idea: Bytecode improvements

Bytecode.

It's a part of the VM that really "just works" and so nobody spends much time playing with it. This is not to say that nobody has touched it, but in my tenure as a Parrot developer I have not seen nearly so much development on the entire subsystem as I have seen in other places. This has been changing recently with some cleanups to the freeze/thaw system, which is related to the bytecode system. There are several packfile-related PMC types waiting in the wings, but Parrot only uses them as a thin wrapper layer to provide PBC access from PIR.

There are several issues that need attention in the realm of Parrot's bytecode. Some of these issues, or combinations thereof, could make for very interesting--and very rewarding--GSoC projects.

First project that I would like to see is in making Parrot's bytecode more portable. Currently bytecode isn't really portable between different systems like it should be, nor is it portable between different versions. Everytime somebody increases the bytecode version number in the PBC_COMPAT file, all previously-generated bytecode files become invalid.

One way to tackle this problem would be to include metadata in the bytecode file. This metadata could include a list of PMC types and their corresponding type numbers, and opcodes and their corresponding index numbers. A verification step at Parrot load could check versions and, if different, could loop over this metadata and attempt to remap old numbers to the new numbers.

...And while we're talking about bytecode portability, Parrot needs a "robust testing strategy" for testing and verifying PBC files. I mention this because I see the warning several times per test run, every test run, every day. Seriously, somebody fix this warning!

The packfile system code is terrible and very complicated. We do have those PMC types that act as thin wrappers over the packfile structures and APIs. One thing that I would like to see is a complete refactor of the system that replaced all of the packfile structures with the equivalent, properly-encapsulated, PMCs.

Parrot's bytecode also stores long lists of the constants used in the code. These long lists, unfortunately, often contain duplicates. A post-processing step on the generated bytecode could perform some basic constant folding on the code to decrease the final size of the generated PBC file.

And speaking of constant folding, there are plenty of other optimizations that could be performed on bytecode. Hooks to add optional optimizations would be a most welcome addition, especially if we could notably decrease file size or even improve performance (though performance-improving optimizations would probably be better suited at the PIR or PAST levels). One such optimization that could be useful is a translator routine that converts bytecode files to a "native" format where data is in the correct byte order.

The pbc_dump program, which attempts a basic disassembly of a bytecode file, is woefully inadequate for most needs. Major improvements to this would be nice. A full-featured PBC disassembler that could produce round-trip capable code would be even better. The pbc_merge program similarly is in need of some love. Major improvements to this could be made as part of a larger set of refactors. Tangentially related but still important is a proper PIR->PASM translation utility. IMCC has an option to do this, but produces output code that is known to be incorrect and incomplete.

The freeze/thaw system converts individual PMCs to serialized strings suitable for saving to a file and then loading again at a later time. Better integration of the freeze/thaw code would enable storing all Parrot data, be it bytecode or data, to a similarly-structured file. Eventually we could see the freeze/thaw types be inheritable, so users could define how their data is serialized. Information in the file could direct Parrot which serialization types are required to read the data later. Some ideas of subtypes of the PMC serializer type are serializers that checksum data for later verification, or types which encrypt or obfuscate the data for security purposes.

So these are a few random ideas about projects that have to do with the PBC system. I think this is an area that is ripe for some GSoC-level projects. Also, I think projects in this area would be majorly beneficial to a lot of people and would help to make bytecode a compelling format for shipping portable executable files to end-users. If you like one of these ideas, or if you have ideas of your own to fix these systems, please do let me know!

Monday, February 8, 2010

GSoC Idea: Parrot + RTEMS

By background I'm an embedded systems guy. In college I spent years focusing on microcontroller systems and FPGA hardware. I'm not going to spend the whole post reminiscing about the "good-ole' days" when I still worked down on the metal, but I do want to express my particular fondness for this next GSoC project idea: The Parrot-on-RTEMS port.

Jonathan Leto has been spearheading the project so far, but it's become clear that there are some major deficiencies in Parrot's cross-compilation facilities to enable Parrot to build and run properly on a real-time OS like RTEMS. Also, there are other problems with Parrot's algorithms that prevent it from meeting strict real-time deadlines.

In a message to the mailing list tonight Johnathan put out an official-looking call for interested participants, and also advertised an extremely informational page on the RTEMS wiki.

I will let the RTEMS page discuss most of the good details about what is needed, so I will cut off this blog post here. But if you're an interested student and want more information about this project, head on over to the RTEMS wiki and get in touch with Jonathan Leto.

Saturday, January 30, 2010

GSoC Idea: Implement a new GC

Blah Blah Blah Parrot needs a new GC. Blah Blah the current GC is lousy Blah Blah.

I've said it all before. At length. Ad nausea. Hell, I even attempted this same GSoC project myself two years ago.

The garbage collector is one of those systems that absolutely effects almost every aspect of Parrot's operation and performance. A bad GC means a slow, obnoxious VM with pauses and huge memory consumption. Of course, there is no one single "best" GC algorithm to use that provides optimal behavior to all classes of programs. This is why Parrot was designed to have a pluggable GC system where the most pertinent GC from among multiple options can be selected.

At least, that's the theory.

Parrot really only has one GC right now and it's a very simplistic Mark and Sweep collector with some less-than-desirable properties. bacek has been making good progress on porting the Boehm GC to Parrot, but that has it's drawbacks as well (though drawbacks are being mitigated). What we need is more people working on it, more people trying new things, and more eyes on the code.

So what kinds of GC could Parrot use? The projects page on the wiki describes three types that Parrot is specifically interested in:
  1. Generational
  2. Incremental
  3. Concurrent
These adjectives are not orthogonal, you could easily have an incremental-concurrent collector, or a generational-incremental collector, or even a generational-incremental-concurrect collector. But, these three things tend to have nice properties.

Generational collectors tend to have nice throughput because we subdivide the memory space and only take the time to mark and sweep a subset. Incremental collectors break the mark and sweep phases up into smaller increments which in turn decrease pause times. Concurrent collectors utilize multiprocessor systems to operate the GC in parallel with the program code and therefore experience no pauses and high throughput (but lousy performance on single-processor systems, or multiprocessor systems with many other running processes).

The aspiring GSoC student working on GC will benefit from two years of hard cleanup and refactoring work from myself and other contributors. Plus, there is strong community support to get a new GC up and running as quickly as possible to replace our current one. It's a project that will be tough no matter what, but where there are huge gains to be made for the Parrot community and a large amount of available support.

As I mentioned above, the GC system is supposed to be pluggable. So in reality you don't need to implement a great GC that will solve all our problems, you just need to implement any GC to prove that the system is, indeed, pluggable. If the GC you implement has some nice properties that's just a nice bonus.

Friday, January 29, 2010

GSOC Idea: Strings and NFG

Unicode is a messy thing. You wouldn't think that something so trivial as representing all the text in all the languages in all the world, keeping track of the relationships between various related symbols and characters, and creating a clear delineation between languages and dialects would be such a hassle, but apparently it is. Pause for a second while I look for a decent way to signal my sarcasm.

Unicode text doesn't have a single format. In fact, the same string of text can be represented in a number of different normalization forms. These normalization forms can optimize for byte-width by combining character codes with modifiers like accents into a single symbol, or breaking these complex characters into a sequence of a character and modifiers. Current Unicode normalization forms are called NFC, NFKC, NFD, and NFKD.

The Parrot design documents speak of a new, special normalization form that can be used internally to Parrot to implement some of our internal string algorithms. We call this new normalization form "NFG" because it normalizes over individual graphemes. In NFG, each grapheme (which is a fancy word for a single symbol) corresponds to a single sequence of characters and composing modifers. In other normalization forms these sequences don't need to be unique.

NFG has some interesting properties that make it useful internally to Parrot as an intermediate form for string operations. Being able to convert strings from multiple encodings and charsets into a single unique character sequence could be a big help in a lot of ways. At the moment when two strings are combined together, Parrot tries to convert directly from the more restrictive encoding/charset to the less restrictive one. For N different encoding/charset combinations, we have potentially N2 such transformations. This is non-ideal.

With NFG we only need 2N transformations: One to convert each encoding/charset combo to and from NFG. Strings are converted to NFG, composed together, and then converted out into whatever format they are needed in. It's really a nice system on paper and I have high hopes that it would be a big benefit to Parrot in terms of scalability and performance.

Parrot needs two things to satisfy the letter of PDD28: A comprehensive string subsystem cleanup and refactor, and implementation of the new NFG normalization form. The later is probably more suitable for a GSoC project, while the former would be a good job for a prospective student to start looking at now to become familiarized with the system.

Thursday, January 28, 2010

GSoC Idea: Asynchronous IO System

I talked to Coke briefly yesterday evening about GSoC. I hadn't heard specifically that Parrot would be applying for GSoC again this year and even though I was hopeful I didn't want to presume. He didn't have a firm answer about whether Parrot was planning to apply as a mentoring orgnization, but he did say "I think we'd be crazy not to". That's good enough for me.

There are a few topics that I've covered more than others on this blog. One of them has been asynchronous IO. I've covered that topic at length, and was hoping to have the time to implement the system myself. Regrettably I don't think I will have time to do it before summer, which means it's ripe for prospective GSoC students to gander at.

Parrot has a basic, synchronous-blocking IO system right now. This is all well and good but let's all face facts: IO tends to be among the slowest operations that a program performs. IO represents a huge bottleneck, especially for certain types of IO-intensive programs. To alleviate this, Parrot's design documents describe an asynchronous system where IO operations are performed in the background while the program itself can continue performing non-IO operations at the same time.

I'm purposefully conflating ideas of "asynchronous" and "non-blocking" ideas, though strictly speaking there are differences between the two. What Parrot really needs is an "asynchronous non-blocking" IO system, which I will just call "AIO" for short.

In an AIO system, IO requests are dispatched to the operating system and program execution immediately resumes. The dispatch operation returns a request object. There are two ways to keep track of the IO request, if you need to keep track at all: First is to poll the request object to see if it's completed (or had an error). The second is to provide a callback with the request and the OS will execute the callback when the request has been satisfied (again, with error information if the request was not successful).

A current Parrot print operation looks like this:

print "Hello World\n"

An AIO implemention would add new forms like this:

request = print "Hello World\n", callback

This presents a lot of flexibility, and could potentially improve performance significantly for IO-intensive programs.

The IO system could use a few miscellaneous cleanups but is generally not in bad condition. A good implementation of AIO will be well-integrated with Parrot's event scheduler, and possibly with with the threading implementation as well. The scheduler is in good condition but the threading system is not. The aspiring AIO developer may find herself making fixes to any of these systems along the way. If AIO sounds like the job for you, you may want to start reading through the source code now so there are no surprises over the summer.

Wednesday, January 27, 2010

GSoC Summer 2010 Projects

I saw an email today from Leslie Hawthorn, the program director for Google's GSoC program. I have not heard official confirmation from anybody, but I can't imagine that Parrot won't apply to be a mentoring organization this year. We've had such good success in the past with it, and so many of the students (including myself!) have stayed on to become long-term productive community members.

Mentoring organizations need to apply by the beginning of the March. Students can begin submitting their applications by the end of March. Before that, assuming Parrot gets accepted as a mentoring organization, I think we should have a few ideas for projects prepared. Of course a student is free to come up with any proposal they want, but I think it's good to have some ideas floating around to inspire people beforehand. Parrot has a wiki page devoted to project ideas like this, though that page is a little threadbare right now. Any ideas I get will be cross-post to that page as well.

I'm going to try to post some good GSoC project ideas on this blog in the coming weeks. If you have a good idea for a project, let me know and I'll write about it. If you're a student and want more information about a particular idea, get in touch and I will do the best I can to steer you in the right direction. Let's hope that this summer turns out as good as previous ones have!