Whiteknight's World: October 2009

Friday, October 30, 2009

Proper Matrices in Matrixy

I started a branch in Matrixy (it's own little adventure!) to start working on improved integration with parrot-linear-algebra. Matrixy currently uses nested ResizablePMCArray PMCs to implement matrices. This carries a few problems with it and is a huge performance bottleneck.

When I started work on Matrixy I had in mind to create classes for the fundamental things like Matrices and Vectors, but also the other types like cell arrays and structs. My work at the time was very naive, and I expected to be implementing all the various numerical algorithms myself. It wasn't until Blair mentioned it to me that I really thought about using an optimized library like BLAS, although that brings with it it's own hardships. Libraries like BLAS saves us the effort of implementing all the operations from scratch, but requires that we be able to write glue code for interfacing with the library functions. Tradeoffs.

Nested RPAs were easy because they were a pure-PIR solution that could be made to work quickly. The PMC manages it's own memory, so our only job was to manage the individual arrays. There is a certain amount of effort involved in keeping the matrix square when we resize (very expensive) and a certain amount of effort involved in marshalling data into BLAS API functions (very expensive), but we were able to get a lot of development done using the quick prototype.

in Parrot-Linear-Algebra we're providing PMC types that store data in a low-level C data buffer, in a format that is amenable for use directly in BLAS. Also, since it's a square buffer by default, we don't have to do anything special to keep it square on resize: We create a new buffer of the necessary size and block-copy items from the smaller old buffer to the larger new buffer. This might not be the most efficient way to handle it, but if we consider that matrices can be preallocated quickly, it's not going to be a huge bottleneck and a good developer can avoid it almost entirely. The new PMC type also provides all the necessary VTABLEs, so we can interact with it in a very natural way from PIR. Code that we were executing in large PIR functions before can now be performed using single opcodes, or single method calls.

So the general plan for the branch is to change Matrixy to have a strict dependency on parrot-linear-algebra. I will cut out all the code that manages custom RPA types and replace it with (hopefully smaller and better) code to interact with NumMatrix2D PMCs instead. Along the way, I'll be adding METHODs to the PMC type to perform the interactions with BLAS that we were previously doing in PIR. This should provide us a significant speedup in most cases, clean up the code significantly, and get us on the right path for future developments. I've already added several METHODs and VTABLEs for various purposes: clone, transpose, attribute access, fill, etc. These all need testing and improving of course, but I've been on too much of a roll with Matrixy to do these things.

One cool method that I added last night was an "iterate function" method. It takes a Sub object, which it invokes on every element in the matrix and replaces the current value with the result of the Sub. Since Matrixy has a lot of functions that act independently on every item in the array, this was a pretty natural thing to add. Also, since we're doing this in C now instead of PIR, we open up all sorts of possibilities like automatically threading large requests for improved performance. I think Parrot's threads system has some work to do yet before this becomes a viable optimization, but it's still interesting to think about.

Here's a PIR example of iterating a function:


.sub main :main
 $P0 = new ['NumMatrix2D']
 .const Sub helper = "add_to_each"
 $P0[1;1] = 4
 $P0[0;1] = 3
 $P0[1;0] = 2
 $P0[0;0] = 1
 # 1 2
 # 3 4
 $P0.'iterate_function_inplace'(helper, 2)
 # 3 4
 # 5 6
 $P0.'iterate_function_inplace'(helper, 5)
 # 8  9
 # 10 11
.end

.sub 'add_to_each'
 .param pmc matrix
 .param num value
 .param num to_add
 $N0 = value + to_add
 .return($N0)
.end

I haven't done any testing on this yet, but this is what it will look like when it's all working properly. There are a lot of functions that act item-wise on each element in a matrix, and this little helper routine is going to make those all much easier and cleaner to implement.

Thursday, October 29, 2009

ParrotTheory: Microprocessor Design

Long before I was involved in helping with the Parrot virtual machine, I was in school learning about and even designing actual microprocessor hardware. In my posts I nonchalantly talk about things like caches, pipelines, and hazards. Now I'm going to explain a little bit about what these things are and why they matter to Parrot (and all code, for that matter).

A typical production-grade microcontroller, not a toy or a colorful explanatory diagram in a book, is broken up into several stages. Each stage represents a portion of work or computation that can be done in a single clock cycle. I'll call these "clock cycles" for simplicity, but it's worth mentioning that there is not necessarily a strict relationship between the clock frequency and the instruction frequency. It's also worth mentioning here that the smaller your cycles are, the faster your clock can operate. Your clock's frequency is limited by the length of time it takes to execute your longest stage.

If we look at a simple and classic design, like the original MIPS microcontroller, it is broken up into 5 distinct stages: Instruction Fetch (IF), Instruction Decode (ID), Register Lookup (REG), Execute (EX) and Write Back (WB). In the IF stage, the instruction is retrieved from memory. ID breaks that instruction up to a series of individual components like control sequences, register names, etc. REG retrieves the values of the operand registers based on the information in the instruction word from the register file (an array of register values). EX performs the desired operation and produces a result. WB takes that result and stores it back into the register file. Wash, rinse, repeat.

A given instruction is only being executed in one stage at a time, so in this example it takes 5 clock cycles for the instruction to execute fully, and 1 instruction is executed every 5 cycles. This may sound redundant, but I'll explain later that it might not be. This would be a bit of a waste because at any given time 4 of the 5 stages can be empty and unused. To prevent this, microprocessors use a technique called pipelining, where we stagger execution to keep all the stages full. At any given time, each stage can be processing a different instruction. In this case, it would take 5 cycles to complete 1 instruction, but we can now complete 1 instruction every cycle. This is a 5x speedup!

[Note: some readers will realize that I'm talking about a multicycle processor here, not the more simple single cycle design. This is on purpose.]

The original MIPS design is a very simple example, modern processors from SPARC, AMD, or Intel might have more then 20 stages.

We run into problems called pipeline hazards if we do things like try to read a value from a register that is being calculated concurrently by another instruction in the processor. In this case, the processor bubbles, or delays for a few cycles until all the necessary data is ready. A smart optimizing compiler can automatically rearrange instructions to help reduce bubbling. To see how, take a look at this sequence:

a = b + c
d = a + 2
e = f + g

In this sequence, the value a is used immediately after it is calculated. This will cause the processor to bubble for a few cycles until the value of a is ready. Remember, it takes 5 cycles to compute a single instruction, so the first instruction must complete fully before the second instruction can begin. This means that for 4 cycles, new instructions cannot enter the pipeline, being instead replaced with empty "bubbles". By simply reordering things a little bit, we can speed things up:

a = b + c
e = f + g
d = a + 2

Same exact code, but now executes a little bit faster. This is because the instruction "e = f + g" is not dependent on the value of a, so it can enter the pipeline before the value of a is calculated. Now, we're down from a total of 7 cycles to 6 to perform the exact same work

[Note: I'm definitely simplifying. Don't shoot me!]

The EX stage, which is the actual workhorse of the processor, is relatively complicated in itself. It consists of two separate components, the arithmetic-logic unit (ALU) and the Memory Unit (MEM). Instead of being a series of stages in a row, the ALU is a set of parallel execution cores. Each core performs a different operation, so one may add/subtract, one may multiply, one may bitshift, etc. Data comes in and is copied to every core, every operation is performed on it, and a multiplexer is used at the end to select which value we want to keep. The MEM component can write or read data in memory at a given address. It's the MEM component that is usually one of the slowest parts of the processor, and serves as a huge bottleneck for the rest of the hardware. This is because MEM needs to interface with the cache and the RAM, which both may be running at a significantly slower clock frequency.

The ALU can be remarkably fast, because there are often very straight-forward tradeoffs to be made between numbers of transistors and speed. When you hear statistics about modern CPUs having billions of transistors, you'll know that they're trading more transistors for improved performance. I can talk about this more later if people are interested.

What we can do to make things even faster is to start doubling up on parts of the pipeline. We can be fetching and decoding instructions in large batches and passing them to multiple parallel EX cores. So long as we have the necessary hardware to detect hazards, we can be essentially executing multiple instructions together at each stage, with multiple stages in process. Some modern Intel CPUs, for instance, may contain two ALUs and two specialized MEM units (one to read and one to write). This means it can execute two mathematical instructions and two memory instructions at the same time. A smart compiler will know to order instructions like this to maximize parallelism. This design which uses parallel pipelines in a single processor core is called superscalar. When a compiler rearranges instructions to take advantage of parallelism or even to avoid hazards, it is called instruction pairing.

Consider a modern processor executing at 2GHz with 20 pipeline stages and can complete 1 instruction per cycle. That's 40 billion instructions per second that can be executed if everything is running at it's maximum efficiency. Not too bad. However, keeping things maximally efficient is a difficult task. Reducing hazards is a great thing, but the processor can only run fast if it can keep it's pipeline full, and can only keep the pipeline full if it knows what instructions to execute ahead of time. When we look at a conditional in C, we may not know what instructions to execute next until we've already calculated the value of the conditional:

if(a == 0) {
a = 1
} else {
a = 2
}

If the comparison a == 0 is in the pipeline now, we don't know which instruction to load next: a = 1, or a = 2. Without knowing we would have to wait 20 cycles until the comparison completed before we could load the next instruction into the pipeline. When you consider the case of more stages and more parallel pipelines, if we have to stall and wait for a result to be computed we can lose a lot of processor time. You can start to see huge performance decreases from these kinds of situations. To help resolve these situations we have branch predictors.

Branch predictors are specialized pieces of hardware that attempt to predict where control flow will be going, and use that information to speculatively load the pipeline with instructions. If the branch predictor is correct, things chug along like normal and everything is fast and wonderful. If the branch predictor is incorrect we need to flush the incorrect instructions out of the pipeline and start fetching the correct instructions. This can be expensive for a number of reasons: We need to completely refill the pipeline, which means we need to fetch more instructions from slow-moving memory.

I haven't made much mention of it so far, but memory caching is a huge deal. Memory runs at a much slower speed then the processor does. To keep the processor moving quickly we bring small blocks of memory into a small bit of fast-access storage inside the processor called the cache. The cache is small but very fast, and information in the processor cache can be accessed very quickly. If we try to access memory that isn't in the cache, we need to stop everything and load that data from memory into the cache before we can continue. This is called a cache miss and can be very expensive. If we look for memory that is not in memory either but is located on disk, we have what's called a page fault, which is more expensive still, but discussing tha is well beyond the scope of this post.

If the branch predictor is wrong, we need to flush the pipeline and maybe flush the cache too. This can become prohibitively expensive, with slowdowns around 100x or more. This is why sometimes your 3GHz quad-core processor grinds along like an old 486: poorly written code is causing all sorts of hazards and misses, and drives efficiency down to the floor. It's not the processor that's slow, it's the processor needing to wait for the pipeline to refill, or the cache to refill, or even the virtual memory manager to load the correct page from disk. It's the waiting that's a huge performance killer.

In a lot of cases, doing more can actually be better for performance. Consider this code, for instance:

if(a != NULL) a = NULL;

This is very straight forward, and only does the work of setting a to NULL if it isn't already. What you don't see is that this is probably a net performance loss: the branch gives us a 50% chance of missing in the branch predictor, and the comparison instruction is already moving the value of a into the cache and then it's an extra instruction to compare before we set the value. In this case, it is likely much faster to just write

a = NULL;

But then again, that's not the whole story. It's not just a matter of keeping the pipeline full, it's also a matter of how much effort it takes to read from and write to the cache. If we need to write through cache to main memory, we'll waste a lot more time then we would have if we did the conditional. There are always tradeoffs to make, some of them painful.

In other cases, it's better to write more code to be faster. Here's an example:

int i;
void *a[8];
for(i = 0; i < 8; i++) a[i] = NULL;

We could make this faster by telling C that i can avoid memory entirely by defining it as "register int i", but that's a small optimization. The loop is going to cause at least one branch predictor failure, either entering the loop (and not knowing the first time to jump back to the top) or exiting the loop (and not knowing when not to jump back to the top). By unrolling the loop we can make things much faster:

a[0] = NULL;
a[1] = NULL;
a[2] = NULL;
a[3] = NULL;
a[4] = NULL;
a[5] = NULL;
a[6] = NULL;
a[7] = NULL;

I don't want to talk about all potential optimizations here, that would be a huge list. But by being conscious of the types of factors involved in performance I hope that other developers are able to spot other situations where code may be far less efficient then it could be.

Wednesday, October 28, 2009

CS Education

Let me preface this post by saying that I don't have a CS (Computer Science) degree. I majored Electrical Engineering for my undergrad and Computer Engineering for my masters degree.

Yesterday I read a very interesting series of blog posts about the state of CS education. First up was "Joel of Software" with a post about how lousy CS education is. Next up was "Stochastic Geometry" with a post rebuking Joel. Finally, Ovid had a nice post about his reactions to them.

I can tell that the state of computer engineering education is lousy, and I have heard enough from professors and students to know that the state of computer science education is lousy too. It's hard to really put a finger on what exactly is wrong with the system, however. First, let's distill the good points of what Joel has to say:

Students aren't doing enough team work
Students aren't good at time management
Students don't write large programs
Students don't have to spend much time debugging, stabilizing, securing, or maintaining software in the long term.

These points are all true, and go largely unaddressed by Stochastic Geometry. However, in the flip side, let's look at what the response had to say:

Any particular tool is unimportant compared to learning a class of tools
Learning the theoretical fundamentals is important, because it allows the student to learn specifics more easily.
Teaching one particular methodology is harmful, because there are no standards and no proof that any one methodology is beneficial or even widely used.

Since joining my job, I've had the privilege to participate in a few interviews of new applicants. Most of the jobs we've been looking to fill are entry-level, so many of our applicants were fresh out of school. We had one candidate in particular with a dual major in CS and CE from a school that, I thought, had a decent program. He had a nice resume and a decent GPA, so we were hopeful.

We started with some low-ball sanity questions. First up: "How many bits are in a word?" There are a few good ways to answer this one, and while it may seem esoteric it is profoundly important when talking about embedded systems. Savvy readers will realize that this is sort of a trick question. The term "word" in computers has a number of different meanings depending on context. An acceptable, though naive, answer would have been "16". A better answer would have been something like "it depends on the system. 32-bit computers have a 32-bit word, 64-bit computers have a 64-bit word, etc". A great answer would have included the previous one, and included "though in some programming spheres, 'word' is defined to always mean 16-bit and double-word (32) and quad-word (64) are defined to mean other things, regardless of the size of the machine word". All of these would have been acceptable. His answer: "I don't know."

We were stunned, and tried to give him some help, but he simply didn't know the answer. So we moved to a less-difficult question, to try and coax some information that was buried a little too deep: "How many bits in a byte?". He didn't know that either. Thanks for coming in today, but the interview is now over.

Without fail. Without any exceptions, the best potential applicants I've seen for my place of work, and the best software developers that I've known from other contexts, have significant programming experience outside of school. Sometimes this comes some independent learning and small personal projects. Sometimes this takes the form of Open Source Software contributions. I make no secret of the fact that my participation in Parrot has had one of the biggest beneficial effects on my skill as a coder. I am far more influenced by my work in Parrot then I was by anything I did in school. I'm more influenced by it then I am from my full-time job. It really is an immeasurably wonderful environment to work in, and gives that kind of practical immersion that a university degree really can't provide

One applicant we talked to listed PHP on his resume. I asked, "Where did you learn PHP, in school?". To which he replied, "No, I taught it to myself."

"Why?"

"Just because I was interested in it and wanted to know how it works."

That applicant got the job, and has proven to be very successful at it. It didn't hurt that he knew how many bits were in a byte.

There was a time when people went into computer programming for the money. Back in the heyday of Silicone Valley and the .Com Bubble, there was money to be made hand-over-fist for any programmer who was able to dip into it. Reality has set in now, and this isn't the field to get into if you're only looking for some easy cash. In fact, if you don't stand out from the crowd in some way, you're more likely to be unemployed. I graduated with a lot of other EE and CS majors who settled for jobs in sales or marketing when they couldn't find a tech-related one. I knew one Masters degree recipient who had to work at Target for a while because there were no jobs for him whatsoever.

Every recent college graduate will tell you that almost every single job listing they find requires at least one year of experience. The naive will say "How am I supposed to get the experience in the first place, if I can't get a job that requires prior experience?" In some cases internships can go a long way towards greasing those skids. However, the bigger question is this: If you're fresh out of school and have all the fun facts and theory fresh in your mind, why do you need experience? You need it because school doesn't really prepare you for a job at all: it prepares you to prepare yourself to get a job.

I don't want to scare anybody away, but I do need to make this point: If you aren't good at it, you won't easily find a good job programming or web developing. Coming out of college you will not be any good at it if you don't have any other experience. This is an undisputable fact, a college education simply does not and cannot prepare you for a real job in the real world of computer programming. You're going to have to put in the extra effort yourself. You're going to have to fill in the gaps yourself. You're going to have to learn the important lessons yourself. College made you capable of learning, now the onus is on you to actually learn. Going back for a masters degree is a good idea. Internships are good. Doing little contracts for small local businesses can be good too. Classes and training courses are good, and they happen in all sorts of places at all times of the year. Participating in an open-source software project is, in my estimation, the best.

The last option is my favorite, but so many people shy away from it: "I don't want to spend my free time writing code". To this I have two replies. First, if you don't want to write code, why are you pursuing a career in it? Second, what are you going to say when your future boss hands you a blackberry and tells you that you are on-call 24/7 and are expected to jump anytime the system goes down? Or when your boss tells you that the entire team is expected to put in 10-20 additional hours of overtime per week, every week, until the next release?

So in conclusion, my point is this: a college education, especially in the fields of CS or CE, will teach you the basics and the fundamentals, and put you into a position to learn new things and adapt to new work environments. It will not directly prepare you for a good job as a programmer. What a college education does (and this is extremely important, even if it doesn't sound it) is to prepare you to learn the new things you will need to be a programmer. You have to do the learning yourself. Joel is right to point out that students are coming out of college woefully unprepared for work. He is also right to point out the problem belongs to the students themselves. Stochastic Geometry is right to point out that colleges are doing what they should, and that teaching the fundamentals is very important. The missing link is that the students need to go through the last few preparation steps themselves. Students need to get the extra real-world experiences themselves to complement the fundamentals they've learned in college. Participating in a good open source project will teach you the necessary lessons, will look great on a resume, and will act as that final bit of preparation to get you into the good jobs you want.

Monday, October 26, 2009

Parrot Project Proliferation Part 1

It seems I can't shake a stick without hitting a new Parrot-related development project. They're everywhere, increasing in number, attracting new contributors, and growing like wildfire. We all know about Rakudo, so I don't need to give them any more free advertising. But there are plenty of newer, smaller projects that could use a little bit of exposure. So, in a small miniseries of posts, I would like to discuss some of the new Parrot-related projects that are worth some attention:

Smart-Make

There haven't been any commits to the project in about a month now, judging from the commit log on the project website. However, this is still a very cool project that I've wanted to tackle myself for some time. It's Smart-Make, a "make" variant implemented on top of Parrot. It supports compatibility for GNU Make, but also appears to implement another, better, syntax as well. I'm not versed on all the details of it, but it looks like it could be an awesome project and an absolute staple of the Parrot development world.

I've wanted to put together a build tool on Parrot for a while now. There are a few reasons for this:

I've always hated the syntax of GNU Make, and anything that could improve on that would be awesome.
A rule-based decision engine or inference engine could be the backbone for a large number of projects. Prolog is a great example.

So, it appears that Parrot now has a make utility. Only question is: what do we need to do to use it now? Is anybody currently using it?

Plumage

I've mentioned it a few times, but I don't think I've really described it. Plumage is to Parrot approximately as CPAN is to Perl5. It's the "Module Ecosystem", a central metadata repository that helps coordinate a wide array of interdependent projects. I haven't done more then a cursory walkabout in Plumage, but it looks like a really great start for now. Plus, I'm happy that Matrixy and Parrot-Linear-Algebra are being tracked with it.

Eventually, all the installation directions for Matrixy and Parrot-Linear-Algebra will say "Use Plumage", it will just work, and everything will be very easy.

PL/Parrot

When I heard about this one I got excited: I thought it was a PL/1 compiler for Parrot. I mean, I wouldn't ever write a program in PL/1 other then some cutesy proof-of-concept stuff. But it would be cool to have in a "we have more nerd cred then you" kind of way.

Fortunately for everybody, PL/Parrot is not a PL/1 compiler. It's bindings for PostgreSQL. Or, it will be.

PostgreSQL has a cool interface that allows procedural programming languages to be plugged in to it. By writing a binding for Parrot, suddenly we get the ability to plug in any language that runs on Parrot into PostgreSQL without having to duplicate any of the effort. Cool no?

Other Projects

I'd like to highlight other Parrot-related projects here on my blog. Start a cool one yourself? Know somebody who's started one? I'd love to hear about it and I would love to talk about it here.

Sunday, October 25, 2009

Matrices

Yesterday I put together a prototype of a matrix PMC for the parrot-linear-algebra project. Today, it's building and can be used in Parrot, thanks to efforts from darbelo and desertm4x. It doesn't do much yet, since we haven't implemented many bindings for the CBLAS library. But it's a good way to store a 2-dimensional matrix of floating point numbers. The new PMC type, which I hope is only the first of many related PMCs, is called NumMatrix2D.

Linear algebra is a big deal in the world of computing. When people talk about high performance computing, mathematical or scientific simulations, or other related topics, they're generally talking about applications involving linear algebra. Computer hardware optimizes for certain matrix and vector operations, and highly-tuned software libraries implement a wide set of other operations. Linear Algebra operations can be used to implement large sets of numerical algorithms. But, each matrix operation is composed of large numbers of scalar operations. For this reason, if we want to implement a linear algebra interface to Parrot, performance of the various primitive operations is critical.

Matrixy currently uses nested ResizablePMCArray PMCs to implement it's matrices. This is problematic for a number of reasons, and performance suffers significantly because of it. What I want to do soon, now that the NumMatrix2D builds, is to use it in Matrixy as the fundamental matrix data type. This should drastically improve performance and help to get rid of a lot of unnecessary code.

Parrot-Linear-Algebra has a few things that it needs now. I'll discuss each separately.

Be installable

We need Parrot-Linear-Algebra to be installable so we can use the NumMatrix2D PMC type in other languages and projects easily. I tried a few naive things today, but obviously don't know enough about even my own platform to make this happen, much less to make it happen in many target platforms.

Test Suite

Now that we have something to test, we need to write a test suite. We want lots of tests so that we can keep track of regressions for NumMatrix2D, and maybe even do a little bit of proper Test-Driven-Development of new features.

In general, I would really like to have the test harness be not written in Perl5. I would like to avoid Perl5 as a dependency throughout this project, and I would like to try to only rely on Parrot and languages which run on Parrot. ~~To my knowledge, nobody has yet written a complete test harness that runs on top of Parrot, so that may have to be a new project I start work on soon.~~

[Update 27 Oct 2009: I've been informed by dukeleto that Plumage in fact has a test harness written in pure NQP. I plan to "borrow" that later for both Matrixy and Parrot-Linear-Algebra.]

More BLAS Bindings

The goal for Parrot-Linear-Algebra, at least as of the last "meeting" I had with some of the other committers, was that we were going to try to build BLAS bindings into the NumMatrix2D PMC itself, and add separate NCI bindings for other libraries (LAPACK, etc).

Some of the API functions will fit neatly into VTABLEs: add, substract, multiply. Others will need to be made into METHODs instead. This gives us lots of flexibility while keeping the most common operations fast.

BLAS Detection

At the moment, we don't do any automatic detection of BLAS or one of it's various implementations. What we need to do is write a series of configure probes that will look for various implementations and generate the necessary interfaces automatically. We would like to support CBLAS, ATLAS, and GSL directly, if possible. Other implementations (and believe me, there are many) would be great too. Since BLAS is critical for high-performance computing and computing benchmarks, there are many implementations which are tuned and optimized for individual processor types. We want to support the common types initially, but provide enough flexibility that people could implement new probes for other similar libraries as needed.

Road Ahead

There is obviously a lot of work left to do, but we've made the important first few steps already. Highly performant matrix types and bindings to optimized linear algebra libraries will open the door for all sorts of high-performance computing applications to start targetting Parrot. It will also, of course, remove some roadblocks from Matrixy development. Anybody who's interested in joining the effort should feel free to get in touch here, via email, or on IRC.

Thursday, October 22, 2009

Text Edior

Szabgab posted a very interesting poll for Perl programmers to see which text editors they have been using for writing Perl. Obviously I think he is trying to measure usage of Padre, which actually fares pretty well, considering how young the project. Vim and Emacs (and variants) do the best, with a variety of other smaller editors making up the remainder of the pie.

This reminds me again that I'm in the constant search for an editor that I can call my own. On Windows I use Notepad++. There is no question, it's the editor for me on Win32 and I install it on every single windows computer I use. The one exception is that I use Visual Studio for my C# programming needs, but I don't think anybody will fault me for that.

On Linux, however, it's a different story entirely. I don't have a go-to editor of choice on Linux, and I've never found myself to be 100% happy with any of the offerings. Surprisingly, I'm not even happy with Notepad++ on Wine, although that seems like it should be the obvious choice. I was using MEdit for a while, GEdit a little before that, and now I'm using Kate. MEdit was far to light in the features department, and Kate is turning out to be far too buggy. This may be because I am using Kate in my Gnome environment, or because I'm using the wrong version of a critical library, or whatever. The fact is, I've found it to have a shittonne of bugs in Kate: some are annoying but harmless, some make the software downright unusable. Plus, Kate doesn't really offer all the features I want in an editor, so that's an extra layer of annoyance. I tried Eclipse too, but that was far too much for me and I was having a lot of difficulty just making basic edits to simple ASCII files. I don't need a huge integrated IDE (don't want it either), just a simple text editor with some features for programmers.

So what do I want? Here's a list of some must have features and properties that I need in an editor:

Absolute, 100% configurable syntax highlighting. I should be able to specify color scheme in any way that I choose, and I should be able to easily define schemes for new languages. Being able to define things on a per-language level would be the best. This is the #1 requirement and the primary reason why I left MEdit. This is also the reason why I haven't switched to Padre. I know that a lot of highlighting libraries use default schemes and system-wide settings so this is a little tricky.
The editor must be light weight, should start quickly, and just support simple plain-text editing. It should be easy to find the options and actions that I need. The "learning curve" for a light-weight text editor should take all of 30 seconds, if that. This is why I don't use Eclipse.
Line numbering. Most editors do this.
Multiple documents (tabbed is fine, but not the only way). I should be able to switch between documents with a simple key combination. FireFox lets me switch tabs with CTRL+PgUp or CTRL+PgDn, so that would be ideal in an editor too. Not having a good key combination (at least not one that I have found) to do this in Kate is a big reason why I am moving away from that editor.
Whitespace handing. Must be able to replace tabs with spaces, must be able to specify the number of spaces to use for a tab, and must be able to indicate or automatically remove trailing whitespace.
Must work well with C. Most of what I write is C, and I need an editor that can tailor itself to that. Being able to handle Perl and M would be cool too. Proper XML/HTML handling is always a plus but far less important.
Proper Tab button action. Selecting text and pressing Tab should indent the selection. Shift+Tab should unindent the selection. Most editors that I've seen do this.

So that's my list of requirements in an editor. Fail these, and I won't use it no matter what other features it has. Assuming these are satisfied, here are features that I would also like to see, if possible:

A good search tool. Literals good, regexes better, regex find/replace best.
Automatic indenting is good. Automatic code formatting (especially with per-language rules) is better. Being able to specify custom rules for formatting and indenting is best.
Brace matching. If I could custom set the highlighted color, that would be even better.
Being able to autoformat code (especially whitespace) on paste would be awesome.
Being able to open documents side-by-side: Being able to clone a document into multiple views would be awesome. Being able to link scrollbars so files scroll together would be better. Notice that Notepad++ does this very very well.
Autocomplete for certain language constructs would be very nice. "Go to Definition" would be good too
Displaying a guide at a fixed column width (especially a configurable width)
A multi-item keyboard that I can store multiple snippets in at once.

That all said, here is a list of features that I won't use and don't care about:

An embedded console window.
The ability to build or run code from inside the editor
Macros
Zoom
Code or comment folding
Spell Check or Grammar Check
"Sessions" or "Profiles"
All sorts of complex key bindings. I like some but not too many. I definitely don't like having to rely on them when my hand is already on the mouse. This is why I don't (and won't) use Vim or Emacs.
"Tip of the Day". Hate.
Too many buttons, huge menus, and no ability to hide the crap that I don't want or need.
Writing extensions or commands

So that, in a nutshell, is my list of requirements for a new text editor. I'm definitely hard to please, and I know that some of my priorities won't make sense to other coders. I'm also well aware that some of my needs above seem contradictory. I'm fine with all that. There's a particular way that I work, and particular features that I need, and I'm not going to use an editor that doesn't support my flow. It's not a popularity contest or me hating on certain editors, I just know that certain things work for me and certain things don't.

I would be very happy to hear suggestions about editors that people use that support some of these needs. I wouldn't even be surprised to hear that no editor on Linux really satisfied all of this, so I don't expect anything to be 100% perfect.

Wednesday, October 21, 2009

PCC Branch Lands!

Public Service Announcement: A few minutes ago, the pcc_reapply branch was merged into trunk. There is an ongoing cleanup effort right now to make sure everything is working the way it is supposed to.

Monday, October 19, 2009

PCC Branch Test Results Update

a little bit of an update, following up on my post from the other day: I'm hearing the first reports now that the pcc_reapply branch is passing all tests on some platforms. Will post updates as I get them.

Matrixy and Parrot-Linear-Algebra

I wrote a post about the current state of Matrixy and my ongoing plans for its development. The feedback was swift and very positive. I've now moved Matrixy to Github. In addition I separated out the CLAPACK and CBLAS library binding stuff (what little there was) into another repo, also on Github. The library package will be known by the absolutely uncreative name "parrot-linear-algebra". I might rename it if somebody comes up with something very clever, but a nice simple utilitarian name suits the project just fine for now. It is very simple and doesn't actually "do" anything yet, so best to worry about creative branding later when there is something to show.

I was also able to find some interested participants, and have handed out some commit bits to dukeleto, darbelo, and desertm4x to help work on these things. All of them have shown a lot of interest in the projects, and I'm hoping we can attract a few more people as well. Matrixy is interesting for people who have ever used M before and would like to use it to write programs on top of Parrot. This is a relatively limited set, as I've mentioned before M is a bit of a niche language. However, the linear algebra package has huge potential to affect a number of people, because it could be integrated into all sorts of other libraries and compiler projects. Of course, the linear algebra package is much less mature then Matrixy is becoming, so we need a lot of help with that.

Parrot-Linear-Algebra Tasklist

These are the things that Parrot-Linear-Algebra needs in the short term:

Get suitable licensing information in place, README, and other metadata. I think I like Artistic2.0 for the license, but I could be persuaded to use an alternative pretty easily.
CLAPACK and CBLAS API functions have a lot of very exotic function signatures. We need to either add permanent NCI bindings to Parrot to handle these or (preferably) give Parrot a proper NCI frame builder to build these on all systems.
We need to write wrappers for all the functions, and there are a LOT of functions to wrap. A nice lazy Wrapper-Builder routine, or some kind of PMC type that could rapidly generate NCI PMCs on demand and cache them would be a nice addition, because it would prevent us from having to create tons of unused and unneeded bindings up front.
We need a proper 2-D matrix data type PMC that will be able to interface quickly with the library wrappers. This is probably the single biggest piece of primary development that we need, at least initially.

Matrixy Tasklist

Here is a list of things that I think Matrixy really needs in the near-term:

~~It needs to build against an installed Parrot. darbelo was looking at this problem last night and the prognosis is not good: Configure.pl and Makefile both probably need to be completely rewritten.~~ Update: At the time of publishing, darbelo has already fixed this. Matrixy should now build against an installed Parrot at version 1.6.0 or higher.
Set up Matrixy to require the Parrot-Linear-Algebra package. We should be able to detect whether the library exists and (if we have git installed) grab the latest copy from Github.
Set up all the fancy-pants stuff that Rakudo has for requiring a specific version of Parrot (and also Parrot-Linear-Algebra), and being able to check out and build that version if necessary.

Once we have these things, primary feature development will be able to continue like normal and we'll be able to start adding all the missing syntax and cool new stuff that Matrixy really needs.

So Matrixy development will hopefully be getting back into full swing, and I'll be able to add some of the remaining syntax items (variadic input and output arguments, for instance) that I hadn't been able to implement before now. The linear algebra package also needs a lot of work and a few interested volunteers to get up to speed and become a useful resource for other projects.

Sunday, October 18, 2009

Calling All Mac Users

A quick shout-out to any developers out there who are using a Mac. Parrot has some outstanding problems on Mac OS 10.4 (I hear the problems are gone in later versions, but have not confirmed).

If you have a Mac, preferrably running 10.4 or older, we could use your help:

Set up an automatic test bot, so we can get up-to-date test reports on your platform
See if you are able to fix some outstanding bugs and get Parrot to build and install on your system
If you can't fix problems yourself, I would like to ask if you are willing to grant shell access to myself or another Parrot dev who could try.
If you have old, unused Mac machines laying around, maybe you would consider making a hardware donation to the Parrot Foundation so we can start setting up our own dedicated build/test machines.

Once we get Parrot building and installing properly on all Mac versions, we'll be able to put out proper packages for end-users to download and install.

Optimizing Parrot

Let's all face reality here: All the recent refactors of Parrot, including the ongoing PCC refactors, have either a neutral effect or a negative effect on Parrot performance. dukeleto showed me an eye-opening comparison between several versions of Parrot and the most recent rev of pcc_reapply that showed a stunning 400% execution time increase for some benchmarks between the 0.9.0 release and the current branch revision. That's terrible.

The Context refactors added about a 200% execution time increase. The PCC refactors are currently adding about another 200%. The fact that these two systems overlap so closely is certainly only exacerbating the problems. So why are these two refactors responsible for so much slowdown? Let's look at each in detail.

The Context refactor converted the Parrot_Context structure, which was manually memory-managed using bit twiddling reference counts, into a garbage Collectible PMC type. Had the changes stopped there, I have the strong belief that this would have been a neutral change in terms of performance. If anything I would suspect that this change would have been a small win overall. No matter how lousy and inefficient our Garbage Collector is, I have to believe that the tight loops and spacial locality of keeping PMCs together would be a good thing. Also things like the memory allocator in the collector are pretty efficient (although could definitely be improved). However, the refactors did not stop there. bacek also went through and properly encapsulated the Context PMC type so that there were fewer direct accesses of it's internal members. All accesses happen now through API calls, which are significantly more expensive. Plus, every field access adds an additional pointer dereference now too. Together I don't really know how these things could have added a 200% performance penalty, but it's pretty well documented that they did.

The PCC refactor is adding in a new abstraction step in the calling process. This makes the code prettier, along with much more maintainable. The cost is in making some of the operations a lot less efficient. Some of this is the fault of the new CallSignature PMC, which naively integrates several array and hash PMCs, inheriting from the oddly-named Capture PMC in the process. All strings, integers, and numbers are autoboxed into PMCs during the creation of the signature, and all return values get a CPointer PMC to hold the pointer. This is not to mention that several of these values are only accessible through named lookups, and you start to see a major performance problem. We're creating a shittonne of new PMCs for every call, and then using relatively expensive VTABLE accesses to recursively access the data in them.

chromatic has been doing some great work in the past few days rewriting the naive CallSignature implementation into a much more efficient form. Many of the operations can be more specialized for the task at hand, and the number of created PMCs can be reduced dramatically. bacek has also been doing some good work in this area, mostly small bits and pieces around as he finds them. So the situation isn't hopeless, but we are going to have to tighten several things up if we want Parrot to be as screaming fast as it was in 0.9.0 (if not faster).

Now that we have a very well-designed system, there are plenty of places where we will probably want to break encapsulation for a performance win. A few places where we will want to write some ugly code instead of the current prettier code, also for a win. We'll just make sure to mark these places with lots of documentation.

I've been focusing my attentions on fixing bugs. This weekend chromatic and bacek have been looking into them also, and together we're down to 1 remaining failure at the time that I write this. The last few tests were all very tricky, but this last one is turning out to be the worst of the bunch. I think we'll be able to beat it this weekend, however. After that, we're focusing on optimization until we're ready to merge the branch, probably soon after the 1.7 release.

So that's where things stand in terms of performance. Parrot has a lot of ground to recover since 0.9.0, but I am convinced that there are plenty of opportunities to do just that. Some of it will be algorithmic improvements, some will be small intelligent tweaks to individual functions and datastructures, and some will be real low-level bit-twiddly nonsense. However it happens, we really need to make Parrot faster, and I have very high confidence that we will be able to do just that.

Tuesday, October 13, 2009

Matrixy: Moving Forward

The world of Parrot is an ever-moving target, but my poor compiler project Matrixy has been laying fallow for several months. Matrixy, for those who haven't seen me write about it before, is an implementation of the M language (MATLAB/Octave, not the new, weird, .NET language). MATLAB/Octave is the standard language of choice for many branches of science and engineering. I have never seen another language, except maybe R, which is so catered to the task of design and simulation, and it has some of the best integrated mathematics support I have ever seen.

Matrixy History

I started what is now called Matrixy about two years ago now when I was still in grad school. Part of my thesis involved writing an assembler in MATLAB for a new processor architecture I was designing in the visual programming component Simulink. Since I was just getting involved in Parrot at the time (and was reading the classic Dragon Book cover-to-cover in my leisure time), I decided an M compiler would be a great project to start. I put it on hold during GSOC and immediately thereafter when I was still burried knee-deep in Parrot core internals work. Luckily the idea was rescued by a guy named Blair who set up a project on googlecode and together we made a lot of progress implementing the syntax and semantics of the core language. Part of the standard M package is a huge list of built-in functions, something we didn't even attempt to reproduce.

Blair moved on to other projects, and I haven't turned back to Matrixy in months. It's sat completely dormant for months now, and needs a breath of life pumped into it. I can't think of a better time to do that either: With PCC refactors landing soon (I hope) and Plumage taking off, there is plenty of opportunity to get Matrixy up and running again. Here's a list of things I would like to do with the project, and I am looking for helpers and volunteers who would like to get involved.

Linear Algebra Libraries

Integral to M is comprehensive linear algebra support. This is what makes it such a great tool for engineers to use. Implementing these features, we've written bindings for Parrot to the CLAPACK and CBLAS linear algebra libraries. With Plumage quickly growing into a great ecosystem project, I would like to separate out the CBLAS and CLAPACK libraries into a separate "Linear Algebra Pack" to be dowloaded through Plumage.

This is going to require some work, however. CBLAS and CLAPACK have several functions with very exotic function signatures, and currently we would need to recompile Parrot to include them. Once Parrot has a proper NCI call frame builder, we don't need to worry about that.

Separating our libraries out into separate projects is going to help other languages and projects on Parrot who want to use them. However, it's going to require adding non-local dependencies to Matrixy's makefile. Small price to pay for the greater good.

Proper Matrix Data Types

One thing that M requires is a native matrix data type. We've managed to get along so far with an ad hoc system of nested ResizablePMCArray objects, but that's a terrible (and slow!) solution. Better yet would be to write up a good set of dedicated PMC types that implement a 2-D floating point matrix or vector. Having that PMC type support a sparse mode, or having a separate sparce array type would be fantastic too. If we included these PMC types with the "Linear Algebra Pack" that I talked about above, we would be able to quickly integrate robust linear algebra support in any language or project running on top of Parrot.

Imagine that in a few months time you could be using Perl6 or PHP or TCL to manipulate numerical matrices and performing complex linear algebra on them at blinding native code speeds!

Make Matrixy Installable

Matrixy currently doesn't work with an installable Parrot. In fact, Matrixy isn't really installable by itself either. We need to fix these things.

Matrixy needs to work against an installed Parrot so we can start properly tracking point releases instead of SVN head. Of course I always expect there to be exceptions to the rule, but it's a good policy to get into the swing of.

Matrixy also needs to be able to be packaged up and installed, especially through Plumage.

Move Matrixy to Github

Actually, this is something I think I can do on my own. I don't have any problem with Googlecode, but I feel the sea change pulling us towards Git and I would like to host at least one project through Git so I can start getting good with the tool. Github seems like a great place and I really like the social atmosphere.

Conclusion

Obviously there is a lot of primary development to be done too. We're a long way away from even having the complete syntax working, much less the semantics and the gigantic runtime library. However, with a few small changes Matrixy could be well along it's way to being not only a first class compiler project on Parrot, but a contributing force to other language projects as well.

If you have any knowledge about working with an installed Parrot, or working with Plumage, I would love to hear from you! Once I get things moved up to Github, I'll have plenty of commit bits to hand out to interested contributors.

Saturday, October 10, 2009

PCC Hackathon

I'm busy moving today, so I scheduled a blog post ahead of time to help spur the troops on. PCC is a big deal, last weekend we kicked some serious ass as a design team. People were developing, debugging, testing, talking. It was awesome.

This weekend, lets take some advice from one of my favorite internet memes, Courage Wolf:

PCC is a big deal, probably one of the biggest. I can't think of any system that is more central to normal operations in Parrot, except the runcores (and they're much simpler). We've got a big task if we want to completely rewrite PCC from the ground, not only without losing any functionality, but actually adding new features in a sane way. It's big. It's more then we can chew, but we're going to chew it anyway. I have all the faith in the world in our development team.

The big question is whether we can chew it before the release. I sincerely hope so, and I know a lot of other people hope so too.

Thursday, October 8, 2009

Hackathon Schedule

This week at the #parrotsketch meeting we decided a few things:

We need to keep up momentum on the PCC work. So, we're doing the hackathon thing again this weekend
Hackathons are good things, but they need to be used sparingly.

We can't just hackathon all the time, and the fact that we are trying to do it two weekends in a row is more a testament to the extreme necessity of these refactors. We want it and we're going to make an extended push to get it.

Conversely, if every day is a hackathon, no day is. We need to do it infrequently enough that people don't get used to it and tired of it. However, we need to have them frequently enough that we get good focus on big projects and develop good momentum. After this cycle, we're looking to do about one hackathon per month, depending on need and interest. More need means we do it a little more frequently, less interest means we do it a little less frequently. We've got plans penciled in for the next two months, ideas after that would be welcome.

This is the month of PCC. Next month we're looking at PIRC. It's not a huge necessity for us since IMCC still "just works", but then again there is a painfully small bus number for that system and we would like to get knowledge and interest higher.

The month after that I think it's going to be time for JIT. I don't think I am going to wait for a random hackathon in two months to get started, but hopefully we can lay enough groundwork for other developers to really latch onto and get going. Working in a void is hard, working on a pre-made foundation is much easier.

Besides development hackathons, we're also wanting to do testing/debugging hackathons. In the historical notes I think these used to be called "bug days", and tended to be the Saturday before a release. If we could make that a regular occurance again, it would be a great thing for Parrot. A day when we were dedicated to closing tickets, tracking down bugs, submitting smoke reports, and getting Parrot building and passing tests of various platforms would be a huge benefit. It would also get us into a better habit of keeping things clean and stable before the release, which is something we probably should be better with.

I'm moving to a new apartment this weekend, so likely won't be able to participate in most of the hackathoning. I will check in as often as possible.

Tuesday, October 6, 2009

Closing the Hotmail Account

I haven't used it in months really, but I've been keeping my hotmail account around because I still occasionally got mail sent there from people who haven't updated their contacts. Well, today I finally got a good reason to just go ahead and close the account. So I did that.

Monday, October 5, 2009

PCC Hackathon Day!

It was only decided at the #parrotsketch meeting on Tuesday so there wasn't a lot of advance notice. But we decided that on Saturday we were going to have a proper, focused hackathon on the PCC refactors. It was supposed to be an event that lasted all day on Saturday, but it ended going from Friday night through the entire weekend. The momentum we've built up is huge, and even though we haven't finished with the task yet I have very high hopes that we will be finished quite soon. I personally would like to see this branch land by 1.7, but I don't want to make any promises.

Mission Recap

In current Parrot trunk, argument passing happens in a variety of ways. Predominantely, there were two: calls made from PIR and calls made from C. The calls made from PIR stored hardcoded lists of arguments, register indices, in the compiled opcode stream. When a function filled it's parameter list, it would look through the opcode stream for the indices and use them to lookup register values in the caller's context. These were processed and massaged to fit into the various parameters that the function defined.

The second way to go about a call was from C. In this case, the arguments were passed as a variadic argument list, and passed around as a pointer to that list through the various processing functions. Finally, values were extracted from there and processed into the callee's context parameters.

This means we need six functions (or, six families of functions): Pack arguments into a variadic argument list pointer, pack arguments into an opcode stream, unpack variadic argument lists into C variables, unpack variadic argument lists into a PIR context, unpack an opcode stream into C variables, and unpack an opcode stream into a PIR context. And encapsulation is broken all over the place because callee needed to be aware of how it was called.

And this only covers passing arguments to a function, not the act of returning values from that function (which is similar, but currently distinct). In short, calling conventions were messy and desperately needed a refactor if we want to do any amount of new development on it.

Current Refactors

The goal of this current work is to unify the code paths for both types of invocation. In both cases we extract arguments (from the opcode stream or from the variadic argument list) into a new PMC called a CallSignature. The CallSignature contained information about the call, the list of arguments, etc. By marshalling everything through this new abstraction layer, we can make a method call from anywhere to anywhere without having to be aware of the details. All that needs to be worried about is how to put arguments into the CallSignature, and how to get them back out again.

I've mentioned these refactors several times on my blog in the past. They've been in progress for several months now and needed something like this hackathon to really get moving.

Argument Processing Algorithms

One of the problems we were having over the weekend is that the algorithm for implementing the argument processing mechanism is very complex. Parrot supports a wide variety of argument and parameter options:

Normal positional arguments. Positional arguments must be passed in order and all must be accounted for without overflow or underflow.
Optional arguments, which might or might not be provided without causing an error
"opt_flag" arguments, which are coupled with an optional argument and provide a boolean value whether the previous optional was supplied or not.
Named arguments, which can come in any order, and which use a string name to identify them.
Slurpy arguments, which take a variadic list of positional args and combine them into a single array PMC
Named slurpy arguments, which are similar to normal slurpies except they group all extra named parameters into a hash.

And it turns out that supporting all these things in a sane way is not trivial. I've put outlines of potential algorithms on the wiki, so the reader can get an idea about how complex these algorithms really are. The code to implement these is much worse, of course.

Current Status

The branch is currently down to a very managable number of test failures, most of them stemming from a few issues requiring primary development. Specifically, :named and :slurpy return values aren't supported, which produces several failures and prevents PCT from building. Last I checked there were no more segfaults or prematurely aborted test files. The only failure I didn't think I could explain was a failure in the subroutine tests dealing with IMCC. I'll look into those a little more too, if they haven't disappeared by now. Of course once we get all the core test files working, we're going to want to get HLLs building and testing on the branch, which could expose a whole new world of failures.

Parrot has a great team of developers, and when they got motivated to work on a single task we made some awesome progress. Hopefully we can keep the momentum going and get this refactor locked down and merged soon.

Thursday, October 1, 2009

libJIT NCI Frame Builder Branch

I alluded to it earlier, but today is the big announcement: Newcomer Peter Lobsinger has developed a patchset to replace Parrot's ad hoc NCI frame builder with a libJIT-based solution instead, and sent it to me as part of my JIT Challenge. Well, today he set up a branch of Parrot on github with the patch for public review. I recommend that all Parrot developers check it out to see what kinds of things are possible in a libJIT-integrated Parrot.

Blog Closed

Pages