Comments on Whiteknight's World: L1: The Implementation

I'll look forward to the next post, then. (BT...

2009-06-19T20:11:49.175-04:00

I'll look forward to the next post, then.

(BTW, the XOR trick takes 3 ops, and, unless your writes are guaranteed atomic and SMP caches coherent, and things do not blow up if both locations have the same value, you still have to worry about concurrency when using the temp variable approach. If you can guarantee the temp is a register and doesn't cause a spill, using a temp is the better approach.)

Thanks for the explanation, that makes a lot more ...

2009-06-19T19:55:18.965-04:00

Thanks for the explanation, that makes a lot more sense to me where you are coming from now. I actually did EE in school, but did coding in my spare time at home in front of the TV, so I know exactly where you are from!

Parrot is at a very interesting development stage right now. First, the majority of it's major systems are at least prototyped, if not at some level of stability and maturity. However, there are several systems still in need of major cleanup, improvement, and basic implementation.

Because a lot of the systems are basically implemented now as initial prototypes, there is massive room for optimization at the function level and at the architectural level. Some systems, while functional, are in such a deplorable state that the only path forward is complete reimplementation. JIT is one of those systems. GC is another. Both of these are on-the-metal systems where optimizations would have a major effect on Parrot performance.

L1, while a big project, actually is going to occupy a relatively small niche in Parrot land. I'll try to put together a new blog post tonight to really explain what it will do from a perspective that I think you will want to see. What L1 is mostl going to effect is the front-end compiler (which will be modified to output L1 streams instead of PBC streams) and the central execution core (which is about a 5 line function). All the core systems are basically going to be unaffected by the transition, so work on them should be fine.

Well, here's the perspective I am coming from....

2009-06-19T19:35:25.679-04:00

Well, here's the perspective I am coming from.

About half a year ago I took a read through the Parrot source. Some of it I understand, some of it not so much. I don't program for a living, usually. My degree is in Electrical and Computer Systems Engineering, not CS. I never went into the CSE field professionally, opportunity led me into WAN/LAN. But still, programmatically I think down to the wire. When I see code I see the bits going from here to there, gears meshing, valves valvifying, whatnot.

When I hear "multiple inheritance with parametric roles" my eyes glaze over. I realize to some developers such features are incredibly important. I'll use those features from time to time when it helps, and study up on them when I need to mod something that uses them, but they aren't critical to me. For sheer lack of time, I'll probably never write something so complicated that they'll be critical to code management.

I'd place good odds that a majority of potential Parrot contributers who are not contributing are in a similar situation -- they really want Parrot and/or Rakudo for one specific set of features. They could leave or take 90% of the rest, and they don't have three solid weeks to ferret an understanding of the entire system and the development disposition from disparate sources.

I write two types of things on the job -- scripts in high level languages, and very low level API bit shuffling usually involving network packets.

Data structures, that I can understand. So I read into hash.c and the string class. I see in there that a lot of streamlining could happen. I nopasted some observations to IRC, and nobody tells me I'm delusional, so I figure, hey, maybe I'll try to hand code up some functions to see if we can calc hashes at memory line speed when doing a long compare. And if that works out, and someone likes the sample and maybe it even gets in, I could do some other architectures.

Now, I've had about 6 false starts trying to be productive on rakudo and 2 on Parrot before this. I'm thinking to myself, this just might be low level enough a job that I could tuck myself in a corner, not bother anyone, and just crank out a bunch of optimized functions while watching TV every afternoon after work.

Perfect. Just the sort of thing I actually enjoy coding. Finally, even if it's a modest performance gain, no more dealing with a shifting landscape, here's something I can zoom in on.

But then, just when I'm starting to talk myself into it, there's now this mystery thing on the horizon, called L1, that looks like it has the potential to render that work useless. And I'm trying to determine A) if the operations I was thinking of working on are low level enough that they would still be in C, and B) if I'm going to be able to use the vector processor, or is it going to be reserved across the entire codebase because JIT might need it (even though JIT's only going to kick in in the middle of tight loops.)

And that's why understanding the full implications of L1 is important to me.

Please do poke this idea with a stick, and poke it...

2009-06-19T18:43:06.554-04:00

Please do poke this idea with a stick, and poke it rudely. I would be far less happy if my posting here didn't spark any questions, discussion, or argument.

I called the swap thing a premature optimization because it's not really necessary at this stage. In the broader sense, whether we implement swap as one machine instruction or three (two using the XOR trick is bad for reasons I could explain separately) is absolutely unimportant. swap isn't an instruction that is used too often so optimizing it on some platforms is very low priority.

You are right about making sure new users are able to get involved, and I think Parrot does a good job of it. What you talk about is also a big part of why I have this blog in the first place: Keeping a written record of things, not only post facto but also throughout the planning and implementation stages. I like to write about things I am planning so that people (myself included) can get insight about what I have in mind for things.

Think about L1 as being a tool. It does add a layer of algorithmic complexity, but actually provides a large amount of simplicity too. It's going to simplify the way the whole VM operates, simplify the way all the components of it are developed and maintained. simplify the way individual components are updated, simplify the logic in our GC, simplify control flow, simplify our exceptions system, simplify our threading system, and the list goes on. We add a whole new layer to our little programming flowchart, and yet the whole thing gets less complicated, not more.

L1 is a long way away (at least in my estimation, it is notoriously difficult to estimate progress on volunteer-driven projects), and I will be doing a lot more blogging in the interim to try and express my vision for L1, and to try and make it more clear for other people to understand. If through all that discussion we decide that it's actually a bad idea, we won't implement it at all.

Well, first, as to the philosphical stuff: Obstruc...

2009-06-19T17:13:55.571-04:00

Well, first, as to the philosphical stuff: Obstructing optimization is about as "bad" as prematurely optimizing, and the reasons why are the same. One must understand precisely what these reasons are.

"Prematurely optimized" code requires the people working nearby it to pay mind to the optimization, and slows development due to attempts to remain compatible with the extra demands of the optimization while modifying behavior. This is just about the ONLY real reason it is bad.

If there was a function that nobody ever needed to maintain, because it's part of a very stable codebase, and someone came along and optimized it, then that's not "premature optimization" in the sense meant by the oft-quoted princep, even if that optimized function really isn't run enough to be in the 10% of the 90/10.

If they did that when there was better things to be doing, that's a different matter, but we have to keep in mind that volunteer coder skill sets vary widely... not all developers are capable of grasping a project the size of Parrot and must restrict their attention to subsystems -- either that, or do no coding and just a lot of talking about coding.

On the other hand, erecting barriers to optimization has a price, too, and not just the frustration of developers waiting for their test suite to complete.

Code maintainability can suffer if there are walls beyond which only automated utilities dare tread. The custom automations become the thing catered to and worked around, instead of the optimizations. They become jargon within the codebase; New developers face a steeper learning curve, and the ability to work on parts of the system without having to develop an accurate "big picture" is diminished. When core developers finally look up from their voodoo, they are speaking an entirely different language than anyone who's attention they might be able to get.

I'm not against L1, and I thank you greatly for trying to shed some light on it. I'm just not sold on it yet, so do pardon me if I poke it rudely with a stick. My attitude is that if it's another jargon, it had better damn well be worth the negative impact on the accessibility by new blood, because I'm new blood, and I certainly don't feel like this project is something I could make a positive impact on without devoting way more spare time than most people have.

Now onto a few specific points:

The number 1300 doesn't scare me, as long as there are really that many unique operations to perform. Modern CPUs are within an order of magnitude of that (even RISC CPUs, which sometimes even exceed CISC CPUs) and when we talk about internal microcode, much of that is really just eliminating redundant logic used by multiple instructions. There is still a lot of logic unique to many instructions, just you don't need the extra in-silicon copy of "and then byteswap your result" or whatnot.

I kind of EXPECT a higher level VM to have more instructions (or MMD overloading acheiving the same thing) because I expect to be able to do more with it.

While there are plenty of real world success cases for JIT, just using it doesn't mean it will necessarily work. I can understand how L1 is trying to address that need by reigning in the code to behave better on the back end.

But there at least four other areas I view as much more important: Concurrency and AIO, at least soft realtime, stable APIs for direct memory access (I guess this is called "c pointer support" in parrotspeak), and lest we forget, much better documentation and updated code comments that do not lead you on goose chases. You know, all that stuff core developers avoid doing by embarking on the next big adventure :-)

So maybe we could live without JIT? I mean I know it's the "thing to have" because JVM has it, but somehow despite that java_vm is the slowest pig of a process on my desktop... so...

I don't think the swap example was a bad one, ...

2009-06-19T15:35:24.721-04:00

I don't think the swap example was a bad one, and I don't think we need to worry about optimizing it down. Premature optimization is a bad thing, and it's worth realizing that not all platforms where Parrot is supported or will be supported have it. Remember, L1 ops need to be trivial to JIT, which means that they have to be very small and atomic. A swap operation, if a hardware swap isn't provided, is composed of at least two separate sub-operations. So we have some cases where the swap L1 opcode would be far less trivial to JIT then other cases.

We can leave it to the JIT engine to detect patterns like this and optimize them for each particular platform. Every JIT engine I've ever seen has options to specify the level of desired optimization, which is always going to be a trade-off between initial compilation effort versus execution efficiency. In many cases we won't want to take the performance hit of lengthy optimizations on the front-end when such a process would take longer then execution of the non-optimized program! It is well beyond the scope of parrot to implement per-platform machine code optimizations while trying to also maintain a programming environment that is supposed to be independent of platform differences. A script written in PIR should execute equivalently on all platforms where Parrot is compiled and built.

PIR doesn't really use opcode overloading, it's an illusion. Each operation has a unique long name, and the PIR compiler allows the user to use a non-unique "short name" if enough information is available at runtime to derive the long name from it. The number of PIR ops is damn near 1300 right now, although there are some efforts to eliminate a number of these and move several sets of them to non-core dynamic libraries.

"Breaking even with JIT" is a tradeoff between the initial compilation overhead and the amount of time saved throughout execution. For small one-off programs in any language, JIT is rarely as quick as direct execution. For longer-lived programs or programs that execute long loops JIT is better because the per iteration efficiency savings outweigh the initial compilation overhead over time.

The swap example kind of hits right at what I was ...

2009-06-19T15:18:10.243-04:00

The swap example kind of hits right at what I was talking about on the last post.

Some CPUs have a swap opcode. Were L1 not to have a swap opcode, you'd have to do like above (or use the XOR trick to avoid using a temporary variable, assuming there is an L1 XOR op.)

In any case you are talking about invoking three things to do what can be done in one (often even atomic) operation on many CPUs.

I'll give you that it was probably a bad pick for an example, as there probably would be an L1 swap opcode. Or one could argue that JIT would notice and reduce the code -- but breaking even with JIT is basically a game of hoping enough logic gets strung end-to-end that the cost of doing the JIT doesn't exceed the cost of operations it saves.

As to the number of opcodes needed, it is worth noting that the way PIR/PBC has managed to exercise control over the need to manage a large mapping of binary codes to operations is by overloading single opcodes on the type of the operands.

vtable[add] $1, $2, $3 or vtable{add} $1, $2, $3 ...

2009-06-19T15:17:07.411-04:00

vtable[add] $1, $2, $3
or
vtable{add} $1, $2, $3

Could be converted to:

vtable_add($1,$2,$3)

vtable_add_ppp($1,$2,$3)

&func = find_vtable obj, "add"
result = call( &func, $1,$2,$3)

by the compiler, so I would recommend the former over the latter.

op swap(inout PMC, inout PMC) {
move $.P0, $1 # move into lexical var
move $1, $2,
move $2, $.P0
}