Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.

Thursday, November 12, 2009

String Handling in M

I've been doing a lot of work recently on the pla_integration branch for Matrixy. The goals of this branch (which is likely to just become the new master) are to integrate Matrixy with the Parrot-Linear-Algebra project, and use it's new matrix types instead of home-brewing our own types.

I'm already seeing some good results: I've got lots of important tests passing and performance seems to be nice (though startup time and PCT processing time overall are worse, but that's another issue for another day). However, the one area where I am still having a lot of problems is in fixing the handling of strings.

Strings in M are very idiosyncratic. This is especially true when we start mixing strings with matrices. One good thing I am finding is that the various idiosyncracies and even--dare I say--inconsistencies help give a certain insight into the way Octave does it's parsing. We should be able to take those insights and try to produce a sane and compliant parser for Matrixy. Best way to proceed is through a series of examples. As a reminder about M, lines not terminated by a semicolon print their results to the screen, and the % sign is the comment delimiter. I'll show the output of each line in comments below it:

x = ["ok";"w00t"]
% ok
% w00t

Here is a very simple case. We have a matrix with two rows, each row contains a string. When printed, each row of the matrix is printed on a newline. One thing to notice and remember for later is that the strings on these two lines are not the same lengths.

x = ["ok"; 65, 66]
% ok
% AB

M has a lot of close ties to Fortran, which is what the original version of Matlab was developed in. In later times I believe it was ported to C and then to Java, taking characteristics of each implementation language along the way. In any case, there are some very obvious influences from Fortan and C on the M language. One such influence is that strings are simply arrays of characters. When we are printing out a matrix that has some kind of internal "I am a matrix of strings" flag set, integer values are converted to their ASCII equivalents and treated as characters.

x = ["ok"; 65, 66, 67]
% "Error: number of columns must match (3 != 2)"

Here is a slightly strange result. In the first example I showed, we can have a matrix where the string literals on each row are different lengths. In the second example I showed that we can treat integers as characters in forming string literals inside a matrix. However here we see a surprising result: If we try to make a row of all integers and a row that is a string, they must have the same length.

x = ["A", 66; "C", 67]
% AB
% CD

Mixing integers and strings on a single row works.

x = ["A", 66; "C", 68, 69]
% "Error: number of columns must match (3 != 2)"

...but the rows must be the same lengths when we build them like this.

x = ["ABC"; "D"; "E", 70]
% "Error: number of columns must match (1 != 3)

And we can see from this error message that suddenly line 2 ("D") throws an error because it's not the same length as line 1 ("ABC"), even though this would have worked if we hadn't included line 3 ("E", 70). As a more complicated example, and to clarify how strings of uneven lengths are stored, see this example:

x = ["ABCDE"; "F"];
x(2, 5) = "G";
x
% ABCDE
% F G

So we can see from here that strings aren't inserted into matrices with arbitrary lengths, they are padded out to be the same length with spaces. Finally:

foo = "Awesome";
x = [foo; 65]
% "Error: number of columns must mach (1 != 7)

So we can see that the checks for these matrix sizings are happening at runtime, not at parse time. (This small example could be explained away by aggressive constant propagation in the optimizer, but I will assure the reader that this holds true in "real" cases as well).

We can divine a few parser rules from all this:
  1. If we have strings in the matrix, we flag the matrix as being a stringified one and print it as a string. This means converting any integers in the matrix to characters in the row-string.
  2. If we have integer or number literals in the matrix, even if they can be converted to ASCII characters, the rows of the matrix must have the same lengths.
  3. Judging from the third-to-last example, they appear to do these length checks and string checks on the matrix after parsing is completed (otherwise, why would it have errored on lines 1 and 2 not being equal length when it didn't see the integer until line 3?).
  4. These checks happen at runtime.
What I think we need to do for parsing these literals is the following:
  1. We parse each row separately, and pass them to a row object constructor at runtime.
  2. If the row contains any strings, set the "has strings" flag. If the row contains any numbers, set the "has numbers" flag.
  3. We pass all row objects to a matrix constructor
  4. If all rows are strings, and no rows have numbers, pad the strings with spaces and insert them into a character matrix (like a normal matrix, but defaults to printing like an array of ASCII characters). Done.
  5. Check row lengths. If all are not the same at this point, throw an error. Done.
  6. If any rows contain strings, create a new character matrix object and populate it with all data, no padding. Done.
  7. If rows only contain numbers, create a normal NumMatrix2D PMC, populate it and return that. Done.
As an aside, there's another example that is worth showing:

x = ["ABC"; 68.1, 69.2, 70.3]
% "ABC"
% "DEF"
x(2, 2)
% ans = E

We see here that floating-point numbers are converted to ASCII integers when inserted into the character matrix, and that rounding sticks: you can't get the original non-integral value back after the conversion. So all my ideas above with the character matrix type should work with this.

So that's what I think we're going to have to do if we want to faithfully reproduce the behavior of Octave. This system will make matrix literals in the code a bit of a performance drain, but that's what we're going to have to live with for now.

2 comments:

  1. Matrixy will be PDL( http://pdl.perl.org/ ) ?

    ReplyDelete
  2. Similar concept, but not the same. PDL is a perlish interpretation of numerical processing software like Matlab/Octave. Matrixy instead tries to be a compiler for the M language (which is definitely not Perl!)

    So yes, same capabilities as PDL, but different syntax. Plus, Matrixy runs on top of Parrot, not Perl5.

    ReplyDelete

Note: Only a member of this blog may post a comment.