I wanted to write a parser for a key-value text based data format recently. Something similar to JSON and KeyValues, but easier to write than JSON, and with more features than KeyValues (mainly arrays).
The format looks like this:
// Keys can have multiple values in parenthesis
one(two "three")
// or spread out over many statements
"one" four
one five
one
six
// Keys can have no value and a block after them
one
{
// Blocks contain more key-value pairs
two three
}
one
{
two("three" four five)
}
// Keys can have many values and a block after them
one(two three four)
{
five six
seven eight
}
// Blocks can be arbitrarily nested
one(two)
{
three(four)
{
five(six)
{
seven eight
}
}
nine(ten eleven)
}
Ozxy and I have named this the Quick Data Format, or QDF.
We had a little competition to see who could write the smallest (fewest characters) parser for this format.
After a few days work I have landed on this:
#define e exit(1);
#define R return
#define M calloc(1,24)
#define W while(
#define D R strndup(s,d);
#define I if(
#define Z >255){P v=M;v[1]=t;
typedef char**P,C;C*w;C*n(){C*s,c,d;W(c=*w)-1u<32)w++;
I!c||strchr("(){}",c)){w++;R c;}I c==47){W*(w++)-10);R
n();}I c==34){w++;s=w;W*w-34)w++;d=w-s;w++;D}s=w;W(c=*
w)-48u<75)w++;d=w-s;D}C*q(){C*f=w,*t=n();w=f;R t;}P m(
);P p(){C*t=n(),*k;P _=M,l=_;I t<255)e _[1]=t;k=w;t=n(
);I t==40){W(t=n())Z l=*l=v;}}else I t Z*_=v;}else I t
=='{')w=k;I q()=='{'){n();_[2]=m();n();}R _;}P m(){P o
=0,l,t;W q()>255){I!o)l=o=M;else{P r=M;l=*l=r;}l[1]=p(
);}R o;}
This code is 590 characters (newlines added for the reader’s sanity). My first version of this code (for the first draft of this post) was 1048 characters. That’s a 458 character improvement, almost half the original size!.
The code is now small enough to fit in an ethernet packet twice, and almost small enough to fit in the MBR boot sector of a floppy disk.
The reason the code has shrunk so much is almost entirely thanks to competing with Ozxy’s parser. After a long and painful few days debugging our parsers we have both ended up at the same character count.
In this post I will explain the design of my parser, explain how my code works, and some of the tricks I used (or copied from Ozxy) that helped me save space.
The code starts off with some #defines
which I use to save space. These are just commonly used functions and expressions.
There’s one weird looking #define
here, Z
. I’ll get into that later. The rest should look simple enough.
I use calloc()
instead of malloc()
because it zeroes the memory it allocates. This is useful since my main data structure just consists of three pointers, and it saves space if they are initialized to 0.
I allocate 24 bytes because the tree data structure (described below) consists of three pointers, so on 64 bit systems, 3 * 8 = 24 bytes total.
#define e exit(1); // Used to report an error
#define R return
#define M calloc(1,24)
#define W while(
#define D R strndup(s,d); // Copy the first d bytes of string s and return
#define I if(
#define Z >255){P v=M;v[1]=t;
Next I define some data types for ease of use. Earlier versions of this code used a struct (and even a union!) but that was all cut to save code.
P
here is the main data type of my parser. It represents three pointers. Since this is C, they’re all implicitly cast to a char *
when assigning to the type.
P
represents a tree with a two branches and a value. Internally it is equivalent to an array of three char pointers. Note that branches and value don’t have to be strings, in fact they usually aren’t. C will implicitly cast any pointers assigned to indices of a P
to char *
, so type safety is thrown out to save space.
P[0]
represents the right branch of the tree (rather confusingly, I know, this is a relic from when P
was a struct); P[1]
represents the value, and P[2]
represents the left branch.
The global variable w
is the input buffer.
typedef char **P, C;
C *w;
The Lexer
The first function n()
is the lexer. It skips any whitespace, returns a special character if it finds it, and then parses a quoted or unquoted string.
On line 19 I use -1u < 32
to check if the character is a white space character. Any character less than 32 other than 0 is considered white space. This uses an integer underflow instead of two comparisons because it saves a few chars. 0 - 1u
will under flow up to 255, while any character from 1 to 33 will then be less than 32. This technique helps save some space in the code.
I use a char *
as the basic token type in my parser. I consider any token less than 255 to be a special character, and anything above that a string. C allows treating pointers as integers (with some warnings), and this saves space.
C *n()
{
C *s, // Used to store the start of a string
c, // Stores the current char
d; // Stores the length of the string
// While the character is greater than 0 and less than 33,
// step forward in the input (skips whitespace)
W (c = *w) -1u < 32)
w++;
// If this is the end of the input, or the input is a special
// character, return it verbatim
I !c || strchr("(){}",c))
{
w++;
R c;
}
// If the input starts with a comment, continue to the end of
// the line
I c == 47)
{
W *(w++) - 10)
;
R n();
}
// If the input starts with a quote ("), save the start of the
// string, and continue until the closing quote. Then, copy
// the string and return it.
I c == 34)
{
w++; // skip the "
s = w;
// Continue until the end of the string
W *w - 34)
w++;
// d is used by the D macro to know how long the string is
d = w - s;
w++;
D
}
// Otherwise, assume the string is unquoted and store its start
s = w;
// Again, integer underflow. Anything from 48 (ascii 0) to 123
// (48 + 75, ascii z) is considered part of the string
W (c = *w) -48u < 75)
w++;
d = w - s;
D
}
The q()
function is simple, it just peeks at the next token in the input. Because the input stream is just a char *
that the lexer increments, saving the parser state is trivial.
C *q()
{
C *f = w, // store the old state
*t = n(); // parse the next token
w = f; // restore the old state
R t;
}
The Parser
The next part starts with a forward declaration of m()
, which parses many key-value pairs, either in a block or at the top level.
The p()
function parses a key-value pair. It starts by reading a token from the input (the key), and saving the position after the key. It then allocates a tree (M
) and stores it in _
and l
.
If the key is not a string (< 255
) it throws an error. Then it saves the string as the value of _
(remember, [1]
is the value).
It then saves the position of the parser after the key, and reads another token. If the token is a (
it continues reading strings from the token stream, storing them as the right branch of the last tree (l
), and setting them as the last tree.
This part can be a little confusing. Remember that I
starts an if statement, W
starts a while loop, and Z
expands to >255){P v=M;v[1]=t;
.
P m();
P p()
{
C *t = n(), // last token
*k; // save state
P _ = M, // root tree
l = _; // last tree
// throw an error if the key is not a string
I t < 255)
e
// set the value [1] of the root tree to the key
_[1] = t;
k = w;
t = n(); // read the next token
// if the token is a (
I t == 40)
{
// Read the next token, store it in t, then, if it is greater
// than 255, allocate a tree `v` and set its value to the
// token.
// Remember, `Z` expands to >255){P v=M;v[1]=t;
W (t = n()) Z
// Store the new tree as the right branch ([0] or *) of
// the last tree, then set it as the last tree.
// This is evaluated right to left.
l = *l = v;
}
}
// Same as in the while loop above, if the value is a string,
// allocate a new tree, set its value to the string, and set it
// as the right branch of the root tree. No need to use `l` here
// since there will only be one value.
else I t Z
*_ = v;
}
// If the token after the key is a `{` then there is no value.
// Reset the input to the position after the key to prepare
// for the next part.
else I t == '{')
w = k;
// If there is a block after the key-value pair, skip the opening
// {, and set the left branch ([2]) of the root tree to the content
// of the block. Then skip the closing }
I q() == '{')
{
n();
_[2] = m();
n();
}
// Finally, return the root tree.
R _;
}
The last function m()
is pretty simple. Here o
is the root tree, l
is the last tree, and t
is the current token. o
is initialized to 0 to ensure that an empty block ({}
) will return null.
P m()
{
P o = 0, // root tree
l, // last tree
t; // current token (unused but I don't want to change the code)
// While the input starts with a string
W q() > 255)
{
// If the root node is null, allocate a tree and set l to
// point to the root.
I !o)
l = o = M;
else
{
// Otherwise allocate a new tree
P r = M;
// Set the right branch of l to point to the new tree,
// then set l to point to the root tree.
// Remember, this evaluates right to left.
l = *l = r;
}
// Set the value of the last tree to the key-value pair
// parsed by p().
l[1]=p();
}
// Return the root tree.
R o;
}
Using the Parser
I also wrote some helper functions to help visualize the parsed data. This can also give you an idea of how to actually use the tree structure to query what was parsed.
#include "min.c"
// [0] == r
// [1] == v
// [2] == l
// Display a token
void dt(C *t)
{
if (t>255)
printf("%s\n", t);
else
printf("%c\n", (C)t);
}
// Display a pair
void dp(P v, int i)
{
for (int j = 0; j < i; j++)
printf(" ");
printf("%s( ", v[1]);
for (v = v[0]; v; v = v[0])
{
printf("'%s' ", v[1]);
}
printf(")\n");
}
// Display many pairs
void dm(P p, int i)
{
for (; p; p = p[0]) {
P v = p[1];
dp(v, i);
if (v[2])
dm(v[2], i + 1);
}
}
int main(int argc, char **argv)
{
// Test bench
w = "one(two\"three\")\"one\"two one two\tone\ttwo//com\n"
"//\r\n"
"one//\n"
"two\n"
"one{two three}\tone{two\"three\"four five}"
"one(two three four){five six seven eight}one//\n"
"(//\n"
"two//\n"
"three//\n"
"four//\n"
"){//\n"
"five//\n"
"six//\n"
"seven//\n"
"eight//\n"
"}one(two){three(four){five(six){seven eight}}nine(ten eleven)}//";
dm(m(), 0);
}
When run with the string shown above, the test code will output the following:
one( 'two' 'three' )
one( 'two' )
one( 'two' )
one( 'two' )
one( 'two' )
one( )
two( 'three' )
one( )
two( 'three' )
four( 'five' )
one( 'two' 'three' 'four' )
five( 'six' )
seven( 'eight' )
one( 'two' 'three' 'four' )
five( 'six' )
seven( 'eight' )
one( 'two' )
three( 'four' )
five( 'six' )
seven( 'eight' )
nine( 'ten' 'eleven' )
The parser is far from practical. The error handling isn’t very useful, and the tree structure is far from efficient. But it works, and it’s quite small so I’m pretty happy with it.