Beginning parsers with PEG.js
PEG or Parsing Expression Grammars are similar to CFG (context-free grammars) with some modifications. There's an excellent library called PEG.js that I'm going to introduce here; I found it to be an amazing parser generator for most of my needs parsing simple languages. The most amazing thing with PEG.js is that it allows inline javascript to be used, so instead of just generating boring parse trees etc., you can even solve problems from within the parser.
I did write a long post before I wrote this paragraph. It was too technical, and I deleted it. I hope this version sticks. Also, this is my first post here, so any feedback would be really appreciated.
Getting ready
Instead of explaining, defining and going the proper route. I'd recommend you mosey over to the online peg.js runner and start blurting out some grammar.
You'll find on the left a text area, and just because you feel like following my lead, delete the entire text. Now you'll find that Peg.js didn't quite like this, and is giving us a warning below the text area. Lets fix that. Type in:
myspecialrule = ""
That would make the warning go away, but you're still getting another warning with the test input. For now, we're not doing anything, so lets just delete that too. We'll add our test input later on.
Ok, now that we've had a clean slate, time to think about what to make. For our very first experience, lets try and get a simple word counter going.
My first PEG.js grammar
Here's what we want: For the input 'this is my first post on coderwall', we should get 7 as the result. Now I know, I know, it's stupidly easy to do this in javascript, but bear with me.
Let's start with the inital rule. The first rule PEG.js encounters in your text is the 'main' rule. All parsing starts from there. So, first rule
wordCounter = word*
Now, if you're familiar with Regular Expressions, this looks a lot like them, but here word
represents another rule. For those who are unfamiliar with Regular Expressions, please do read up and learn how to use them, they're awesome. In Regular Expressions, a *
means that the preceding token (say in some case a*
, a
is the token) can be present 0 or many times within the input string (0-n). The change here is that word
is referencing another rule and PEG.js expects that to be defined, so, if you type this in, you'll find PEG.js complaining that it can't find the word
rule.
Lets make that rule, then. Right now we're not really caring about internationalization, so we'll make this really simple. (Last I checked, there wasn't any support for internationalization within PEG.js proper, but it can be done. If you want help with that, you can ask me or check out the examples in the peg.js source)
word = letter+
Again, like Regular Expressions, we're using the +
operator. The +
operator means the token has to be present atleast once but may be repeated many times (1 - n). Now we've solved the problem of the word
rule, but created another rule letter
that's not present. Moving on…
letter = [a-zA-Z0-9]
The []
operator basically means that any character inside the brackets can be matched. It's like a set. Anything within in matched. If you look closely, there's even more funky syntax. Within the set operator []
, a-z
would let PEG.js know (and most regular expression engines too, since this is also valid regular expression syntax) to interpolate the characters. So, in short, a-z
literally means characters from a to z.
Now we're getting the green signal that everything's fine. Lets add in a test input. Say 'Hello'. We were hoping magically to get 1. Alas, it isn't working like that, instead, Hello has been taken apart and that's the output. No good.
Lets add in some inline javascript here to count the words. Since we're talking about counting words, the only rule that knows how many words are matched is the wordCount
rule. Lets add some javascript to it.
To add javascript, append the expression with {}
containing whatever javascript you want. So, we modify wordCount
to be:
wordCount = word* { return 1; }
So, we're getting the desired result, but we're cheating. In order to count the words, we need to be able to reference it within the javascript. To reference something, we prefix it with a label like this:
wordCount = w:word* { return 1; }
The label:
syntax enables you to access the actual matched text within your javascript code. Pretty cool eh? Now to put it to good use.
wordCount = w:word* { return w.length; }
Tokens matched with *
or +
automatically become javascript arrays within the javascript code which is great.
You can see we're getting 1 as the result. Nice! However, if we add more words, you'll see we get an error. Upon closer inspection, you'll see that PEG.js was expecting the letter
rule, but instead was dumbfounded when it got a ' ' (space) instead. Since words are broken by spaces, we need to add that to our rule set. Again, the only rule dealing with whole words is wordCount
we need to modify it. But before, we need to actually add a rule that matches spaces. (We could've just added " " to the wordCount
rule, but in practice, extracting common snippets into rules makes the grammar more readable)
space = " "
and then we need to add it to the wordCount
rule. There's another trick coming up though. If we write,
wordCount = w:word* space { return w.length; }
it wont do anything other than expect a space after a word. Multiple words are still not being matched. To fix this, we need to explain our needs better to PEG.js. We don't wants many words and the input should end with a space, we need words seperated by spaces.
wordCount = w:(word space)* { return w.length; }
This gets us most of the way, but we're still seeing an error. PEG.js still expects that the last word be followed by a space. We don't need that, so to fix:
wordCount = w:(word space?)* { return w.length; }
The ?
operator here means that we expect the rule space
to be present sometimes, and not be present at other times (0-1). Now, try out the input, and you'll find that the parser is working as expected. Yay!
A little more complex example
Now that we're done, lets try to do something which would be a bit harder to do with plain javascript. Lets add a command system here. Say, if someone types in wc: Hello world
we should return 2. But if someone says lc: Hello world
we should return the number of letters in the text, in this case 10 (not 11, we're not counting spaces). Uptil now we've only dealt with simple rules for matching the same thing, but now we have to deal with a much more complex system. Here's the parser uptil now:
wordCount = w:(word space?)* {return w.length;}
word = letter+
letter = [a-zA-Z0-9]
space = " "
Let's add in a new top-level rule, that will parse the input string, detect which kind of operation to run on the remaining text and run that rule.
textWorks
= "wc:" space* wc:wordCount { return wc; }
/ "lc:" space* lc:letterCount { return lc; }
Whenever I work with parsers, I like to generate the top-most rules first, so that I can get an idea of how everything connects. The only new thing in the above syntax is the use of the /
operator. Think of the /
operator as an OR
operator. If the parser can't match the first option, it looks at the next option, and so on until either it finds the first option that matches, or none, in which case it goes back to the parent rule and repeats. If all routes are exhausted and the input string isn't matched fully, it returns and tells you where it failed. So in this case if we started an input text with 'pc:', the parser would fail. Lets make letterCount
next.
letterCount = w:(iw:word space? {return iw;})* {
var total = 0;
for (var i = 0; i < w.length; i++) {
total += w[i].length;
}
return total;
}
Now, this a bit more complex, but what it's essentially doing is
- Within
()
if there is more than one token, PEG.js converts it into an array of results, however, each()
can have its own inline javascript, so I used that to just return the result of theword
rule, ignoring the space rule's matched text. - iterating through the words matched, we keep counting the letters in each word and return the total
This works because we defined word
as letter+
which by default returns an array of matched letters.
Now, try it out. You'll see that when you type in 'wc: Hello World' it returns 2, and change 'wc' to 'lc' and you get 10.
There's still a lot of stuff I haven't covered here, but I hope I've given an idea about the power of PEG.js and how to get started with it. You can download the resulting parser from the online version after you're done making it. You can also download it using npm install -g pegjs
if you have node. You could also run the parser within the browser at runtime, but I think that has limited uses. Just precompile the parser for better performance, unless you plan on dynamically generating the grammar itself.
Links
PEG.js
PEG.js Online Version
PEG.js Documentation
Github Project
PEG.js is developed by David Majda (@dmajda).
Written by Mutahhir
Related protips
7 Responses
Thanks a bunch for this!
@alpacaaa Glad you found it useful!
Very insightful example! Thanks!
Very nice, thanks for writing this up
well explained.
i have a question though (i think it is tricky, i am missing something!):
in the grammar part, i wrote this:
rootexpression = simplecondition
simplecondition = analogcondition / timecondition
analogcondition = ws (!analogoperator inputname)+ ws analogoperator ws analogvalue ws
timecondition = ws "time" ws timeoperator ws timevalue ws
analogoperator = "=="
timeoperator = "=="
analogvalue = integer
timevalue = digit digit ":" digit digit ":" digit digit
ws "whitespace" = [ \t\n\r]*
integer = digit+
digit = [0-9]
input_name = alphanum
alphanum = [a-zA-Z0-9]
and in the test part - i just write this: time == 10:10:00
for some reason, the parser wont recognize it, and it should.
if i write this: time == 10
it will accept it - as it matches "analogcondition"
how can i make it recognize the timecondition as well?!
(i know it should recognize the timecondition because when i remove the "analogcondition" from the simple condition rule - it matches).
Thanks,
Jim.
Nice intro. Thanks!
Wouldn't letterCounter = l:(letter space?)* { return l.length; }
be a more concise way to make that rule?