Last Updated: May 11, 2020

Beginning parsers with PEG.js

PEG or Parsing Expression Grammars are similar to CFG (context-free grammars) with some modifications. There's an excellent library called PEG.js that I'm going to introduce here; I found it to be an amazing parser generator for most of my needs parsing simple languages. The most amazing thing with PEG.js is that it allows inline javascript to be used, so instead of just generating boring parse trees etc., you can even solve problems from within the parser.

I did write a long post before I wrote this paragraph. It was too technical, and I deleted it. I hope this version sticks. Also, this is my first post here, so any feedback would be really appreciated.

Getting ready

Instead of explaining, defining and going the proper route. I'd recommend you mosey over to the online peg.js runner and start blurting out some grammar.

You'll find on the left a text area, and just because you feel like following my lead, delete the entire text. Now you'll find that Peg.js didn't quite like this, and is giving us a warning below the text area. Lets fix that. Type in:

myspecialrule = ""

That would make the warning go away, but you're still getting another warning with the test input. For now, we're not doing anything, so lets just delete that too. We'll add our test input later on.

Ok, now that we've had a clean slate, time to think about what to make. For our very first experience, lets try and get a simple word counter going.

My first PEG.js grammar

Here's what we want: For the input 'this is my first post on coderwall', we should get 7 as the result. Now I know, I know, it's stupidly easy to do this in javascript, but bear with me.

Let's start with the inital rule. The first rule PEG.js encounters in your text is the 'main' rule. All parsing starts from there. So, first rule

wordCounter = word*

Now, if you're familiar with Regular Expressions, this looks a lot like them, but here word represents another rule. For those who are unfamiliar with Regular Expressions, please do read up and learn how to use them, they're awesome. In Regular Expressions, a * means that the preceding token (say in some case a*, a is the token) can be present 0 or many times within the input string (0-n). The change here is that word is referencing another rule and PEG.js expects that to be defined, so, if you type this in, you'll find PEG.js complaining that it can't find the word rule.

Lets make that rule, then. Right now we're not really caring about internationalization, so we'll make this really simple. (Last I checked, there wasn't any support for internationalization within PEG.js proper, but it can be done. If you want help with that, you can ask me or check out the examples in the peg.js source)

word = letter+

Again, like Regular Expressions, we're using the + operator. The + operator means the token has to be present atleast once but may be repeated many times (1 - n). Now we've solved the problem of the word rule, but created another rule letter that's not present. Moving on…

letter = [a-zA-Z0-9]

The [] operator basically means that any character inside the brackets can be matched. It's like a set. Anything within in matched. If you look closely, there's even more funky syntax. Within the set operator [], a-z would let PEG.js know (and most regular expression engines too, since this is also valid regular expression syntax) to interpolate the characters. So, in short, a-z literally means characters from a to z.

Now we're getting the green signal that everything's fine. Lets add in a test input. Say 'Hello'. We were hoping magically to get 1. Alas, it isn't working like that, instead, Hello has been taken apart and that's the output. No good.

Lets add in some inline javascript here to count the words. Since we're talking about counting words, the only rule that knows how many words are matched is the wordCount rule. Lets add some javascript to it.

To add javascript, append the expression with {} containing whatever javascript you want. So, we modify wordCount to be:

wordCount = word* { return 1; }

So, we're getting the desired result, but we're cheating. In order to count the words, we need to be able to reference it within the javascript. To reference something, we prefix it with a label like this:

wordCount = w:word* { return 1; }

The label: syntax enables you to access the actual matched text within your javascript code. Pretty cool eh? Now to put it to good use.

wordCount = w:word* { return w.length; }

Tokens matched with * or + automatically become javascript arrays within the javascript code which is great.

You can see we're getting 1 as the result. Nice! However, if we add more words, you'll see we get an error. Upon closer inspection, you'll see that PEG.js was expecting the letter rule, but instead was dumbfounded when it got a ' ' (space) instead. Since words are broken by spaces, we need to add that to our rule set. Again, the only rule dealing with whole words is wordCount we need to modify it. But before, we need to actually add a rule that matches spaces. (We could've just added " " to the wordCount rule, but in practice, extracting common snippets into rules makes the grammar more readable)

space = " "

and then we need to add it to the wordCount rule. There's another trick coming up though. If we write,

wordCount = w:word* space { return w.length; }

it wont do anything other than expect a space after a word. Multiple words are still not being matched. To fix this, we need to explain our needs better to PEG.js. We don't wants many words and the input should end with a space, we need words seperated by spaces.

wordCount = w:(word space)* { return w.length; }

This gets us most of the way, but we're still seeing an error. PEG.js still expects that the last word be followed by a space. We don't need that, so to fix:

wordCount = w:(word space?)* { return w.length; }

The ? operator here means that we expect the rule space to be present sometimes, and not be present at other times (0-1). Now, try out the input, and you'll find that the parser is working as expected. Yay!

A little more complex example

Now that we're done, lets try to do something which would be a bit harder to do with plain javascript. Lets add a command system here. Say, if someone types in wc: Hello world we should return 2. But if someone says lc: Hello world we should return the number of letters in the text, in this case 10 (not 11, we're not counting spaces). Uptil now we've only dealt with simple rules for matching the same thing, but now we have to deal with a much more complex system. Here's the parser uptil now:

wordCount = w:(word space?)* {return w.length;}
word = letter+
letter = [a-zA-Z0-9]
space = " "

Let's add in a new top-level rule, that will parse the input string, detect which kind of operation to run on the remaining text and run that rule.

textWorks 
  = "wc:" space* wc:wordCount { return wc; }
  / "lc:" space* lc:letterCount { return lc; }

Whenever I work with parsers, I like to generate the top-most rules first, so that I can get an idea of how everything connects. The only new thing in the above syntax is the use of the / operator. Think of the / operator as an OR operator. If the parser can't match the first option, it looks at the next option, and so on until either it finds the first option that matches, or none, in which case it goes back to the parent rule and repeats. If all routes are exhausted and the input string isn't matched fully, it returns and tells you where it failed. So in this case if we started an input text with 'pc:', the parser would fail. Lets make letterCount next.

letterCount = w:(iw:word space? {return iw;})* {
   var total = 0;
   for (var i = 0; i < w.length; i++) {
       total += w[i].length;
   }
   return total;
 }

Now, this a bit more complex, but what it's essentially doing is

Within () if there is more than one token, PEG.js converts it into an array of results, however, each () can have its own inline javascript, so I used that to just return the result of the word rule, ignoring the space rule's matched text.
iterating through the words matched, we keep counting the letters in each word and return the total

This works because we defined word as letter+ which by default returns an array of matched letters.

Now, try it out. You'll see that when you type in 'wc: Hello World' it returns 2, and change 'wc' to 'lc' and you get 10.

There's still a lot of stuff I haven't covered here, but I hope I've given an idea about the power of PEG.js and how to get started with it. You can download the resulting parser from the online version after you're done making it. You can also download it using npm install -g pegjs if you have node. You could also run the parser within the browser at runtime, but I think that has limited uses. Just precompile the parser for better performance, unless you plan on dynamically generating the grammar itself.

Links

PEG.js
PEG.js Online Version
PEG.js Documentation
Github Project

PEG.js is developed by David Majda (@dmajda).

#parser

#javascript

#pegjs

Written by Mutahhir

Say Thanks

Respond

Related protips

Parse XML in Android(tm) with xml-drafts-framework

1.253K

There is no freely available T-SQL parser

1.023K

Weather and Money Simple Parser

433

7 Responses

Add your response

alpacaaa

Thanks a bunch for this!

over 1 year ago ·

mutahhir

@alpacaaa Glad you found it useful!

over 1 year ago ·

vijayrudraraju

Very insightful example! Thanks!

over 1 year ago ·

timcash

Very nice, thanks for writing this up

over 1 year ago ·

jimfridlin

well explained.
i have a question though (i think it is tricky, i am missing something!):

in the grammar part, i wrote this:
rootexpression = simplecondition

simplecondition = analogcondition / timecondition

analogcondition = ws (!analogoperator inputname)+ ws analogoperator ws analogvalue ws

timecondition = ws "time" ws timeoperator ws timevalue ws
analogoperator = "=="

timeoperator = "=="

analogvalue = integer

timevalue = digit digit ":" digit digit ":" digit digit

ws "whitespace" = [ \t\n\r]*

integer = digit+

digit = [0-9]

input_name = alphanum

alphanum = [a-zA-Z0-9]

and in the test part - i just write this: time == 10:10:00
for some reason, the parser wont recognize it, and it should.
if i write this: time == 10
it will accept it - as it matches "analogcondition"
how can i make it recognize the timecondition as well?!
(i know it should recognize the timecondition because when i remove the "analogcondition" from the simple condition rule - it matches).

Thanks,
Jim.

over 1 year ago ·