Parsing Expression Languages
Can Parsica parse expression? Why yes, I'm glad you asked!
An expression, roughly, is anything that can evaluated to a value, such as
- arithmetic expressions
(1 + 2) * 3
, - boolean expressions
x and (y or z)
, - code inside a template language
{{ user.loggedIn ? 'Hello ' ~ user.name : 'Log in' }}
, - spreadsheet formulas
=SUM(A1:A10) * B1
, - and anything else you can think of.
The tricky thing about parsing expressions is that you often have to deal with things like recursion, associativity, and operator precedence. These can make it pretty tricky to build a parser. Parsica provides the expression()
function, which offers a simple way to create a parser for your custom expression language.
Arithmetic
Let's build a simple calculator, that can evaluate expressions like 1 + 2 * (2 - 3)
to -1
.
Let's handle whitespace first. (See the chapter on "Dealing with Space" for details.)
Next, we define a parser for our terms. For this example, let's keep it simple and support only natural numbers:
Let's do parentheses next. Parsica's between()
combinator will do the job nicely, but let's wrap it in our combinator for clarity and reusability:
Now let's define our first expression, using expression()
. In our language, an expression can be:
- A naked term like
12
- A term between parentheses
(12)
- An operator and its arguments
1 + 2
- The arguments are expressions themselves, as in
1 + (2 + 3)
An expression is defined using expressions, so this calls for recursion. (See the chapter on Recursion.) Let's ignore operators for now, and do the simplest recursive expression parser:
We're saying here that $primary
is either an expression wrapped in parens, or a term. $expr
is an expression that uses $primary
as its primary parser.
Now let's add the plus operator. We need a parser for the symbol itself, in this case a simple char('+')
will do, but it could be anything. For example, PHP has two 'not equal' operators, which we could parse in one go either(string('!='), string('<>'))
.
We also need to decide what to do with the terms that we parse, using a transformation. This is a function that will take the left and the right operands from our +
. As we're building a calculator, we're simple going to add up the two terms, using fn($left, $right) => $left + $right
. (Later we will use this to create abstract syntax trees.)
Finally, we need to tell the expression parser that +
is a binary operator, and that we want it to be left associative. Let's put it all together:
The second argument to expression()
is an array of operators. The order is important: it determines the precedence. +
and -
have the same precedence, whereas *
and /
have the same precedence as each other, but higher precedence than +
and -
. We can solve this easily by grouping each precedence level, and putting the highest precedence levels first.
You can play around with the precedence and the associativity to see how it impacts the result. As an exercise, make a parser that solves 1 - 2 - 3 = (1 - (2 - 3) = (1 - (-1)) = 2
.
You can add unary operators, such as the negation prefix operator -
, and the increment and decrement postfix operators ++
and --
.
Parsing to an AST
Building calculators isn't that interesting of course. Typically you'll want your parser to output a datastructure that represents your expression, called an Abstract Syntax Tree or AST. This structure can then be used for whatever the next step in your program is, ranging from evaluation to compilation, static analysis, typechecking, optimisation, rendering and formatting...
Let's build a simple Boolean expression language, starting with the types for AST. Everything else will be pretty similar to the calculator example above, but instead of evaluating the expressions on the fly, we use the transform functions to create the datastructure.
Now the AST can be used for whatever purposes you need. In our Boolean example above, as an exercise you can
- add a
render()
method to write the expression back to a pretty formatted string, - add a
reduce()
method that simplifies the AST (eg turningTRUE AND TRUE
intoTRUE
), - add an
evaluate(['isBlue' => true, 'isEdible' => false, ...])
method that calculates the final result - ...