In this part we’ll extend our simple parser to match against symbols preceded by arbitrary amounts of whitespace:
$ ./main '
Found match: $
We’ll use BNF to describe grammars from now on - if you don’t know BNF check out the above wikipedia entry and familiarize yourself with it.
The grammar we want to match is:
rule ::= whitespace symbol
whitespace ::= whitespace1 whitespace | ""
whitespace1 ::= ' ' | '\t'
whitespace1
is either a space or a tab. In general we won’t worry
about new lines.
Previously to print out a match we used this very convoluted construction:
Using some append
predicate for strings wouldn’t be actually that much
better:
For situations such as these you can use functions. Mercury has some support for functional programming but it’s embedded in the “logical world” we discussed before.
By default functions in mercury behave just like det
predicates but
have one special “output” argument. They aren’t just syntax sugar -
functions occupy their own name space and you can have functions names
with the same arity (+1) as a predicate.
In general function declarations have the form:
Where t1
, .., tn
- are input types (can be different, same as in a
predicate), and to
is the output type. Functions of two input
arguments can also use an infix style - and that style is used in the
++
function that can append two strings (from the string
module):
We can unify the output of a function with either on the left or the right side:
So using ++
and another function from the string
module
char_to_string
we can rewrite the above to:
We will use functions to simplyfy the parser code.
To be able to parse one or more (or in this case 0 or more) characters we need to introduce the concept of recursion.
In logic programming as well as functional programming recursion replaces the imperative notion of looping - it is important to remember that recursion can represent everything that simple looping could.
For an example of recursion let’s analyze a predicate calculating a length of a list:
+
is a function defined on int
s and it works exactly like you think
it should ;). The A
in list(A)
stands for a generic argument - the
list predicate should work for all types of lists.
Let’s analyze how this works. First let’s try unifying with an empty list
Now for this list: [a, b, c]
You may have noticed that every expansion of the list_length
introduces a variable “lagging” behind waiting to be unified - this is
the unfortunate side-effect of structuring predicates this way. Those
variables will take up space on the stack and in extrema cases may blow
it.
There is a “trick” to make that work however, consider this two predicate:
The whole logic moved to the helper list_length_aux
method - and there
the introduction of helper variables might seem weird but let’s see this
at work:
As you see now we don’t pollute the stack with unnecessary variables. This technique is known in fp-circles as “tail call optimization” (TCO) and in logic programming circles as “last call optimization (LCO).
If we can represent the recursion in a predicate in such a way that the recursive unification is the last thing done (the predicate is not unified) the mercury runtime will be able to “trim” away unneeded stack so it will behave exactly like we have shown above.
This often needs the introduction of an auxiliary variable - sometimes called the accumulator and (and this technique is sometimes called accumulator passing style).
To ease the use of such predicates some “public” predicate most often
then not are used to (like list_length
, and list_length_aux
here).
You can view recursion like mathematical induction - to help you get started think about “the base case” and the “next step”.
Implement all the following predicates using accumulator passing style,
if you think it is worth it implement using “simple recursion”. It may
be worthwhile trying to analyze the predicates as I did above for
list_length
.
Unifies with the last element of the list or fails on empty list.
Unifies with the sum of elements in the list or 0 on empty list.
Unifies the output list with the reverse of the input list. Hints:
Unifies the output list to the concatenation of the first list input list with the second one.
Hints:
true
and false
To explicitly fail
a predicate (like for example on some condition)
use false
, likewise true
will always succeed. This will help us
define some predicates (in particular whitespace1
).
So, after this theoretical introduction we’re now ready to extend our program.
To parse the grammar defined by the BNF above we’ll introduce three new predicates corresponding to the rules:
:- pred rule(char::out, list(char)::in, list(char)::out) is semidet.
:- pred whitespace(list(char)::in, list(char)::out) is det.
:- pred whitespace1(list(char)::in, list(char)::out) is semidet.
The full code for this part is here
Let’s skip the obvious parts that haven’t changed:
So for the meat and bones let’s have a look at the whitespace1
predicate
So the whitespace1
parser first tries to extract the head and tail of
the list, we can unify ListOut
with the tail here - whitespace1
only
consumes one char.
Then we compare the char with ‘ ‘ - in the consequent we explicitly
succeed using true
and in the alternative we compare C
with = ‘\t’
(either failing or succeeding).
Here we have the recursive definition of whitespace - nothing fancy. The
base case is the 0-repetitions of whitespace where we succeed unifying
ListIn
to ListOut
- and for the recursive step we unify the Rest
of the characters with whitespace.
Here we just pass the input list to the whitespace
parser possibly
stripping out any optional whitespace and then passing that to the
symbol
parser.
The updated main predicate
looks like this:
Uff! That’s all for this part - in the next part we’ll define the core data structures for our interpreter, then we’ll have sidetrack a bit and discuss better parsing methods.