In this part we will add parsers to the data types that we created previously, also we’ll learn about some syntax sugar that makes writing parsers somewhat easier. Let’s get started!
Unifies if the input char is matches “a-zA-Z”
Unifies if the input char is matches “a-zA-Z”
Tries to parse the string as an integer value. Fails if not possible.
Converts a list of characters to a string
. Note the uo
mode in the
output argument and see the next section.
In the previous parts we said that uo
was free >> unique
and that
out
modes were free
» ground
. We can convert a variable that’s
unique
to ground
- but that unique
variable will cease to be
unique.
Now you might wonder “why”?
uo
anywhere but in effectful predicates?For 1. - no! If you think about it you’ll see that in the main
predicate expects a pair of (di, uo)
arguments - and doing anything to
an unique
variable will clobber it. The mode system would not allow
anything done to the io
type that would not preserve uniquness.
As for 2 - for unique variables mercury is able to do some magic optimizations - such as destructive updates (if we know that a string is unique we can change it in place and in the end the output won’t matter).
But don’t think too strongly about point 2 for now :).
While writing a parser using “just” plain old predicates is certainly possible we’ll introduce DCGs which will help ease the pain of writing out parsers.
We won’t go much into the theory of DCGs - I recommend reading the wikipedia entry and this prolog tutorial about DCGs. We also won’t cover “advanced” usages of DCGs such as passing state.
You may have noticed that the parser predicates that we wrote follow a certain pattern:
This was done explicitly for the purpose of compatibility with DCGs.
We will view DCGs as a simple syntax transformation (syntax sugar) that can transform predicates in the form:
to the explicit forms we have writen above (t1, …, tn - is a list of arguments “not counting” the input/output lists - can be empty).
In the body of a DCG rule:
is equivalent to:
{
and }
will be compiled down to “normal”
mercury predicates (so without implicit passing of input/output
lists).is equivalent to:
Why is this useful? Let’s tackle an example of a BNF grammar from this page (simplified to remove repetitions ‘{‘ X ‘}’ with recursion)
and here’s how this would look like in mercury using DCG-syntax (full code here)
Hopefully it’s easy to see the similarity between DCGs and the BNF grammar - without having to explicit pass the input and output char lists it’s much much clearer to see the “what” and not the “how”.
Rewrite the following DCGs into predicate form, add signatures where non-trivial. Write the BNF grammar that corresponds to the DCG.
Rewrite the following predicates into DCG form, add signatures where non-trivial. Write the BNF grammar that corresponds to the DCG.
The goal of this part is to parse this grammar:
Caveats:
()
, (a)
and (a b)
we introduce 3 rules
in <list>
- one for empty list, one for 1 element list and one for
lists with spaces after the first element.This should explain the weird looking dotted-list
rule - intuitively
we’d expect that as:
but because <list>
consumes spaces in last alternative and failures
will
Parsing is only the first step - we also need to produce values:
<atom>
rule’s output should produce:
lisp_boolean(yes)
if “#t” was parsedlisp_boolean(no)
if “#f” was parsedlisp_atom(PARSED)
(where PARSED was not “#t” nor “#f”)<number>
rule’s output should produce lisp_number(PARSED)
<quoted>
rule’s output should produce
lisp_list([lisp_atom("quote"), PARSED_EXPRESSION])
<list>
rule’s should unify with lisp_list(PARSED_EXPRESSIONS)
<dotted-list>
rule’s output should unify with
lisp_dotted_list(PARSED_EXPRESSIONS_BEFORE_THE_DOT, PARSED_EXPRESSION)
Parsing and producing values should be done in module parsing
only
top_level_expression
should be exported.
We’ll also need changes in the main
module. We want to discern between
“partial” matches and full matches - we have to check the output char
list and see if it’s empty after unifying with top_level_expression
.
number
parser first and flesh out
top_level_expression
, expression
and the changes in the main
module.string
, atom
, quoted
and then list
and finally
dotted_list
predicates. Test after each new addition.parse_number
to unify with
the digits themselves but the number
predicate takes the output of
parse_number
and has one number output that unifies with an
integer.For the whole project you can check out this directory .
Let’s start from the changes in the main module:
This is getting a little convoluted but bare with me - we will learn
some syntax sugar that helps with code such as that in due time. The
code while looking scary is simple - just nested if
s. And the only
real part that has changed is this:
So like we said before we want to find either an partial or a full match
Rest
and match it against the empty list. If the list is
indeed empty we have a full match, if it’s not - it’s a partial match.Now for the parser
module. The header:
Nothing fancy - we export the only predicate that we will use - and it’s
a parser that will unify with lisp_val
s.
In the implementation section we need to import other modules that we will use:
Now the definitions of top_level_expression
and expression
:
top_level_expression
just follows the BNF grammar, so nothing
surprising - it uses the optional_space
predicate that uses recursion.
If we’d try to analyze it we can see that if it find an ' '
empty
space it will try to match more, same with the tab '\t'
and if it
finds neither it stops.
expression
on the other had just delegates to the other parsers
.
Let’s analyze them one by one starting from the easiest:
number
For all the parsers that produce lisp_val
s we will also write a parser
that just matches tha characters without any additional logic. For
clarity we will prefix them with parse_
.
The definition of parse_number
is:
digit
first matches any character then uses a normal (i.e non DCG)
predicate to check if that character is in the range 0-9. So we can say
it just dcg-ifies the digit
predicate.
parse_number
tries to consume as many digits as possible. Recursively
we append the current digit with the rest of the digits where the empty
list is base case for recursion.
For the lisp_val
-producing predicate:
We unify with parse_number(Cs)
first - then obtain a string from the
character list using string.from_char_list
, next we convert the
string
into an integer
using integer.from_string
(possibly failing
here if our parser ain’t right) and finally we wrap the integer with the
lisp_number
constructor.
string
String is not that much different from number, first let’s look at the
parse_string
predicate:
parse_string
is really simple just using a combination of three
other rules.
We use not_quote
as a helper rule - nothing interesting going on in
there. string_content
is tries to consume as many “not quotes” as
possible unifying with Chars
with the current not-quote-character
cons-ed with the rest of the not-quote-characters in the input list.
And now string
:
Here we just convert the char list into a string and wrap that in the
lisp_string
constructor.
Parsing code:
parse_atom
will unify with the characters of the atom name, the head
must either be letter
or symbol
- the former is a simple predicate
and the latter we know from the previous parts.
If everything succeeded up to this point we then parse the rest of the
atom name using the parse_atom_rest
predicate and we unify the name by
appending the first char with the output list of parse_atom_rest
.
We should be used to sights like this by now - again a recursive rule and failing will unify the output with the empty list.
And this is all used by atom
:
First we use the parse_atom
rule then we take the unified chars and
covert them into a string - and we match on that. Either
lisp_bool(yes)
, lisp_bool(no)
or lisp_atom(AtomName)
will be
produced.
The quoted rule is really not that difficult to grasp:
It uses just uses the previously defined expression
passes that to a
lisp_lisp
with two elements lisp_atom("quote")
and the second being
the expression.
Note the mutual recursion between expression
and quoted
- think
about this for a bit if it seems hard - but it’s really no different to
“normal” recursion.
While the parse_
predicates before didn’t produce lisp_val
s the
parse_list
here also won’t produce a single lisp_val
- but it will
produce a list of them. We define parse_list
so it can be reused later
on in parse_dotted_list
.
Just nested if-then-else
statements just like in the BNF grammar, we
use expression
once again and the non optional space
rule which
looks like this:
space
tries to consume at least one ' '
or '\t'
and then it
“delegates” to optional_space
rule.
Finally we wrap the list(lisp_val)
in a lisp_list
in the list
predicate:
In parse_dotted_list
we reuse the parse_list
rule then match against
the dot .
, space
and expression
- see the caveats section in the
goal if you don’t know why it looks like that.
parse_dotted_list
unifies with two output variables - a list and an
expression we use that in dotted_list
…
… wrapping both in the lisp_dotted_list
constructor.
In the next parts we will concentrate on writing simple evaluators for the values that we parsed starting with converting them to strings.