Reconstructing Ruby, Part 4: Our First Parser
Read Part 3 in case you missed it.
Now that our lexer is in a pretty good situation, we can start working on our parser. We'll be using Bison. Bison works very similar to Flex, but instead of generating a lexer, it will generate a parser.
Bison is an improvement on the program Yacc. It's an LALR parser generator. LALR stands for Look-Ahead Left-to-right, Rightmost derivation parser, which is a mouthful. I'm not going to even try to explain this.
In programming, a grammar refers to what tokens are allowed to be next to each other, as well as what tokens, when next to each other, represent a higher level concept.
First let's describe a program. In Ruby, a program is just a list of expressions. In programming terms, an expression is a piece of code that returns a value. In Ruby pretty much everything will return a value so we can consider everything an expression.
Other languages have an alternative to expressions called statements. Statements are like expresions but do not have a return value. Ruby technically has no statements as everything returns a value.
To start our grammar, we'll say a program is just a list of expressions:
program: expressions
program
and expressions
are both known as non-terminals. You can think of a non-terminal as "not a token". The tokens on the other hand we'll refer to as terminals.
The main idea here is that every non-terminal must eventually resolve to a terminal. To some degree that makes sense because our program is just built up of tokens so we eventually want to get to a point to where we're talking about those tokens. One way to generalize a parser is to say it takes an array of tokens and from them builds a tree like structure known as an AST or Abstract Syntax Tree. You can find out more about ASTs on this Wikipedia page.
An interesting part of definining a grammar is dealing with recursion. For instance we've described a program as a list of expressions, but what is a list of expressions? The way we'd describe this in Bison would look something like this:
expressions: expressions expression
| expression
The |
is kind of like an or. So in this case expressions
are either an expression
or what was previously determined to be expressions
followed by another expression
. The way this get's simplified is like this. Imagine we have 3 expression
s following each other (I've added numbers to the end for clarity):
expression#1 expression#2 expression#3
When we parse this we're always trying to work our way to our starting non-terminal program
, so we simplify the first expression
to expressions
expressions expression#2 expression#3
Since an acceptable simplification of expressions
would be expressions expression
we can collapse them together:
expressions expression#3
Then we have a similar situation where we can simplify again to just:
expressions
Now since we can't simplify any futher expressions
becomes program
:
program
Bison files end with a .y
extension due to Yacc. The format is very similar to our Flex file. The Bison manual defines it as:
%{
Prologue
%}
Bison declarations
%%
Grammar rules
%%
Epilogue
To start our grammar, let's work downwards from the program
definition. Make a file called parse.y
and put the following code into it:
%{
#include <stdio.h>
%}
%token tNUMBER
%start program
%%
program: expressions
expressions: expressions expression
| expression
expression: tNUMBER { printf("PARSED(%d)\n", $1); }
Here we've defined a token tNUMBER
which is a terminal. A common idiom is to prefix your tokens with the letter t. If a tNUMBER
token is matched it will print $1 which is a reference back to tNUMBER
because it's in the first position. If we had a larger simplification like expressions expression
then expressions
would be stored in $1 and expression
in $2 due to their positions. We'll see more $
variables when our patterns get larger.
We've also defined the non-terminals program
, expressions
and expression
. All of these non-terminals will eventually be expressible in tokens. A program
contains expressions
which is a set of expression
which must be a tNUMBER
.
Let's modify our Makefile
to compile this now:
SRC=main.c parse.tab.c lex.yyc
all: ruby
ruby: ${SRC}
> cc -o ruby ${SRC}
lex.yy.c: ruby.l
> flex ruby.l
parse.tab.c: parse.y
> bison -d parse.y
clean:
> rm -rf ruby lex.yy.c parse.tab.c parse.tab.h
We've introduced SRC
to make adding new files easier. Now as the list of files we need to compile grows we can specify them in one locations.
The -d
flag for Bison produces the header file parse.tab.h
. This header file will include definitions for tokens, like tNUMBER
, which is useful for our lexer since all of the tokens will be defined in our parser.
If you run make
you'll probably get this noise:
$ make
bison -d parse.y
flex ruby.l
cc -o ruby main.c lex.yy.c parse.tab.c
parse.tab.c:1257:16: warning: implicit declaration of function
'yylex' is invalid in C99 [-Wimplicit-function-declaration]
yychar = YYLEX;
^
parse.tab.c:601:16: note: expanded from macro 'YYLEX'
# define YYLEX yylex ()
^
parse.tab.c:1385:7: warning: implicit declaration of function
'yyerror' is invalid in C99 [-Wimplicit-function-declaration]
yyerror (YY_("syntax error"));
^
2 warnings generated.
Undefined symbols for architecture x86_64:
"_yyerror", referenced from:
_yyparse in parse-d9f59c.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to
see invocation)
make: *** [ruby] Error 1
The reason this is happening is because in order for Bison to work, it expects two functions to be defined - yylex
and yyerror
. We'll make use of extern
again to reference yylex
and we'll define a yyerror
. At the top of our parse.y
file add the following:
%{
#include <stdio.h>
extern int yylex(void);
void yyerror(char const *s) { fprintf(stderr, "%s\n", s); }
%}
Now we'll need to have our main.c
file call yyparse
instead of yylex
. At the top of our main.c
add:
#include "parse.tab.h"
Replace:
extern int yylex(void)
With:
extern int yyparse(void)
And replace the call to yylex
with yyparse
. If we change program.rb
to contain just the number 1, compile and run our program, then we'll see:
$./ruby program.rb
NUMBER(1)
syntax error
Oops, we'll need to change our lexer to return the correct token. Open ruby.l
and replace:
{NUMBER} { VTYPE("NUMBER", yytext); }
with:
{NUMBER} { yylval = atoi(yytext); return tNUMBER; }
yylval
holds the value of our token, which is currently the integer value of yytext
thanks to atoi
. Next we return the token type tNUMBER
.
The last change we need to make to ruby.l
is to add:
#include "parse.tab.h"
Right under:
#include <stdio.h>
This lets our lexer knows about the tokens we've defined in our parser. When we compile and run our program now we should see:
./ruby program.rb
PARSED(1)
From this point we're ready to expand our grammar, define more tokens, and refine our lexer to utilize them. Let's start simple. Let's change program.rb
to contain:
1 + 3
And when we run the program we want the output to be 4
. Running the program currently outputs:
PARSED(1)
PLUS
PARSED(3)
Notice that while PLUS
is printed it wasn't parsed. It was instead printed from our lexer. We'll fix that by adding a token for PLUS
that we can return from our lexer. Add:
%token tPLUS
Below the tNUMBER
token. Then change:
"+" { TYPE("PLUS"); }
to:
"+" { return tPLUS; }
If you run make
and run the program now we'll end up with a syntax error. The reason for this is that we haven't described in our grammar a valid "sentence" with plus in it. So let's extend out expression grammer a little. We're going to change:
expression: tNUMBER { printf("PARSED(%d)\n", $1); }
to:
expression: tNUMBER
| expression tPLUS expression { printf("%d\n", $1 + $3); }
Running the program now will produce the output of 4
. The reason for this is because when the parser encounters a tNUMBER it will just return that value (which is currently an integer). So when it sees an expression of the form expression tPLUS expression
that's the same as 1 tPLUS 3
which in our case $1 = 1
and $3 = 3
. If our program.rb
had 2 + 4
in it then $1 = 2
and $3 = 4
and the output would be 6
.
At this point we're ready to extend out grammar to cover more of the ruby language. In the next post We'll define some more complex expressions in our grammar so that:
# The Greeter class
class Greeter
def initialize(name)
@name = name.capitalize
end
def salute
puts "Hello #{@name}!"
end
end
# Create a new object
g = Greeter.new("world")
# Output "Hello World!"
g.salute
will be considered a valid program by our language.
If you're having any problems you can check the reference implementation on GitHub or look specifically at the diff to see what I've changed. Additionally, if you have any comments or feedback I'd greatly appreciate if you left a comment!
Update: read Part 5 of this series.