summaryrefslogtreecommitdiff
path: root/docs/tutorial/LangImpl2.rst
diff options
context:
space:
mode:
authorSean Silva <silvas@purdue.edu>2012-12-05 00:26:32 +0000
committerSean Silva <silvas@purdue.edu>2012-12-05 00:26:32 +0000
commitee47edfd8e2dd048522ebd47305aeefbe9d8729c (patch)
tree1149ccaddfcba655771ab114e383a2cae3b6b200 /docs/tutorial/LangImpl2.rst
parent4e5448053163e0d9c2107b240ccdb5a95c107b07 (diff)
downloadllvm-ee47edfd8e2dd048522ebd47305aeefbe9d8729c.tar.gz
llvm-ee47edfd8e2dd048522ebd47305aeefbe9d8729c.tar.bz2
llvm-ee47edfd8e2dd048522ebd47305aeefbe9d8729c.tar.xz
docs: Sphinxify `docs/tutorial/`
Sorry for the massive commit, but I just wanted to knock this one down and it is really straightforward. There are still a couple trivial (i.e. not related to the content) things left to fix: - Use of raw HTML links where :doc:`...` and :ref:`...` could be used instead. If you are a newbie and want to help fix this it would make for some good bite-sized patches; more experienced developers should be focusing on adding new content (to this tutorial or elsewhere, but please _do not_ waste your time on formatting when there is such dire need for documentation (see docs/SphinxQuickstartTemplate.rst to get started writing)). - Highlighting of the kaleidoscope code blocks (currently left as bare `::`). I will be working on writing a custom Pygments highlighter for this, mostly as training for maintaining the `llvm` code-block's lexer in-tree. I want to do this because I am extremely unhappy with how it just "gives up" on the slightest deviation from the expected syntax and leaves the whole code-block un-highlighted. More generally I am looking at writing some Sphinx extensions and keeping them in-tree as well, to support common use cases that currently have no good solution (like "monospace text inside a link"). git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@169343 91177308-0d34-0410-b5e6-96231b3b80d8
Diffstat (limited to 'docs/tutorial/LangImpl2.rst')
-rw-r--r--docs/tutorial/LangImpl2.rst1098
1 files changed, 1098 insertions, 0 deletions
diff --git a/docs/tutorial/LangImpl2.rst b/docs/tutorial/LangImpl2.rst
new file mode 100644
index 0000000000..0d62894a24
--- /dev/null
+++ b/docs/tutorial/LangImpl2.rst
@@ -0,0 +1,1098 @@
+===========================================
+Kaleidoscope: Implementing a Parser and AST
+===========================================
+
+.. contents::
+ :local:
+
+Written by `Chris Lattner <mailto:sabre@nondot.org>`_
+
+Chapter 2 Introduction
+======================
+
+Welcome to Chapter 2 of the "`Implementing a language with
+LLVM <index.html>`_" tutorial. This chapter shows you how to use the
+lexer, built in `Chapter 1 <LangImpl1.html>`_, to build a full
+`parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
+language. Once we have a parser, we'll define and build an `Abstract
+Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
+
+The parser we will build uses a combination of `Recursive Descent
+Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
+`Operator-Precedence
+Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
+parse the Kaleidoscope language (the latter for binary expressions and
+the former for everything else). Before we get to parsing though, lets
+talk about the output of the parser: the Abstract Syntax Tree.
+
+The Abstract Syntax Tree (AST)
+==============================
+
+The AST for a program captures its behavior in such a way that it is
+easy for later stages of the compiler (e.g. code generation) to
+interpret. We basically want one object for each construct in the
+language, and the AST should closely model the language. In
+Kaleidoscope, we have expressions, a prototype, and a function object.
+We'll start with expressions first:
+
+.. code-block:: c++
+
+ /// ExprAST - Base class for all expression nodes.
+ class ExprAST {
+ public:
+ virtual ~ExprAST() {}
+ };
+
+ /// NumberExprAST - Expression class for numeric literals like "1.0".
+ class NumberExprAST : public ExprAST {
+ double Val;
+ public:
+ NumberExprAST(double val) : Val(val) {}
+ };
+
+The code above shows the definition of the base ExprAST class and one
+subclass which we use for numeric literals. The important thing to note
+about this code is that the NumberExprAST class captures the numeric
+value of the literal as an instance variable. This allows later phases
+of the compiler to know what the stored numeric value is.
+
+Right now we only create the AST, so there are no useful accessor
+methods on them. It would be very easy to add a virtual method to pretty
+print the code, for example. Here are the other expression AST node
+definitions that we'll use in the basic form of the Kaleidoscope
+language:
+
+.. code-block:: c++
+
+ /// VariableExprAST - Expression class for referencing a variable, like "a".
+ class VariableExprAST : public ExprAST {
+ std::string Name;
+ public:
+ VariableExprAST(const std::string &name) : Name(name) {}
+ };
+
+ /// BinaryExprAST - Expression class for a binary operator.
+ class BinaryExprAST : public ExprAST {
+ char Op;
+ ExprAST *LHS, *RHS;
+ public:
+ BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
+ : Op(op), LHS(lhs), RHS(rhs) {}
+ };
+
+ /// CallExprAST - Expression class for function calls.
+ class CallExprAST : public ExprAST {
+ std::string Callee;
+ std::vector<ExprAST*> Args;
+ public:
+ CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
+ : Callee(callee), Args(args) {}
+ };
+
+This is all (intentionally) rather straight-forward: variables capture
+the variable name, binary operators capture their opcode (e.g. '+'), and
+calls capture a function name as well as a list of any argument
+expressions. One thing that is nice about our AST is that it captures
+the language features without talking about the syntax of the language.
+Note that there is no discussion about precedence of binary operators,
+lexical structure, etc.
+
+For our basic language, these are all of the expression nodes we'll
+define. Because it doesn't have conditional control flow, it isn't
+Turing-complete; we'll fix that in a later installment. The two things
+we need next are a way to talk about the interface to a function, and a
+way to talk about functions themselves:
+
+.. code-block:: c++
+
+ /// PrototypeAST - This class represents the "prototype" for a function,
+ /// which captures its name, and its argument names (thus implicitly the number
+ /// of arguments the function takes).
+ class PrototypeAST {
+ std::string Name;
+ std::vector<std::string> Args;
+ public:
+ PrototypeAST(const std::string &name, const std::vector<std::string> &args)
+ : Name(name), Args(args) {}
+ };
+
+ /// FunctionAST - This class represents a function definition itself.
+ class FunctionAST {
+ PrototypeAST *Proto;
+ ExprAST *Body;
+ public:
+ FunctionAST(PrototypeAST *proto, ExprAST *body)
+ : Proto(proto), Body(body) {}
+ };
+
+In Kaleidoscope, functions are typed with just a count of their
+arguments. Since all values are double precision floating point, the
+type of each argument doesn't need to be stored anywhere. In a more
+aggressive and realistic language, the "ExprAST" class would probably
+have a type field.
+
+With this scaffolding, we can now talk about parsing expressions and
+function bodies in Kaleidoscope.
+
+Parser Basics
+=============
+
+Now that we have an AST to build, we need to define the parser code to
+build it. The idea here is that we want to parse something like "x+y"
+(which is returned as three tokens by the lexer) into an AST that could
+be generated with calls like this:
+
+.. code-block:: c++
+
+ ExprAST *X = new VariableExprAST("x");
+ ExprAST *Y = new VariableExprAST("y");
+ ExprAST *Result = new BinaryExprAST('+', X, Y);
+
+In order to do this, we'll start by defining some basic helper routines:
+
+.. code-block:: c++
+
+ /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
+ /// token the parser is looking at. getNextToken reads another token from the
+ /// lexer and updates CurTok with its results.
+ static int CurTok;
+ static int getNextToken() {
+ return CurTok = gettok();
+ }
+
+This implements a simple token buffer around the lexer. This allows us
+to look one token ahead at what the lexer is returning. Every function
+in our parser will assume that CurTok is the current token that needs to
+be parsed.
+
+.. code-block:: c++
+
+
+ /// Error* - These are little helper functions for error handling.
+ ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
+ PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
+ FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
+
+The ``Error`` routines are simple helper routines that our parser will
+use to handle errors. The error recovery in our parser will not be the
+best and is not particular user-friendly, but it will be enough for our
+tutorial. These routines make it easier to handle errors in routines
+that have various return types: they always return null.
+
+With these basic helper functions, we can implement the first piece of
+our grammar: numeric literals.
+
+Basic Expression Parsing
+========================
+
+We start with numeric literals, because they are the simplest to
+process. For each production in our grammar, we'll define a function
+which parses that production. For numeric literals, we have:
+
+.. code-block:: c++
+
+ /// numberexpr ::= number
+ static ExprAST *ParseNumberExpr() {
+ ExprAST *Result = new NumberExprAST(NumVal);
+ getNextToken(); // consume the number
+ return Result;
+ }
+
+This routine is very simple: it expects to be called when the current
+token is a ``tok_number`` token. It takes the current number value,
+creates a ``NumberExprAST`` node, advances the lexer to the next token,
+and finally returns.
+
+There are some interesting aspects to this. The most important one is
+that this routine eats all of the tokens that correspond to the
+production and returns the lexer buffer with the next token (which is
+not part of the grammar production) ready to go. This is a fairly
+standard way to go for recursive descent parsers. For a better example,
+the parenthesis operator is defined like this:
+
+.. code-block:: c++
+
+ /// parenexpr ::= '(' expression ')'
+ static ExprAST *ParseParenExpr() {
+ getNextToken(); // eat (.
+ ExprAST *V = ParseExpression();
+ if (!V) return 0;
+
+ if (CurTok != ')')
+ return Error("expected ')'");
+ getNextToken(); // eat ).
+ return V;
+ }
+
+This function illustrates a number of interesting things about the
+parser:
+
+1) It shows how we use the Error routines. When called, this function
+expects that the current token is a '(' token, but after parsing the
+subexpression, it is possible that there is no ')' waiting. For example,
+if the user types in "(4 x" instead of "(4)", the parser should emit an
+error. Because errors can occur, the parser needs a way to indicate that
+they happened: in our parser, we return null on an error.
+
+2) Another interesting aspect of this function is that it uses recursion
+by calling ``ParseExpression`` (we will soon see that
+``ParseExpression`` can call ``ParseParenExpr``). This is powerful
+because it allows us to handle recursive grammars, and keeps each
+production very simple. Note that parentheses do not cause construction
+of AST nodes themselves. While we could do it this way, the most
+important role of parentheses are to guide the parser and provide
+grouping. Once the parser constructs the AST, parentheses are not
+needed.
+
+The next simple production is for handling variable references and
+function calls:
+
+.. code-block:: c++
+
+ /// identifierexpr
+ /// ::= identifier
+ /// ::= identifier '(' expression* ')'
+ static ExprAST *ParseIdentifierExpr() {
+ std::string IdName = IdentifierStr;
+
+ getNextToken(); // eat identifier.
+
+ if (CurTok != '(') // Simple variable ref.
+ return new VariableExprAST(IdName);
+
+ // Call.
+ getNextToken(); // eat (
+ std::vector<ExprAST*> Args;
+ if (CurTok != ')') {
+ while (1) {
+ ExprAST *Arg = ParseExpression();
+ if (!Arg) return 0;
+ Args.push_back(Arg);
+
+ if (CurTok == ')') break;
+
+ if (CurTok != ',')
+ return Error("Expected ')' or ',' in argument list");
+ getNextToken();
+ }
+ }
+
+ // Eat the ')'.
+ getNextToken();
+
+ return new CallExprAST(IdName, Args);
+ }
+
+This routine follows the same style as the other routines. (It expects
+to be called if the current token is a ``tok_identifier`` token). It
+also has recursion and error handling. One interesting aspect of this is
+that it uses *look-ahead* to determine if the current identifier is a
+stand alone variable reference or if it is a function call expression.
+It handles this by checking to see if the token after the identifier is
+a '(' token, constructing either a ``VariableExprAST`` or
+``CallExprAST`` node as appropriate.
+
+Now that we have all of our simple expression-parsing logic in place, we
+can define a helper function to wrap it together into one entry point.
+We call this class of expressions "primary" expressions, for reasons
+that will become more clear `later in the
+tutorial <LangImpl6.html#unary>`_. In order to parse an arbitrary
+primary expression, we need to determine what sort of expression it is:
+
+.. code-block:: c++
+
+ /// primary
+ /// ::= identifierexpr
+ /// ::= numberexpr
+ /// ::= parenexpr
+ static ExprAST *ParsePrimary() {
+ switch (CurTok) {
+ default: return Error("unknown token when expecting an expression");
+ case tok_identifier: return ParseIdentifierExpr();
+ case tok_number: return ParseNumberExpr();
+ case '(': return ParseParenExpr();
+ }
+ }
+
+Now that you see the definition of this function, it is more obvious why
+we can assume the state of CurTok in the various functions. This uses
+look-ahead to determine which sort of expression is being inspected, and
+then parses it with a function call.
+
+Now that basic expressions are handled, we need to handle binary
+expressions. They are a bit more complex.
+
+Binary Expression Parsing
+=========================
+
+Binary expressions are significantly harder to parse because they are
+often ambiguous. For example, when given the string "x+y\*z", the parser
+can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
+definitions from mathematics, we expect the later parse, because "\*"
+(multiplication) has higher *precedence* than "+" (addition).
+
+There are many ways to handle this, but an elegant and efficient way is
+to use `Operator-Precedence
+Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
+This parsing technique uses the precedence of binary operators to guide
+recursion. To start with, we need a table of precedences:
+
+.. code-block:: c++
+
+ /// BinopPrecedence - This holds the precedence for each binary operator that is
+ /// defined.
+ static std::map<char, int> BinopPrecedence;
+
+ /// GetTokPrecedence - Get the precedence of the pending binary operator token.
+ static int GetTokPrecedence() {
+ if (!isascii(CurTok))
+ return -1;
+
+ // Make sure it's a declared binop.
+ int TokPrec = BinopPrecedence[CurTok];
+ if (TokPrec <= 0) return -1;
+ return TokPrec;
+ }
+
+ int main() {
+ // Install standard binary operators.
+ // 1 is lowest precedence.
+ BinopPrecedence['<'] = 10;
+ BinopPrecedence['+'] = 20;
+ BinopPrecedence['-'] = 20;
+ BinopPrecedence['*'] = 40; // highest.
+ ...
+ }
+
+For the basic form of Kaleidoscope, we will only support 4 binary
+operators (this can obviously be extended by you, our brave and intrepid
+reader). The ``GetTokPrecedence`` function returns the precedence for
+the current token, or -1 if the token is not a binary operator. Having a
+map makes it easy to add new operators and makes it clear that the
+algorithm doesn't depend on the specific operators involved, but it
+would be easy enough to eliminate the map and do the comparisons in the
+``GetTokPrecedence`` function. (Or just use a fixed-size array).
+
+With the helper above defined, we can now start parsing binary
+expressions. The basic idea of operator precedence parsing is to break
+down an expression with potentially ambiguous binary operators into
+pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g".
+Operator precedence parsing considers this as a stream of primary
+expressions separated by binary operators. As such, it will first parse
+the leading primary expression "a", then it will see the pairs [+, b]
+[+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
+primary expressions, the binary expression parser doesn't need to worry
+about nested subexpressions like (c+d) at all.
+
+To start, an expression is a primary expression potentially followed by
+a sequence of [binop,primaryexpr] pairs:
+
+.. code-block:: c++
+
+ /// expression
+ /// ::= primary binoprhs
+ ///
+ static ExprAST *ParseExpression() {
+ ExprAST *LHS = ParsePrimary();
+ if (!LHS) return 0;
+
+ return ParseBinOpRHS(0, LHS);
+ }
+
+``ParseBinOpRHS`` is the function that parses the sequence of pairs for
+us. It takes a precedence and a pointer to an expression for the part
+that has been parsed so far. Note that "x" is a perfectly valid
+expression: As such, "binoprhs" is allowed to be empty, in which case it
+returns the expression that is passed into it. In our example above, the
+code passes the expression for "a" into ``ParseBinOpRHS`` and the
+current token is "+".
+
+The precedence value passed into ``ParseBinOpRHS`` indicates the
+*minimal operator precedence* that the function is allowed to eat. For
+example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
+passed in a precedence of 40, it will not consume any tokens (because
+the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
+starts with:
+
+.. code-block:: c++
+
+ /// binoprhs
+ /// ::= ('+' primary)*
+ static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
+ // If this is a binop, find its precedence.
+ while (1) {
+ int TokPrec = GetTokPrecedence();
+
+ // If this is a binop that binds at least as tightly as the current binop,
+ // consume it, otherwise we are done.
+ if (TokPrec < ExprPrec)
+ return LHS;
+
+This code gets the precedence of the current token and checks to see if
+if is too low. Because we defined invalid tokens to have a precedence of
+-1, this check implicitly knows that the pair-stream ends when the token
+stream runs out of binary operators. If this check succeeds, we know
+that the token is a binary operator and that it will be included in this
+expression:
+
+.. code-block:: c++
+
+ // Okay, we know this is a binop.
+ int BinOp = CurTok;
+ getNextToken(); // eat binop
+
+ // Parse the primary expression after the binary operator.
+ ExprAST *RHS = ParsePrimary();
+ if (!RHS) return 0;
+
+As such, this code eats (and remembers) the binary operator and then
+parses the primary expression that follows. This builds up the whole
+pair, the first of which is [+, b] for the running example.
+
+Now that we parsed the left-hand side of an expression and one pair of
+the RHS sequence, we have to decide which way the expression associates.
+In particular, we could have "(a+b) binop unparsed" or "a + (b binop
+unparsed)". To determine this, we look ahead at "binop" to determine its
+precedence and compare it to BinOp's precedence (which is '+' in this
+case):
+
+.. code-block:: c++
+
+ // If BinOp binds less tightly with RHS than the operator after RHS, let
+ // the pending operator take RHS as its LHS.
+ int NextPrec = GetTokPrecedence();
+ if (TokPrec < NextPrec) {
+
+If the precedence of the binop to the right of "RHS" is lower or equal
+to the precedence of our current operator, then we know that the
+parentheses associate as "(a+b) binop ...". In our example, the current
+operator is "+" and the next operator is "+", we know that they have the
+same precedence. In this case we'll create the AST node for "a+b", and
+then continue parsing:
+
+.. code-block:: c++
+
+ ... if body omitted ...
+ }
+
+ // Merge LHS/RHS.
+ LHS = new BinaryExprAST(BinOp, LHS, RHS);
+ } // loop around to the top of the while loop.
+ }
+
+In our example above, this will turn "a+b+" into "(a+b)" and execute the
+next iteration of the loop, with "+" as the current token. The code
+above will eat, remember, and parse "(c+d)" as the primary expression,
+which makes the current pair equal to [+, (c+d)]. It will then evaluate
+the 'if' conditional above with "\*" as the binop to the right of the
+primary. In this case, the precedence of "\*" is higher than the
+precedence of "+" so the if condition will be entered.
+
+The critical question left here is "how can the if condition parse the
+right hand side in full"? In particular, to build the AST correctly for
+our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
+variable. The code to do this is surprisingly simple (code from the
+above two blocks duplicated for context):
+
+.. code-block:: c++
+
+ // If BinOp binds less tightly with RHS than the operator after RHS, let
+ // the pending operator take RHS as its LHS.
+ int NextPrec = GetTokPrecedence();
+ if (TokPrec < NextPrec) {
+ RHS = ParseBinOpRHS(TokPrec+1, RHS);
+ if (RHS == 0) return 0;
+ }
+ // Merge LHS/RHS.
+ LHS = new BinaryExprAST(BinOp, LHS, RHS);
+ } // loop around to the top of the while loop.
+ }
+
+At this point, we know that the binary operator to the RHS of our
+primary has higher precedence than the binop we are currently parsing.
+As such, we know that any sequence of pairs whose operators are all
+higher precedence than "+" should be parsed together and returned as
+"RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
+specifying "TokPrec+1" as the minimum precedence required for it to
+continue. In our example above, this will cause it to return the AST
+node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+'
+expression.
+
+Finally, on the next iteration of the while loop, the "+g" piece is
+parsed and added to the AST. With this little bit of code (14
+non-trivial lines), we correctly handle fully general binary expression
+parsing in a very elegant way. This was a whirlwind tour of this code,
+and it is somewhat subtle. I recommend running through it with a few
+tough examples to see how it works.
+
+This wraps up handling of expressions. At this point, we can point the
+parser at an arbitrary token stream and build an expression from it,
+stopping at the first token that is not part of the expression. Next up
+we need to handle function definitions, etc.
+
+Parsing the Rest
+================
+
+The next thing missing is handling of function prototypes. In
+Kaleidoscope, these are used both for 'extern' function declarations as
+well as function body definitions. The code to do this is
+straight-forward and not very interesting (once you've survived
+expressions):
+
+.. code-block:: c++
+
+ /// prototype
+ /// ::= id '(' id* ')'
+ static PrototypeAST *ParsePrototype() {
+ if (CurTok != tok_identifier)
+ return ErrorP("Expected function name in prototype");
+
+ std::string FnName = IdentifierStr;
+ getNextToken();
+
+ if (CurTok != '(')
+ return ErrorP("Expected '(' in prototype");
+
+ // Read the list of argument names.
+ std::vector<std::string> ArgNames;
+ while (getNextToken() == tok_identifier)
+ ArgNames.push_back(IdentifierStr);
+ if (CurTok != ')')
+ return ErrorP("Expected ')' in prototype");
+
+ // success.
+ getNextToken(); // eat ')'.
+
+ return new PrototypeAST(FnName, ArgNames);
+ }
+
+Given this, a function definition is very simple, just a prototype plus
+an expression to implement the body:
+
+.. code-block:: c++
+
+ /// definition ::= 'def' prototype expression
+ static FunctionAST *ParseDefinition() {
+ getNextToken(); // eat def.
+ PrototypeAST *Proto = ParsePrototype();
+ if (Proto == 0) return 0;
+
+ if (ExprAST *E = ParseExpression())
+ return new FunctionAST(Proto, E);
+ return 0;
+ }
+
+In addition, we support 'extern' to declare functions like 'sin' and
+'cos' as well as to support forward declaration of user functions. These
+'extern's are just prototypes with no body:
+
+.. code-block:: c++
+
+ /// external ::= 'extern' prototype
+ static PrototypeAST *ParseExtern() {
+ getNextToken(); // eat extern.
+ return ParsePrototype();
+ }
+
+Finally, we'll also let the user type in arbitrary top-level expressions
+and evaluate them on the fly. We will handle this by defining anonymous
+nullary (zero argument) functions for them:
+
+.. code-block:: c++
+
+ /// toplevelexpr ::= expression
+ static FunctionAST *ParseTopLevelExpr() {
+ if (ExprAST *E = ParseExpression()) {
+ // Make an anonymous proto.
+ PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
+ return new FunctionAST(Proto, E);
+ }
+ return 0;
+ }
+
+Now that we have all the pieces, let's build a little driver that will
+let us actually *execute* this code we've built!
+
+The Driver
+==========
+
+The driver for this simply invokes all of the parsing pieces with a
+top-level dispatch loop. There isn't much interesting here, so I'll just
+include the top-level loop. See `below <#code>`_ for full code in the
+"Top-Level Parsing" section.
+
+.. code-block:: c++
+
+ /// top ::= definition | external | expression | ';'
+ static void MainLoop() {
+ while (1) {
+ fprintf(stderr, "ready> ");
+ switch (CurTok) {
+ case tok_eof: return;
+ case ';': getNextToken(); break; // ignore top-level semicolons.
+ case tok_def: HandleDefinition(); break;
+ case tok_extern: HandleExtern(); break;
+ default: HandleTopLevelExpression(); break;
+ }
+ }
+ }
+
+The most interesting part of this is that we ignore top-level
+semicolons. Why is this, you ask? The basic reason is that if you type
+"4 + 5" at the command line, the parser doesn't know whether that is the
+end of what you will type or not. For example, on the next line you
+could type "def foo..." in which case 4+5 is the end of a top-level
+expression. Alternatively you could type "\* 6", which would continue
+the expression. Having top-level semicolons allows you to type "4+5;",
+and the parser will know you are done.
+
+Conclusions
+===========
+
+With just under 400 lines of commented code (240 lines of non-comment,
+non-blank code), we fully defined our minimal language, including a
+lexer, parser, and AST builder. With this done, the executable will
+validate Kaleidoscope code and tell us if it is grammatically invalid.
+For example, here is a sample interaction:
+
+.. code-block:: bash
+
+ $ ./a.out
+ ready> def foo(x y) x+foo(y, 4.0);
+ Parsed a function definition.
+ ready> def foo(x y) x+y y;
+ Parsed a function definition.
+ Parsed a top-level expr
+ ready> def foo(x y) x+y );
+ Parsed a function definition.
+ Error: unknown token when expecting an expression
+ ready> extern sin(a);
+ ready> Parsed an extern
+ ready> ^D
+ $
+
+There is a lot of room for extension here. You can define new AST nodes,
+extend the language in many ways, etc. In the `next
+installment <LangImpl3.html>`_, we will describe how to generate LLVM
+Intermediate Representation (IR) from the AST.
+
+Full Code Listing
+=================
+
+Here is the complete code listing for this and the previous chapter.
+Note that it is fully self-contained: you don't need LLVM or any
+external libraries at all for this. (Besides the C and C++ standard
+libraries, of course.) To build this, just compile with:
+
+.. code-block:: bash
+
+ # Compile
+ clang++ -g -O3 toy.cpp
+ # Run
+ ./a.out
+
+Here is the code:
+
+.. code-block:: c++
+
+ #include <cstdio>
+ #include <cstdlib>
+ #include <string>
+ #include <map>
+ #include <vector>
+
+ //===----------------------------------------------------------------------===//
+ // Lexer
+ //===----------------------------------------------------------------------===//
+
+ // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
+ // of these for known things.
+ enum Token {
+ tok_eof = -1,
+
+ // commands
+ tok_def = -2, tok_extern = -3,
+
+ // primary
+ tok_identifier = -4, tok_number = -5
+ };
+
+ static std::string IdentifierStr; // Filled in if tok_identifier
+ static double NumVal; // Filled in if tok_number
+
+ /// gettok - Return the next token from standard input.
+ static int gettok() {
+ static int LastChar = ' ';
+
+ // Skip any whitespace.
+ while (isspace(LastChar))
+ LastChar = getchar();
+
+ if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
+ IdentifierStr = LastChar;
+ while (isalnum((LastChar = getchar())))
+ IdentifierStr += LastChar;
+
+ if (IdentifierStr == "def") return tok_def;
+ if (IdentifierStr == "extern") return tok_extern;
+ return tok_identifier;
+ }
+
+ if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
+ std::string NumStr;
+ do {
+ NumStr += LastChar;
+ LastChar = getchar();
+ } while (isdigit(LastChar) || LastChar == '.');
+
+ NumVal = strtod(NumStr.c_str(), 0);
+ return tok_number;
+ }
+
+ if (LastChar == '#') {
+ // Comment until end of line.
+ do LastChar = getchar();
+ while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
+
+ if (LastChar != EOF)
+ return gettok();
+ }
+
+ // Check for end of file. Don't eat the EOF.
+ if (LastChar == EOF)
+ return tok_eof;
+
+ // Otherwise, just return the character as its ascii value.
+ int ThisChar = LastChar;
+ LastChar = getchar();
+ return ThisChar;
+ }
+
+ //===----------------------------------------------------------------------===//
+ // Abstract Syntax Tree (aka Parse Tree)
+ //===----------------------------------------------------------------------===//
+
+ /// ExprAST - Base class for all expression nodes.
+ class ExprAST {
+ public:
+ virtual ~ExprAST() {}
+ };
+
+ /// NumberExprAST - Expression class for numeric literals like "1.0".
+ class NumberExprAST : public ExprAST {
+ double Val;
+ public:
+ NumberExprAST(double val) : Val(val) {}
+ };
+
+ /// VariableExprAST - Expression class for referencing a variable, like "a".
+ class VariableExprAST : public ExprAST {
+ std::string Name;
+ public:
+ VariableExprAST(const std::string &name) : Name(name) {}
+ };
+
+ /// BinaryExprAST - Expression class for a binary operator.
+ class BinaryExprAST : public ExprAST {
+ char Op;
+ ExprAST *LHS, *RHS;
+ public:
+ BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
+ : Op(op), LHS(lhs), RHS(rhs) {}
+ };
+
+ /// CallExprAST - Expression class for function calls.
+ class CallExprAST : public ExprAST {
+ std::string Callee;
+ std::vector<ExprAST*> Args;
+ public:
+ CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
+ : Callee(callee), Args(args) {}
+ };
+
+ /// PrototypeAST - This class represents the "prototype" for a function,
+ /// which captures its name, and its argument names (thus implicitly the number
+ /// of arguments the function takes).
+ class PrototypeAST {
+ std::string Name;
+ std::vector<std::string> Args;
+ public:
+ PrototypeAST(const std::string &name, const std::vector<std::string> &args)
+ : Name(name), Args(args) {}
+
+ };
+
+ /// FunctionAST - This class represents a function definition itself.
+ class FunctionAST {
+ PrototypeAST *Proto;
+ ExprAST *Body;
+ public:
+ FunctionAST(PrototypeAST *proto, ExprAST *body)
+ : Proto(proto), Body(body) {}
+
+ };
+
+ //===----------------------------------------------------------------------===//
+ // Parser
+ //===----------------------------------------------------------------------===//
+
+ /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
+ /// token the parser is looking at. getNextToken reads another token from the
+ /// lexer and updates CurTok with its results.
+ static int CurTok;
+ static int getNextToken() {
+ return CurTok = gettok();
+ }
+
+ /// BinopPrecedence - This holds the precedence for each binary operator that is
+ /// defined.
+ static std::map<char, int> BinopPrecedence;
+
+ /// GetTokPrecedence - Get the precedence of the pending binary operator token.
+ static int GetTokPrecedence() {
+ if (!isascii(CurTok))
+ return -1;
+
+ // Make sure it's a declared binop.
+ int TokPrec = BinopPrecedence[CurTok];
+ if (TokPrec <= 0) return -1;
+ return TokPrec;
+ }
+
+ /// Error* - These are little helper functions for error handling.
+ ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
+ PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
+ FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
+
+ static ExprAST *ParseExpression();
+
+ /// identifierexpr
+ /// ::= identifier
+ /// ::= identifier '(' expression* ')'
+ static ExprAST *ParseIdentifierExpr() {
+ std::string IdName = IdentifierStr;
+
+ getNextToken(); // eat identifier.
+
+ if (CurTok != '(') // Simple variable ref.
+ return new VariableExprAST(IdName);
+
+ // Call.
+ getNextToken(); // eat (
+ std::vector<ExprAST*> Args;
+ if (CurTok != ')') {
+ while (1) {
+ ExprAST *Arg = ParseExpression();
+ if (!Arg) return 0;
+ Args.push_back(Arg);
+
+ if (CurTok == ')') break;
+
+ if (CurTok != ',')
+ return Error("Expected ')' or ',' in argument list");
+ getNextToken();
+ }
+ }
+
+ // Eat the ')'.
+ getNextToken();
+
+ return new CallExprAST(IdName, Args);
+ }
+
+ /// numberexpr ::= number
+ static ExprAST *ParseNumberExpr() {
+ ExprAST *Result = new NumberExprAST(NumVal);
+ getNextToken(); // consume the number
+ return Result;
+ }
+
+ /// parenexpr ::= '(' expression ')'
+ static ExprAST *ParseParenExpr() {
+ getNextToken(); // eat (.
+ ExprAST *V = ParseExpression();
+ if (!V) return 0;
+
+ if (CurTok != ')')
+ return Error("expected ')'");
+ getNextToken(); // eat ).
+ return V;
+ }
+
+ /// primary
+ /// ::= identifierexpr
+ /// ::= numberexpr
+ /// ::= parenexpr
+ static ExprAST *ParsePrimary() {
+ switch (CurTok) {
+ default: return Error("unknown token when expecting an expression");
+ case tok_identifier: return ParseIdentifierExpr();
+ case tok_number: return ParseNumberExpr();
+ case '(': return ParseParenExpr();
+ }
+ }
+
+ /// binoprhs
+ /// ::= ('+' primary)*
+ static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
+ // If this is a binop, find its precedence.
+ while (1) {
+ int TokPrec = GetTokPrecedence();
+
+ // If this is a binop that binds at least as tightly as the current binop,
+ // consume it, otherwise we are done.
+ if (TokPrec < ExprPrec)
+ return LHS;
+
+ // Okay, we know this is a binop.
+ int BinOp = CurTok;
+ getNextToken(); // eat binop
+
+ // Parse the primary expression after the binary operator.
+ ExprAST *RHS = ParsePrimary();
+ if (!RHS) return 0;
+
+ // If BinOp binds less tightly with RHS than the operator after RHS, let
+ // the pending operator take RHS as its LHS.
+ int NextPrec = GetTokPrecedence();
+ if (TokPrec < NextPrec) {
+ RHS = ParseBinOpRHS(TokPrec+1, RHS);
+ if (RHS == 0) return 0;
+ }
+
+ // Merge LHS/RHS.
+ LHS = new BinaryExprAST(BinOp, LHS, RHS);
+ }
+ }
+
+ /// expression
+ /// ::= primary binoprhs
+ ///
+ static ExprAST *ParseExpression() {
+ ExprAST *LHS = ParsePrimary();
+ if (!LHS) return 0;
+
+ return ParseBinOpRHS(0, LHS);
+ }
+
+ /// prototype
+ /// ::= id '(' id* ')'
+ static PrototypeAST *ParsePrototype() {
+ if (CurTok != tok_identifier)
+ return ErrorP("Expected function name in prototype");
+
+ std::string FnName = IdentifierStr;
+ getNextToken();
+
+ if (CurTok != '(')
+ return ErrorP("Expected '(' in prototype");
+
+ std::vector<std::string> ArgNames;
+ while (getNextToken() == tok_identifier)
+ ArgNames.push_back(IdentifierStr);
+ if (CurTok != ')')
+ return ErrorP("Expected ')' in prototype");
+
+ // success.
+ getNextToken(); // eat ')'.
+
+ return new PrototypeAST(FnName, ArgNames);
+ }
+
+ /// definition ::= 'def' prototype expression
+ static FunctionAST *ParseDefinition() {
+ getNextToken(); // eat def.
+ PrototypeAST *Proto = ParsePrototype();
+ if (Proto == 0) return 0;
+
+ if (ExprAST *E = ParseExpression())
+ return new FunctionAST(Proto, E);
+ return 0;
+ }
+
+ /// toplevelexpr ::= expression
+ static FunctionAST *ParseTopLevelExpr() {
+ if (ExprAST *E = ParseExpression()) {
+ // Make an anonymous proto.
+ PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
+ return new FunctionAST(Proto, E);
+ }
+ return 0;
+ }
+
+ /// external ::= 'extern' prototype
+ static PrototypeAST *ParseExtern() {
+ getNextToken(); // eat extern.
+ return ParsePrototype();
+ }
+
+ //===----------------------------------------------------------------------===//
+ // Top-Level parsing
+ //===----------------------------------------------------------------------===//
+
+ static void HandleDefinition() {
+ if (ParseDefinition()) {
+ fprintf(stderr, "Parsed a function definition.\n");
+ } else {
+ // Skip token for error recovery.
+ getNextToken();
+ }
+ }
+
+ static void HandleExtern() {
+ if (ParseExtern()) {
+ fprintf(stderr, "Parsed an extern\n");
+ } else {
+ // Skip token for error recovery.
+ getNextToken();
+ }
+ }
+
+ static void HandleTopLevelExpression() {
+ // Evaluate a top-level expression into an anonymous function.
+ if (ParseTopLevelExpr()) {
+ fprintf(stderr, "Parsed a top-level expr\n");
+ } else {
+ // Skip token for error recovery.
+ getNextToken();
+ }
+ }
+
+ /// top ::= definition | external | expression | ';'
+ static void MainLoop() {
+ while (1) {
+ fprintf(stderr, "ready> ");
+ switch (CurTok) {
+ case tok_eof: return;
+ case ';': getNextToken(); break; // ignore top-level semicolons.
+ case tok_def: HandleDefinition(); break;
+ case tok_extern: HandleExtern(); break;
+ default: HandleTopLevelExpression(); break;
+ }
+ }
+ }
+
+ //===----------------------------------------------------------------------===//
+ // Main driver code.
+ //===----------------------------------------------------------------------===//
+
+ int main() {
+ // Install standard binary operators.
+ // 1 is lowest precedence.
+ BinopPrecedence['<'] = 10;
+ BinopPrecedence['+'] = 20;
+ BinopPrecedence['-'] = 20;
+ BinopPrecedence['*'] = 40; // highest.
+
+ // Prime the first token.
+ fprintf(stderr, "ready> ");
+ getNextToken();
+
+ // Run the main "interpreter loop" now.
+ MainLoop();
+
+ return 0;
+ }
+
+`Next: Implementing Code Generation to LLVM IR <LangImpl3.html>`_
+