--- /dev/null
- the parsing routines can be given C-like parameters. So, for example
+.\" $Id$
+.\" Run this paper off with
+.\" refer [options] -p LLgen.refs LLgen.doc | [n]eqn | tbl | (nt)roff -ms
+.if '\*(>.'' \{\
+. if '\*(<.'' \{\
+. if n .ds >. .
+. if n .ds >, ,
+. if t .ds <. .
+. if t .ds <, ,\
+\}\
+\}
+.cs 5 22u
+.ND
+.EQ
+delim @@
+.EN
+.TL
+LLgen, an extended LL(1) parser generator
+.AU
+Ceriel J. H. Jacobs
+.AI
+Dept. of Mathematics and Computer Science
+Vrije Universiteit
+Amsterdam, The Netherlands
+.AB
+\fILLgen\fR provides a
+tool for generating an efficient recursive descent parser
+with no backtrack from
+an Extended Context Free syntax.
+The \fILLgen\fR
+user specifies the syntax, together with code
+describing actions associated with the parsing process.
+\fILLgen\fR
+turns this specification into a number of subroutines that handle the
+parsing process.
+.PP
+The grammar may be ambiguous.
+\fILLgen\fR contains both static and dynamic facilities
+to resolve these ambiguities.
+.PP
+The specification can be split into several files, for each of
+which \fILLgen\fR generates an output file containing the
+corresponding part of the parser.
+Furthermore, only output files that differ from their previous
+version are updated.
+Other output files are not affected in any
+way.
+This allows the user to recompile only those output files that have
+changed.
+.PP
+The subroutine produced by \fILLgen\fR calls a user supplied routine
+that must return the next token. This way, the input to the
+parser can be split into single characters or higher level
+tokens.
+.PP
+An error recovery mechanism is generated almost completely
+automatically.
+It is based on so called \fBdefault choices\fR, which are
+implicitly or explicitly specified by the user.
+.PP
+\fILLgen\fR has succesfully been used to create recognizers for
+Pascal, C, and Modula-2.
+.AE
+.NH
+Introduction
+.PP
+\fILLgen\fR
+provides a tool for generating an efficient recursive
+descent parser with no backtrack from an Extended Context Free
+syntax.
+A parser generated by
+\fILLgen\fR
+will be called
+\fILLparse\fR
+for the rest of this document.
+It is assumed that the reader has some knowledge of LL(1) grammars and
+recursive descent parsers.
+For a survey on the subject, see reference
+.[ (
+griffiths
+.]).
+.PP
+Extended LL(1) parsers are an extension of LL(1) parsers. They are
+derived from an Extended Context-Free (ECF) syntax instead of a Context-Free
+(CF) syntax.
+ECF syntax is described in section 2.
+Section 3 provides an outline of a
+specification as accepted by
+\fILLgen\fR and also discusses the lexical conventions of
+grammar specification files.
+Section 4 provides a description of the way the
+\fILLgen\fR
+user can associate
+actions with the syntax. These actions must be written in the programming
+language C,
+.[
+kernighan ritchie
+.]
+which also is the target language of \fILLgen\fR.
+The error recovery technique is discussed in section 5.
+This section also discusses what the user can do about it.
+Section 6 discusses
+the facilities \fILLgen\fR offers
+to resolve ambiguities and conflicts.
+\fILLgen\fR offers facilities to resolve them both at parser
+generation time and during the execution of \fILLparse\fR.
+Section 7 discusses the
+\fILLgen\fR
+working environment.
+It also discusses the lexical analyzer that must be supplied by the
+user.
+This lexical analyzer must read the input stream and break it
+up into basic input items, called \fBtokens\fR for the rest of
+this document.
+Appendix A gives a summary of the
+\fILLgen\fR
+input syntax.
+Appendix B gives an example.
+It is very instructive to compare this example with the one
+given in reference
+.[ (
+yacc
+.]).
+It demonstrates the struggle \fILLparse\fR and other LL(1)
+parsers have with expressions.
+Appendix C gives an example of the \fILLgen\fR features
+allowing the user to recompile only those output files that
+have changed, using the \fImake\fR program.
+.[
+make
+.]
+.NH
+The Extended Context-Free Syntax
+.PP
+The extensions of an ECF syntax with respect to an ordinary CF syntax are:
+.IP 1. 10
+An ECF syntax contains the repetition operator: "N" (N represents a positive
+integer).
+.IP 2. 10
+An ECF syntax contains the closure set operator without and with
+upperbound: "*" and "*N".
+.IP 3. 10
+An ECF syntax contains the positive closure set operator without and with
+upperbound: "+" and "+N".
+.IP 4. 10
+An ECF syntax contains the optional operator: "?", which is a
+shorthand for "*1".
+.IP 5. 10
+An ECF syntax contains parentheses "[" and "]" which can be
+used for grouping.
+.PP
+We can describe the syntax of an ECF syntax with an ECF syntax :
+.DS
+.ft CW
+grammar : rule +
+ ;
+.ft R
+.DE
+This grammar rule states that a grammar consists of one or more
+rules.
+.DS
+.ft CW
+rule : nonterminal ':' productionrule ';'
+ ;
+.ft R
+.DE
+A rule consists of a left hand side, the nonterminal,
+followed by ":",
+the \fBproduce symbol\fR, followed by a production rule, followed by a
+";", in\%di\%ca\%ting the end of the rule.
+.DS
+.ft CW
+productionrule : production [ '|' production ]*
+ ;
+.ft R
+.DE
+A production rule consists of one or
+more alternative productions separated by "|". This symbol is called the
+\fBalternation symbol\fR.
+.DS
+.ft CW
+production : term *
+ ;
+.ft R
+.DE
+A production consists of a possibly empty list of terms.
+So, empty productions are allowed.
+.DS
+.ft CW
+term : element repeats
+ ;
+.ft R
+.DE
+A term is an element, possibly with a repeat specification.
+.DS
+.ft CW
+element : LITERAL
+ | IDENTIFIER
+ | '[' productionrule ']'
+ ;
+.ft R
+.DE
+An element can be a LITERAL, which basically is a single character
+between apostrophes, it can be an IDENTIFIER, which is either a
+nonterminal or a token, and it can be a production rule
+between square parentheses.
+.DS
+.ft CW
+repeats : '?'
+ | [ '*' | '+' ] NUMBER ?
+ | NUMBER ?
+ ;
+.ft R
+.DE
+These are the repeat specifications discussed above. Notice that
+this specification may be empty.
+.PP
+The class of ECF languages
+is identical with the class of CF languages. However, in many
+cases recursive definitions of language features can now be
+replaced by iterative ones. This tends to reduce the number of
+nonterminals and gives rise to very efficient recursive descent
+parsers.
+.NH
+Grammar Specifications
+.PP
+The major part of a
+\fILLgen\fR
+grammar specification consists of an
+ECF syntax specification.
+Names in this syntax specification refer to either tokens or nonterminal
+symbols.
+\fILLgen\fR
+requires token names to be declared as such. This way it
+can be avoided that a typing error in a nonterminal name causes it to
+be accepted as a token name. The token declarations will be
+discussed later.
+A name will be regarded as a nonterminal symbol, unless it is declared
+as a token name.
+If there is no production rule for a nonterminal symbol, \fILLgen\fR
+will complain.
+.PP
+A grammar specification may also include some C routines,
+for instance the lexical analyzer and an error reporting
+routine.
+Thus, a grammar specification file can contain declarations,
+grammar rules and C-code.
+.PP
+Blanks, tabs and newlines are ignored, but may not appear in names or
+keywords.
+Comments may appear wherever a name is legal (which is almost
+everywhere).
+They are enclosed in
+/* ... */, as in C. Comments do not nest.
+.PP
+Names may be of arbitrary length, and can be made up of letters, underscore
+"\_" and non-initial digits. Upper and lower case letters are distinct.
+Only the first 50 characters are significant.
+Notice however, that the names for the tokens will be used by the
+C-preprocessor.
+The number of significant characters therefore depends on the
+underlying C-implementation.
+A safe rule is to make the identifiers distinct in the first six
+characters, case ignored.
+.PP
+There are two kinds of tokens:
+those that are declared and are denoted by a name,
+and literals.
+.PP
+A literal consists of a character enclosed in apostrophes "'".
+The "\e" is an escape character within literals. The following escapes
+are recognized :
+.TS
+center;
+l l.
+\&'\en' newline
+\&'\er' return
+\&'\e'' apostrophe "'"
+\&'\e\e' backslash "\e"
+\&'\et' tab
+\&'\eb' backspace
+\&'\ef' form feed
+\&'\exxx' "xxx" in octal
+.TE
+.PP
+Names representing tokens must be declared before they are used.
+This can be done using the "\fB%token\fR" keyword,
+by writing
+.nf
+.ft CW
+.sp 1
+%token name1, name2, . . . ;
+.ft R
+.fi
+.PP
+\fILLparse\fR is designed to recognize special nonterminal
+symbols called \fBstart symbols\fR.
+\fILLgen\fR allows for more than one start symbol.
+Thus, grammars with more than one entry point are accepted.
+The start symbols must be declared explicitly using the
+"\fB%start\fR" keyword. It can be used whenever a declaration is
+legal, f.i.:
+.nf
+.ft CW
+.sp 1
+%start LLparse, specification ;
+.ft R
+.fi
+.sp 1
+declares "specification" as a start symbol and associates the
+identifier "LLparse" with it.
+"LLparse" will now be the name of the C-function that must be
+called to recognize "specification".
+.NH
+Actions
+.PP
+\fILLgen\fR
+allows arbitrary insertions of actions within the right hand side
+of a production rule in the ECF syntax. An action consists of a number of C
+statements, enclosed in the brackets "{" and "}".
+.PP
+\fILLgen\fR
+generates a parsing routine for each rule in the grammar. The actions
+supplied by the user are just inserted in the proper place.
+There may also be declarations before the statements in the
+action, as
+the "{" and "}" are copied into the target code along with the
+action. The scope of these declarations terminates with the
+closing bracket "}" of the action.
+.PP
+In addition to actions, it is also possible to declare local variables
+in the parsing routine, which can then be used in the actions.
+Such a declaration consists of a number of C variable declarations,
+enclosed in the brackets "{" and "}". It must be placed
+right in front of the ":" in the grammar rule.
+The scope of these local variables consists of the complete
+grammar rule.
+.PP
+In order to facilitate communication between the actions and
+\fILLparse\fR,
- expr(int level, int *val;) { int expr; } :
++the parsing routines can be given C-like parameters.
++Each parameter must be declared separately, and each of these declarations must
++end with a semicolon.
++For the last parameter, the semicolon is optional.
++.PP
++So, for example
+.nf
+.ft CW
+.sp 1
+expr(int *pval;) { int fact; } :
+ /*
+ * Rule with one parameter, a pointer to an int.
+ * Parameter specifications are ordinary C declarations.
+ * One local variable, of type int.
+ */
+ factor (&fact) { *pval = fact; }
+ /*
+ * factor is another nonterminal symbol.
+ * One actual parameter is supplied.
+ * Notice that the parameter passing mechanism is that
+ * of C.
+ */
+ [ '+' factor (&fact) { *pval += fact; } ]*
+ /*
+ * remember the '*' means zero or more times
+ */
+ ;
+.sp 1
+.ft R
+.fi
+is a rule to recognize a number of factors, separated by "+", and
+to compute their sum.
+.PP
+\fILLgen\fR
+generates C code, so the parameter passing mechanism is that of
+C, as is shown in the example above.
+.PP
+Actions often manipulate attributes of the token just read.
+For instance, when an identifier is read, its name must be
+looked up in a symbol table.
+Therefore, \fILLgen\fR generates code
+such that at a number of places in the grammar rule
+it is defined which token has last been read.
+After a token, the last token read is this token.
+After a "[" or a "|", the last token read is the next token to
+be accepted by \fILLparse\fR.
+At all other places, it is undefined which token has last been
+read.
+The last token read is available in the global integer variable
+\fILLsymb\fR.
+.PP
+The user may also specify C-code wherever a \fILLgen\fR-declaration is
+legal.
+Again, this code must be enclosed in the brackets "{" and "}".
+This way, the user can define global declarations and
+C-functions.
+To avoid name-conflicts with identifiers generated by
+\fILLgen\fR, \fILLparse\fR only uses names beginning with
+"LL"; the user should avoid such names.
+.NH
+Error Recovery
+.PP
+The error recovery technique used by \fILLgen\fR is a
+modification of the one presented in reference
+.[ (
+automatic construction error correcting
+.]).
+It is based on \fBdefault choices\fR, which just are
+what the word says, default choices at
+every point in the grammar where there is a
+choice.
+Thus, in an alternation, one of the productions is marked as a
+default choice, and in a term with a non-fixed repetition
+specification there will also be a default choice (between
+doing the term (once more) and continuing with the rest of the
+production in which the term appears).
+.PP
+When \fILLparse\fR detects an error after having parsed the
+string @s@, the default choices enable it to compute one
+syntactically correct continuation,
+consisting of the tokens @t sub 1~...~t sub n@,
+such that @s~t sub 1~...~t sub n@ is a string of tokens that
+is a member of the language defined by the grammar.
+Notice, that the computation of this continuation must
+terminate, which implies that the default choices may not
+invoke recursive rules.
+.PP
+At each point in this continuation, a certain number of other
+tokens could also be syntactically correct, f.i. the token
+@t@ is syntactically correct at point @t sub i@ in this
+continuation, if the string @s~t sub 1~...~t sub i~t~s sub 1@
+is a string of the language defined by the grammar for some
+string @s sub 1@ and i >= 0.
+.PP
+The set @T@
+containing all these tokens (including @t sub 1 ,~...,~t sub n@) is computed.
+Next, \fILLparse\fR discards zero
+or more tokens from its input, until a token
+@t@ \(mo @T@ is found.
+The error is then corrected by inserting i (i >= 0) tokens
+@t sub 1~...~t sub i@, such that the string
+@s~t sub 1~...~t sub i~t~s sub 1@ is a string of the language
+defined by the grammar, for some @s sub 1@.
+Then, normal parsing is resumed.
+.PP
+The above is difficult to implement in a recursive decent
+parser, and is not the way \fILLparse\fR does it, but the
+effect is the same. In fact, \fILLparse\fR maintains a list
+of tokens that may not be discarded, which is adjusted as
+\fILLparse\fR proceeds. This list is just a representation
+of the set @T@ mentioned
+above. When an error occurs, \fILLparse\fR discards tokens until
+a token @t@ that is a member of this list is found.
+Then, it continues parsing, following the default choices,
+inserting tokens along the way, until this token @t@ is legal.
+The selection of
+the default choices must guarantee that this will always
+happen.
+.PP
+The default choices are explicitly or implicitly
+specified by the user.
+By default, the default choice in an alternation is the
+alternative with the shortest possible terminal production.
+The user can select one of the other productions in the
+alternation as the default choice by putting the keyword
+"\fB%default\fR" in front of it.
+.PP
+By default, for terms with a repetition count containing "*" or
+"?" the default choice is to continue with the rest of the rule
+in which the term appears, and
+.sp 1
+.ft CW
+.nf
+ term+
+.fi
+.ft R
+.sp 1
+is treated as
+.sp 1
+.nf
+.ft CW
+ term term* .
+.ft R
+.fi
+.PP
+It is also clear, that it can never be the default choice to do
+the term (once more), because this could cause the parser to
+loop, inserting tokens forever.
+However, when the user does not want the parser to skip
+tokens that would not have been skipped if the term
+would have been the default choice,
+the skipping of such a term can be prevented by
+using the keyword "\fB%persistent\fR".
+For instance, the rule
+.sp 1
+.ft CW
+.nf
+commandlist : command* ;
+.fi
+.ft R
+.sp 1
+could be changed to
+.sp 1
+.ft CW
+.nf
+commandlist : [ %persistent command ]* ;
+.fi
+.ft R
+.sp 1
+The effects of this in case of a syntax error are twofold:
+The set @T@ mentioned above will be extended as if "command" were
+in the default production, so that fewer tokens will be
+skipped.
+Also, if the first token that is not skipped is a member of the
+subset of @T@ arising from the grammar rule for "command",
+\fILLparse\fR will enter that rule.
+So, in fact the default choice
+is determined dynamically (by \fILLparse\fR).
+Again, \fILLgen\fR checks (statically)
+that \fILLparse\fR will always terminate, and if not,
+\fILLgen\fR will complain.
+.PP
+An important property of this error recovery method is that,
+once a rule is started, it will be finished.
+This means that all actions in the rule will be executed
+normally, so that the user can be sure that there will be no
+inconsistencies in his data structures because of syntax
+errors.
+Also, as the method is in fact error correcting, the
+actions in a rule only have to deal with syntactically correct
+input.
+.NH
+Ambiguities and conflicts
+.PP
+As \fILLgen\fR generates a recursive descent parser with no backtrack,
+it must at all times be able to determine what to do,
+based on the current input symbol.
+Unfortunately, this cannot be done for all grammars.
+Two kinds of conflicts can arise :
+.IP 1) 10
+the grammar rule is of the form "production1 | production2",
+and \fILLparse\fR cannot decide which production to chose.
+This we call an \fBalternation conflict\fR.
+.IP 2) 10
+the grammar rule is of the form "[ productionrule ]...",
+where ... specifies a non-fixed repetition count,
+and \fILLparse\fR cannot decide whether to
+choose "productionrule" once more, or to continue.
+This we call a \fBrepetition conflict\fR.
+.PP
+There can be several causes for conflicts: the grammar may be
+ambiguous, or the grammar may require a more complex parser
+than \fILLgen\fR can construct.
+The conflicts can be examined by inspecting the verbose
+(-\fBv\fR) option output file.
+The conflicts can be resolved by rewriting the grammar
+or by using \fBconflict resolvers\fR.
+The mechanism described here is based on the attributed parsing
+of reference
+.[ (
+milton
+.]).
+.PP
+An alternation conflict can be resolved by putting an \fBif condition\fR
+in front of the first conflicting production.
+It consists of a "\fB%if\fR" followed by a
+C-expression between parentheses.
+\fILLparse\fR will then evaluate this expression whenever a
+token is met at this point on which there is a conflict, so
+the conflict will be resolved dynamically.
+If the expression evaluates to
+non-zero, the first conflicting production is chosen,
+otherwise one of the remaining ones is chosen.
+.PP
+An alternation conflict can also be resolved using the keywords
+"\fB%prefer\fR" or "\fB%avoid\fR". "\fB%prefer\fR"
+is equivalent in behaviour to
+"\fB%if\fR (1)". "\fB%avoid\fR" is equivalent to "\fB%if\fR (0)".
+In these cases however, "\fB%prefer\fR" and "\fB%avoid\fR" should be used,
+as they resolve the conflict statically and thus
+give rise to better C-code.
+.PP
+A repetition conflict can be resolved by putting a \fBwhile condition\fR
+right after the opening parentheses. This while condition
+consists of a "\fB%while\fR" followed by a C-expression between
+parentheses. Again, \fILLparse\fR will then
+evaluate this expression whenever a token is met
+at this point on which there is a conflict.
+If the expression evaluates to non-zero, the
+repeating part is chosen, otherwise the parser continues with
+the rest of the rule.
+Appendix B will give an example of these features.
+.PP
+A useful aid in writing conflict resolvers is the "\fB%first\fR" keyword.
+It is used to declare a C-macro that forms an expression
+returning 1 if the parameter supplied can start a specified
+nonterminal, f.i.:
+.sp 1
+.nf
+.ft CW
+%first fmac, nonterm ;
+.ft R
+.sp 1
+.fi
+declares "fmac" as a macro with one parameter, whose value
+is a token number. If the parameter
+X can start the nonterminal "nonterm", "fmac(X)" is true,
+otherwise it is false.
+.NH
+The LLgen working environment
+.PP
+\fILLgen\fR generates a number of files: one for each input
+file, and two other files: \fILpars.c\fR and \fILpars.h\fR.
+\fILpars.h\fR contains "#-define"s for the tokennames.
+\fILpars.c\fR contains the error recovery routines and tables.
+Only those output files that differ from their previous version
+are updated. See appendix C for a possible application of this
+feature.
+.PP
+The names of the output files are constructed as
+follows:
+in the input file name, the suffix after the last point is
+replaced by a "c". If no point is present in the input file
+name, ".c" is appended to it. \fILLgen\fR checks that the
+filename constructed this way in fact represents a previous
+version, or does not exist already.
+.PP
+The user must provide some environment to obtain a complete
+program.
+Routines called \fImain\fR and \fILLmessage\fR must be defined.
+Also, a lexical analyzer must be provided.
+.PP
+The routine \fImain\fR must be defined, as it must be in every
+C-program. It should eventually call one of the startsymbol
+routines.
+.PP
+The routine \fILLmessage\fR must accept one
+parameter, whose value is a token number, zero or -1.
+.br
+A zero parameter indicates that the current token (the one in
+the external variable \fILLsymb\fR) is deleted.
+.br
+A -1 parameter indicates that the parser expected end of file, but didn't get
+it.
+The parser will then skip tokens until end of file is detected.
+.br
+A parameter that is a token number (a positive parameter)
+indicates that this
+token is to be inserted in front of the token currently in
+\fILLsymb\fR.
+The user can give the token the proper attributes.
+Also, the user must take care, that the token currently in
+\fILLsymb\fR is again returned by the \fBnext\fR call to the
+lexical analyzer, with the proper attributes.
+So, the lexical analyzer must have a facility to push back one
+token.
+.PP
+The user may also supply his own error recovery routines, or handle
+errors differently. For this purpose, the name of a routine to be called
+when an error occurs may be declared using the keyword \fB%onerror\fR.
+This routine takes two parameters.
+The first one is either the token number of the
+token expected, or 0. In the last case, the error occurred at a choice.
+In both cases, the routine must ensure that the next call to the lexical
+analyser returns the token that replaces the current one. Of course,
+that could well be the current one, in which case
+.I LLparse
+recovers from the error.
+The second parameter contains a list of tokens that are not skipped at the
+error point. The list is in the form of a null-terminated array of integers,
+whose address is passed.
+.PP
+The user must supply a lexical analyzer to read the input stream and
+break it up into tokens, which are passed to
+.I LLparse.
+It should be an integer valued function, returning the token number.
+The name of this function can be declared using the
+"\fB%lexical\fR" keyword.
+This keyword can be used wherever a declaration is legal and may appear
+only once in the grammar specification, f.i.:
+.sp 1
+.nf
+.ft CW
+%lexical scanner ;
+.ft R
+.fi
+.sp 1
+declares "scanner" as the name of the lexical analyzer.
+The default name for the lexical analyzer is "yylex".
+The reason for this funny name is that a useful tool for constructing
+lexical analyzers is the
+.I Lex
+program,
+.[
+lex
+.]
+which generates a routine of that name.
+.PP
+The token numbers are chosen by \fILLgen\fR.
+The token number for a literal
+is the numerical value of the character in the local character set.
+If the tokens have a name,
+the "#\ define" mechanism of C is used to give them a value and
+to allow the lexical analyzer to return their token numbers symbolically.
+These "#\ define"s are collected in the file \fILpars.h\fR which
+can be "#\ include"d in any file that needs the token-names.
+The maximum token number chosen is defined in the macro \fILL_MAXTOKNO\fP.
+.PP
+The lexical analyzer must signal the end
+of input to \fILLparse\fR
+by returning a number less than or equal to zero.
+.NH
+Programs with more than one parser
+.PP
+\fILLgen\fR offers a simple facility for having more than one parser in
+a program: in this case, the user can change the names of global procedures,
+variables, etc, by giving a different prefix, like this:
+.sp 1
+.nf
+.ft CW
+%prefix XX ;
+.ft R
+.fi
+.sp 1
+The effect of this is that all global names start with XX instead of LL, for
+the parser that has this prefix. This holds for the variables \fILLsymb\fP,
+which now is called \fIXXsymb\fP, for the routine \fILLmessage\fP,
+which must now be called \fIXXmessage\fP, and for the macro \fILL_MAXTOKNO\fP,
+which is now called \fIXX_MAXTOKNO\fP.
+\fILL.output\fP is now \fIXX.output\fP, and \fILpars.c\fP and \fILpars.h\fP
+are now called \fIXXpars.c\fP and \fIXXpars.h\fP.
+.bp
+.SH
+References
+.[
+$LIST$
+.]
+.bp
+.SH
+Appendix A : LLgen Input Syntax
+.PP
+This appendix has a description of the \fILLgen\fR input syntax,
+as a \fILLgen\fR specification. As a matter of fact, the current
+version of \fILLgen\fR is written with \fILLgen\fR.
+.nf
+.ft CW
+.sp 2
+/*
+ * First the declarations of the terminals
+ * The order is not important
+ */
+
+%token IDENTIFIER; /* terminal or nonterminal name */
+%token NUMBER;
+%token LITERAL;
+
+/*
+ * Reserved words
+ */
+
+%token TOKEN; /* %token */
+%token START; /* %start */
+%token PERSISTENT; /* %persistent */
+%token IF; /* %if */
+%token WHILE; /* %while */
+%token AVOID; /* %avoid */
+%token PREFER; /* %prefer */
+%token DEFAULT; /* %default */
+%token LEXICAL; /* %lexical */
+%token PREFIX; /* %prefix */
+%token ONERROR; /* %onerror */
+%token FIRST; /* %first */
+
+/*
+ * Declare LLparse to be a C-routine that recognizes "specification"
+ */
+
+%start LLparse, specification;
+
+specification
+ : declaration*
+ ;
+
+declaration
+ : START
+ IDENTIFIER ',' IDENTIFIER
+ ';'
+ | '{'
+ /* Read C-declaration here */
+ '}'
+ | TOKEN
+ IDENTIFIER
+ [ ',' IDENTIFIER ]*
+ ';'
+ | FIRST
+ IDENTIFIER ',' IDENTIFIER
+ ';'
+ | LEXICAL
+ IDENTIFIER
+ ';'
+ | PREFIX
+ IDENTIFIER
+ ';'
+ | ONERROR
+ IDENTIFIER
+ ';'
+ | rule
+ ;
+
+rule : IDENTIFIER parameters? ldecl?
+ ':' productions
+ ';'
+ ;
+
+ldecl : '{'
+ /* Read C-declaration here */
+ '}'
+ ;
+
+productions
+ : simpleproduction
+ [ '|' simpleproduction ]*
+ ;
+
+simpleproduction
+ : DEFAULT?
+ [ IF '(' /* Read C-expression here */ ')'
+ | PREFER
+ | AVOID
+ ]?
+ [ element repeats ]*
+ ;
+
+element : '{'
+ /* Read action here */
+ '}'
+ | '[' [ WHILE '(' /* Read C-expression here */ ')' ]?
+ PERSISTENT?
+ productions
+ ']'
+ | LITERAL
+ | IDENTIFIER parameters?
+ ;
+
+parameters
+ : '(' /* Read C-parameters here */ ')'
+ ;
+
+repeats : /* empty */
+ | [ '*' | '+' ] NUMBER?
+ | NUMBER
+ | '?'
+ ;
+
+.fi
+.ft R
+.bp
+.SH
+Appendix B : An example
+.PP
+This example gives the complete \fILLgen\fR specification of a simple
+desk calculator. It has 26 registers, labeled "a" through "z",
+and accepts arithmetic expressions made up of the C operators
++, -, *, /, %, &, and |, with their usual priorities.
+The value of the expression is
+printed. As in C, an integer that begins with 0 is assumed to
+be octal; otherwise it is assumed to be decimal.
+.PP
+Although the example is short and not very complicated, it
+demonstrates the use of if and while conditions. In
+the example they are in fact used to reduce the number of
+nonterminals, and to reduce the overhead due to the recursion
+that would be involved in parsing an expression with an
+ordinary recursive descent parser. In an ordinary LL(1)
+grammar there would be one nonterminal for each operator
+priority. The example shows how we can do it all with one
+nonterminal, no matter how many priority levels there are.
+.sp 1
+.nf
+.ft CW
+{
+#include <stdio.h>
+#include <ctype.h>
+#define MAXPRIO 5
+#define prio(op) (ptab[op])
+
+struct token {
+ int t_tokno; /* token number */
+ int t_tval; /* Its attribute */
+} stok = { 0,0 }, tok;
+
+int nerrors = 0;
+int regs[26]; /* Space for the registers */
+int ptab[128]; /* Attribute table */
+
+struct token
+nexttok() { /* Read next token and return it */
+ register c;
+ struct token new;
+
+ while ((c = getchar()) == ' ' || c == '\et') { /* nothing */ }
+ if (isdigit(c)) new.t_tokno = DIGIT;
+ else if (islower(c)) new.t_tokno = IDENT;
+ else new.t_tokno = c;
+ if (c >= 0) new.t_tval = ptab[c];
+ return new;
+} }
+
+%token DIGIT, IDENT;
+%start parse, list;
+
+list : stat* ;
+
+stat { int ident, val; } :
+ %if (stok = nexttok(),
+ stok.t_tokno == '=')
+ /* The conflict is resolved by looking one further
+ * token ahead. The grammar is LL(2)
+ */
+ IDENT
+ { ident = tok.t_tval; }
+ '=' expr(1,&val) '\en'
+ { if (!nerrors) regs[ident] = val; }
+ | expr(1,&val) '\en'
+ { if (!nerrors) printf("%d\en",val); }
+ | '\en'
+ ;
+
++expr(int level; int *val;) { int expr; } :
+ factor(val)
+ [ %while (prio(tok.t_tokno) >= level)
+ /* Swallow operators as long as their priority is
+ * larger than or equal to the level of this invocation
+ */
+ '+' expr(prio('+')+1,&expr)
+ { *val += expr; }
+ /* This states that '+' groups left to right. If it
+ * should group right to left, the rule should read:
+ * '+' expr(prio('+'),&expr)
+ */
+ | '-' expr(prio('-')+1,&expr)
+ { *val -= expr; }
+ | '*' expr(prio('*')+1,&expr)
+ { *val *= expr; }
+ | '/' expr(prio('/')+1,&expr)
+ { *val /= expr; }
+ | '%' expr(prio('%')+1,&expr)
+ { *val %= expr; }
+ | '&' expr(prio('&')+1,&expr)
+ { *val &= expr; }
+ | '|' expr(prio('|')+1,&expr)
+ { *val |= expr; }
+ ]*
+ /* Notice the "*" here. It is important.
+ */
+ ;
+
+factor(int *val;):
+ '(' expr(1,val) ')'
+ | '-' expr(MAXPRIO+1,val)
+ { *val = -*val; }
+ | number(val)
+ | IDENT
+ { *val = regs[tok.t_tval]; }
+ ;
+
+number(int *val;) { int base; }
+ : DIGIT
+ { base = (*val=tok.t_tval)==0?8:10; }
+ [ DIGIT
+ { *val = base * *val + tok.t_tval; }
+ ]* ;
+
+%lexical scanner ;
+{
+scanner() {
+ if (stok.t_tokno) { /* a token has been inserted or read ahead */
+ tok = stok;
+ stok.t_tokno = 0;
+ return tok.t_tokno;
+ }
+ if (nerrors && tok.t_tokno == '\en') {
+ printf("ERROR\en");
+ nerrors = 0;
+ }
+ tok = nexttok();
+ return tok.t_tokno;
+}
+
+LLmessage(insertedtok) {
+ nerrors++;
+ if (insertedtok) { /* token inserted, save old token */
+ stok = tok;
+ tok.t_tval = 0;
+ if (insertedtok < 128) tok.t_tval = ptab[insertedtok];
+ }
+}
+
+main() {
+ register *p;
+
+ for (p = ptab; p < &ptab[128]; p++) *p = 0;
+ /* for letters, their attribute is their index in the regs array */
+ for (p = &ptab['a']; p <= &ptab['z']; p++) *p = p - &ptab['a'];
+ /* for digits, their attribute is their value */
+ for (p = &ptab['0']; p <= &ptab['9']; p++) *p = p - &ptab['0'];
+ /* for operators, their attribute is their priority */
+ ptab['*'] = 4;
+ ptab['/'] = 4;
+ ptab['%'] = 4;
+ ptab['+'] = 3;
+ ptab['-'] = 3;
+ ptab['&'] = 2;
+ ptab['|'] = 1;
+ parse();
+ exit(nerrors);
+} }
+.fi
+.ft R
+.bp
+.SH
+Appendix C. How to use \fILLgen\fR.
+.PP
+This appendix demonstrates how \fILLgen\fR can be used in
+combination with the \fImake\fR program, to make effective use
+of the \fILLgen\fR-feature that it only changes output files
+when neccessary. \fIMake\fR uses a "makefile", which
+is a file containing dependencies and associated commands.
+A dependency usually indicates that some files depend on other
+files. When a file depends on another file and is older than
+that other file, the commands associated with the dependency
+are executed.
+.PP
+So, \fImake\fR seems just the program that we always wanted.
+However, it
+is not very good in handling programs that generate more than
+one file.
+As usual, there is a way around this problem.
+A sample makefile follows:
+.sp 1
+.ft CW
+.nf
+# The grammar exists of the files decl.g, stat.g and expr.g.
+# The ".o"-files are the result of a C-compilation.
+
+GFILES = decl.g stat.g expr.g
+OFILES = decl.o stat.o expr.o Lpars.o
+LLOPT =
+
+# As make does'nt handle programs that generate more than one
+# file well, we just don't tell make about it.
+# We just create a dummy file, and touch it whenever LLgen is
+# executed. This way, the dummy in fact depends on the grammar
+# files.
+# Then, we execute make again, to do the C-compilations and
+# such.
+
+all: dummy
+ make parser
+
+dummy: $(GFILES)
+ LLgen $(LLOPT) $(GFILES)
+ touch dummy
+
+parser: $(OFILES)
+ $(CC) -o parser $(LDFLAGS) $(OFILES)
+
+# Some dependencies without actions :
+# make already knows what to do about them
+
+Lpars.o: Lpars.h
+stat.o: Lpars.h
+decl.o: Lpars.h
+expr.o: Lpars.h
+
+.fi
+.ft R