Added more precise info about parameters

author ceriel <none@none>

Tue, 20 Dec 1994 12:40:21 +0000 (12:40 +0000)

committer ceriel <none@none>

Tue, 20 Dec 1994 12:40:21 +0000 (12:40 +0000)
author ceriel <none@none>
Tue, 20 Dec 1994 12:40:21 +0000 (12:40 +0000)
committer ceriel <none@none>
Tue, 20 Dec 1994 12:40:21 +0000 (12:40 +0000)
diff --cc doc/LLgen/LLgen.n

index d91edb1,0000000..3d9786a

mode 100644,000000..100644
--- 1/doc/LLgen/LLgen.n
--- /dev/null
+++ b/doc/LLgen/LLgen.n
@@@ -1,1072 -1,0 +1,1077 @@@
- the parsing routines can be given C-like parameters. So, for example
+ +.\"   $Id$
+ +.\"   Run this paper off with
+ +.\"   refer [options] -p LLgen.refs LLgen.doc | [n]eqn | tbl | (nt)roff -ms
+ +.if '\*(>.'' \{\
+ +.     if '\*(<.'' \{\
+ +.             if n .ds >. .
+ +.             if n .ds >, ,
+ +.             if t .ds <. .
+ +.             if t .ds <, ,\
+ +\}\
+ +\}
+ +.cs 5 22u
+ +.ND
+ +.EQ
+ +delim @@
+ +.EN
+ +.TL
+ +LLgen, an extended LL(1) parser generator
+ +.AU
+ +Ceriel J. H. Jacobs
+ +.AI
+ +Dept. of Mathematics and Computer Science
+ +Vrije Universiteit
+ +Amsterdam, The Netherlands
+ +.AB
+ +\fILLgen\fR provides a
+ +tool for generating an efficient recursive descent parser
+ +with no backtrack from
+ +an Extended Context Free syntax.
+ +The \fILLgen\fR
+ +user specifies the syntax, together with code
+ +describing actions associated with the parsing process.
+ +\fILLgen\fR
+ +turns this specification into a number of subroutines that handle the
+ +parsing process.
+ +.PP
+ +The grammar may be ambiguous.
+ +\fILLgen\fR contains both static and dynamic facilities
+ +to resolve these ambiguities.
+ +.PP
+ +The specification can be split into several files, for each of
+ +which \fILLgen\fR generates an output file containing the
+ +corresponding part of the parser.
+ +Furthermore, only output files that differ from their previous
+ +version are updated.
+ +Other output files are not affected in any
+ +way.
+ +This allows the user to recompile only those output files that have
+ +changed.
+ +.PP
+ +The subroutine produced by \fILLgen\fR calls a user supplied routine
+ +that must return the next token. This way, the input to the
+ +parser can be split into single characters or higher level
+ +tokens.
+ +.PP
+ +An error recovery mechanism is generated almost completely
+ +automatically.
+ +It is based on so called \fBdefault choices\fR, which are
+ +implicitly or explicitly specified by the user.
+ +.PP
+ +\fILLgen\fR has succesfully been used to create recognizers for
+ +Pascal, C, and Modula-2.
+ +.AE
+ +.NH
+ +Introduction
+ +.PP
+ +\fILLgen\fR
+ +provides a tool for generating an efficient recursive
+ +descent parser with no backtrack from an Extended Context Free
+ +syntax.
+ +A parser generated by
+ +\fILLgen\fR
+ +will be called
+ +\fILLparse\fR
+ +for the rest of this document.
+ +It is assumed that the reader has some knowledge of LL(1) grammars and
+ +recursive descent parsers.
+ +For a survey on the subject, see reference
+ +.[ (
+ +griffiths
+ +.]).
+ +.PP
+ +Extended LL(1) parsers are an extension of LL(1) parsers. They are
+ +derived from an Extended Context-Free (ECF) syntax instead of a Context-Free
+ +(CF) syntax.
+ +ECF syntax is described in section 2.
+ +Section 3 provides an outline of a
+ +specification as accepted by
+ +\fILLgen\fR and also discusses the lexical conventions of
+ +grammar specification files.
+ +Section 4 provides a description of the way the
+ +\fILLgen\fR
+ +user can associate
+ +actions with the syntax. These actions must be written in the programming
+ +language C,
+ +.[
+ +kernighan ritchie
+ +.]
+ +which also is the target language of \fILLgen\fR.
+ +The error recovery technique is discussed in section 5.
+ +This section also discusses what the user can do about it.
+ +Section 6 discusses
+ +the facilities \fILLgen\fR offers
+ +to resolve ambiguities and conflicts.
+ +\fILLgen\fR offers facilities to resolve them both at parser
+ +generation time and during the execution of \fILLparse\fR.
+ +Section 7 discusses the
+ +\fILLgen\fR
+ +working environment.
+ +It also discusses the lexical analyzer that must be supplied by the
+ +user.
+ +This lexical analyzer must read the input stream and break it
+ +up into basic input items, called \fBtokens\fR for the rest of
+ +this document.
+ +Appendix A gives a summary of the
+ +\fILLgen\fR
+ +input syntax.
+ +Appendix B gives an example.
+ +It is very instructive to compare this example with the one
+ +given in reference
+ +.[ (
+ +yacc
+ +.]).
+ +It demonstrates the struggle \fILLparse\fR and other LL(1)
+ +parsers have with expressions.
+ +Appendix C gives an example of the \fILLgen\fR features
+ +allowing the user to recompile only those output files that
+ +have changed, using the \fImake\fR program.
+ +.[
+ +make
+ +.]
+ +.NH
+ +The Extended Context-Free Syntax
+ +.PP
+ +The extensions of an ECF syntax with respect to an ordinary CF syntax are:
+ +.IP 1. 10
+ +An ECF syntax contains the repetition operator: "N" (N represents a positive
+ +integer).
+ +.IP 2. 10
+ +An ECF syntax contains the closure set operator without and with
+ +upperbound: "*" and "*N".
+ +.IP 3. 10
+ +An ECF syntax contains the positive closure set operator without and with
+ +upperbound: "+" and "+N".
+ +.IP 4. 10
+ +An ECF syntax contains the optional operator: "?", which is a
+ +shorthand for "*1".
+ +.IP 5. 10
+ +An ECF syntax contains parentheses "[" and "]" which can be
+ +used for grouping.
+ +.PP
+ +We can describe the syntax of an ECF syntax with an ECF syntax :
+ +.DS
+ +.ft CW
+ +grammar         : rule +
+ +                ;
+ +.ft R
+ +.DE
+ +This grammar rule states that a grammar consists of one or more
+ +rules.
+ +.DS
+ +.ft CW
+ +rule            : nonterminal ':' productionrule ';'
+ +                ;
+ +.ft R
+ +.DE
+ +A rule consists of a left hand side, the nonterminal,
+ +followed by ":",
+ +the \fBproduce symbol\fR, followed by a production rule, followed by a
+ +";", in\%di\%ca\%ting the end of the rule.
+ +.DS
+ +.ft CW
+ +productionrule  : production [ '|' production ]*
+ +                ;
+ +.ft R
+ +.DE
+ +A production rule consists of one or
+ +more alternative productions separated by "|". This symbol is called the
+ +\fBalternation symbol\fR.
+ +.DS
+ +.ft CW
+ +production      : term *
+ +                ;
+ +.ft R
+ +.DE
+ +A production consists of a possibly empty list of terms.
+ +So, empty productions are allowed.
+ +.DS
+ +.ft CW
+ +term            : element repeats
+ +                ;
+ +.ft R
+ +.DE
+ +A term is an element, possibly with a repeat specification.
+ +.DS
+ +.ft CW
+ +element         : LITERAL
+ +                | IDENTIFIER
+ +                | '[' productionrule ']'
+ +                ;
+ +.ft R
+ +.DE
+ +An element can be a LITERAL, which basically is a single character
+ +between apostrophes, it can be an IDENTIFIER, which is either a
+ +nonterminal or a token, and it can be a production rule
+ +between square parentheses.
+ +.DS
+ +.ft CW
+ +repeats         : '?'
+ +                | [ '*' | '+' ] NUMBER ?
+ +                | NUMBER ?
+ +                ;
+ +.ft R
+ +.DE
+ +These are the repeat specifications discussed above. Notice that
+ +this specification may be empty.
+ +.PP
+ +The class of ECF languages
+ +is identical with the class of CF languages. However, in many
+ +cases recursive definitions of language features can now be
+ +replaced by iterative ones. This tends to reduce the number of
+ +nonterminals and gives rise to very efficient recursive descent
+ +parsers.
+ +.NH
+ +Grammar Specifications
+ +.PP
+ +The major part of a
+ +\fILLgen\fR
+ +grammar specification consists of an
+ +ECF syntax specification.
+ +Names in this syntax specification refer to either tokens or nonterminal
+ +symbols.
+ +\fILLgen\fR
+ +requires token names to be declared as such. This way it
+ +can be avoided that a typing error in a nonterminal name causes it to
+ +be accepted as a token name. The token declarations will be
+ +discussed later.
+ +A name will be regarded as a nonterminal symbol, unless it is declared
+ +as a token name.
+ +If there is no production rule for a nonterminal symbol, \fILLgen\fR
+ +will complain.
+ +.PP
+ +A grammar specification may also include some C routines,
+ +for instance the lexical analyzer and an error reporting
+ +routine.
+ +Thus, a grammar specification file can contain declarations,
+ +grammar rules and C-code.
+ +.PP
+ +Blanks, tabs and newlines are ignored, but may not appear in names or
+ +keywords.
+ +Comments may appear wherever a name is legal (which is almost
+ +everywhere).
+ +They are enclosed in
+ +/* ... */, as in C. Comments do not nest.
+ +.PP
+ +Names may be of arbitrary length, and can be made up of letters, underscore
+ +"\_" and non-initial digits. Upper and lower case letters are distinct.
+ +Only the first 50 characters are significant.
+ +Notice however, that the names for the tokens will be used by the
+ +C-preprocessor.
+ +The number of significant characters therefore depends on the
+ +underlying C-implementation.
+ +A safe rule is to make the identifiers distinct in the first six
+ +characters, case ignored.
+ +.PP
+ +There are two kinds of tokens:
+ +those that are declared and are denoted by a name,
+ +and literals.
+ +.PP
+ +A literal consists of a character enclosed in apostrophes "'".
+ +The "\e" is an escape character within literals. The following escapes
+ +are recognized :
+ +.TS
+ +center;
+ +l l.
+ +\&'\en'       newline
+ +\&'\er'       return
+ +\&'\e''       apostrophe "'"
+ +\&'\e\e'      backslash "\e"
+ +\&'\et'       tab
+ +\&'\eb'       backspace
+ +\&'\ef'       form feed
+ +\&'\exxx'     "xxx" in octal
+ +.TE
+ +.PP
+ +Names representing tokens must be declared before they are used.
+ +This can be done using the "\fB%token\fR" keyword,
+ +by writing
+ +.nf
+ +.ft CW
+ +.sp 1
+ +%token  name1, name2, . . . ;
+ +.ft R
+ +.fi
+ +.PP
+ +\fILLparse\fR is designed to recognize special nonterminal
+ +symbols called \fBstart symbols\fR.
+ +\fILLgen\fR allows for more than one start symbol.
+ +Thus, grammars with more than one entry point are accepted.
+ +The start symbols must be declared explicitly using the
+ +"\fB%start\fR" keyword. It can be used whenever a declaration is
+ +legal, f.i.:
+ +.nf
+ +.ft CW
+ +.sp 1
+ +%start LLparse, specification ;
+ +.ft R
+ +.fi
+ +.sp 1
+ +declares "specification" as a start symbol and associates the
+ +identifier "LLparse" with it.
+ +"LLparse" will now be the name of the C-function that must be
+ +called to recognize "specification".
+ +.NH
+ +Actions
+ +.PP
+ +\fILLgen\fR
+ +allows arbitrary insertions of actions within the right hand side
+ +of a production rule in the ECF syntax. An action consists of a number of C
+ +statements, enclosed in the brackets "{" and "}".
+ +.PP
+ +\fILLgen\fR
+ +generates a parsing routine for each rule in the grammar. The actions
+ +supplied by the user are just inserted in the proper place.
+ +There may also be declarations before the statements in the
+ +action, as
+ +the "{" and "}" are copied into the target code along with the
+ +action. The scope of these declarations terminates with the
+ +closing bracket "}" of the action.
+ +.PP
+ +In addition to actions, it is also possible to declare local variables
+ +in the parsing routine, which can then be used in the actions.
+ +Such a declaration consists of a number of C variable declarations,
+ +enclosed in the brackets "{" and "}". It must be placed
+ +right in front of the ":" in the grammar rule.
+ +The scope of these local variables consists of the complete
+ +grammar rule.
+ +.PP
+ +In order to facilitate communication between the actions and
+ +\fILLparse\fR,
- expr(int level, int *val;) {       int     expr; } :
++the parsing routines can be given C-like parameters.
++Each parameter must be declared separately, and each of these declarations must
++end with a semicolon.
++For the last parameter, the semicolon is optional.
++.PP
++So, for example
+ +.nf
+ +.ft CW
+ +.sp 1
+ +expr(int *pval;) { int fact; } :
+ +                /*
+ +                 * Rule with one parameter, a pointer to an int.
+ +                 * Parameter specifications are ordinary C declarations.
+ +                 * One local variable, of type int.
+ +                 */
+ +        factor (&fact)          { *pval = fact; }
+ +                /*
+ +                 * factor is another nonterminal symbol.
+ +                 * One actual parameter is supplied.
+ +                 * Notice that the parameter passing mechanism is that
+ +                 * of C.
+ +                 */
+ +        [ '+' factor (&fact)    { *pval += fact; } ]*
+ +                /*
+ +                 * remember the '*' means zero or more times
+ +                 */
+ +        ;
+ +.sp 1
+ +.ft R
+ +.fi
+ +is a rule to recognize a number of factors, separated by "+", and
+ +to compute their sum.
+ +.PP
+ +\fILLgen\fR
+ +generates C code, so the parameter passing mechanism is that of
+ +C, as is shown in the example above.
+ +.PP
+ +Actions often manipulate attributes of the token just read.
+ +For instance, when an identifier is read, its name must be
+ +looked up in a symbol table.
+ +Therefore, \fILLgen\fR generates code
+ +such that at a number of places in the grammar rule
+ +it is defined which token has last been read.
+ +After a token, the last token read is this token.
+ +After a "[" or a "|", the last token read is the next token to
+ +be accepted by \fILLparse\fR.
+ +At all other places, it is undefined which token has last been
+ +read.
+ +The last token read is available in the global integer variable
+ +\fILLsymb\fR.
+ +.PP
+ +The user may also specify C-code wherever a \fILLgen\fR-declaration is
+ +legal.
+ +Again, this code must be enclosed in the brackets "{" and "}".
+ +This way, the user can define global declarations and
+ +C-functions.
+ +To avoid name-conflicts with identifiers generated by
+ +\fILLgen\fR, \fILLparse\fR only uses names beginning with
+ +"LL"; the user should avoid such names.
+ +.NH
+ +Error Recovery
+ +.PP
+ +The error recovery technique used by \fILLgen\fR is a
+ +modification of the one presented in reference
+ +.[ (
+ +automatic construction error correcting
+ +.]).
+ +It is based on \fBdefault choices\fR, which just are
+ +what the word says, default choices at
+ +every point in the grammar where there is a
+ +choice.
+ +Thus, in an alternation, one of the productions is marked as a
+ +default choice, and in a term with a non-fixed repetition
+ +specification there will also be a default choice (between
+ +doing the term (once more) and continuing with the rest of the
+ +production in which the term appears).
+ +.PP
+ +When \fILLparse\fR detects an error after having parsed the
+ +string @s@, the default choices enable it to compute one
+ +syntactically correct continuation,
+ +consisting of the tokens @t sub 1~...~t sub n@,
+ +such that @s~t sub 1~...~t sub n@ is a string of tokens that
+ +is a member of the language defined by the grammar.
+ +Notice, that the computation of this continuation must
+ +terminate, which implies that the default choices may not
+ +invoke recursive rules.
+ +.PP
+ +At each point in this continuation, a certain number of other
+ +tokens could also be syntactically correct, f.i. the token
+ +@t@ is syntactically correct at point @t sub i@ in this
+ +continuation, if the string @s~t sub 1~...~t sub i~t~s sub 1@
+ +is a string of the language defined by the grammar for some
+ +string @s sub 1@ and i >= 0.
+ +.PP
+ +The set @T@
+ +containing all these tokens (including @t sub 1 ,~...,~t sub n@) is computed.
+ +Next, \fILLparse\fR discards zero
+ +or more tokens from its input, until a token
+ +@t@ \(mo @T@ is found.
+ +The error is then corrected by inserting i (i >= 0) tokens
+ +@t sub 1~...~t sub i@, such that the string
+ +@s~t sub 1~...~t sub i~t~s sub 1@ is a string of the language
+ +defined by the grammar, for some @s sub 1@.
+ +Then, normal parsing is resumed.
+ +.PP
+ +The above is difficult to implement in a recursive decent
+ +parser, and is not the way \fILLparse\fR does it, but the
+ +effect is the same. In fact, \fILLparse\fR maintains a list
+ +of tokens that may not be discarded, which is adjusted as
+ +\fILLparse\fR proceeds. This list is just a representation
+ +of the set @T@ mentioned
+ +above. When an error occurs, \fILLparse\fR discards tokens until
+ +a token @t@ that is a member of this list is found.
+ +Then, it continues parsing, following the default choices,
+ +inserting tokens along the way, until this token @t@ is legal.
+ +The selection of
+ +the default choices must guarantee that this will always
+ +happen.
+ +.PP
+ +The default choices are explicitly or implicitly
+ +specified by the user.
+ +By default, the default choice in an alternation is the
+ +alternative with the shortest possible terminal production.
+ +The user can select one of the other productions in the
+ +alternation as the default choice by putting the keyword
+ +"\fB%default\fR" in front of it.
+ +.PP
+ +By default, for terms with a repetition count containing "*" or
+ +"?" the default choice is to continue with the rest of the rule
+ +in which the term appears, and
+ +.sp 1
+ +.ft CW
+ +.nf
+ +                term+
+ +.fi
+ +.ft R
+ +.sp 1
+ +is treated as
+ +.sp 1
+ +.nf
+ +.ft CW
+ +                term term* .
+ +.ft R
+ +.fi
+ +.PP
+ +It is also clear, that it can never be the default choice to do
+ +the term (once more), because this could cause the parser to
+ +loop, inserting tokens forever.
+ +However, when the user does not want the parser to skip
+ +tokens that would not have been skipped if the term
+ +would have been the default choice,
+ +the skipping of such a term can be prevented by
+ +using the keyword "\fB%persistent\fR".
+ +For instance, the rule
+ +.sp 1
+ +.ft CW
+ +.nf
+ +commandlist : command* ;
+ +.fi
+ +.ft R
+ +.sp 1
+ +could be changed to
+ +.sp 1
+ +.ft CW
+ +.nf
+ +commandlist : [ %persistent command ]* ;
+ +.fi
+ +.ft R
+ +.sp 1
+ +The effects of this in case of a syntax error are twofold:
+ +The set @T@ mentioned above will be extended as if "command" were
+ +in the default production, so that fewer tokens will be
+ +skipped.
+ +Also, if the first token that is not skipped is a member of the
+ +subset of @T@ arising from the grammar rule for "command",
+ +\fILLparse\fR will enter that rule.
+ +So, in fact the default choice
+ +is determined dynamically (by \fILLparse\fR).
+ +Again, \fILLgen\fR checks (statically)
+ +that \fILLparse\fR will always terminate, and if not,
+ +\fILLgen\fR will complain.
+ +.PP
+ +An important property of this error recovery method is that,
+ +once a rule is started, it will be finished.
+ +This means that all actions in the rule will be executed
+ +normally, so that the user can be sure that there will be no
+ +inconsistencies in his data structures because of syntax
+ +errors.
+ +Also, as the method is in fact error correcting, the
+ +actions in a rule only have to deal with syntactically correct
+ +input.
+ +.NH
+ +Ambiguities and conflicts
+ +.PP
+ +As \fILLgen\fR generates a recursive descent parser with no backtrack,
+ +it must at all times be able to determine what to do,
+ +based on the current input symbol.
+ +Unfortunately, this cannot be done for all grammars.
+ +Two kinds of conflicts can arise :
+ +.IP 1) 10
+ +the grammar rule is of the form "production1 | production2",
+ +and \fILLparse\fR cannot decide which production to chose.
+ +This we call an \fBalternation conflict\fR.
+ +.IP 2) 10
+ +the grammar rule is of the form "[ productionrule ]...",
+ +where ... specifies a non-fixed repetition count,
+ +and \fILLparse\fR cannot decide whether to
+ +choose "productionrule" once more, or to continue.
+ +This we call a \fBrepetition conflict\fR.
+ +.PP
+ +There can be several causes for conflicts: the grammar may be
+ +ambiguous, or the grammar may require a more complex parser
+ +than \fILLgen\fR can construct.
+ +The conflicts can be examined by inspecting the verbose
+ +(-\fBv\fR) option output file.
+ +The conflicts can be resolved by rewriting the grammar
+ +or by using \fBconflict resolvers\fR.
+ +The mechanism described here is based on the attributed parsing
+ +of reference
+ +.[ (
+ +milton
+ +.]).
+ +.PP
+ +An alternation conflict can be resolved by putting an \fBif condition\fR
+ +in front of the first conflicting production.
+ +It consists of a "\fB%if\fR" followed by a
+ +C-expression between parentheses.
+ +\fILLparse\fR will then evaluate this expression whenever a
+ +token is met at this point on which there is a conflict, so
+ +the conflict will be resolved dynamically.
+ +If the expression evaluates to
+ +non-zero, the first conflicting production is chosen,
+ +otherwise one of the remaining ones is chosen.
+ +.PP
+ +An alternation conflict can also be resolved using the keywords
+ +"\fB%prefer\fR" or "\fB%avoid\fR". "\fB%prefer\fR"
+ +is equivalent in behaviour to
+ +"\fB%if\fR (1)". "\fB%avoid\fR" is equivalent to "\fB%if\fR (0)".
+ +In these cases however, "\fB%prefer\fR" and "\fB%avoid\fR" should be used,
+ +as they resolve the conflict statically and thus
+ +give rise to better C-code.
+ +.PP
+ +A repetition conflict can be resolved by putting a \fBwhile condition\fR
+ +right after the opening parentheses. This while condition
+ +consists of a "\fB%while\fR" followed by a C-expression between
+ +parentheses. Again, \fILLparse\fR will then
+ +evaluate this expression whenever a token is met
+ +at this point on which there is a conflict.
+ +If the expression evaluates to non-zero, the
+ +repeating part is chosen, otherwise the parser continues with
+ +the rest of the rule.
+ +Appendix B will give an example of these features.
+ +.PP
+ +A useful aid in writing conflict resolvers is the "\fB%first\fR" keyword.
+ +It is used to declare a C-macro that forms an expression
+ +returning 1 if the parameter supplied can start a specified
+ +nonterminal, f.i.:
+ +.sp 1
+ +.nf
+ +.ft CW
+ +%first fmac, nonterm ;
+ +.ft R
+ +.sp 1
+ +.fi
+ +declares "fmac" as a macro with one parameter, whose value
+ +is a token number. If the parameter
+ +X can start the nonterminal "nonterm", "fmac(X)" is true,
+ +otherwise it is false.
+ +.NH
+ +The LLgen working environment
+ +.PP
+ +\fILLgen\fR generates a number of files: one for each input
+ +file, and two other files: \fILpars.c\fR and \fILpars.h\fR.
+ +\fILpars.h\fR contains "#-define"s for the tokennames.
+ +\fILpars.c\fR contains the error recovery routines and tables.
+ +Only those output files that differ from their previous version
+ +are updated. See appendix C for a possible application of this
+ +feature.
+ +.PP
+ +The names of the output files are constructed as
+ +follows:
+ +in the input file name, the suffix after the last point is
+ +replaced by a "c". If no point is present in the input file
+ +name, ".c" is appended to it. \fILLgen\fR checks that the
+ +filename constructed this way in fact represents a previous
+ +version, or does not exist already.
+ +.PP
+ +The user must provide some environment to obtain a complete
+ +program.
+ +Routines called \fImain\fR and \fILLmessage\fR must be defined.
+ +Also, a lexical analyzer must be provided.
+ +.PP
+ +The routine \fImain\fR must be defined, as it must be in every
+ +C-program. It should eventually call one of the startsymbol
+ +routines.
+ +.PP
+ +The routine \fILLmessage\fR must accept one
+ +parameter, whose value is a token number, zero or -1.
+ +.br
+ +A zero parameter indicates that the current token (the one in
+ +the external variable \fILLsymb\fR) is deleted.
+ +.br
+ +A -1 parameter indicates that the parser expected end of file, but didn't get
+ +it.
+ +The parser will then skip tokens until end of file is detected.
+ +.br
+ +A parameter that is a token number (a positive parameter)
+ +indicates that this
+ +token is to be inserted in front of the token currently in
+ +\fILLsymb\fR.
+ +The user can give the token the proper attributes.
+ +Also, the user must take care, that the token currently in
+ +\fILLsymb\fR is again returned by the \fBnext\fR call to the
+ +lexical analyzer, with the proper attributes.
+ +So, the lexical analyzer must have a facility to push back one
+ +token.
+ +.PP
+ +The user may also supply his own error recovery routines, or handle
+ +errors differently. For this purpose, the name of a routine to be called
+ +when an error occurs may be declared using the keyword \fB%onerror\fR.
+ +This routine takes two parameters.
+ +The first one is either the token number of the
+ +token expected, or 0. In the last case, the error occurred at a choice.
+ +In both cases, the routine must ensure that the next call to the lexical
+ +analyser returns the token that replaces the current one. Of course,
+ +that could well be the current one, in which case
+ +.I LLparse
+ +recovers from the error.
+ +The second parameter contains a list of tokens that are not skipped at the
+ +error point. The list is in the form of a null-terminated array of integers,
+ +whose address is passed.
+ +.PP
+ +The user must supply a lexical analyzer to read the input stream and
+ +break it up into tokens, which are passed to
+ +.I LLparse.
+ +It should be an integer valued function, returning the token number.
+ +The name of this function can be declared using the
+ +"\fB%lexical\fR" keyword.
+ +This keyword can be used wherever a declaration is legal and may appear
+ +only once in the grammar specification, f.i.:
+ +.sp 1
+ +.nf
+ +.ft CW
+ +%lexical scanner ;
+ +.ft R
+ +.fi
+ +.sp 1
+ +declares "scanner" as the name of the lexical analyzer.
+ +The default name for the lexical analyzer is "yylex".
+ +The reason for this funny name is that a useful tool for constructing
+ +lexical analyzers is the
+ +.I Lex
+ +program,
+ +.[
+ +lex
+ +.]
+ +which generates a routine of that name.
+ +.PP
+ +The token numbers are chosen by \fILLgen\fR.
+ +The token number for a literal
+ +is the numerical value of the character in the local character set.
+ +If the tokens have a name,
+ +the "#\ define" mechanism of C is used to give them a value and
+ +to allow the lexical analyzer to return their token numbers symbolically.
+ +These "#\ define"s are collected in the file \fILpars.h\fR which
+ +can be "#\ include"d in any file that needs the token-names.
+ +The maximum token number chosen is defined in the macro \fILL_MAXTOKNO\fP.
+ +.PP
+ +The lexical analyzer must signal the end
+ +of input to \fILLparse\fR
+ +by returning a number less than or equal to zero.
+ +.NH
+ +Programs with more than one parser
+ +.PP
+ +\fILLgen\fR offers a simple facility for having more than one parser in
+ +a program: in this case, the user can change the names of global procedures,
+ +variables, etc, by giving a different prefix, like this:
+ +.sp 1
+ +.nf
+ +.ft CW
+ +%prefix XX ;
+ +.ft R
+ +.fi
+ +.sp 1
+ +The effect of this is that all global names start with XX instead of LL, for
+ +the parser that has this prefix. This holds for the variables \fILLsymb\fP,
+ +which now is called \fIXXsymb\fP, for the routine \fILLmessage\fP,
+ +which must now be called \fIXXmessage\fP, and for the macro \fILL_MAXTOKNO\fP,
+ +which is now called \fIXX_MAXTOKNO\fP.
+ +\fILL.output\fP is now \fIXX.output\fP, and \fILpars.c\fP and \fILpars.h\fP
+ +are now called \fIXXpars.c\fP and \fIXXpars.h\fP.
+ +.bp
+ +.SH
+ +References
+ +.[
+ +$LIST$
+ +.]
+ +.bp
+ +.SH
+ +Appendix A : LLgen Input Syntax
+ +.PP
+ +This appendix has a description of the \fILLgen\fR input syntax,
+ +as a \fILLgen\fR specification. As a matter of fact, the current
+ +version of \fILLgen\fR is written with \fILLgen\fR.
+ +.nf
+ +.ft CW
+ +.sp 2
+ +/*
+ + * First the declarations of the terminals
+ + * The order is not important
+ + */
+ +
+ +%token  IDENTIFIER;            /* terminal or nonterminal name */
+ +%token  NUMBER;
+ +%token  LITERAL;
+ +
+ +/*
+ + * Reserved words
+ + */
+ +
+ +%token  TOKEN;         /* %token */
+ +%token  START;         /* %start */
+ +%token  PERSISTENT;    /* %persistent */
+ +%token  IF;            /* %if */
+ +%token  WHILE;         /* %while */
+ +%token  AVOID;         /* %avoid */
+ +%token  PREFER;        /* %prefer */
+ +%token  DEFAULT;       /* %default */
+ +%token  LEXICAL;       /* %lexical */
+ +%token  PREFIX;        /* %prefix */
+ +%token  ONERROR;       /* %onerror */
+ +%token  FIRST;         /* %first */
+ +
+ +/*
+ + * Declare LLparse to be a C-routine that recognizes "specification"
+ + */
+ +
+ +%start  LLparse, specification;
+ +
+ +specification
+ +        : declaration*
+ +        ;
+ +
+ +declaration
+ +        : START
+ +                IDENTIFIER ',' IDENTIFIER
+ +          ';'
+ +        | '{'
+ +                /* Read C-declaration here */
+ +          '}'
+ +        | TOKEN
+ +                IDENTIFIER
+ +                [ ',' IDENTIFIER ]*
+ +          ';'
+ +        | FIRST
+ +                IDENTIFIER ',' IDENTIFIER
+ +          ';'
+ +        | LEXICAL
+ +                IDENTIFIER
+ +          ';'
+ +        | PREFIX
+ +                IDENTIFIER
+ +          ';'
+ +        | ONERROR
+ +                IDENTIFIER
+ +        ';'
+ +        | rule
+ +        ;
+ +
+ +rule    : IDENTIFIER parameters? ldecl?
+ +                ':' productions
+ +          ';'
+ +        ;
+ +
+ +ldecl   : '{'
+ +                /* Read C-declaration here */
+ +          '}'
+ +        ;
+ +
+ +productions
+ +        : simpleproduction
+ +          [ '|' simpleproduction ]*
+ +        ;
+ +
+ +simpleproduction
+ +        : DEFAULT?
+ +        [ IF '(' /* Read C-expression here */ ')'
+ +          | PREFER
+ +          | AVOID
+ +          ]?
+ +          [ element repeats ]*
+ +        ;
+ +
+ +element : '{'
+ +                /* Read action here */
+ +          '}'
+ +        | '[' [ WHILE '(' /* Read C-expression here */ ')' ]?
+ +                PERSISTENT?
+ +                productions
+ +          ']'
+ +        | LITERAL
+ +        | IDENTIFIER parameters?
+ +        ;
+ +
+ +parameters
+ +        : '(' /* Read C-parameters here */ ')'
+ +        ;
+ +
+ +repeats : /* empty */
+ +        | [ '*' | '+' ] NUMBER?
+ +        | NUMBER
+ +        | '?'
+ +        ;
+ +
+ +.fi
+ +.ft R
+ +.bp
+ +.SH
+ +Appendix B : An example
+ +.PP
+ +This example gives the complete \fILLgen\fR specification of a simple
+ +desk calculator. It has 26 registers, labeled "a" through "z",
+ +and accepts arithmetic expressions made up of the C operators
+ ++, -, *, /, %, &, and |, with their usual priorities.
+ +The value of the expression is
+ +printed. As in C, an integer that begins with 0 is assumed to
+ +be octal; otherwise it is assumed to be decimal.
+ +.PP
+ +Although the example is short and not very complicated, it
+ +demonstrates the use of if and while conditions. In
+ +the example they are in fact used to reduce the number of
+ +nonterminals, and to reduce the overhead due to the recursion
+ +that would be involved in parsing an expression with an
+ +ordinary recursive descent parser. In an ordinary LL(1)
+ +grammar there would be one nonterminal for each operator
+ +priority. The example shows how we can do it all with one
+ +nonterminal, no matter how many priority levels there are.
+ +.sp 1
+ +.nf
+ +.ft CW
+ +{
+ +#include <stdio.h>
+ +#include <ctype.h>
+ +#define MAXPRIO      5
+ +#define prio(op)     (ptab[op])
+ +
+ +struct token {
+ +        int     t_tokno;        /* token number */
+ +        int     t_tval;         /* Its attribute */
+ +} stok = { 0,0 }, tok;
+ +
+ +int     nerrors = 0;
+ +int     regs[26];               /* Space for the registers */
+ +int     ptab[128];              /* Attribute table */
+ +
+ +struct token
+ +nexttok() {  /* Read next token and return it */
+ +        register        c;
+ +        struct token    new;
+ +
+ +        while ((c = getchar()) == ' ' || c == '\et') { /* nothing */ }
+ +        if (isdigit(c)) new.t_tokno = DIGIT;
+ +        else if (islower(c)) new.t_tokno = IDENT;
+ +        else new.t_tokno = c;
+ +        if (c >= 0) new.t_tval = ptab[c];
+ +        return new;
+ +}   }
+ +
+ +%token  DIGIT, IDENT;
+ +%start  parse, list;
+ +
+ +list    : stat* ;
+ +
+ +stat    {       int     ident, val; } :
+ +        %if (stok = nexttok(),
+ +             stok.t_tokno == '=')
+ +                    /* The conflict is resolved by looking one further
+ +                     * token ahead. The grammar is LL(2)
+ +                     */
+ +          IDENT
+ +                                {       ident = tok.t_tval; }
+ +          '=' expr(1,&val) '\en'
+ +                                {       if (!nerrors) regs[ident] = val; }
+ +        | expr(1,&val) '\en'
+ +                                {       if (!nerrors) printf("%d\en",val); }
+ +        | '\en'
+ +        ;
+ +
++expr(int level; int *val;) {       int     expr; } :
+ +          factor(val)
+ +          [ %while (prio(tok.t_tokno) >= level)
+ +                    /* Swallow operators as long as their priority is
+ +                     * larger than or equal to the level of this invocation
+ +                     */
+ +              '+' expr(prio('+')+1,&expr)
+ +                                {       *val += expr; }
+ +                    /* This states that '+' groups left to right. If it
+ +                     * should group right to left, the rule should read:
+ +                     * '+' expr(prio('+'),&expr)
+ +                     */
+ +            | '-' expr(prio('-')+1,&expr)
+ +                                {       *val -= expr; }
+ +            | '*' expr(prio('*')+1,&expr)
+ +                                {       *val *= expr; }
+ +            | '/' expr(prio('/')+1,&expr)
+ +                                {       *val /= expr; }
+ +            | '%' expr(prio('%')+1,&expr)
+ +                                {       *val %= expr; }
+ +            | '&' expr(prio('&')+1,&expr)
+ +                                {       *val &= expr; }
+ +            | '|' expr(prio('|')+1,&expr)
+ +                                {       *val |= expr; }
+ +          ]*
+ +                    /* Notice the "*" here. It is important.
+ +                     */
+ +      ;
+ +
+ +factor(int *val;):
+ +            '(' expr(1,val) ')'
+ +          | '-' expr(MAXPRIO+1,val)
+ +                                {       *val = -*val; }
+ +          | number(val)
+ +          | IDENT
+ +                                {       *val = regs[tok.t_tval]; }
+ +        ;
+ +
+ +number(int *val;) {       int base; }
+ +        : DIGIT
+ +                                {       base = (*val=tok.t_tval)==0?8:10; }
+ +          [ DIGIT
+ +                                {       *val = base * *val + tok.t_tval; }
+ +          ]*        ;
+ +
+ +%lexical scanner ;
+ +{
+ +scanner() {
+ +        if (stok.t_tokno) { /* a token has been inserted or read ahead */
+ +                tok = stok;
+ +                stok.t_tokno = 0;
+ +                return tok.t_tokno;
+ +        }
+ +        if (nerrors && tok.t_tokno == '\en') {
+ +                printf("ERROR\en");
+ +                nerrors = 0;
+ +        }
+ +        tok = nexttok();
+ +        return tok.t_tokno;
+ +}
+ +
+ +LLmessage(insertedtok) {
+ +        nerrors++;
+ +        if (insertedtok) { /* token inserted, save old token */
+ +                stok = tok;
+ +                tok.t_tval = 0;
+ +                if (insertedtok < 128) tok.t_tval = ptab[insertedtok];
+ +        }
+ +}
+ +
+ +main() {
+ +        register *p;
+ +
+ +        for (p = ptab; p < &ptab[128]; p++) *p = 0;
+ +        /* for letters, their attribute is their index in the regs array */
+ +        for (p = &ptab['a']; p <= &ptab['z']; p++) *p = p - &ptab['a'];
+ +        /* for digits, their attribute is their value */
+ +        for (p = &ptab['0']; p <= &ptab['9']; p++) *p = p - &ptab['0'];
+ +        /* for operators, their attribute is their priority */
+ +        ptab['*'] = 4;
+ +        ptab['/'] = 4;
+ +        ptab['%'] = 4;
+ +        ptab['+'] = 3;
+ +        ptab['-'] = 3;
+ +        ptab['&'] = 2;
+ +        ptab['|'] = 1;
+ +        parse();
+ +      exit(nerrors);
+ +}   }
+ +.fi
+ +.ft R
+ +.bp
+ +.SH
+ +Appendix C. How to use \fILLgen\fR.
+ +.PP
+ +This appendix demonstrates how \fILLgen\fR can be used in
+ +combination with the \fImake\fR program, to make effective use
+ +of the \fILLgen\fR-feature that it only changes output files
+ +when neccessary. \fIMake\fR uses a "makefile", which
+ +is a file containing dependencies and associated commands.
+ +A dependency usually indicates that some files depend on other
+ +files. When a file depends on another file and is older than
+ +that other file, the commands associated with the dependency
+ +are executed.
+ +.PP
+ +So, \fImake\fR seems just the program that we always wanted.
+ +However, it
+ +is not very good in handling programs that generate more than
+ +one file.
+ +As usual, there is a way around this problem.
+ +A sample makefile follows:
+ +.sp 1
+ +.ft CW
+ +.nf
+ +# The grammar exists of the files decl.g, stat.g and expr.g.
+ +# The ".o"-files are the result of a C-compilation.
+ +
+ +GFILES = decl.g stat.g expr.g
+ +OFILES = decl.o stat.o expr.o Lpars.o
+ +LLOPT =
+ +
+ +# As make does'nt handle programs that generate more than one
+ +# file well, we just don't tell make about it.
+ +# We just create a dummy file, and touch it whenever LLgen is
+ +# executed. This way, the dummy in fact depends on the grammar
+ +# files.
+ +# Then, we execute make again, to do the C-compilations and
+ +# such.
+ +
+ +all:  dummy
+ +        make parser
+ +
+ +dummy:  $(GFILES)
+ +        LLgen $(LLOPT) $(GFILES)
+ +        touch dummy
+ +
+ +parser: $(OFILES)
+ +        $(CC) -o parser $(LDFLAGS) $(OFILES)
+ +
+ +# Some dependencies without actions :
+ +# make already knows what to do about them
+ +
+ +Lpars.o:        Lpars.h
+ +stat.o:         Lpars.h
+ +decl.o:         Lpars.h
+ +expr.o:         Lpars.h
+ +
+ +.fi
+ +.ft R
author	ceriel <none@none>
	Tue, 20 Dec 1994 12:40:21 +0000 (12:40 +0000)
committer	ceriel <none@none>
	Tue, 20 Dec 1994 12:40:21 +0000 (12:40 +0000)