doc/LLgen/LLgen.n

   1 .\"     $Id: LLgen.n,v 1.16 1994/12/20 12:40:21 ceriel Exp $
   2 .\"     Run this paper off with
   3 .\"     refer [options] -p LLgen.refs LLgen.doc | [n]eqn | tbl | (nt)roff -ms
   4 .if '\*(>.'' \{\
   5 .       if '\*(<.'' \{\
   6 .               if n .ds >. .
   7 .               if n .ds >, ,
   8 .               if t .ds <. .
   9 .               if t .ds <, ,\
  10 \}\
  11 \}
  12 .cs 5 22u
  13 .ND
  14 .EQ
  15 delim @@
  16 .EN
  17 .TL
  18 LLgen, an extended LL(1) parser generator
  19 .AU
  20 Ceriel J. H. Jacobs
  21 .AI
  22 Dept. of Mathematics and Computer Science
  23 Vrije Universiteit
  24 Amsterdam, The Netherlands
  25 .AB
  26 \fILLgen\fR provides a
  27 tool for generating an efficient recursive descent parser
  28 with no backtrack from
  29 an Extended Context Free syntax.
  30 The \fILLgen\fR
  31 user specifies the syntax, together with code
  32 describing actions associated with the parsing process.
  33 \fILLgen\fR
  34 turns this specification into a number of subroutines that handle the
  35 parsing process.
  36 .PP
  37 The grammar may be ambiguous.
  38 \fILLgen\fR contains both static and dynamic facilities
  39 to resolve these ambiguities.
  40 .PP
  41 The specification can be split into several files, for each of
  42 which \fILLgen\fR generates an output file containing the
  43 corresponding part of the parser.
  44 Furthermore, only output files that differ from their previous
  45 version are updated.
  46 Other output files are not affected in any
  47 way.
  48 This allows the user to recompile only those output files that have
  49 changed.
  50 .PP
  51 The subroutine produced by \fILLgen\fR calls a user supplied routine
  52 that must return the next token. This way, the input to the
  53 parser can be split into single characters or higher level
  54 tokens.
  55 .PP
  56 An error recovery mechanism is generated almost completely
  57 automatically.
  58 It is based on so called \fBdefault choices\fR, which are
  59 implicitly or explicitly specified by the user.
  60 .PP
  61 \fILLgen\fR has succesfully been used to create recognizers for
  62 Pascal, C, and Modula-2.
  63 .AE
  64 .NH
  65 Introduction
  66 .PP
  67 \fILLgen\fR
  68 provides a tool for generating an efficient recursive
  69 descent parser with no backtrack from an Extended Context Free
  70 syntax.
  71 A parser generated by
  72 \fILLgen\fR
  73 will be called
  74 \fILLparse\fR
  75 for the rest of this document.
  76 It is assumed that the reader has some knowledge of LL(1) grammars and
  77 recursive descent parsers.
  78 For a survey on the subject, see reference
  79 .[ (
  80 griffiths
  81 .]).
  82 .PP
  83 Extended LL(1) parsers are an extension of LL(1) parsers. They are
  84 derived from an Extended Context-Free (ECF) syntax instead of a Context-Free
  85 (CF) syntax.
  86 ECF syntax is described in section 2.
  87 Section 3 provides an outline of a
  88 specification as accepted by
  89 \fILLgen\fR and also discusses the lexical conventions of
  90 grammar specification files.
  91 Section 4 provides a description of the way the
  92 \fILLgen\fR
  93 user can associate
  94 actions with the syntax. These actions must be written in the programming
  95 language C,
  96 .[
  97 kernighan ritchie
  98 .]
  99 which also is the target language of \fILLgen\fR.
 100 The error recovery technique is discussed in section 5.
 101 This section also discusses what the user can do about it.
 102 Section 6 discusses
 103 the facilities \fILLgen\fR offers
 104 to resolve ambiguities and conflicts.
 105 \fILLgen\fR offers facilities to resolve them both at parser
 106 generation time and during the execution of \fILLparse\fR.
 107 Section 7 discusses the
 108 \fILLgen\fR
 109 working environment.
 110 It also discusses the lexical analyzer that must be supplied by the
 111 user.
 112 This lexical analyzer must read the input stream and break it
 113 up into basic input items, called \fBtokens\fR for the rest of
 114 this document.
 115 Appendix A gives a summary of the
 116 \fILLgen\fR
 117 input syntax.
 118 Appendix B gives an example.
 119 It is very instructive to compare this example with the one
 120 given in reference
 121 .[ (
 122 yacc
 123 .]).
 124 It demonstrates the struggle \fILLparse\fR and other LL(1)
 125 parsers have with expressions.
 126 Appendix C gives an example of the \fILLgen\fR features
 127 allowing the user to recompile only those output files that
 128 have changed, using the \fImake\fR program.
 129 .[
 130 make
 131 .]
 132 .NH
 133 The Extended Context-Free Syntax
 134 .PP
 135 The extensions of an ECF syntax with respect to an ordinary CF syntax are:
 136 .IP 1. 10
 137 An ECF syntax contains the repetition operator: "N" (N represents a positive
 138 integer).
 139 .IP 2. 10
 140 An ECF syntax contains the closure set operator without and with
 141 upperbound: "*" and "*N".
 142 .IP 3. 10
 143 An ECF syntax contains the positive closure set operator without and with
 144 upperbound: "+" and "+N".
 145 .IP 4. 10
 146 An ECF syntax contains the optional operator: "?", which is a
 147 shorthand for "*1".
 148 .IP 5. 10
 149 An ECF syntax contains parentheses "[" and "]" which can be
 150 used for grouping.
 151 .PP
 152 We can describe the syntax of an ECF syntax with an ECF syntax :
 153 .DS
 154 .ft CW
 155 grammar         : rule +
 156                 ;
 157 .ft R
 158 .DE
 159 This grammar rule states that a grammar consists of one or more
 160 rules.
 161 .DS
 162 .ft CW
 163 rule            : nonterminal ':' productionrule ';'
 164                 ;
 165 .ft R
 166 .DE
 167 A rule consists of a left hand side, the nonterminal,
 168 followed by ":",
 169 the \fBproduce symbol\fR, followed by a production rule, followed by a
 170 ";", in\%di\%ca\%ting the end of the rule.
 171 .DS
 172 .ft CW
 173 productionrule  : production [ '|' production ]*
 174                 ;
 175 .ft R
 176 .DE
 177 A production rule consists of one or
 178 more alternative productions separated by "|". This symbol is called the
 179 \fBalternation symbol\fR.
 180 .DS
 181 .ft CW
 182 production      : term *
 183                 ;
 184 .ft R
 185 .DE
 186 A production consists of a possibly empty list of terms.
 187 So, empty productions are allowed.
 188 .DS
 189 .ft CW
 190 term            : element repeats
 191                 ;
 192 .ft R
 193 .DE
 194 A term is an element, possibly with a repeat specification.
 195 .DS
 196 .ft CW
 197 element         : LITERAL
 198                 | IDENTIFIER
 199                 | '[' productionrule ']'
 200                 ;
 201 .ft R
 202 .DE
 203 An element can be a LITERAL, which basically is a single character
 204 between apostrophes, it can be an IDENTIFIER, which is either a
 205 nonterminal or a token, and it can be a production rule
 206 between square parentheses.
 207 .DS
 208 .ft CW
 209 repeats         : '?'
 210                 | [ '*' | '+' ] NUMBER ?
 211                 | NUMBER ?
 212                 ;
 213 .ft R
 214 .DE
 215 These are the repeat specifications discussed above. Notice that
 216 this specification may be empty.
 217 .PP
 218 The class of ECF languages
 219 is identical with the class of CF languages. However, in many
 220 cases recursive definitions of language features can now be
 221 replaced by iterative ones. This tends to reduce the number of
 222 nonterminals and gives rise to very efficient recursive descent
 223 parsers.
 224 .NH
 225 Grammar Specifications
 226 .PP
 227 The major part of a
 228 \fILLgen\fR
 229 grammar specification consists of an
 230 ECF syntax specification.
 231 Names in this syntax specification refer to either tokens or nonterminal
 232 symbols.
 233 \fILLgen\fR
 234 requires token names to be declared as such. This way it
 235 can be avoided that a typing error in a nonterminal name causes it to
 236 be accepted as a token name. The token declarations will be
 237 discussed later.
 238 A name will be regarded as a nonterminal symbol, unless it is declared
 239 as a token name.
 240 If there is no production rule for a nonterminal symbol, \fILLgen\fR
 241 will complain.
 242 .PP
 243 A grammar specification may also include some C routines,
 244 for instance the lexical analyzer and an error reporting
 245 routine.
 246 Thus, a grammar specification file can contain declarations,
 247 grammar rules and C-code.
 248 .PP
 249 Blanks, tabs and newlines are ignored, but may not appear in names or
 250 keywords.
 251 Comments may appear wherever a name is legal (which is almost
 252 everywhere).
 253 They are enclosed in
 254 /* ... */, as in C. Comments do not nest.
 255 .PP
 256 Names may be of arbitrary length, and can be made up of letters, underscore
 257 "\_" and non-initial digits. Upper and lower case letters are distinct.
 258 Only the first 50 characters are significant.
 259 Notice however, that the names for the tokens will be used by the
 260 C-preprocessor.
 261 The number of significant characters therefore depends on the
 262 underlying C-implementation.
 263 A safe rule is to make the identifiers distinct in the first six
 264 characters, case ignored.
 265 .PP
 266 There are two kinds of tokens:
 267 those that are declared and are denoted by a name,
 268 and literals.
 269 .PP
 270 A literal consists of a character enclosed in apostrophes "'".
 271 The "\e" is an escape character within literals. The following escapes
 272 are recognized :
 273 .TS
 274 center;
 275 l l.
 276 \&'\en' newline
 277 \&'\er' return
 278 \&'\e'' apostrophe "'"
 279 \&'\e\e'        backslash "\e"
 280 \&'\et' tab
 281 \&'\eb' backspace
 282 \&'\ef' form feed
 283 \&'\exxx'       "xxx" in octal
 284 .TE
 285 .PP
 286 Names representing tokens must be declared before they are used.
 287 This can be done using the "\fB%token\fR" keyword,
 288 by writing
 289 .nf
 290 .ft CW
 291 .sp 1
 292 %token  name1, name2, . . . ;
 293 .ft R
 294 .fi
 295 .PP
 296 \fILLparse\fR is designed to recognize special nonterminal
 297 symbols called \fBstart symbols\fR.
 298 \fILLgen\fR allows for more than one start symbol.
 299 Thus, grammars with more than one entry point are accepted.
 300 The start symbols must be declared explicitly using the
 301 "\fB%start\fR" keyword. It can be used whenever a declaration is
 302 legal, f.i.:
 303 .nf
 304 .ft CW
 305 .sp 1
 306 %start LLparse, specification ;
 307 .ft R
 308 .fi
 309 .sp 1
 310 declares "specification" as a start symbol and associates the
 311 identifier "LLparse" with it.
 312 "LLparse" will now be the name of the C-function that must be
 313 called to recognize "specification".
 314 .NH
 315 Actions
 316 .PP
 317 \fILLgen\fR
 318 allows arbitrary insertions of actions within the right hand side
 319 of a production rule in the ECF syntax. An action consists of a number of C
 320 statements, enclosed in the brackets "{" and "}".
 321 .PP
 322 \fILLgen\fR
 323 generates a parsing routine for each rule in the grammar. The actions
 324 supplied by the user are just inserted in the proper place.
 325 There may also be declarations before the statements in the
 326 action, as
 327 the "{" and "}" are copied into the target code along with the
 328 action. The scope of these declarations terminates with the
 329 closing bracket "}" of the action.
 330 .PP
 331 In addition to actions, it is also possible to declare local variables
 332 in the parsing routine, which can then be used in the actions.
 333 Such a declaration consists of a number of C variable declarations,
 334 enclosed in the brackets "{" and "}". It must be placed
 335 right in front of the ":" in the grammar rule.
 336 The scope of these local variables consists of the complete
 337 grammar rule.
 338 .PP
 339 In order to facilitate communication between the actions and
 340 \fILLparse\fR,
 341 the parsing routines can be given C-like parameters.
 342 Each parameter must be declared separately, and each of these declarations must
 343 end with a semicolon.
 344 For the last parameter, the semicolon is optional.
 345 .PP
 346 So, for example
 347 .nf
 348 .ft CW
 349 .sp 1
 350 expr(int *pval;) { int fact; } :
 351                 /*
 352                  * Rule with one parameter, a pointer to an int.
 353                  * Parameter specifications are ordinary C declarations.
 354                  * One local variable, of type int.
 355                  */
 356         factor (&fact)          { *pval = fact; }
 357                 /*
 358                  * factor is another nonterminal symbol.
 359                  * One actual parameter is supplied.
 360                  * Notice that the parameter passing mechanism is that
 361                  * of C.
 362                  */
 363         [ '+' factor (&fact)    { *pval += fact; } ]*
 364                 /*
 365                  * remember the '*' means zero or more times
 366                  */
 367         ;
 368 .sp 1
 369 .ft R
 370 .fi
 371 is a rule to recognize a number of factors, separated by "+", and
 372 to compute their sum.
 373 .PP
 374 \fILLgen\fR
 375 generates C code, so the parameter passing mechanism is that of
 376 C, as is shown in the example above.
 377 .PP
 378 Actions often manipulate attributes of the token just read.
 379 For instance, when an identifier is read, its name must be
 380 looked up in a symbol table.
 381 Therefore, \fILLgen\fR generates code
 382 such that at a number of places in the grammar rule
 383 it is defined which token has last been read.
 384 After a token, the last token read is this token.
 385 After a "[" or a "|", the last token read is the next token to
 386 be accepted by \fILLparse\fR.
 387 At all other places, it is undefined which token has last been
 388 read.
 389 The last token read is available in the global integer variable
 390 \fILLsymb\fR.
 391 .PP
 392 The user may also specify C-code wherever a \fILLgen\fR-declaration is
 393 legal.
 394 Again, this code must be enclosed in the brackets "{" and "}".
 395 This way, the user can define global declarations and
 396 C-functions.
 397 To avoid name-conflicts with identifiers generated by
 398 \fILLgen\fR, \fILLparse\fR only uses names beginning with
 399 "LL"; the user should avoid such names.
 400 .NH
 401 Error Recovery
 402 .PP
 403 The error recovery technique used by \fILLgen\fR is a
 404 modification of the one presented in reference
 405 .[ (
 406 automatic construction error correcting
 407 .]).
 408 It is based on \fBdefault choices\fR, which just are
 409 what the word says, default choices at
 410 every point in the grammar where there is a
 411 choice.
 412 Thus, in an alternation, one of the productions is marked as a
 413 default choice, and in a term with a non-fixed repetition
 414 specification there will also be a default choice (between
 415 doing the term (once more) and continuing with the rest of the
 416 production in which the term appears).
 417 .PP
 418 When \fILLparse\fR detects an error after having parsed the
 419 string @s@, the default choices enable it to compute one
 420 syntactically correct continuation,
 421 consisting of the tokens @t sub 1~...~t sub n@,
 422 such that @s~t sub 1~...~t sub n@ is a string of tokens that
 423 is a member of the language defined by the grammar.
 424 Notice, that the computation of this continuation must
 425 terminate, which implies that the default choices may not
 426 invoke recursive rules.
 427 .PP
 428 At each point in this continuation, a certain number of other
 429 tokens could also be syntactically correct, f.i. the token
 430 @t@ is syntactically correct at point @t sub i@ in this
 431 continuation, if the string @s~t sub 1~...~t sub i~t~s sub 1@
 432 is a string of the language defined by the grammar for some
 433 string @s sub 1@ and i >= 0.
 434 .PP
 435 The set @T@
 436 containing all these tokens (including @t sub 1 ,~...,~t sub n@) is computed.
 437 Next, \fILLparse\fR discards zero
 438 or more tokens from its input, until a token
 439 @t@ \(mo @T@ is found.
 440 The error is then corrected by inserting i (i >= 0) tokens
 441 @t sub 1~...~t sub i@, such that the string
 442 @s~t sub 1~...~t sub i~t~s sub 1@ is a string of the language
 443 defined by the grammar, for some @s sub 1@.
 444 Then, normal parsing is resumed.
 445 .PP
 446 The above is difficult to implement in a recursive decent
 447 parser, and is not the way \fILLparse\fR does it, but the
 448 effect is the same. In fact, \fILLparse\fR maintains a list
 449 of tokens that may not be discarded, which is adjusted as
 450 \fILLparse\fR proceeds. This list is just a representation
 451 of the set @T@ mentioned
 452 above. When an error occurs, \fILLparse\fR discards tokens until
 453 a token @t@ that is a member of this list is found.
 454 Then, it continues parsing, following the default choices,
 455 inserting tokens along the way, until this token @t@ is legal.
 456 The selection of
 457 the default choices must guarantee that this will always
 458 happen.
 459 .PP
 460 The default choices are explicitly or implicitly
 461 specified by the user.
 462 By default, the default choice in an alternation is the
 463 alternative with the shortest possible terminal production.
 464 The user can select one of the other productions in the
 465 alternation as the default choice by putting the keyword
 466 "\fB%default\fR" in front of it.
 467 .PP
 468 By default, for terms with a repetition count containing "*" or
 469 "?" the default choice is to continue with the rest of the rule
 470 in which the term appears, and
 471 .sp 1
 472 .ft CW
 473 .nf
 474                 term+
 475 .fi
 476 .ft R
 477 .sp 1
 478 is treated as
 479 .sp 1
 480 .nf
 481 .ft CW
 482                 term term* .
 483 .ft R
 484 .fi
 485 .PP
 486 It is also clear, that it can never be the default choice to do
 487 the term (once more), because this could cause the parser to
 488 loop, inserting tokens forever.
 489 However, when the user does not want the parser to skip
 490 tokens that would not have been skipped if the term
 491 would have been the default choice,
 492 the skipping of such a term can be prevented by
 493 using the keyword "\fB%persistent\fR".
 494 For instance, the rule
 495 .sp 1
 496 .ft CW
 497 .nf
 498 commandlist : command* ;
 499 .fi
 500 .ft R
 501 .sp 1
 502 could be changed to
 503 .sp 1
 504 .ft CW
 505 .nf
 506 commandlist : [ %persistent command ]* ;
 507 .fi
 508 .ft R
 509 .sp 1
 510 The effects of this in case of a syntax error are twofold:
 511 The set @T@ mentioned above will be extended as if "command" were
 512 in the default production, so that fewer tokens will be
 513 skipped.
 514 Also, if the first token that is not skipped is a member of the
 515 subset of @T@ arising from the grammar rule for "command",
 516 \fILLparse\fR will enter that rule.
 517 So, in fact the default choice
 518 is determined dynamically (by \fILLparse\fR).
 519 Again, \fILLgen\fR checks (statically)
 520 that \fILLparse\fR will always terminate, and if not,
 521 \fILLgen\fR will complain.
 522 .PP
 523 An important property of this error recovery method is that,
 524 once a rule is started, it will be finished.
 525 This means that all actions in the rule will be executed
 526 normally, so that the user can be sure that there will be no
 527 inconsistencies in his data structures because of syntax
 528 errors.
 529 Also, as the method is in fact error correcting, the
 530 actions in a rule only have to deal with syntactically correct
 531 input.
 532 .NH
 533 Ambiguities and conflicts
 534 .PP
 535 As \fILLgen\fR generates a recursive descent parser with no backtrack,
 536 it must at all times be able to determine what to do,
 537 based on the current input symbol.
 538 Unfortunately, this cannot be done for all grammars.
 539 Two kinds of conflicts can arise :
 540 .IP 1) 10
 541 the grammar rule is of the form "production1 | production2",
 542 and \fILLparse\fR cannot decide which production to chose.
 543 This we call an \fBalternation conflict\fR.
 544 .IP 2) 10
 545 the grammar rule is of the form "[ productionrule ]...",
 546 where ... specifies a non-fixed repetition count,
 547 and \fILLparse\fR cannot decide whether to
 548 choose "productionrule" once more, or to continue.
 549 This we call a \fBrepetition conflict\fR.
 550 .PP
 551 There can be several causes for conflicts: the grammar may be
 552 ambiguous, or the grammar may require a more complex parser
 553 than \fILLgen\fR can construct.
 554 The conflicts can be examined by inspecting the verbose
 555 (-\fBv\fR) option output file.
 556 The conflicts can be resolved by rewriting the grammar
 557 or by using \fBconflict resolvers\fR.
 558 The mechanism described here is based on the attributed parsing
 559 of reference
 560 .[ (
 561 milton
 562 .]).
 563 .PP
 564 An alternation conflict can be resolved by putting an \fBif condition\fR
 565 in front of the first conflicting production.
 566 It consists of a "\fB%if\fR" followed by a
 567 C-expression between parentheses.
 568 \fILLparse\fR will then evaluate this expression whenever a
 569 token is met at this point on which there is a conflict, so
 570 the conflict will be resolved dynamically.
 571 If the expression evaluates to
 572 non-zero, the first conflicting production is chosen,
 573 otherwise one of the remaining ones is chosen.
 574 .PP
 575 An alternation conflict can also be resolved using the keywords
 576 "\fB%prefer\fR" or "\fB%avoid\fR". "\fB%prefer\fR"
 577 is equivalent in behaviour to
 578 "\fB%if\fR (1)". "\fB%avoid\fR" is equivalent to "\fB%if\fR (0)".
 579 In these cases however, "\fB%prefer\fR" and "\fB%avoid\fR" should be used,
 580 as they resolve the conflict statically and thus
 581 give rise to better C-code.
 582 .PP
 583 A repetition conflict can be resolved by putting a \fBwhile condition\fR
 584 right after the opening parentheses. This while condition
 585 consists of a "\fB%while\fR" followed by a C-expression between
 586 parentheses. Again, \fILLparse\fR will then
 587 evaluate this expression whenever a token is met
 588 at this point on which there is a conflict.
 589 If the expression evaluates to non-zero, the
 590 repeating part is chosen, otherwise the parser continues with
 591 the rest of the rule.
 592 Appendix B will give an example of these features.
 593 .PP
 594 A useful aid in writing conflict resolvers is the "\fB%first\fR" keyword.
 595 It is used to declare a C-macro that forms an expression
 596 returning 1 if the parameter supplied can start a specified
 597 nonterminal, f.i.:
 598 .sp 1
 599 .nf
 600 .ft CW
 601 %first fmac, nonterm ;
 602 .ft R
 603 .sp 1
 604 .fi
 605 declares "fmac" as a macro with one parameter, whose value
 606 is a token number. If the parameter
 607 X can start the nonterminal "nonterm", "fmac(X)" is true,
 608 otherwise it is false.
 609 .NH
 610 The LLgen working environment
 611 .PP
 612 \fILLgen\fR generates a number of files: one for each input
 613 file, and two other files: \fILpars.c\fR and \fILpars.h\fR.
 614 \fILpars.h\fR contains "#-define"s for the tokennames.
 615 \fILpars.c\fR contains the error recovery routines and tables.
 616 Only those output files that differ from their previous version
 617 are updated. See appendix C for a possible application of this
 618 feature.
 619 .PP
 620 The names of the output files are constructed as
 621 follows:
 622 in the input file name, the suffix after the last point is
 623 replaced by a "c". If no point is present in the input file
 624 name, ".c" is appended to it. \fILLgen\fR checks that the
 625 filename constructed this way in fact represents a previous
 626 version, or does not exist already.
 627 .PP
 628 The user must provide some environment to obtain a complete
 629 program.
 630 Routines called \fImain\fR and \fILLmessage\fR must be defined.
 631 Also, a lexical analyzer must be provided.
 632 .PP
 633 The routine \fImain\fR must be defined, as it must be in every
 634 C-program. It should eventually call one of the startsymbol
 635 routines.
 636 .PP
 637 The routine \fILLmessage\fR must accept one
 638 parameter, whose value is a token number, zero or -1.
 639 .br
 640 A zero parameter indicates that the current token (the one in
 641 the external variable \fILLsymb\fR) is deleted.
 642 .br
 643 A -1 parameter indicates that the parser expected end of file, but didn't get
 644 it.
 645 The parser will then skip tokens until end of file is detected.
 646 .br
 647 A parameter that is a token number (a positive parameter)
 648 indicates that this
 649 token is to be inserted in front of the token currently in
 650 \fILLsymb\fR.
 651 The user can give the token the proper attributes.
 652 Also, the user must take care, that the token currently in
 653 \fILLsymb\fR is again returned by the \fBnext\fR call to the
 654 lexical analyzer, with the proper attributes.
 655 So, the lexical analyzer must have a facility to push back one
 656 token.
 657 .PP
 658 The user may also supply his own error recovery routines, or handle
 659 errors differently. For this purpose, the name of a routine to be called
 660 when an error occurs may be declared using the keyword \fB%onerror\fR.
 661 This routine takes two parameters.
 662 The first one is either the token number of the
 663 token expected, or 0. In the last case, the error occurred at a choice.
 664 In both cases, the routine must ensure that the next call to the lexical
 665 analyser returns the token that replaces the current one. Of course,
 666 that could well be the current one, in which case
 667 .I LLparse
 668 recovers from the error.
 669 The second parameter contains a list of tokens that are not skipped at the
 670 error point. The list is in the form of a null-terminated array of integers,
 671 whose address is passed.
 672 .PP
 673 The user must supply a lexical analyzer to read the input stream and
 674 break it up into tokens, which are passed to
 675 .I LLparse.
 676 It should be an integer valued function, returning the token number.
 677 The name of this function can be declared using the
 678 "\fB%lexical\fR" keyword.
 679 This keyword can be used wherever a declaration is legal and may appear
 680 only once in the grammar specification, f.i.:
 681 .sp 1
 682 .nf
 683 .ft CW
 684 %lexical scanner ;
 685 .ft R
 686 .fi
 687 .sp 1
 688 declares "scanner" as the name of the lexical analyzer.
 689 The default name for the lexical analyzer is "yylex".
 690 The reason for this funny name is that a useful tool for constructing
 691 lexical analyzers is the
 692 .I Lex
 693 program,
 694 .[
 695 lex
 696 .]
 697 which generates a routine of that name.
 698 .PP
 699 The token numbers are chosen by \fILLgen\fR.
 700 The token number for a literal
 701 is the numerical value of the character in the local character set.
 702 If the tokens have a name,
 703 the "#\ define" mechanism of C is used to give them a value and
 704 to allow the lexical analyzer to return their token numbers symbolically.
 705 These "#\ define"s are collected in the file \fILpars.h\fR which
 706 can be "#\ include"d in any file that needs the token-names.
 707 The maximum token number chosen is defined in the macro \fILL_MAXTOKNO\fP.
 708 .PP
 709 The lexical analyzer must signal the end
 710 of input to \fILLparse\fR
 711 by returning a number less than or equal to zero.
 712 .NH
 713 Programs with more than one parser
 714 .PP
 715 \fILLgen\fR offers a simple facility for having more than one parser in
 716 a program: in this case, the user can change the names of global procedures,
 717 variables, etc, by giving a different prefix, like this:
 718 .sp 1
 719 .nf
 720 .ft CW
 721 %prefix XX ;
 722 .ft R
 723 .fi
 724 .sp 1
 725 The effect of this is that all global names start with XX instead of LL, for
 726 the parser that has this prefix. This holds for the variables \fILLsymb\fP,
 727 which now is called \fIXXsymb\fP, for the routine \fILLmessage\fP,
 728 which must now be called \fIXXmessage\fP, and for the macro \fILL_MAXTOKNO\fP,
 729 which is now called \fIXX_MAXTOKNO\fP.
 730 \fILL.output\fP is now \fIXX.output\fP, and \fILpars.c\fP and \fILpars.h\fP
 731 are now called \fIXXpars.c\fP and \fIXXpars.h\fP.
 732 .bp
 733 .SH
 734 References
 735 .[
 736 $LIST$
 737 .]
 738 .bp
 739 .SH
 740 Appendix A : LLgen Input Syntax
 741 .PP
 742 This appendix has a description of the \fILLgen\fR input syntax,
 743 as a \fILLgen\fR specification. As a matter of fact, the current
 744 version of \fILLgen\fR is written with \fILLgen\fR.
 745 .nf
 746 .ft CW
 747 .sp 2
 748 /*
 749  * First the declarations of the terminals
 750  * The order is not important
 751  */
 752
 753 %token  IDENTIFIER;            /* terminal or nonterminal name */
 754 %token  NUMBER;
 755 %token  LITERAL;
 756
 757 /*
 758  * Reserved words
 759  */
 760
 761 %token  TOKEN;         /* %token */
 762 %token  START;         /* %start */
 763 %token  PERSISTENT;    /* %persistent */
 764 %token  IF;            /* %if */
 765 %token  WHILE;         /* %while */
 766 %token  AVOID;         /* %avoid */
 767 %token  PREFER;        /* %prefer */
 768 %token  DEFAULT;       /* %default */
 769 %token  LEXICAL;       /* %lexical */
 770 %token  PREFIX;        /* %prefix */
 771 %token  ONERROR;       /* %onerror */
 772 %token  FIRST;         /* %first */
 773
 774 /*
 775  * Declare LLparse to be a C-routine that recognizes "specification"
 776  */
 777
 778 %start  LLparse, specification;
 779
 780 specification
 781         : declaration*
 782         ;
 783
 784 declaration
 785         : START
 786                 IDENTIFIER ',' IDENTIFIER
 787           ';'
 788         | '{'
 789                 /* Read C-declaration here */
 790           '}'
 791         | TOKEN
 792                 IDENTIFIER
 793                 [ ',' IDENTIFIER ]*
 794           ';'
 795         | FIRST
 796                 IDENTIFIER ',' IDENTIFIER
 797           ';'
 798         | LEXICAL
 799                 IDENTIFIER
 800           ';'
 801         | PREFIX
 802                 IDENTIFIER
 803           ';'
 804         | ONERROR
 805                 IDENTIFIER
 806           ';'
 807         | rule
 808         ;
 809
 810 rule    : IDENTIFIER parameters? ldecl?
 811                 ':' productions
 812           ';'
 813         ;
 814
 815 ldecl   : '{'
 816                 /* Read C-declaration here */
 817           '}'
 818         ;
 819
 820 productions
 821         : simpleproduction
 822           [ '|' simpleproduction ]*
 823         ;
 824
 825 simpleproduction
 826         : DEFAULT?
 827           [ IF '(' /* Read C-expression here */ ')'
 828           | PREFER
 829           | AVOID
 830           ]?
 831           [ element repeats ]*
 832         ;
 833
 834 element : '{'
 835                 /* Read action here */
 836           '}'
 837         | '[' [ WHILE '(' /* Read C-expression here */ ')' ]?
 838                 PERSISTENT?
 839                 productions
 840           ']'
 841         | LITERAL
 842         | IDENTIFIER parameters?
 843         ;
 844
 845 parameters
 846         : '(' /* Read C-parameters here */ ')'
 847         ;
 848
 849 repeats : /* empty */
 850         | [ '*' | '+' ] NUMBER?
 851         | NUMBER
 852         | '?'
 853         ;
 854
 855 .fi
 856 .ft R
 857 .bp
 858 .SH
 859 Appendix B : An example
 860 .PP
 861 This example gives the complete \fILLgen\fR specification of a simple
 862 desk calculator. It has 26 registers, labeled "a" through "z",
 863 and accepts arithmetic expressions made up of the C operators
 864 +, -, *, /, %, &, and |, with their usual priorities.
 865 The value of the expression is
 866 printed. As in C, an integer that begins with 0 is assumed to
 867 be octal; otherwise it is assumed to be decimal.
 868 .PP
 869 Although the example is short and not very complicated, it
 870 demonstrates the use of if and while conditions. In
 871 the example they are in fact used to reduce the number of
 872 nonterminals, and to reduce the overhead due to the recursion
 873 that would be involved in parsing an expression with an
 874 ordinary recursive descent parser. In an ordinary LL(1)
 875 grammar there would be one nonterminal for each operator
 876 priority. The example shows how we can do it all with one
 877 nonterminal, no matter how many priority levels there are.
 878 .sp 1
 879 .nf
 880 .ft CW
 881 {
 882 #include <stdio.h>
 883 #include <ctype.h>
 884 #define MAXPRIO      5
 885 #define prio(op)     (ptab[op])
 886
 887 struct token {
 888         int     t_tokno;        /* token number */
 889         int     t_tval;         /* Its attribute */
 890 } stok = { 0,0 }, tok;
 891
 892 int     nerrors = 0;
 893 int     regs[26];               /* Space for the registers */
 894 int     ptab[128];              /* Attribute table */
 895
 896 struct token
 897 nexttok() {  /* Read next token and return it */
 898         register        c;
 899         struct token    new;
 900
 901         while ((c = getchar()) == ' ' || c == '\et') { /* nothing */ }
 902         if (isdigit(c)) new.t_tokno = DIGIT;
 903         else if (islower(c)) new.t_tokno = IDENT;
 904         else new.t_tokno = c;
 905         if (c >= 0) new.t_tval = ptab[c];
 906         return new;
 907 }   }
 908
 909 %token  DIGIT, IDENT;
 910 %start  parse, list;
 911
 912 list    : stat* ;
 913
 914 stat    {       int     ident, val; } :
 915         %if (stok = nexttok(),
 916              stok.t_tokno == '=')
 917                     /* The conflict is resolved by looking one further
 918                      * token ahead. The grammar is LL(2)
 919                      */
 920           IDENT
 921                                 {       ident = tok.t_tval; }
 922           '=' expr(1,&val) '\en'
 923                                 {       if (!nerrors) regs[ident] = val; }
 924         | expr(1,&val) '\en'
 925                                 {       if (!nerrors) printf("%d\en",val); }
 926         | '\en'
 927         ;
 928
 929 expr(int level; int *val;) {       int     expr; } :
 930           factor(val)
 931           [ %while (prio(tok.t_tokno) >= level)
 932                     /* Swallow operators as long as their priority is
 933                      * larger than or equal to the level of this invocation
 934                      */
 935               '+' expr(prio('+')+1,&expr)
 936                                 {       *val += expr; }
 937                     /* This states that '+' groups left to right. If it
 938                      * should group right to left, the rule should read:
 939                      * '+' expr(prio('+'),&expr)
 940                      */
 941             | '-' expr(prio('-')+1,&expr)
 942                                 {       *val -= expr; }
 943             | '*' expr(prio('*')+1,&expr)
 944                                 {       *val *= expr; }
 945             | '/' expr(prio('/')+1,&expr)
 946                                 {       *val /= expr; }
 947             | '%' expr(prio('%')+1,&expr)
 948                                 {       *val %= expr; }
 949             | '&' expr(prio('&')+1,&expr)
 950                                 {       *val &= expr; }
 951             | '|' expr(prio('|')+1,&expr)
 952                                 {       *val |= expr; }
 953           ]*
 954                     /* Notice the "*" here. It is important.
 955                      */
 956         ;
 957
 958 factor(int *val;):
 959             '(' expr(1,val) ')'
 960           | '-' expr(MAXPRIO+1,val)
 961                                 {       *val = -*val; }
 962           | number(val)
 963           | IDENT
 964                                 {       *val = regs[tok.t_tval]; }
 965         ;
 966
 967 number(int *val;) {       int base; }
 968         : DIGIT
 969                                 {       base = (*val=tok.t_tval)==0?8:10; }
 970           [ DIGIT
 971                                 {       *val = base * *val + tok.t_tval; }
 972           ]*        ;
 973
 974 %lexical scanner ;
 975 {
 976 scanner() {
 977         if (stok.t_tokno) { /* a token has been inserted or read ahead */
 978                 tok = stok;
 979                 stok.t_tokno = 0;
 980                 return tok.t_tokno;
 981         }
 982         if (nerrors && tok.t_tokno == '\en') {
 983                 printf("ERROR\en");
 984                 nerrors = 0;
 985         }
 986         tok = nexttok();
 987         return tok.t_tokno;
 988 }
 989
 990 LLmessage(insertedtok) {
 991         nerrors++;
 992         if (insertedtok) { /* token inserted, save old token */
 993                 stok = tok;
 994                 tok.t_tval = 0;
 995                 if (insertedtok < 128) tok.t_tval = ptab[insertedtok];
 996         }
 997 }
 998
 999 main() {
1000         register *p;
1001
1002         for (p = ptab; p < &ptab[128]; p++) *p = 0;
1003         /* for letters, their attribute is their index in the regs array */
1004         for (p = &ptab['a']; p <= &ptab['z']; p++) *p = p - &ptab['a'];
1005         /* for digits, their attribute is their value */
1006         for (p = &ptab['0']; p <= &ptab['9']; p++) *p = p - &ptab['0'];
1007         /* for operators, their attribute is their priority */
1008         ptab['*'] = 4;
1009         ptab['/'] = 4;
1010         ptab['%'] = 4;
1011         ptab['+'] = 3;
1012         ptab['-'] = 3;
1013         ptab['&'] = 2;
1014         ptab['|'] = 1;
1015         parse();
1016         exit(nerrors);
1017 }   }
1018 .fi
1019 .ft R
1020 .bp
1021 .SH
1022 Appendix C. How to use \fILLgen\fR.
1023 .PP
1024 This appendix demonstrates how \fILLgen\fR can be used in
1025 combination with the \fImake\fR program, to make effective use
1026 of the \fILLgen\fR-feature that it only changes output files
1027 when neccessary. \fIMake\fR uses a "makefile", which
1028 is a file containing dependencies and associated commands.
1029 A dependency usually indicates that some files depend on other
1030 files. When a file depends on another file and is older than
1031 that other file, the commands associated with the dependency
1032 are executed.
1033 .PP
1034 So, \fImake\fR seems just the program that we always wanted.
1035 However, it
1036 is not very good in handling programs that generate more than
1037 one file.
1038 As usual, there is a way around this problem.
1039 A sample makefile follows:
1040 .sp 1
1041 .ft CW
1042 .nf
1043 # The grammar exists of the files decl.g, stat.g and expr.g.
1044 # The ".o"-files are the result of a C-compilation.
1045
1046 GFILES = decl.g stat.g expr.g
1047 OFILES = decl.o stat.o expr.o Lpars.o
1048 LLOPT =
1049
1050 # As make does'nt handle programs that generate more than one
1051 # file well, we just don't tell make about it.
1052 # We just create a dummy file, and touch it whenever LLgen is
1053 # executed. This way, the dummy in fact depends on the grammar
1054 # files.
1055 # Then, we execute make again, to do the C-compilations and
1056 # such.
1057
1058 all:    dummy
1059         make parser
1060
1061 dummy:  $(GFILES)
1062         LLgen $(LLOPT) $(GFILES)
1063         touch dummy
1064
1065 parser: $(OFILES)
1066         $(CC) -o parser $(LDFLAGS) $(OFILES)
1067
1068 # Some dependencies without actions :
1069 # make already knows what to do about them
1070
1071 Lpars.o:        Lpars.h
1072 stat.o:         Lpars.h
1073 decl.o:         Lpars.h
1074 expr.o:         Lpars.h
1075
1076 .fi
1077 .ft R