sphinx/_sources/lr1.rst.txt

   1 The ``lr1`` module
   2 ==================
   3
   4 .. automodule:: ndcode.piyacc.lr1
   5
   6 Design of the module
   7 --------------------
   8
   9 The parser is structured as a list of productions, each of which describes a
  10 consecutive series of symbols that can be "reduced" to a new symbol.
  11
  12 Traditionally an LR1 parser is defined as a collection of head symbols, each
  13 of which has a collection of alternatives, each a string of symbols, e.g.::
  14
  15   expr <- number | binary_expr | '(' expr ')'
  16   binary_expr <- unary_expr | binary_expr binary_op unary_expr
  17   unary_expr <- expr | unary_op unary_expr
  18   binary_op <- '+' | '-'
  19   unary_op <- '-'
  20
  21 Here we take a different approach by defining the LR1 parser as a collection
  22 of productions, each of which reduces to a unique head symbol. Then instead of
  23 referring to a head symbol which has alternatives, at each reference we list
  24 out the alternative head symbols explicitly. For example, based on the above,
  25 we would first have to split each head symbol into several indepedent ones::
  26
  27   expr0 <- num
  28   expr1 <- bin_expr
  29   expr2 <- '(' expr ')'
  30   bin_expr0 <- un_expr
  31   bin_expr1 <- bin_expr bin_op un_expr
  32   un_expr0 <- expr
  33   un_expr1 <- un_op un_expr
  34   bin_op0 <- '+'
  35   bin_op1 <- '-'
  36   un_op0 <- '-'
  37
  38 Then, each symbol would have to be changed into a set of acceptable symbols::
  39
  40   expr0 <- {num}
  41   expr1 <- {bin_expr0, bin_expr1}
  42   expr2 <- {'('} {expr0, expr1, expr2} {')'}
  43   bin_expr0 <- {un_expr0, un_expr1}
  44   bin_expr1 <- {bin_expr0, bin_expr1} {bin_op0, bin_op1} {un_expr0, un_expr1}
  45   un_expr0 <- {expr0, expr1, expr2}
  46   un_expr1 <- {un_op0} {un_expr0, un_expr1}
  47   bin_op0 <- {'+'}
  48   bin_op1 <- {'-'}
  49   un_op0 <- {'-'}
  50
  51 Whilst our representation is equivalent, it is more efficient in several ways:
  52
  53 * It can handle simple character input streams itself without needing a
  54   lexical analysis front end. With the lexical analysis front-end, the parser
  55   receives a stream of lexical tokens. A typical ``lex`` definition for how
  56   to recognize a token ``NUM`` and return it to the parser might look like::
  57
  58     [0-9]+ return NUM;
  59
  60   Whereas, without the front-end, the parser receives a stream of characters,
  61   these might be ASCII characters in the range ``0..0xff``. An equivalent rule
  62   to the above could be written in the traditional parser language as follows::
  63
  64     digit <- '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
  65     num <- digit | num digit
  66
  67   Or in our parser language (using symbol sets rather than alternatives) by::
  68
  69     num0 <- {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}
  70     num1 <- {num0, num1} {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}
  71
  72   We find the second way to be more efficient, since the parser now naturally
  73   supports character classes, and repurposes the same mechanism to implement
  74   alternatives. This principle is carried through to the generated automaton,
  75   so that the ``LR1DFA`` recognizes character classes the same way as a regular
  76   ``DFA`` automaton (such as would be generated for the ``lex``-based example).
  77
  78 * It is common to see (non-recursive) head symbols where all alternatives have
  79   length one, and these can be expanded in-place and eliminated, since the
  80   framework has natural support for sets. Furthermore, after splitting each
  81   head symbol into separate productions, some productions might be eliminated::
  82
  83     expr1 <- {bin_expr0, bin_expr1}
  84     expr2 <- {'('} {num, expr1, expr2} {')'}
  85     bin_expr0 <- {un_expr0, un_expr1}
  86     bin_expr1 <- {bin_expr0, bin_expr1} {'+', '-'} {un_expr0, un_expr1}
  87     un_expr0 <- {num, expr1, expr2}
  88     un_expr1 <- {'-'} {un_expr0, un_expr1}
  89
  90 Symbols are mapped to integers. The usual mapping has the following ranges:
  91
  92 * ``0..N_CHARACTERS - 1`` represent single characters, or possibly if a lexical
  93   analyzer front-end is being used, shorthand for particular lexical symbols;
  94 * ``N_CHARACTERS..n_terminals - 1`` represent terminal symbols which are not
  95   single characters, usually meaning a lexical analyzer front-end is used;
  96 * ``n_terminals..n_terminals + len(productions) - 1`` represent nonterminal
  97   symbols, i.e. the head symbols to which the productions can be reduced.
  98
  99 A special symbol called "eof_terminal" is defined, which must be in the range
 100 ``N_CHARACTERS <= eof_terminal < n_terminals``.
 101
 102 If a lexical analyzer front-end is not used, ``eof_terminal`` will be the only
 103 terminal other than the characters, so assuming 8-bit ASCII, the characters
 104 will be ``0..0xff``, EOF will be ``0x100``, and nonterminals will be ``0x101``
 105 and above.
 106
 107 In our scheme, nonterminal symbols do not have alternatives, meaning that each
 108 production reduces to a unique nonterminal. Hence there is a 1-1 mapping
 109 between productions and nonterminal symbols, and the symbol number is obtained
 110 by taking the position of the production in the productions list and adding
 111 the symbol number of the first nonterminal, which we call ``n_terminals``
 112 above. Thus it is not necessary to store what symbol a production reduces to.
 113
 114 Since symbols are integers, symbol sets are sets of integers. We represent a
 115 set of integers using a list of breaks -- see the ``bisect_set`` module.
 116
 117 Then the internal representation of a grammar is quite simple, as it is just a
 118 list of productions, each of which is a list of symbol sets, each of which is
 119 a list of breaks (i.e. starting symbol numbers of included/excluded symbols).
 120 Production number 0 is always the start production, and it must end with EOF.
 121
 122 We keep symbol sets separated into terminals and non-terminals, so each symbol
 123 set is actually two lists, not one. This is convenient for computing lookahead
 124 sets and generating the automaton, since terminals must be treated differently,
 125 so it would be annoying to have to split the list into two at each step. Note
 126 that in this format the non-terminals are not offset by ``n_terminals``, so the
 127 numbers in the non-terminals set are simply the indices into ``productions``.
 128
 129 We allow the user to store a reference data item with each production, and we
 130 also keep a workspace for computing the production's lookahead sets and whether
 131 the production can be empty. This is calculated by scanning the production in a
 132 forward direction, checking for each element of each symbol set the element's
 133 lookahead set and can-be-empty status. We save the intermediate results of the
 134 scan, i.e. the lookahead set and nullability of each prefix of each production.
 135
 136 Additionally, in a separate data structure we keep a map from nonterminals to
 137 associativity and precedence, for resolving shift/reduce conflicts. For
 138 compatibility, the algorithm is deliberately identical to YACC. The map is kept
 139 in a similar way to the sets discussed earlier -- see the ``bisect_set`` module
 140 for how we keep sets of integers. For maps from integers to arbitary objects
 141 we keep two lists: a list of ``breaks``, and a list of ``objects``. Then symbol
 142 ``symbol`` maps to ``objects[i]`` for the smallest ``i`` such that ``symbol <
 143 breaks[i]``, or if none of these holds, it maps to ``objects[len(breaks)]``.