--- /dev/null
+.. πyacc documentation master file, created by
+ sphinx-quickstart on Sat Feb 15 22:14:34 2020.
+ You can adapt this file completely to your liking, but it should at least
+ contain the root `toctree` directive.
+
+Implementation of the πyacc parser generator
+============================================
+
+.. toctree::
+ :maxdepth: 2
+ :caption: Contents:
+
+The parser generator proceeds through the following stages:
+
+1. Processing the command line
+2. Parsing and analyzing the YACC/Bison-compatible input file
+3. Generating a parser definition and corresponding automaton
+4. Writing C or Python code to implement the requested parser
+
+Step 1 is fairly straightforward Python code using ``getopt.getopt()``.
+
+Step 2 is more complicated, it uses a πlex/πyacc/πtree generated parser which
+automatically creates an abstract syntax tree representing the input, and then
+it performs several passes over the tree to extract the needed information such
+as lists of terminals and non-terminals, the rules, the alternatives of each
+rule, and the symbols and/or actions and other directives of each alternative.
+How it does this is somewhat beyond the scope of this documentation, as we
+would first have to refer you to tutorials in how to use πlex/πyacc/πtree.
+
+Step 3 is the focus of this document. Internally it uses a Python class library
+that we have developed for dealing with LR1 grammars and automata. See the
+classes ``ndcode.piyacc.lr1.LR1`` and ``ndcode.piyacc.lr1dfa.LR1DFA``. These
+classes are general-purpose and there is nothing to stop you pulling them out
+and using them in your project (subject to the license which is GPLv2). At the
+moment the classes are only used for generating automata to be used later, but
+there are commented functions in each class that can perform parsing directly.
+
+Step 4 is fairly straightforward Python code using ``str.format()``. Basically
+it reads a skeleton file line-by-line looking for special lines similar to::
+
+ # GENERATE SECTION1
+
+and then expands these into a sequence like::
+
+ # GENERATE SECTION1 BEGIN
+ ...
+ # GENERATE SECTION1 END
+
+where the middle part (denoted ``...``) is printed with a ``str.format()``
+format string containing the C or Python code to be inserted into the skeleton.
+
+Often this C or Python code will contain repetitive material (e.g. switch cases
+or action handler functions or similar) and these are sent into the format
+string as an string argument, which is a ``str.join()`` of a list of items,
+each formatted in turn with ``str.format()``. These may also contain repetitive
+material embedded in a similar manner, up to several levels deep. To illustrate
+how this works, here's an example where we are given a list of strings in the
+variable ``lines`` and we generate C code to write them out with ``fwrite()``::
+
+ sys.stdout.write(
+ '''#include <stdio.h>
+
+ int main(void) {{
+ {0:s}}}
+ '''.format(
+ ''.join(
+ [
+ ' fwrite(stdout, 1, {0:d}, "{1:s}");\n'.format(
+ len(line),
+ line
+ )
+ for line in lines
+ ]
+ )
+ )
+ )
+
+
+LR1 parser
+==========
+
+The parser is structured as a list of productions, each of which describes a
+consecutive series of symbols that can be "reduced" to a new symbol.
+
+Traditionally an LR1 parser is defined as a collection of head symbols, each
+of which has a collection of alternatives, each a string of symbols, e.g.::
+
+ expr <- number | binary_expr | '(' expr ')'
+ binary_expr <- unary_expr | binary_expr binary_op unary_expr
+ unary_expr <- expr | unary_op unary_expr
+ binary_op <- '+' | '-'
+ unary_op <- '-'
+
+Here we take a different approach by defining the LR1 parser as a collection
+of productions, each of which reduces to a unique head symbol. Then instead of
+referring to a head symbol which has alternatives, at each reference we list
+out the alternative head symbols explicitly. For example, based on the above,
+we would first have to split each head symbol into several indepedent ones::
+
+ expr0 <- num
+ expr1 <- bin_expr
+ expr2 <- '(' expr ')'
+ bin_expr0 <- un_expr
+ bin_expr1 <- bin_expr bin_op un_expr
+ un_expr0 <- expr
+ un_expr1 <- un_op un_expr
+ bin_op0 <- '+'
+ bin_op1 <- '-'
+ un_op0 <- '-'
+
+Then, each symbol would have to be changed into a set of acceptable symbols::
+
+ expr0 <- {num}
+ expr1 <- {bin_expr0, bin_expr1}
+ expr2 <- {'('} {expr0, expr1, expr2} {')'}
+ bin_expr0 <- {un_expr0, un_expr1}
+ bin_expr1 <- {bin_expr0, bin_expr1} {bin_op0, bin_op1} {un_expr0, un_expr1}
+ un_expr0 <- {expr0, expr1, expr2}
+ un_expr1 <- {un_op0} {un_expr0, un_expr1}
+ bin_op0 <- {'+'}
+ bin_op1 <- {'-'}
+ un_op0 <- {'-'}
+
+Whilst our representation is equivalent, it is more efficient in several ways:
+
+* It can handle simple character input streams itself without needing a
+ lexical analysis front end. With the lexical analysis front-end, the parser
+ receives a stream of lexical tokens. A typical ``lex`` definition for how
+ to recognize a token ``NUM`` and return it to the parser might look like::
+
+ [0-9]+ return NUM;
+
+ Whereas, without the front-end, the parser receives a stream of characters,
+ these might be ASCII characters in the range ``0..0xff``. An equivalent rule
+ to the above could be written in the traditional parser language as follows::
+
+ digit <- '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
+ num <- digit | num digit
+
+ Or in our parser language (using symbol sets rather than alternatives) by::
+
+ num0 <- {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}
+ num1 <- {num0, num1} {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}
+
+ We find the second way to be more efficient, since the parser now naturally
+ supports character classes, and repurposes the same mechanism to implement
+ alternatives. This principle is carried through to the generated automaton,
+ so that the ``LR1DFA`` recognizes character classes the same way as a regular
+ ``DFA`` automaton (such as would be generated for the ``lex``-based example).
+
+* It is common to see (non-recursive) head symbols where all alternatives have
+ length one, and these can be expanded in-place and eliminated, since the
+ framework has natural support for sets. Furthermore, after splitting each
+ head symbol into separate productions, some productions might be eliminated::
+
+ expr1 <- {bin_expr0, bin_expr1}
+ expr2 <- {'('} {num, expr1, expr2} {')'}
+ bin_expr0 <- {un_expr0, un_expr1}
+ bin_expr1 <- {bin_expr0, bin_expr1} {'+', '-'} {un_expr0, un_expr1}
+ un_expr0 <- {num, expr1, expr2}
+ un_expr1 <- {'-'} {un_expr0, un_expr1}
+
+Symbols are mapped to integers. The usual mapping has the following ranges:
+
+* ``0..N_CHARACTERS - 1`` represent single characters, or possibly if a lexical
+ analyzer front-end is being used, shorthand for particular lexical symbols;
+* ``N_CHARACTERS..n_terminals - 1`` represent terminal symbols which are not
+ single characters, usually meaning a lexical analyzer front-end is used;
+* ``n_terminals..n_terminals + len(productions) - 1`` represent nonterminal
+ symbols, i.e. the head symbols to which the productions can be reduced.
+
+A special symbol called "eof_terminal" is defined, which must be in the range
+``N_CHARACTERS <= eof_terminal < n_terminals``.
+
+If a lexical analyzer front-end is not used, ``eof_terminal`` will be the only
+terminal other than the characters, so assuming 8-bit ASCII, the characters
+will be ``0..0xff``, EOF will be ``0x100``, and nonterminals will be ``0x101``
+and above.
+
+In our scheme, nonterminal symbols do not have alternatives, meaning that each
+production reduces to a unique nonterminal. Hence there is a 1-1 mapping
+between productions and nonterminal symbols, and the symbol number is obtained
+by taking the position of the production in the productions list and adding
+the symbol number of the first nonterminal, which we call ``n_terminals``
+above. Thus it is not necessary to store what symbol a production reduces to.
+
+Since symbols are integers, symbol sets are sets of integers. We represent a
+set of integers by a list of intervals, so that we can efficiently represent
+sets of consecutive characters specified in a manner similar to character
+classes in regexes. For example, the character class ``[0-9A-Za-z]`` contains
+the ASCII intervals ``[0x30, 0x3a), [0x41, 0x5b), [0x61, 0x7b)``. Note how the
+larger end of the interval is non-inclusive. We then just flatten out the
+interval pairs into a list like ``[0x30, 0x3a, 0x41, 0x5b, 0x61, 0x7b]``. In
+this way even positions give the start of included characters, odd positions
+dl the start of excluded characters. Adjacent intervals must be merged, in such
+a way that the list is always strictly increasing. It must always be even in
+length.
+
+Then the internal representation of a grammar is quite simple, as it is just a
+list of productions, each of which is a list of symbol sets, each of which is
+a list of breaks (i.e. starting symbol numbers of included/excluded symbols).
+Production number 0 is always the start production, and it must end with EOF.
+
+We keep symbol sets separated into terminals and non-terminals, so each symbol
+set is actually two lists, not one. This is convenient for computing lookahead
+sets and generating the automaton, since terminals must be treated differently,
+so it would be annoying to have to split the list into two at each step. Note
+that in this format the non-terminals are not offset by ``n_terminals`` so the
+numbers in the non-terminals set are simply the indices into ``productions``.
+
+We allow the user to store a reference data item with each production, and we
+also keep a workspace for computing the production's lookahead sets and whether
+the production can be empty. This is calculated by scanning the production in a
+forward direction, checking for each element of each symbol set the element's
+lookahead set and can-be-empty status. We save the intermediate results of the
+scan, i.e. the lookahead set and nullability of each prefix of each production.
+
+Additionally, in a separate data structure (like the sets above, but as a map
+not a set) we keep associativity and precedence, for resolving shift/reduce
+conflicts. For compatibility, the algorithm is deliberately identical to YACC.
+
+.. automodule:: ndcode.piyacc.lr1
+ :members:
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
#from ndcode.piyacc import work
# defines the alphabet size, set this to 0x11000 for unicode
-n_characters = 0x100
+N_CHARACTERS = 0x100
class LR1:
+ '''
+An LR1 object stores a grammar definition for an LR1 parser. Information about
+the grammar can either be passed into the constructor LR1() or added to the LR1
+object progressively after it has been constructed. Once the grammar is set up,
+methods on the LR1 object can create an LR1DFA object containing an automaton.
+'''
+
ASSOCIATIVITY_RIGHT = 0
ASSOCIATIVITY_LEFT = 1
ASSOCIATIVITY_NON = 2
productions = [],
precedences = ([], [], [], []),
associativities = [],
- n_terminals = n_characters + 1,
- eof_terminal = n_characters
+ n_terminals = N_CHARACTERS + 1,
+ eof_terminal = N_CHARACTERS
):
# productions: list of production
# production: (
# )
def to_clr1(self):
+ '''
+Produces a canonical LR1 parser automaton in the form of an LR1DFA object.
+
+This is based around the concept of an "item", which means a production from
+the list of productions, plus a position within that production which we have
+matched input up to. To advance the item we simulate receiving a particular
+symbol (or interval of consectively numbered symbols if they are equivalent).
+An item which does not match drops out, whereas an item which matches gets its
+pointer advanced. Then, given a set of items that have survived the advance by
+one symbol, we compute the closure, which means inspecting the upcoming
+symbol(s) of each item, expanding out all possible productions that can
+reduce to those symbol(s), and recursively doing this until the set stabilizes.
+
+So the algorithm here is to generate an automaton by first creating an initial
+state and an item, pointing to the start of the start production (production
+number 0), and then simulate the arrival of a possible symbol (or an interval
+of consectively-numbered symbols, if they are identical). When the arriving
+symbol leads to a new set of items that hasn't been seen before, a new state
+is admitted, otherwise an existing state built from the same items is chosen.
+When all have been seen (each state of the automaton links only to existing
+states and not to new states to be created), the process completes and the
+automaton is ready to use. Note that these canonical LR automata are large.
+See to_lalr1() below for a more efficient way based on merging lookahead sets.
+'''
+
_lr1dfa = lr1dfa.LR1DFA(
[],
[
return _lr1dfa
def to_lalr1(self):
+ '''
+Produces a LALR1 parser automaton in the form of an LR1DFA object. Basically
+the same as to_clr() described above, but each time a new state is to be
+created, there is an attempt to merge it with an existing state (if any) that
+has the same item set and only differs in the lookahead sets. If such state is
+found, then a union is taken of the lookahead sets and the states are merged.
+This can potentially create conflicts, but is generally robust in practice.
+'''
+
_lr1dfa = lr1dfa.LR1DFA(
[],
[