Add sphinx documentation, integrated into our navigation and colour scheme
[ndcode_site.git] / sphinx / lr1.html
1
2 <!DOCTYPE html>
3
4 <html xmlns="http://www.w3.org/1999/xhtml">
5   <head>
6     <meta charset="utf-8" />
7     <title>The lr1 module &#8212; πyacc  documentation</title>
8     <link rel="stylesheet" href="_static/classic.css" type="text/css" />
9     <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
10     
11     <script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
12     <script src="_static/jquery.js"></script>
13     <script src="_static/underscore.js"></script>
14     <script src="_static/doctools.js"></script>
15     <script src="_static/language_data.js"></script>
16     
17     <link rel="index" title="Index" href="genindex.html" />
18     <link rel="search" title="Search" href="search.html" />
19     <link rel="prev" title="The element module" href="element.html" /> 
20   </head><body>
21     <div class="related" role="navigation" aria-label="related navigation">
22       <h3>Navigation</h3>
23       <ul>
24         <li class="right" style="margin-right: 10px">
25           <a href="genindex.html" title="General Index"
26              accesskey="I">index</a></li>
27         <li class="right" >
28           <a href="py-modindex.html" title="Python Module Index"
29              >modules</a> |</li>
30         <li class="right" >
31           <a href="element.html" title="The element module"
32              accesskey="P">previous</a> |</li>
33         <li class="nav-item nav-item-0"><a href="index.html">πyacc  documentation</a> &#187;</li> 
34       </ul>
35     </div>  
36
37     <div class="document">
38       <div class="documentwrapper">
39         <div class="bodywrapper">
40           <div class="body" role="main">
41             
42   <div class="section" id="module-ndcode.piyacc.lr1">
43 <span id="the-lr1-module"></span><h1>The <code class="docutils literal notranslate"><span class="pre">lr1</span></code> module<a class="headerlink" href="#module-ndcode.piyacc.lr1" title="Permalink to this headline"></a></h1>
44 <div class="section" id="design-of-the-module">
45 <h2>Design of the module<a class="headerlink" href="#design-of-the-module" title="Permalink to this headline"></a></h2>
46 <p>The parser is structured as a list of productions, each of which describes a
47 consecutive series of symbols that can be “reduced” to a new symbol.</p>
48 <p>Traditionally an LR1 parser is defined as a collection of head symbols, each
49 of which has a collection of alternatives, each a string of symbols, e.g.:</p>
50 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">expr</span> <span class="o">&lt;-</span> <span class="n">number</span> <span class="o">|</span> <span class="n">binary_expr</span> <span class="o">|</span> <span class="s1">&#39;(&#39;</span> <span class="n">expr</span> <span class="s1">&#39;)&#39;</span>
51 <span class="n">binary_expr</span> <span class="o">&lt;-</span> <span class="n">unary_expr</span> <span class="o">|</span> <span class="n">binary_expr</span> <span class="n">binary_op</span> <span class="n">unary_expr</span>
52 <span class="n">unary_expr</span> <span class="o">&lt;-</span> <span class="n">expr</span> <span class="o">|</span> <span class="n">unary_op</span> <span class="n">unary_expr</span>
53 <span class="n">binary_op</span> <span class="o">&lt;-</span> <span class="s1">&#39;+&#39;</span> <span class="o">|</span> <span class="s1">&#39;-&#39;</span>
54 <span class="n">unary_op</span> <span class="o">&lt;-</span> <span class="s1">&#39;-&#39;</span>
55 </pre></div>
56 </div>
57 <p>Here we take a different approach by defining the LR1 parser as a collection
58 of productions, each of which reduces to a unique head symbol. Then instead of
59 referring to a head symbol which has alternatives, at each reference we list
60 out the alternative head symbols explicitly. For example, based on the above,
61 we would first have to split each head symbol into several indepedent ones:</p>
62 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">expr0</span> <span class="o">&lt;-</span> <span class="n">num</span>
63 <span class="n">expr1</span> <span class="o">&lt;-</span> <span class="n">bin_expr</span>
64 <span class="n">expr2</span> <span class="o">&lt;-</span> <span class="s1">&#39;(&#39;</span> <span class="n">expr</span> <span class="s1">&#39;)&#39;</span>
65 <span class="n">bin_expr0</span> <span class="o">&lt;-</span> <span class="n">un_expr</span>
66 <span class="n">bin_expr1</span> <span class="o">&lt;-</span> <span class="n">bin_expr</span> <span class="n">bin_op</span> <span class="n">un_expr</span>
67 <span class="n">un_expr0</span> <span class="o">&lt;-</span> <span class="n">expr</span>
68 <span class="n">un_expr1</span> <span class="o">&lt;-</span> <span class="n">un_op</span> <span class="n">un_expr</span>
69 <span class="n">bin_op0</span> <span class="o">&lt;-</span> <span class="s1">&#39;+&#39;</span>
70 <span class="n">bin_op1</span> <span class="o">&lt;-</span> <span class="s1">&#39;-&#39;</span>
71 <span class="n">un_op0</span> <span class="o">&lt;-</span> <span class="s1">&#39;-&#39;</span>
72 </pre></div>
73 </div>
74 <p>Then, each symbol would have to be changed into a set of acceptable symbols:</p>
75 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">expr0</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="n">num</span><span class="p">}</span>
76 <span class="n">expr1</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="n">bin_expr0</span><span class="p">,</span> <span class="n">bin_expr1</span><span class="p">}</span>
77 <span class="n">expr2</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="s1">&#39;(&#39;</span><span class="p">}</span> <span class="p">{</span><span class="n">expr0</span><span class="p">,</span> <span class="n">expr1</span><span class="p">,</span> <span class="n">expr2</span><span class="p">}</span> <span class="p">{</span><span class="s1">&#39;)&#39;</span><span class="p">}</span>
78 <span class="n">bin_expr0</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="n">un_expr0</span><span class="p">,</span> <span class="n">un_expr1</span><span class="p">}</span>
79 <span class="n">bin_expr1</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="n">bin_expr0</span><span class="p">,</span> <span class="n">bin_expr1</span><span class="p">}</span> <span class="p">{</span><span class="n">bin_op0</span><span class="p">,</span> <span class="n">bin_op1</span><span class="p">}</span> <span class="p">{</span><span class="n">un_expr0</span><span class="p">,</span> <span class="n">un_expr1</span><span class="p">}</span>
80 <span class="n">un_expr0</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="n">expr0</span><span class="p">,</span> <span class="n">expr1</span><span class="p">,</span> <span class="n">expr2</span><span class="p">}</span>
81 <span class="n">un_expr1</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="n">un_op0</span><span class="p">}</span> <span class="p">{</span><span class="n">un_expr0</span><span class="p">,</span> <span class="n">un_expr1</span><span class="p">}</span>
82 <span class="n">bin_op0</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="s1">&#39;+&#39;</span><span class="p">}</span>
83 <span class="n">bin_op1</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="s1">&#39;-&#39;</span><span class="p">}</span>
84 <span class="n">un_op0</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="s1">&#39;-&#39;</span><span class="p">}</span>
85 </pre></div>
86 </div>
87 <p>Whilst our representation is equivalent, it is more efficient in several ways:</p>
88 <ul>
89 <li><p>It can handle simple character input streams itself without needing a
90 lexical analysis front end. With the lexical analysis front-end, the parser
91 receives a stream of lexical tokens. A typical <code class="docutils literal notranslate"><span class="pre">lex</span></code> definition for how
92 to recognize a token <code class="docutils literal notranslate"><span class="pre">NUM</span></code> and return it to the parser might look like:</p>
93 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="mi">0</span><span class="o">-</span><span class="mi">9</span><span class="p">]</span><span class="o">+</span> <span class="k">return</span> <span class="n">NUM</span><span class="p">;</span>
94 </pre></div>
95 </div>
96 <p>Whereas, without the front-end, the parser receives a stream of characters,
97 these might be ASCII characters in the range <code class="docutils literal notranslate"><span class="pre">0..0xff</span></code>. An equivalent rule
98 to the above could be written in the traditional parser language as follows:</p>
99 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">digit</span> <span class="o">&lt;-</span> <span class="s1">&#39;0&#39;</span> <span class="o">|</span> <span class="s1">&#39;1&#39;</span> <span class="o">|</span> <span class="s1">&#39;2&#39;</span> <span class="o">|</span> <span class="s1">&#39;3&#39;</span> <span class="o">|</span> <span class="s1">&#39;4&#39;</span> <span class="o">|</span> <span class="s1">&#39;5&#39;</span> <span class="o">|</span> <span class="s1">&#39;6&#39;</span> <span class="o">|</span> <span class="s1">&#39;7&#39;</span> <span class="o">|</span> <span class="s1">&#39;8&#39;</span> <span class="o">|</span> <span class="s1">&#39;9&#39;</span>
100 <span class="n">num</span> <span class="o">&lt;-</span> <span class="n">digit</span> <span class="o">|</span> <span class="n">num</span> <span class="n">digit</span>
101 </pre></div>
102 </div>
103 <p>Or in our parser language (using symbol sets rather than alternatives) by:</p>
104 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">num0</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="s1">&#39;0&#39;</span><span class="p">,</span> <span class="s1">&#39;1&#39;</span><span class="p">,</span> <span class="s1">&#39;2&#39;</span><span class="p">,</span> <span class="s1">&#39;3&#39;</span><span class="p">,</span> <span class="s1">&#39;4&#39;</span><span class="p">,</span> <span class="s1">&#39;5&#39;</span><span class="p">,</span> <span class="s1">&#39;6&#39;</span><span class="p">,</span> <span class="s1">&#39;7&#39;</span><span class="p">,</span> <span class="s1">&#39;8&#39;</span><span class="p">,</span> <span class="s1">&#39;9&#39;</span><span class="p">}</span>
105 <span class="n">num1</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="n">num0</span><span class="p">,</span> <span class="n">num1</span><span class="p">}</span> <span class="p">{</span><span class="s1">&#39;0&#39;</span><span class="p">,</span> <span class="s1">&#39;1&#39;</span><span class="p">,</span> <span class="s1">&#39;2&#39;</span><span class="p">,</span> <span class="s1">&#39;3&#39;</span><span class="p">,</span> <span class="s1">&#39;4&#39;</span><span class="p">,</span> <span class="s1">&#39;5&#39;</span><span class="p">,</span> <span class="s1">&#39;6&#39;</span><span class="p">,</span> <span class="s1">&#39;7&#39;</span><span class="p">,</span> <span class="s1">&#39;8&#39;</span><span class="p">,</span> <span class="s1">&#39;9&#39;</span><span class="p">}</span>
106 </pre></div>
107 </div>
108 <p>We find the second way to be more efficient, since the parser now naturally
109 supports character classes, and repurposes the same mechanism to implement
110 alternatives. This principle is carried through to the generated automaton,
111 so that the <code class="docutils literal notranslate"><span class="pre">LR1DFA</span></code> recognizes character classes the same way as a regular
112 <code class="docutils literal notranslate"><span class="pre">DFA</span></code> automaton (such as would be generated for the <code class="docutils literal notranslate"><span class="pre">lex</span></code>-based example).</p>
113 </li>
114 <li><p>It is common to see (non-recursive) head symbols where all alternatives have
115 length one, and these can be expanded in-place and eliminated, since the
116 framework has natural support for sets. Furthermore, after splitting each
117 head symbol into separate productions, some productions might be eliminated:</p>
118 <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">expr1</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="n">bin_expr0</span><span class="p">,</span> <span class="n">bin_expr1</span><span class="p">}</span>
119 <span class="n">expr2</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="s1">&#39;(&#39;</span><span class="p">}</span> <span class="p">{</span><span class="n">num</span><span class="p">,</span> <span class="n">expr1</span><span class="p">,</span> <span class="n">expr2</span><span class="p">}</span> <span class="p">{</span><span class="s1">&#39;)&#39;</span><span class="p">}</span>
120 <span class="n">bin_expr0</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="n">un_expr0</span><span class="p">,</span> <span class="n">un_expr1</span><span class="p">}</span>
121 <span class="n">bin_expr1</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="n">bin_expr0</span><span class="p">,</span> <span class="n">bin_expr1</span><span class="p">}</span> <span class="p">{</span><span class="s1">&#39;+&#39;</span><span class="p">,</span> <span class="s1">&#39;-&#39;</span><span class="p">}</span> <span class="p">{</span><span class="n">un_expr0</span><span class="p">,</span> <span class="n">un_expr1</span><span class="p">}</span>
122 <span class="n">un_expr0</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="n">num</span><span class="p">,</span> <span class="n">expr1</span><span class="p">,</span> <span class="n">expr2</span><span class="p">}</span>
123 <span class="n">un_expr1</span> <span class="o">&lt;-</span> <span class="p">{</span><span class="s1">&#39;-&#39;</span><span class="p">}</span> <span class="p">{</span><span class="n">un_expr0</span><span class="p">,</span> <span class="n">un_expr1</span><span class="p">}</span>
124 </pre></div>
125 </div>
126 </li>
127 </ul>
128 <p>Symbols are mapped to integers. The usual mapping has the following ranges:</p>
129 <ul class="simple">
130 <li><p><code class="docutils literal notranslate"><span class="pre">0..N_CHARACTERS</span> <span class="pre">-</span> <span class="pre">1</span></code> represent single characters, or possibly if a lexical
131 analyzer front-end is being used, shorthand for particular lexical symbols;</p></li>
132 <li><p><code class="docutils literal notranslate"><span class="pre">N_CHARACTERS..n_terminals</span> <span class="pre">-</span> <span class="pre">1</span></code> represent terminal symbols which are not
133 single characters, usually meaning a lexical analyzer front-end is used;</p></li>
134 <li><p><code class="docutils literal notranslate"><span class="pre">n_terminals..n_terminals</span> <span class="pre">+</span> <span class="pre">len(productions)</span> <span class="pre">-</span> <span class="pre">1</span></code> represent nonterminal
135 symbols, i.e. the head symbols to which the productions can be reduced.</p></li>
136 </ul>
137 <p>A special symbol called “eof_terminal” is defined, which must be in the range
138 <code class="docutils literal notranslate"><span class="pre">N_CHARACTERS</span> <span class="pre">&lt;=</span> <span class="pre">eof_terminal</span> <span class="pre">&lt;</span> <span class="pre">n_terminals</span></code>.</p>
139 <p>If a lexical analyzer front-end is not used, <code class="docutils literal notranslate"><span class="pre">eof_terminal</span></code> will be the only
140 terminal other than the characters, so assuming 8-bit ASCII, the characters
141 will be <code class="docutils literal notranslate"><span class="pre">0..0xff</span></code>, EOF will be <code class="docutils literal notranslate"><span class="pre">0x100</span></code>, and nonterminals will be <code class="docutils literal notranslate"><span class="pre">0x101</span></code>
142 and above.</p>
143 <p>In our scheme, nonterminal symbols do not have alternatives, meaning that each
144 production reduces to a unique nonterminal. Hence there is a 1-1 mapping
145 between productions and nonterminal symbols, and the symbol number is obtained
146 by taking the position of the production in the productions list and adding
147 the symbol number of the first nonterminal, which we call <code class="docutils literal notranslate"><span class="pre">n_terminals</span></code>
148 above. Thus it is not necessary to store what symbol a production reduces to.</p>
149 <p>Since symbols are integers, symbol sets are sets of integers. We represent a
150 set of integers using a list of breaks – see the <code class="docutils literal notranslate"><span class="pre">bisect_set</span></code> module.</p>
151 <p>Then the internal representation of a grammar is quite simple, as it is just a
152 list of productions, each of which is a list of symbol sets, each of which is
153 a list of breaks (i.e. starting symbol numbers of included/excluded symbols).
154 Production number 0 is always the start production, and it must end with EOF.</p>
155 <p>We keep symbol sets separated into terminals and non-terminals, so each symbol
156 set is actually two lists, not one. This is convenient for computing lookahead
157 sets and generating the automaton, since terminals must be treated differently,
158 so it would be annoying to have to split the list into two at each step. Note
159 that in this format the non-terminals are not offset by <code class="docutils literal notranslate"><span class="pre">n_terminals</span></code>, so the
160 numbers in the non-terminals set are simply the indices into <code class="docutils literal notranslate"><span class="pre">productions</span></code>.</p>
161 <p>We allow the user to store a reference data item with each production, and we
162 also keep a workspace for computing the production’s lookahead sets and whether
163 the production can be empty. This is calculated by scanning the production in a
164 forward direction, checking for each element of each symbol set the element’s
165 lookahead set and can-be-empty status. We save the intermediate results of the
166 scan, i.e. the lookahead set and nullability of each prefix of each production.</p>
167 <p>Additionally, in a separate data structure we keep a map from nonterminals to
168 associativity and precedence, for resolving shift/reduce conflicts. For
169 compatibility, the algorithm is deliberately identical to YACC. The map is kept
170 in a similar way to the sets discussed earlier – see the <code class="docutils literal notranslate"><span class="pre">bisect_set</span></code> module
171 for how we keep sets of integers. For maps from integers to arbitary objects
172 we keep two lists: a list of <code class="docutils literal notranslate"><span class="pre">breaks</span></code>, and a list of <code class="docutils literal notranslate"><span class="pre">objects</span></code>. Then symbol
173 <code class="docutils literal notranslate"><span class="pre">symbol</span></code> maps to <code class="docutils literal notranslate"><span class="pre">objects[i]</span></code> for the smallest <code class="docutils literal notranslate"><span class="pre">i</span></code> such that <code class="docutils literal notranslate"><span class="pre">symbol</span> <span class="pre">&lt;</span>
174 <span class="pre">breaks[i]</span></code>, or if none of these holds, it maps to <code class="docutils literal notranslate"><span class="pre">objects[len(breaks)]</span></code>.</p>
175 </div>
176 </div>
177
178
179           </div>
180         </div>
181       </div>
182       <div class="sphinxsidebar" role="navigation" aria-label="main navigation">
183         <div class="sphinxsidebarwrapper">
184   <h3><a href="index.html">Table of Contents</a></h3>
185   <ul>
186 <li><a class="reference internal" href="#">The <code class="docutils literal notranslate"><span class="pre">lr1</span></code> module</a><ul>
187 <li><a class="reference internal" href="#design-of-the-module">Design of the module</a></li>
188 </ul>
189 </li>
190 </ul>
191
192   <h4>Previous topic</h4>
193   <p class="topless"><a href="element.html"
194                         title="previous chapter">The <code class="docutils literal notranslate"><span class="pre">element</span></code> module</a></p>
195   <div role="note" aria-label="source link">
196     <h3>This Page</h3>
197     <ul class="this-page-menu">
198       <li><a href="_sources/lr1.rst.txt"
199             rel="nofollow">Show Source</a></li>
200     </ul>
201    </div>
202 <div id="searchbox" style="display: none" role="search">
203   <h3 id="searchlabel">Quick search</h3>
204     <div class="searchformwrapper">
205     <form class="search" action="search.html" method="get">
206       <input type="text" name="q" aria-labelledby="searchlabel" />
207       <input type="submit" value="Go" />
208     </form>
209     </div>
210 </div>
211 <script>$('#searchbox').show(0);</script>
212         </div>
213       </div>
214       <div class="clearer"></div>
215     </div>
216     <div class="related" role="navigation" aria-label="related navigation">
217       <h3>Navigation</h3>
218       <ul>
219         <li class="right" style="margin-right: 10px">
220           <a href="genindex.html" title="General Index"
221              >index</a></li>
222         <li class="right" >
223           <a href="py-modindex.html" title="Python Module Index"
224              >modules</a> |</li>
225         <li class="right" >
226           <a href="element.html" title="The element module"
227              >previous</a> |</li>
228         <li class="nav-item nav-item-0"><a href="index.html">πyacc  documentation</a> &#187;</li> 
229       </ul>
230     </div>
231     <div class="footer" role="contentinfo">
232         &#169; Copyright 2020, NDCODE project.
233       Created using <a href="http://sphinx-doc.org/">Sphinx</a> 2.4.1.
234     </div>
235   </body>
236 </html>