README.md

   1 NDCODE project
   2
   3 # PiTree: Specification Language for Abstract Syntax Trees
   4
   5 ## Introduction
   6
   7 The πtree program is a companion program to πlex and πyacc. It implements a
   8 specification language for abstract syntax trees. In general you will have a
   9 lexical specification such as `my_language.l`, a grammar specification such
  10 as `my_language.y`, and an AST specification such as `my_language.t`. These
  11 will be compiled by πlex, πyacc, and πtree respectively. Then when you run the
  12 generated lexer/parser it will instantiate classes from the AST specification.
  13
  14 The `my_language.t` AST specification is formatted something like a Lex or
  15 YACC specification. It has three parts, separated by %% lines. The parts are
  16 (i) directives and header information, (ii) the class declarations for the
  17 different objects that can be in an AST, in a C++-like syntax, (iii) free-form
  18 code which is appended to the πtree output and can contain subroutines, etc.
  19 Note that in πtree the free-form section serves a second purpose, which is to
  20 attach methods to the classes declared in the classes section. We do this with
  21 regular Python syntax, including a decorator that we provide for this purpose.
  22
  23 It is possible to define your own class hierarchy to use with πlex and πyacc,
  24 without using πtree. You can either define action code in your Lex and YACC
  25 specification to instantiate the objects (the traditional way), or you can use
  26 πlex and πyacc's markup syntax embedded within the πlex and πyacc rules, to say
  27 what you want generated. If you do it the latter way, then the classes need to
  28 have a constructor defined in a specific way so that πlex or πyacc can easily
  29 pass in the child nodes of the AST and any field values to store in the node.
  30
  31 The reason why it is convenient to use πtree to define the class hierarchy is
  32 (i) πtree automatically declares the right kind of constructor that can take
  33 the child nodes and field values, and deal with them appropriately by either
  34 storing or passing into the superclass's constructor, (ii) πtree automatically
  35 adds serialization/deserialization methods to the class to/from πtree's XML
  36 format. While you could use `pickle` or similar, the XML format is advantageous
  37 because it is human readable. It embeds the original parsed text, with markup
  38 giving class names and field values from your πlex/πyacc/πtree specification.
  39
  40 ## Installation
  41
  42 From PyPI we can install πtree by a command like `pip3 install --user pitree`.
  43 This will put a `pitree` executable in `$HOME/.local/bin` where `$HOME` is your
  44 home directory. You need to make sure this is in your path, so if the directory
  45 `$HOME/.local/bin` is being created for the first time, you may wish to log out
  46 and back in again. At least on Debian systems it will then appear in your path.
  47
  48 As well as the `pitree` executable, the `ndcode.pitree` package needs to be
  49 importable from Python, since the first statement of the generated code (from
  50 running the `pitree` executable) will usually be something like `from
  51 ndcode.pitree import element`. With the installation method recommended above,
  52 this module will be placed somewhere like
  53 `$HOME/.local/lib/python3.6/site-packages/pitree-0.0.1-py3.6.egg/ndcode/pitree/element.py`
  54 where it is accessible to the generated code.
  55
  56 An alternative way, if you have already checked out the source repository from
  57 `https://git.ndcode.org/public/pitree.git` or similar, is to run a command like
  58 `. ./env.sh` in the root of the repository. This will put the current directory
  59 (the root of the repository) into your `PYTHONPATH`. The source files have been
  60 put in a directory `ndcode/pitree` so you can then `import ndcode.pitree.XXX`.
  61 With this method, instead of just `pitree` you must run `ndcode/pitree/cli.py`.
  62 Also, you'll have to run `make` in `ndcode/pitree` before anything will work.
  63
  64 ## Tutorial
  65
  66 We are going to define a simple AST for a four-function calculator language.
  67 The leaf nodes of the tree will be literals expressed as strings like '5.0'.
  68 The non-leaf nodes will be operators. The negation operator will have one
  69 child being the expression to negate. The add/subtract/multiply/divide nodes
  70 will have two children being the expressions to add/subtract/multiply/divide.
  71
  72 ### Tutorial: defining a tree
  73
  74 Create a file called `calc.t` containing the following:
  75 ```
  76 %{
  77   from ndcode.pitree import element
  78 %}
  79
  80 %%
  81
  82 class Expression;
  83 class LiteralExpression: Expression;
  84 class NegExpression: Expression;
  85 class AddExpression: Expression;
  86 class SubExpression: Expression;
  87 class MulExpression: Expression;
  88 class DivExpression: Expression;
  89 ```
  90
  91 The text is free-form like C, whitespace is insignificant, and statements
  92 are delimited by semicolons. We have given a simple directives section that
  93 defines some Python code to be copied to the output before anything else,
  94 here an `import` statement that imports the `ndcode.pitree.element` module.
  95
  96 Then we have defined an abstract class `Expression`, which acts as a superclass
  97 to the literal and operator expression classes. Finally, we have gone ahead and
  98 defined the other classes, using πtree's `:` syntax to denote the superclass.
  99
 100 Note that `Expression` has an implicit superclass of `element.Element`, using
 101 the `element` module imported in the directives section. Hence, some `element`
 102 module always needs to be imported (but you can use your own `element` module).
 103
 104 ### Tutorial: compiling the tree
 105
 106 To compile this specification, run a command like `pitree --python calc.t`.
 107 By default this creates an output file called `t_def.py`. (Similarly, πlex
 108 defaults to creating `lex_yy.py` and πyacc defaults to creating `y_tab.py`,
 109 behaviour which is inherited from the C versions of the tools). The name
 110 `t_def.py` stands for "Tree Definition". You can also specify your own output
 111 file name using `-o`, for example `pitree --python -o calc_def.py calc.t`.
 112
 113 Let us now have a quick look at the generated output, it is pretty simple:
 114 ```
 115 # ... copyright stuff (omitted for brevity) ...
 116
 117 import json
 118
 119 # GENERATE SECTION1 BEGIN
 120 from ndcode.pitree import element
 121 # GENERATE END
 122
 123 # GENERATE SECTION2 BEGIN
 124 class Expression(element.Element):
 125   tag = 'Expression'
 126 class LiteralExpression(Expression):
 127   tag = 'LiteralExpression'
 128 class NegExpression(Expression):
 129   tag = 'NegExpression'
 130 class AddExpression(Expression):
 131   tag = 'AddExpression'
 132 class SubExpression(Expression):
 133   tag = 'SubExpression'
 134 class MulExpression(Expression):
 135   tag = 'MulExpression'
 136 class DivExpression(Expression):
 137   tag = 'DivExpression'
 138 tag_to_class = {
 139   'Expression': Expression,
 140   'LiteralExpression': LiteralExpression,
 141   'NegExpression': NegExpression,
 142   'AddExpression': AddExpression,
 143   'SubExpression': SubExpression,
 144   'MulExpression': MulExpression,
 145   'DivExpression': DivExpression
 146 }
 147 # GENERATE END
 148
 149 def method(_class):
 150   def decorator(func):
 151     setattr(_class, func.__name__, func)
 152     return func
 153   return decorator
 154
 155 # GENERATE SECTION3 BEGIN
 156 # GENERATE END
 157 ```
 158
 159 The wanted classes are just defined as regular Python classes, and there is
 160 a small amount of stuff to support serializing/deserializing. The `tag` field
 161 of each class is for serializing and converts a class to a textual name, the
 162 `tag_to_class` dictionary is for deserializing and goes the other way. The
 163 `method` function is a decorator, and will be discussed later in the tutorial.
 164 You can insert arbitary code in the SECTION1 and SECTION3 parts of the output.
 165
 166 The `json` import is not referenced in this particular example, but in later
 167 examples it will be used in the generated code to serialize and deserialize
 168 Python native types such as `str`, `int` and so on. This is a bit lazy on our
 169 part, since we know what type is being serialized/deserialized, so we could
 170 generate more specific code to do this, but the `json` way is easier for now.
 171
 172 ### Tutorial: serializing the tree
 173
 174 Let us now create a short program which imports and uses the example tree
 175 definition from above. Create a script `serialize.py` containing the following:
 176 ```
 177 #!/usr/bin/env python3
 178
 179 from ndcode.pitree import element
 180 import sys
 181 import t_def
 182
 183 expression = t_def.AddExpression(
 184   children = [
 185     t_def.MulExpression(
 186       children = [
 187         t_def.LiteralExpression(text = ['5']),
 188         t_def.LiteralExpression(text = ['2'])
 189       ]
 190     ),
 191     t_def.LiteralExpression(text = ['7'])
 192   ]
 193 )
 194 element.serialize(expression, sys.stdout)
 195 ```
 196
 197 Make it executable, `chmod a+x ./serialize.py`. For the remaining scripts in
 198 the tutorial, we'll assume that you know to do this step, or that you will run
 199 the scripts via the `python3` command if you prefer them not to be executable.
 200
 201 This program programmatically constructs a simple AST for the calculator
 202 expression `5 * 2 + 7`, and then serializes it to `sys.stdout` for inspection.
 203
 204 Basically all the heavy lifting is done in the `ndcode.pitree.element` module,
 205 via a combination of the utility routine `element.serialize()`, and supporting
 206 methods that we have defined on the superclass `ndcode.pitree.element.Element`.
 207
 208 Run this and redirect the output to a file, e.g. `./serialize.py >expr.xml`.
 209 You should get a file containing something like:
 210 ```
 211 <root>
 212   <AddExpression ref="0"><MulExpression><LiteralExpression>5</LiteralExpression><LiteralExpression>2</LiteralExpression></MulExpression><LiteralExpression>7</LiteralExpression></AddExpression>
 213 </root>
 214 ```
 215
 216 The `<root>` and `</root>` tags and the `ref="0"` attribute are added by the
 217 serializer, to support the attachment of arbitrary structures to AST nodes,
 218 such as from semantic analysis. We will not consider such use for now. But note
 219 that when deserializing, the node returned is the one with the `ref="0"` on it.
 220
 221 We have deliberately not indented this in the normal XML way, for reasons that
 222 should become clear further in the tutorial. If we do indent it manually, it
 223 becomes more obvious that this expression is nested like `(5 * 2) + 7`, e.g.:
 224 ```
 225 <root>
 226   <AddExpression ref="0">
 227     <MulExpression>
 228       <LiteralExpression>5</LiteralExpression>
 229       <LiteralExpression>2</LiteralExpression>
 230     </MulExpression>
 231     <LiteralExpression>7</LiteralExpression>
 232   </AddExpression>
 233 </root>
 234 ```
 235
 236 ### Tutorial: methods on the tree
 237
 238 What we will do now is create a method to perform the requested calculation,
 239 e.g. with the above example tree, it would add 5 and 2, then multiply by 7.
 240
 241 We are going to do this in an object-oriented manner, by defining a method
 242 `evaluate()` on the `Expression` class. This method will be specialized in the
 243 leaf class `LiteralExpression` and the non-leaf classes like `NegExpression`,
 244 so that each class knows how to evaluate itself. For the leaf class this means
 245 creating a number to operate on, for the non-leaf classes this means evaluating
 246 the child nodes and then operating on these results by negation, addition, etc.
 247
 248 We will define these methods in a slightly unconventional way, by attaching
 249 them to the class after the class has been defined. This serves two purposes,
 250 (i) the code generator for classes is simplified since it does not need to
 251 worry about our methods, (ii) our methods can be grouped by method name rather
 252 than by class name, making it easy to see how each class evaluates itself. The
 253 point ii makes our Python code look a bit like Haskell pattern matching code.
 254
 255 To try it, add a free-form section to `calc.t` beginning with `%%`, as follows:
 256 ```
 257 %%
 258
 259 @method(Expression)
 260 def evaluate(self):
 261   raise NotImplementedError
 262 @method(LiteralExpression)
 263 def evaluate(self):
 264   return float(self.text[0])
 265 @method(NegExpression)
 266 def evaluate(self):
 267   return -self.children[0].evaluate()
 268 @method(AddExpression)
 269 def evaluate(self):
 270   return self.children[0].evaluate() + self.children[1].evaluate()
 271 @method(SubExpression)
 272 def evaluate(self):
 273   return self.children[0].evaluate() - self.children[1].evaluate()
 274 @method(MulExpression)
 275 def evaluate(self):
 276   return self.children[0].evaluate() * self.children[1].evaluate()
 277 @method(DivExpression)
 278 def evaluate(self):
 279   return self.children[0].evaluate() / self.children[1].evaluate()
 280 del evaluate
 281 ```
 282
 283 The same-named function `evaluate()` is defined a number of times, using the
 284 `@method` decorator to attach the different definitions to different classes.
 285 Unfortunately the `def` statement doesn't know that the decorator has put the
 286 function elsewhere, so it also defines the function in the `t_def` module. To
 287 clean up, we therefore do `del evaluate` after we've done all of the attaching.
 288
 289 In the leaf node method, we refer to text stored in the node by `self.text[0]`.
 290 In the non-leaf node methods, we refer to the child nodes by `self.children[0]`
 291 for the first child, `self.children[1]` for the second child and so on. The
 292 methods simply assume that the correct number of children are present: 0 for
 293 `LiteralExpression`, 1 for `NegExpression`, 2 for `AddExpression` and the rest.
 294
 295 Note that following best Python practice, the abstract class `Expression` also
 296 gets a definition of the method, which raises `NotImplementedError`. It serves
 297 as a catch-all, in case we forget to define it on one of the derived classes.
 298
 299 Don't forget to recompile the tree definition, by `pitree --python calc.t`.
 300 Then, to test the new method, create a script `evaluate.py` as follows:
 301 ```
 302 #!/usr/bin/env python3
 303
 304 import t_def
 305
 306 expression = t_def.AddExpression(
 307   children = [
 308     t_def.MulExpression(
 309       children = [
 310         t_def.LiteralExpression(text = ['5']),
 311         t_def.LiteralExpression(text = ['2'])
 312       ]
 313     ),
 314     t_def.LiteralExpression(text = ['7'])
 315   ]
 316 )
 317 print(expression.evaluate())
 318 ```
 319 Running `./evaluate.py` should show the answer of evaluating the tree, `17.0`.
 320
 321 ### Tutorial: deserializing the tree
 322
 323 To support deserialization, we have to add a tiny bit of boilerplate code to
 324 our tree definition `calc.t` in the free-form section, as follows:
 325 ```
 326 def factory(tag, *args, **kwargs):
 327   return tag_to_class[tag](*args, **kwargs)
 328 ```
 329
 330 What this does is, given a tag name and a set of constructor arguments, calls
 331 the appropriate constructor to create the requested node, e.g. it constructs a
 332 `LiteralExpression` when `tag` is set to `'LiteralExpression'`, and so on.
 333
 334 The reason we do this here (instead of doing it automatically in the generated
 335 code) is for expandability. For example, we can define a larger language which
 336 uses `calc.t` as a mini-language for some of its subtrees, with a more complex
 337 factory that can use the `calc.t` factory as a subfactory for particular nodes.
 338
 339 Don't forget to recompile the tree definition, by `pitree --python calc.t`.
 340 Then, to test the deserializer, create a script `deserialize.py` as follows:
 341 ```
 342 #!/usr/bin/env python3
 343
 344 from ndcode.pitree import element
 345 import sys
 346 import t_def
 347
 348 expression = element.deserialize(sys.stdin, t_def.factory)
 349 print(expression.evaluate())
 350 ```
 351
 352 If you still have the `expr.xml` file that you created when testing the
 353 serializer, you can now feed it to the deserializer by running a command like
 354 `./deserialize.py <expr.xml`. It will deserialize, then evaluate the `17.0`.
 355 This is quite powerful, as it makes an existing XML file become "intelligent".
 356
 357 ### Tutorial: fields on a node
 358
 359 Suppose we want to cut down the number of different classes in the tree. We now
 360 want to have a `UnaryExpression` class to cover all unary operators, including
 361 the former `NegExpression`, and a `BinaryExpression` class to cover all binary
 362 operators, including the former `AddExpression` and so on. We will keep track
 363 of which kind of operator is requested, by storing it in a field on the node.
 364 We use "field" to mean the same as Python instance variables or XML attributes.
 365
 366 To avoid confusion we'll call the field `unary_operator` for `UnaryExpression`
 367 and `binary_operator` for `BinaryExpression`. For simplicity's sake we'll make
 368 the field contain a Python `str`, even though in it would be more efficient to
 369 make it an `int` containing an enumeration value. πtree supports a fairly rich
 370 type system, subject to types being declared upfront (not Duck typing). We can
 371 have `str`, `int`, `bool`, `list(str)` and so on. It's best to think of this in
 372 a C++-like way, so `list(str)` is similar to `std::vector<std::string>`, etc.
 373
 374 Here is a revised tree definition `calc1.t` using fields, including a factory
 375 to support the deserializer, and an appropriately updated `evaluate()` method:
 376 ```
 377 %{
 378   from ndcode.pitree import element
 379 %}
 380
 381 %%
 382
 383 class Expression;
 384 class LiteralExpression: Expression;
 385 class UnaryExpression: Expression {
 386   str unary_operator;
 387 };
 388 class BinaryExpression: Expression {
 389   str binary_operator;
 390 };
 391
 392 %%
 393
 394 def factory(tag, *args, **kwargs):
 395   return tag_to_class[tag](*args, **kwargs)
 396
 397 @method(Expression)
 398 def evaluate(self):
 399   raise NotImplementedError
 400 @method(LiteralExpression)
 401 def evaluate(self):
 402   return float(self.text[0])
 403 @method(UnaryExpression)
 404 def evaluate(self):
 405   value = self.children[0].evaluate()
 406   if self.unary_operator == '-':
 407     return -value
 408   raise NotImplementedError
 409 @method(BinaryExpression)
 410 def evaluate(self):
 411   value0 = self.children[0].evaluate()
 412   value1 = self.children[1].evaluate()
 413   if self.binary_operator == '+':
 414     return value0 + value1
 415   if self.binary_operator == '-':
 416     return value0 - value1
 417   if self.binary_operator == '*':
 418     return value0 * value1
 419   if self.binary_operator == '/':
 420     return value0 / value1
 421   raise NotImplementedError
 422 del evaluate
 423 ```
 424
 425 Whilst this is a bit of a contrived example, it illustrates how sometimes an
 426 attribute on a node can be more compact than defining a set of classes for the
 427 different node types. Here it allows some common code, the evaluation of the
 428 child nodes, to be expressed just once for unary and once for binary operators.
 429
 430 To avoid confusion let us compile this to `t_def1.py` instead of `t_def.py`.
 431 Run `pitree --python -o t_def1.py calc1.t`. Then, create an `evaluate1.py`:
 432 ```
 433 #!/usr/bin/env python3
 434
 435 import t_def1
 436
 437 expression = t_def1.BinaryExpression(
 438   binary_operator = '+',
 439   children = [
 440     t_def1.BinaryExpression(
 441       binary_operator = '*',
 442       children = [
 443         t_def1.LiteralExpression(text = ['5']),
 444         t_def1.LiteralExpression(text = ['2'])
 445       ]
 446     ),
 447     t_def1.LiteralExpression(text = ['7'])
 448   ]
 449 )
 450 print(expression.evaluate())
 451 ```
 452
 453 To test the serialization with fields, create a `serialize1.py` as follows:
 454 ```
 455 #!/usr/bin/env python3
 456
 457 from ndcode.pitree import element
 458 import sys
 459 import t_def1
 460
 461 expression = t_def1.BinaryExpression(
 462   binary_operator = '+',
 463   children = [
 464     t_def1.BinaryExpression(
 465       binary_operator = '*',
 466       children = [
 467         t_def1.LiteralExpression(text = ['5']),
 468         t_def1.LiteralExpression(text = ['2'])
 469       ]
 470     ),
 471     t_def1.LiteralExpression(text = ['7'])
 472   ]
 473 )
 474 element.serialize(expression, sys.stdout)
 475 ```
 476
 477 Running `./serialize1.py >tree1.xml` should produce output in which the field
 478 values `unary_operator` and `binary_operator` are stored in XML attributes:
 479 ```
 480 <root>
 481   <BinaryExpression binary_operator="&quot;+&quot;" ref="0"><BinaryExpression binary_operator="&quot;*&quot;"><LiteralExpression>5</LiteralExpression><LiteralExpression>2</LiteralExpression></BinaryExpression><LiteralExpression>7</LiteralExpression></BinaryExpression>
 482 </root>
 483
 484 Note that `&quot;` in the above represents `'`, and hence the value stored for
 485 `binary_operator` is actually `'+'` or `'*'` rather than `+` or `*` directly.
 486 The reason for this, is that if the type of the field were say `list(str)`
 487 rather than `str`, it would be necessary to quote the strings in order to be
 488 able to separate the list items. In the future, we might try to quote only if
 489 needed, to produce a more readable file. For now we use Python's `json` module.
 490
 491 The deserializer test program for the new format using fields, is the same as
 492 `deserialize.py` with `t_def` changed to `t_def1`. It prints `17.0` as before.
 493
 494 ### Tutorial: storing the whitespace
 495
 496 An important part of the XML serialization process is that it the AST stores
 497 the original parsed text within it, which can be recovered either by stripping
 498 the serialized XML output of all tags, or by calling `element.to_text()` on an
 499 in-memory tree. When in memory, this text is stored in the `text` array of each
 500 node. We saw previously, how leaf nodes can get their text by `self.text[0]`.
 501
 502 Logically, the contents of a node is its fields, and the interleaving of its
 503 text and its children, starting and ending with text. So there is always one
 504 more entry in the `text` array than the `children` array. For example, suppose
 505 a node has 2 children, e.g. the `AddExpression` or `BinaryExpression` nodes
 506 seen in the previous examples. Then it will have 3 pieces of text. Its internal
 507 structure can be summarized `text[0] children[0] text[1] children[1] text[2]`.
 508
 509 We will now show a revised `serialize2.py` which stores some text in the nodes:
 510 ```
 511 #!/usr/bin/env python3
 512
 513 from ndcode.pitree import element
 514 import sys
 515 import t_def
 516
 517 expression = t_def.AddExpression(
 518   text = ['', ' + ', ''],
 519   children = [
 520     t_def.MulExpression(
 521       text = ['', ' * ', ''],
 522       children = [
 523         t_def.LiteralExpression(text = ['5']),
 524         t_def.LiteralExpression(text = ['2'])
 525       ]
 526     ),
 527     t_def.LiteralExpression(text = ['7'])
 528   ]
 529 )
 530 element.serialize(expression, sys.stdout)
 531 ```
 532
 533 This would generate the following serialized output:
 534 ```
 535 <root>
 536   <AddExpression ref="0"><MulExpression><LiteralExpression>5</LiteralExpression> * <LiteralExpression>2</LiteralExpression></MulExpression> + <LiteralExpression>7</LiteralExpression></AddExpression>
 537 </root>
 538 ```
 539
 540 If we either strip the tags from the XML output (only the `AddExpression` node)
 541 or add a line like `print(element.to_text(expression))` in the test program,
 542 then we can see the stripped version is `5 * 2 + 7` which is more satisfying
 543 than `527` which would have been printed by the earlier versions without text.
 544
 545 Suppose we use πlex, πyacc, and πtree in integrated fashion to parse general
 546 computer languages such as C. Then the text-storage facility is used in leaf
 547 nodes to store things like identifiers, constants and such, and in non-leaf
 548 nodes mainly to store the whitespace separators between tokens. So for example,
 549 a source code formatter can run over the tree adjusting the inter-token space.
 550
 551 Interestingly though, much of the incoming syntax can also be treated as
 552 inter-token space, because it is redundant in respect of the structure imposed
 553 by the AST. For example, suppose the original expression was `5 * (2 + 7)`.
 554 Since the AST already encodes that the `+` node is nested inside the `*` node,
 555 the parentheses are effectively whitespace from the viewpoint of the AST.
 556
 557 To see how this works, consider this `serialize3.py` which uses parentheses:
 558 ```
 559 #!/usr/bin/env python3
 560
 561 from ndcode.pitree import element
 562 import sys
 563 import t_def
 564
 565 expression = t_def.MulExpression(
 566   text = ['', ' * (', ')'],
 567   children = [
 568     t_def.LiteralExpression(text = ['5']),
 569     t_def.AddExpression(
 570       text = ['', ' + ', ''],
 571       children = [
 572         t_def.LiteralExpression(text = ['2']),
 573         t_def.LiteralExpression(text = ['7'])
 574       ]
 575     )
 576   ]
 577 )
 578 element.serialize(expression, sys.stdout)
 579 ```
 580
 581 This produces output like:
 582 ```
 583 <root>
 584   <MulExpression ref="0"><LiteralExpression>5</LiteralExpression> * (<AddExpression><LiteralExpression>2</LiteralExpression> + <LiteralExpression>7</LiteralExpression></AddExpression>)</MulExpression>
 585 </root>
 586 ```
 587
 588 Whilst it is irrelevant whether the `(` and `)` characters appear inside the
 589 `AddExpression` tags or not, we have placed them outside. This is because in
 590 the parsing stage an `AddExpression` would be recognized first, followed by a
 591 notional `ParenthesesExpression` that contains the parentheses, but does not
 592 actually generate any markup. Using πyacc we can control the markup precisely.
 593
 594 Another time this occurs is when recognizing keyword-delimited structures such
 595 as `for (declaration; expression; expression) statement` in C. In the AST this
 596 would appear as a `ForStatement` class with 4 children: the declaration, the
 597 two expressions and the statement. The keyword `for` and all delimeters are
 598 treated as whitespace, since `ForStatement` already implies the keyword `for`.
 599
 600 ### Tutorial: leading and trailing whitespace
 601
 602 In order to be able to store the contents of a source file exactly (plus the
 603 markup that results from recognizing its lexical and grammatical structure),
 604 we need to be able to store the leading and trailing whitespace. To do this,
 605 we will define a root AST node called `AST`. Furthermore, a πtree convention is
 606 to use inner classes for node types that are expected to appear inside others.
 607
 608 Therefore we will create a `calc4.t` with this modification, as follows:
 609 ```
 610 %{
 611   from ndcode.pitree import element
 612 %}
 613
 614 %%
 615
 616 class AST {
 617   class Expression;
 618   class LiteralExpression: Expression;
 619   class NegExpression: Expression;
 620   class AddExpression: Expression;
 621   class SubExpression: Expression;
 622   class MulExpression: Expression;
 623   class DivExpression: Expression;
 624 };
 625 ```
 626
 627 Whilst Python does not have good inner class support in the same way as Java,
 628 it does allow them. Apart from the lexical nesting, the inner classes are
 629 treated as independent from their outer class. We just define them as inner
 630 classes by convention, as it cleans up the namespace and tends to group them.
 631
 632 Compile this with `pitree --python -o t_def4.py calc4.t` and then create an
 633 example `serialize4.py` that has leading and trailing whitespace as follows:
 634 ```
 635 #!/usr/bin/env python3
 636
 637 from ndcode.pitree import element
 638 import sys
 639 import t_def4
 640
 641 expression = t_def4.AST(
 642   text = ['/* calculator source file */\n\n', '\n'],
 643   children = [
 644     t_def4.AST.MulExpression(
 645       text = ['', ' * (', ')'],
 646       children = [
 647         t_def4.AST.LiteralExpression(text = ['5']),
 648         t_def4.AST.AddExpression(
 649           text = ['', ' + ', ''],
 650           children = [
 651             t_def4.AST.LiteralExpression(text = ['2']),
 652             t_def4.AST.LiteralExpression(text = ['7'])
 653           ]
 654         )
 655       ]
 656     )
 657   ]
 658 )
 659 element.serialize(expression, sys.stdout)
 660 ```
 661
 662 This serializes as follows:
 663 ```
 664 <root>
 665   <AST ref="0">/* calculator source file */
 666
 667 <AST_MulExpression><AST_LiteralExpression>5</AST_LiteralExpression> * (<AST_AddExpression><AST_LiteralExpression>2</AST_LiteralExpression> + <AST_LiteralExpression>7</AST_LiteralExpression></AST_AddExpression>)</AST_MulExpression>
 668 </AST>
 669 </root>
 670 ```
 671
 672 Now, the `AST` tag contains a leading comment (treated as whitespace) and some
 673 newlines, then the real AST content (its root is an `AST_MulExpression`), and
 674 finally a trailing newline between the real AST and the end of the `AST` tag.
 675
 676 Note that inner classes must be prefixed with their outer class name at every
 677 reference, e.g. `t_def.AST.MulExpression` in the Python code, and are stored
 678 with underscores in the serialized output, e.g. `AST_MulExpression`. In future
 679 it might be possible to generate the XML without the fully qualified names, for
 680 example just `<MulExpression>...</MulExpression` when inside `<AST>...</AST>`.
 681
 682 ### Tutorial: semantic analysis and markup
 683
 684 Another important feature of the serialization format is that arbitrary objects
 685 can be attached to a node, subject to those objects being declared to πtree and
 686 following the ordinary πtree conventions. As well as πtree objects, we can also
 687 attach many Python intrinsic types, such as numbers, strings, lists, and so on.
 688
 689 To attach πtree objects to a node, you simply declare a field of type `ref`.
 690 Whereas basic field types like `int` or `list(int)` store the corresponding
 691 Python intrinsic types, fields of type `ref` store Python references, provided
 692 they refer to a πtree object. Aggregates like `list(ref)` are also available.
 693
 694 Such attached πtree objects are serialized by reference rather than by direct
 695 nesting in the XML file, so that multiple pointers to the same object generate
 696 multiple references to that object rather than multiple copies. Deserialization
 697 reverses the process to reproduce the original structure of Python references.
 698 The `None` value is allowed and will be serialized and deserialized correctly.
 699
 700 Note that for the sake of readability in the generated XML, Python intrinsics
 701 such as lists are not stored by reference, and neither are child nodes. Thus,
 702 they cannot be multiple, and care should be taken to ensure that multiple
 703 pointers to those kinds of objects do not occur in the tree. An exception to
 704 this is, you can attach a reference to a subtree that also has a parent node,
 705 provided there is at most one parent node (the serializer relies upon this).
 706
 707 The following example is based on `calc4.t`, but there are now two kinds of
 708 literal nodes, `IntLiteralExpression` and `FloatLiteralExpression`. Every
 709 expression has a `type` (declared on the base class `Expression` to ensure it
 710 is always present), which refers to either a `IntType` or `FloatType` object.
 711
 712 We'll also create `UnaryExpression` and `BinaryExpression` classes to put the
 713 semantic analysis code on, to take advantage that the semantic analysis does
 714 not really care which operators are involved specifically. The `DivExpression`
 715 will override the default semantic analysis to always generate a `FloatType`.
 716
 717 Create the basic `calc5.t` defining the needed class hierarchy, as follows:
 718 ```
 719 %{
 720   from ndcode.pitree import element
 721 %}
 722
 723 %%
 724
 725 class Type;
 726 class IntType: Type;
 727 class FloatType: Type;
 728
 729 class AST {
 730   class Expression {
 731     ref type;
 732   };
 733   class IntLiteralExpression: Expression;
 734   class FloatLiteralExpression: Expression;
 735   class UnaryExpression: Expression;
 736   class NegExpression: UnaryExpression;
 737   class BinaryExpression: Expression;
 738   class AddExpression: BinaryExpression;
 739   class SubExpression: BinaryExpression;
 740   class MulExpression: BinaryExpression;
 741   class DivExpression: BinaryExpression;
 742 };
 743 ```
 744
 745 Add a free-form section beginning with `%%` and a `semantic_analysis()` method:
 746 ```
 747 %%
 748
 749 @method(AST.Expression)
 750 def semantic_analysis(self):
 751   raise NotImplementedError
 752 @method(AST.IntLiteralExpression)
 753 def semantic_analysis(self):
 754   self.type = IntType()
 755 @method(AST.FloatLiteralExpression)
 756 def semantic_analysis(self):
 757   self.type = FloatType()
 758 @method(AST.UnaryExpression)
 759 def semantic_analysis(self):
 760   self.children[0].semantic_analysis()
 761   self.type = self.children[0].type
 762 @method(AST.BinaryExpression)
 763 def semantic_analysis(self):
 764   self.children[0].semantic_analysis()
 765   self.children[1].semantic_analysis()
 766   self.type = (
 767     self.children[0].type
 768   if isinstance(self.children[0].type, FloatType) else
 769     self.children[1].type
 770   )
 771 @method(AST.DivExpression)
 772 def semantic_analysis(self):
 773   self.children[0].semantic_analysis()
 774   self.children[1].semantic_analysis()
 775   self.type = FloatType()
 776 ```
 777
 778 Compile this to `t_def5.py` by `pitree --python -o t_def5.py calc5.t`, and then
 779 create the corresponding test program `serialize5.py` containing the following:
 780 ```
 781 #!/usr/bin/env python3
 782
 783 from ndcode.pitree import element
 784 import sys
 785 import t_def5
 786
 787 expression = t_def5.AST(
 788   text = ['/* calculator source file */\n\n', '\n'],
 789   children = [
 790     t_def5.AST.MulExpression(
 791       text = ['', ' * (', ')'],
 792       children = [
 793         t_def5.AST.IntLiteralExpression(text = ['5']),
 794         t_def5.AST.AddExpression(
 795           text = ['', ' + ', ''],
 796           children = [
 797             t_def5.AST.IntLiteralExpression(text = ['2']),
 798             t_def5.AST.FloatLiteralExpression(text = ['7.5'])
 799           ]
 800         )
 801       ]
 802     )
 803   ]
 804 )
 805 expression.children[0].semantic_analysis()
 806 element.serialize(expression, sys.stdout)
 807 ```
 808
 809 Running this should produce output like
 810 ```
 811 <root>
 812   <AST ref="0">/* calculator source file */
 813
 814 <AST_MulExpression type="3"><AST_IntLiteralExpression type="1">5</AST_IntLiteralExpression> * (<AST_AddExpression type="3"><AST_IntLiteralExpression type="2">2</AST_IntLiteralExpression> + <AST_FloatLiteralExpression type="3">7.5</AST_FloatLiteralExpression></AST_AddExpression>)</AST_MulExpression>
 815 </AST>
 816   <IntType ref="1" />
 817   <IntType ref="2" />
 818   <FloatType ref="3" />
 819 </root>
 820 ```
 821
 822 We can see that the `type` attribute of each expression node in the XML gives
 823 an integer, which is a reference to one of the `IntType` or `FloatType` objects
 824 later in the file. The `IntType` and `FloatType` objects are unparented with
 825 respect to the AST, and thus appear as children of the `<root>...</root>` node.
 826
 827 We can also see that due to some optimization in the `semantic_analysis()`
 828 method of the `BinaryExpression` class, which reuses the type of either the
 829 left child or the right child as appropriate, many of the type references are
 830 the same. This makes it fairly cheap to add the semantic information to a node.
 831
 832 The efficiency of this approach is highlighted for semantic analysis with more
 833 complicated type systems, for example in C we can have "int", "pointer to int",
 834 "pointer to pointer to int", "array of int" and so on. There can be a lot of
 835 reuse, e.g. consider the "address of" operator in C -- the "address of" node
 836 has a type of a "pointer to" node, containing a reference to the child's type.