doc/pascal/internal.doc

   1 .pl 12.5i
   2 .sp 1.5i
   3 .NH
   4 The compiler
   5
   6 .nh
   7 .LP
   8 The compiler can be divided roughly into four modules:
   9
  10 \(bu lexical analysis
  11 .br
  12 \(bu syntax analysis
  13 .br
  14 \(bu semantic analysis
  15 .br
  16 \(bu code generation
  17 .br
  18
  19 The four modules are grouped into one pass. The activity of these modules
  20 is interleaved during the pass.
  21 .br
  22 The lexical analyzer, some expression handling routines and various
  23 datastructures from the Modula-2 compiler contributed to the project.
  24 .sp 2
  25 .NH 2
  26 Lexical Analysis
  27
  28 .LP
  29 The first module of the compiler is the lexical analyzer. In this module, the
  30 stream of input characters making up the source program is grouped into
  31 \fItokens\fR, as defined in \fBISO 6.1\fR. The analyzer is hand-written,
  32 because the lexical analyzer generator, which was at our disposal,
  33 \fILex\fR [LEX], produces much slower analyzers. A character table, in the file
  34 \fIchar.c\fR, is created using the program \fItab\fR which takes as input
  35 the file \fIchar.tab\fR. In this table each character is placed into a
  36 particular class. The classes, as defined in the file \fIclass.h\fR,
  37 represent a set of tokens. The strategy of the analyzer is as follows: the
  38 first character of a new token is used in a multiway branch to eliminate as
  39 many candidate tokens as possible. Then the remaining characters of the token
  40 are read. The constant INP_NPUSHBACK, defined in the file \fIinput.h\fR,
  41 specifies the maximum number of characters the analyzer looks ahead. The
  42 value has to be at least 3, to handle input sequences such as:
  43 .br
  44               1e+4  (which is a real number)
  45 .br
  46               1e+a  (which is the integer 1, followed by the identifier "e", a plus, and the identifier "a")
  47
  48 Another aspect of this module is the insertion and deletion of tokens
  49 required by the parser for the recovery of syntactic errors (see also section
  50 2.2). A generic input module [ACK] is used to avoid the burden of I/O.
  51 .sp 2
  52 .NH 2
  53 Syntax Analysis
  54
  55 .LP
  56 The second module of the compiler is the parser, which is the central part of
  57 the compiler. It invokes the routines of the other modules. The tokens obtained
  58 from the lexical analyzer are grouped into grammatical phrases. These phrases
  59 are stored as parse trees and handed over to the next part. The parser is
  60 generated using \fILLgen\fR[LL], a tool for generating an efficient recursive
  61 descent parser with no backtrack from an Extended Context Free Syntax.
  62 .br
  63 An error recovery mechanism is generated almost completely automatically. A
  64 routine called \fILLmessage\fR had to be written, which gives the necessary
  65 error messages and deals with the insertion and deletion of tokens.
  66 The routine \fILLmessage\fR must accept one parameter, whose value is
  67 a token number, zero or -1. A zero parameter indicates that the current token
  68 (the one in the external variable \fILLsymb\fR) is deleted.
  69 A -1 parameter indicates that the parser expected end of file, but did
  70 not get it. The parser will then skip tokens until end of file is detected.
  71 A parameter that is a token number (a positive parameter) indicates that
  72 this token is to be inserted in front of the token currently in \fILLsymb\fR.
  73 Also, care must be taken, that the token currently in \fILLsymb\fR is again
  74 returned by the \fBnext\fR call to the lexical analyzer, with the proper
  75 attributes. So, the lexical analyzer must have a facility to push back one
  76 token.
  77 .br
  78 Calls to the two standard procedures \fIwrite\fR and \fIwriteln\fR can be
  79 different from calls to other procedures. The syntax of a write-parameter
  80 is different from the syntax of an actual-parameter. We decided to include
  81 them, together with \fIread\fR and \fIreadln\fR, in the grammar. An alternate
  82 solution would be to make the syntax of an actual-parameter identical to the
  83 syntax of a write-parameter. Afterwards the parameter has to be checked to
  84 see whether it is used properly or not.
  85 .bp
  86 As the parser is LL(1), it must always be able to determine what to do,
  87 based on the last token read (\fILLsymb\fR). Unfortunately, this was not the
  88 case with the grammar as specified in [ISO]. Two kinds of problems
  89 appeared, viz. the \fBalternation\fR and \fBrepetition\fR conflict.
  90 The examples given in the following paragraphs are taken from the grammar.
  91
  92 .NH 3
  93 Alternation conflict
  94
  95 .LP
  96 An alternation conflict arises when the parser can not decide which
  97 production to choose.
  98 .br
  99 \fBExample:\fR
 100 .in +2m
 101 .ft 5
 102 .nf
 103 procedure-declaration    : procedure-heading \fB';'\f5 directive |
 104 .br
 105 \h'\w'procedure-declaration    : 'u'procedure-identification \fB';'\f5 procedure-block |
 106 .br
 107 \h'\w'procedure-declaration    : 'u'procedure-heading \fB';'\f5 procedure-block ;
 108 .br
 109 procedure-heading        : \fBprocedure\f5 identifier [ formal-parameter-list ]? ;
 110 .br
 111 procedure-identification : \fBprocedure\f5 procedure-identifier ;
 112 .fi
 113 .ft R
 114 .in -2m
 115
 116 A sentence that starts with the terminal \fBprocedure\fR is derived from the
 117 three alternative productions. This conflict can be resolved in two ways:
 118 adjusting the grammar, usually some rules are replaced by one rule and more
 119 work has to be done in the semantic analysis; using the LLgen conflict
 120 resolver, "\fB%if\fR (C-expression)", if the C-expression evaluates to
 121 non-zero, the production in question is chosen, otherwise one of the
 122 remaining rules is chosen. The grammar rules were rewritten to solve this
 123 conflict. The new rules are given below. For more details see the file
 124 \fIdeclar.g\fR.
 125
 126 .in +2m
 127 .ft 5
 128 .nf
 129 procedure-declaration : procedure-heading \fB';'\f5 ( directive | procedure-block ) ;
 130 .br
 131 procedure-heading     : \fBprocedure\f5 identifier [ formal-parameter-list ]? ;
 132 .fi
 133 .ft R
 134 .in -2m
 135
 136 A special case of an alternation conflict, which is common to many block
 137 structured languages, is the \fI"dangling-else"\fR ambiguity.
 138
 139 .in +2m
 140 .ft 5
 141 .nf
 142 if-statement : \fBif\f5 boolean-expression \fBthen\f5 statement [ else-part ]? ;
 143 .br
 144 else-part    : \fBelse\f5 statement ;
 145 .fi
 146 .ft R
 147 .in -2m
 148
 149 The following statement that can be derived from the rules above is ambiguous:
 150
 151 .ti +2m
 152 \fBif\f5 boolean-expr-1 \fBthen\f5 \fBif\f5 boolean-expr-2 \fBthen\f5 statement-1 \fBelse\f5 statement-2
 153 .ft R
 154
 155
 156 .ps 8
 157 .vs 7
 158 .PS
 159 move right 1.1i
 160 S: line down 0.5i
 161 "if-statement" at S.start above
 162 .ft B
 163 "then" at S.end below
 164 .ft R
 165 move to S.start then down 0.25i
 166 L: line left 0.5i then down 0.25i
 167 box ht 0.33i wid 0.6i "boolean" "expression-1"
 168 move to L.start then left 0.5i
 169 L: line left 0.5i then down 0.25i
 170 .ft B
 171 "if" at L.end below
 172 .ft R
 173 move to L.start then right 0.5i
 174 L: line right 0.5i then down 0.25i
 175 "statement" at L.end below
 176 move to L.end then down 0.10i
 177 L: line down 0.25i dashed
 178 "if-statement" at L.end below
 179 move to L.end then down 0.10i
 180 L: line down 0.5i
 181 .ft B
 182 "then" at L.end below
 183 .ft R
 184 move to L.start then down 0.25i
 185 L: line left 0.5i then down 0.25i
 186 box ht 0.33i wid 0.6i "boolean" "expression-2"
 187 move to L.start then left 0.5i
 188 L: line left 0.5i then down 0.25i
 189 .ft B
 190 "if" at L.end below
 191 .ft R
 192 move to L.start then right 0.5i
 193 L: line right 0.5i then down 0.25i
 194 box ht 0.33i wid 0.6i "statement-1"
 195 move to L.start then right 0.5i
 196 L: line right 0.5i then down 0.25i
 197 .ft B
 198 "else" at L.end below
 199 .ft R
 200 move to L.start then right 0.5i
 201 L: line right 0.5i then down 0.25i
 202 box ht 0.33i wid 0.6i "statement-2"
 203 move to S.start
 204 move right 3.5i
 205 L: line down 0.5i
 206 "if-statement" at L.start above
 207 .ft B
 208 "then" at L.end below
 209 .ft R
 210 move to L.start then down 0.25i
 211 L: line left 0.5i then down 0.25i
 212 box ht 0.33i wid 0.6i "boolean" "expression-1"
 213 move to L.start then left 0.5i
 214 L: line left 0.5i then down 0.25i
 215 .ft B
 216 "if" at L.end below
 217 .ft R
 218 move to L.start then right 0.5i
 219 S: line right 0.5i then down 0.25i
 220 "statement" at S.end below
 221 move to S.start then right 0.5i
 222 L: line right 0.5i then down 0.25i
 223 .ft B
 224 "else" at L.end below
 225 .ft R
 226 move to L.start then right 0.5i
 227 L: line right 0.5i then down 0.25i
 228 box ht 0.33i wid 0.6i "statement-2"
 229 move to S.end then down 0.10i
 230 L: line down 0.25i dashed
 231 "if-statement" at L.end below
 232 move to L.end then down 0.10i
 233 L: line down 0.5i
 234 .ft B
 235 "then" at L.end below
 236 .ft R
 237 move to L.start then down 0.25i
 238 L: line left 0.5i then down 0.25i
 239 box ht 0.33i wid 0.6i "boolean" "expression-2"
 240 move to L.start then left 0.5i
 241 L: line left 0.5i then down 0.25i
 242 .ft B
 243 "if" at L.end below
 244 .ft R
 245 move to L.start then right 0.5i
 246 L: line right 0.5i then down 0.25i
 247 box ht 0.33i wid 0.6i "statement-1"
 248 .PE
 249 .ps
 250 .vs
 251 \h'615u'(a)\h'1339u'(b)
 252 .sp
 253 .ce
 254 Two parse trees showing the \fIdangling-else\fR ambiguity
 255 .sp 2
 256 According to the standard, \fBelse\fR is matched with the nearest preceding
 257 unmatched \fBthen\fR, i.e. parse tree (a) is valid (\fBISO 6.8.3.4\fR).
 258 This conflict is statically resolved in LLgen by using "\fB%prefer\fR",
 259 which is equivalent in behaviour to "\fB%if\fR(1)".
 260 .bp
 261 .NH 3
 262 Repetition conflict
 263
 264 .LP
 265 A repetition conflict arises when the parser can not decide whether to choose
 266 a production once more, or not.
 267 .br
 268 \fBExample:\fR
 269 .in +2m
 270 .ft 5
 271 .nf
 272 field-list : [ ( fixed-part [ \fB';'\f5 variant-part ]? | variantpart ) [;]? ]? ;
 273 .br
 274 fixed-part : record-section [ \fB';'\f5 record-section ]* ;
 275 .fi
 276 .in -2m
 277 .ft R
 278
 279 When the parser sees the semicolon, it can not decide whether another
 280 record-section or a variant-part follows. This conflict can be resolved in
 281 two ways: adjusting the grammar or using the conflict resolver,
 282 "\fB%while\fR (C-expression)". The grammar rules that deal with this conflict
 283 were completely rewritten. For more details, the reader is referred to the
 284 file \fIdeclar.g\fR.
 285 .sp 2
 286 .NH 2
 287 Semantic Analysis
 288
 289 .LP
 290 The third module of the compiler is the checking of semantic conventions of
 291 ISO-Pascal. To check the program being parsed, actions have been used in
 292 LLgen. An action consists of several C-statements, enclosed in brackets
 293 "{" and "}". In order to facilitate communication between the actions and
 294 \fILLparse\fR, the parsing routines can be given C-like parameters and
 295 local variables. An important part of the semantic analyzer is the symbol
 296 table. This table stores all information concerning identifiers and their
 297 definitions. Symbol-table lookup and hashing is done by a generic namelist
 298 module [ACK]. The parser turns each program construction into a parse tree,
 299 which is the major datastructure in the compiler. This parse tree is used
 300 to exchange information between various routines.
 301 .sp 2
 302 .NH 2
 303 Code Generation
 304
 305 .LP
 306 The final module in the compiler is that of code generation. The information
 307 stored in the parse trees is used to generate the EM code [EM]. EM code is
 308 generated with the help of a procedural EM-code interface [ACK]. The use of
 309 static exchanges is not desired, since the fast back end can not cope with
 310 static code exchanges, hence the EM pseudoinstruction \fBexc\fR is never
 311 generated.
 312 .br
 313 Chapter 3 discusses the code generation in more detail.
 314 .sp 2
 315 .NH 2
 316 Error Handling
 317
 318 .LP
 319 The first three modules have in common that they can detect errors in the
 320 Pascal program being compiled. If this is the case, a proper message is given
 321 and some action is performed. If code generation has to be aborted, an error
 322 message is given, otherwise a warning is given. The constant MAXERR_LINE,
 323 defined in the file \fIerrout.h\fR, specifies the maximum number of messages
 324 given per line. This can be used to avoid long lists of error messages caused
 325 by, for example, the omission of a ';'. Three kinds of errors can be
 326 distinguished: the lexical error, the syntactic error, and the semantic error.
 327 Examples of these errors are respectively, nested comments, an expression with
 328 unbalanced parentheses, and the addition of two characters.
 329 .sp 2
 330 .NH 2
 331 Memory Allocation and Garbage Collection
 332
 333 .LP
 334 The routines \fIst_alloc\fR and \fIst_free\fR provide a mechanism for
 335 maintaining free lists of structures, whose first field is a pointer called
 336 \fBnext\fR. This field is used to chain free structures together. Each
 337 structure, suppose the tag of the structure is ST, has a free list pointed
 338 by h_ST. Associated with this list are the operations: \fInew_ST()\fR, an
 339 allocating mechanism which supplies the space for a new ST struct; and
 340 \fIfree_ST()\fR, a garbage collecting mechanism which links the specified
 341 structure into the free list.
 342 .bp