doc/ceg/ceg.tr

   1 .nr PS 12
   2 .nr VS 14
   3 .nr LL 6i
   4 .tr ~
   5 .TL
   6 The Code Expander Generator
   7 .AU
   8 Frans Kaashoek
   9 Koen Langendoen
  10 .AI
  11 Dept. of Mathematics and Computer Science
  12 Vrije Universiteit
  13 Amsterdam, The Netherlands
  14 .NH
  15 Introduction
  16 .PP
  17 A \fBcode expander\fR (\fBce\fR for short) is a part of the
  18 Amsterdam Compiler Kit
  19 .[
  20 toolkit
  21 .]
  22 (\fBACK\fR) and provides the user with
  23 high-speed generation of medium-quality code. Although conceptually
  24 equivalent to the more usual \fBcode generator\fR, it differs in some
  25 aspects.
  26 .PP
  27 Normally, a program to be compiled with \fBACK\fR
  28 is first fed to the preprocessor. The output of the preprocessor goes
  29 into the appropriate front end, which produces EM
  30 .[
  31 block
  32 .]
  33 (a
  34 machine independent low level intermediate code). The generated EM code is fed
  35 into the peephole optimizer, which scans it with a window of a few instructions,
  36 replacing certain inefficient code sequences by better ones. After the
  37 peephole optimizer a back end follows, which produces high-quality assembly code.
  38 The assembly code goes via the target optimizer into the assembler and the
  39 object code then goes into the
  40 linker/loader, the final component in the pipeline.
  41 .PP
  42 For various applications
  43 this scheme is too slow. When debugging, for example,
  44 compile time is more important than execution time of a program.
  45 For this purpose a new scheme is introduced:
  46 .IP \ \ 1:
  47 The code generator and assembler are
  48 replaced by a library, the \fBcode expander\fR, consisting of a set of
  49 routines, one for every EM-instruction. Each routine expands its EM-instruction
  50 into relocatable object code. In contrast, the usual ACK code generator uses
  51 expensive pattern matching on sequences of EM-instructions.
  52 The peephole and target optimizer are not used.
  53 .IP \ \ 2:
  54 These routines replace the usual EM-generating routines in the front end; this
  55 eliminates the overhead of intermediate files.
  56 .LP
  57 This results in a fast compiler producing object file, ready to be
  58 linked and loaded, at the cost of unoptimized object code.
  59 .PP
  60 Because of the
  61 simple nature of the code expander, it is much easier to build, to debug, and to
  62 test. Experience has demonstrated that a code expander can be constructed,
  63 debugged, and tested in less than two weeks.
  64 .PP
  65 This document describes the tools for automatically generating a
  66 \fBce\fR (a library of C files) from two tables and
  67 a few machine-dependent functions.
  68 A thorough knowledge of EM is necessary to understand this document.
  69 .NH
  70 The code expander generator
  71 .PP
  72 The code expander generator (\fBceg\fR) generates a code expander from
  73 two tables and a few machine-dependent functions. This section explains how
  74 \fBceg\fR works. The first half describes the transformations that are done on
  75 the two tables. The
  76 second half tells how these transformations are done by the \fBceg\fR.
  77 .PP
  78 A code expander consists of a set of routines that convert EM-instructions
  79 directly to relocatable object code. These routines are called by a front
  80 end through the EM_CODE(3ACK)
  81 .[
  82 EM_CODE
  83 .]
  84 interface. To free the table writer of the burden of building
  85 an object file, we supply a set of routines that build an object file
  86 in the ACK.OUT(5ACK)
  87 .[
  88 aout
  89 .]
  90 format (see appendix B). This set of routines is called
  91 the
  92 \fBback\fR-primitives (see appendix A). In short, a code expander consists of a
  93 set of routines that map the EM_CODE interface on the
  94 \fBback\fR-primitives interface.
  95 .PP
  96 To avoid repetition of the same sequences of
  97 \fBback\fR-primitives in different
  98 EM-instructions
  99 and to improve readability, the EM-to-object information must be supplied in
 100 two
 101 tables. The EM_table maps EM to an assembly language, and the as_table
 102 maps
 103 assembly code to \fBback\fR-primitives. The assembly language is chosen by the
 104 table writer. It can either be an actual assembly language or his ad-hoc
 105 designed language.
 106 .LP
 107 The following picture shows the dependencies between the different components:
 108 .sp
 109 .PS
 110 linewid = 0.5i
 111 A: line down 2i
 112 B: line down 2i with .start at A.start + (1.5i, 0)
 113 C: line down 2i with .start at B.start + (1.5i, 0)
 114 D: arrow right with .start at A.center - (0.25i, 0)
 115 E: arrow right with .start at B.center - (0.25i, 0)
 116 F: arrow right with .start at C.center - (0.25i, 0)
 117 "EM_CODE(3ACK)" at A.start above
 118 "EM_table" at B.start above
 119 "as_table" at C.start above
 120 "source language  " at D.start rjust
 121 "EM" at 0.5 of the way between D.end and E.start
 122 G: "assembly" at 0.5 of the way between E.end and F.start
 123 H: "  back primitives" at F.end ljust
 124 "(user defined)" at G - (0, 0.2i)
 125 "   (ACK.OUT)" at H - (0, 0.2i) ljust
 126 .PE
 127 .PP
 128 The picture suggests that, during compilation, the EM instructions are
 129 first transformed into assembly instructions and then the assembly instructions
 130 are transformed into object-generating calls. This
 131 is not what happens in practice, although the user is free to think it does.
 132 Actually, however the EM_table and the as_table are combined during code
 133 expander generation time, yielding an imaginary compound table that results in
 134 routines from the EM_CODE interface that generate object code directly.
 135 .PP
 136 As already indicated, the compound table does not exist either. Instead, each
 137 assembly instruction in the as_table is converted to a routine generating C
 138 .[
 139 Kernighan
 140 .]
 141 code
 142 to generate C code to call the \fBback\fR-primitives. The EM_table is
 143 converted into a program that for each EM instruction generates a routine,
 144 using the routines generated from the as_table. Execution of the latter program
 145 will then generate the code expander.
 146 .PP
 147 This scheme allows great flexibility
 148 in the table writing, while still
 149 resulting in a very efficient code expander. One implication is that the
 150 as_table is interpreted twice and the EM_table only once. This has consequences
 151 for their structure.
 152 .PP
 153 To illustrate what happens, we give an example. The example is an entry in
 154 the tables for the VAX-machine. The assembly language chosen is a subset of the
 155 VAX assembly language.
 156 .PP
 157 One of the most fundamental operations in EM is ``loc c'', load the value of c
 158 on the stack. To expand this instruction the
 159 tables contain the following information:
 160 .DS
 161 EM_table   :
 162 .ft CW
 163    C_loc   ==>   "pushl $$$1".
 164      /* $1 refers to the first argument of C_loc.
 165       * $$ is a quoted $. */
 166
 167
 168 \fRas_table   :
 169 .ft CW
 170    pushl  src : CONST   ==>
 171                          @text1( 0xd0);
 172                          @text1( 0xef);
 173                          @text4( %$( src->num)).
 174 \fR
 175 .DE
 176 .LP
 177 The as_table is transformed in the following routine:
 178 .DS
 179 .ft CW
 180 pushl_instr(src)
 181 t_operand *src;
 182 /* ``t_operand'' is a struct defined by the
 183  * table writer. */
 184 {
 185    printf("swtxt();");
 186    printf("text1( 0xd0 );");
 187    printf("text1( 0xef );");
 188    printf("text4(%s);", substitute_dollar( src->num));
 189 }
 190 \fR
 191 .DE
 192 Using ``pushl_instr()'', the following routine is generated from the EM_table:
 193 .DS
 194 .ft CW
 195 C_loc( c)
 196 arith c;
 197 /* text1() and text4() are library routines that fill the
 198  * text segment. */
 199 {
 200     swtxt();
 201     text1( 0xd0);
 202     text1( 0xef);
 203     text4( c);
 204 }
 205 \fR
 206 .DE
 207 .LP
 208 A compiler call to ``C_loc()'' will cause the 1-byte numbers ``0xd0''
 209 and ``0xef''
 210 and the 4-byte value of the variable ``c'' to be stored in the text segment.
 211 .PP
 212 The transformations on the tables are done automatically by the code expander
 213 generator.
 214 The code expander generator is made up of two tools:
 215 \fBemg\fR and \fBasg\fR. \fBAsg\fR
 216 transforms
 217 each assembly instruction into a C routine. These C routines generate calls
 218 to the \fBback\fR-primitives. The generated C routines are used
 219 by \fBemg\fR to generate the actual code expander from the EM_table.
 220 .PP
 221 The link between \fBemg\fR and \fBasg\fR is an assembly language.
 222 We did not enforce a specific syntax for the assembly language;
 223 instead we have given the table writer the freedom
 224 to make an ad-hoc assembly language or to use an actual assembly language
 225 suitable for his purpose. Apart from a greater flexibility this
 226 has another advantage; if the table writer adopts the assembly language that
 227 runs on the machine at hand, he can test the EM_table independently from the
 228 as_table. Of course there is a price to pay: the table writer has to
 229 do the decoding of the operands himself. See section 4 for more details.
 230 .PP
 231 Before we describe the structure of the tables in detail, we will give
 232 an overview of the four main phases.
 233 .IP "phase 1:"
 234 .br
 235 The as_table is transformed by \fBasg\fR. This results in a set of C routines.
 236 Each assembly-opcode generates one C routine. Note that a call to such a
 237 routine does not generate the corresponding object code; it generates C code,
 238 which, when executed, generates the desired object code.
 239 .IP "phase 2:"
 240 .br
 241 The C routines generated by \fBasg\fR are used by emg to expand the EM_table.
 242 This
 243 results in a set of C routines, the code expander, which conform to the
 244 procedural interface EM_CODE(3ACK). A call to such a routine does indeed
 245 generate the desired object code.
 246 .IP "phase 3:"
 247 .br
 248 The front end that uses the procedural interface is linked/loaded with the
 249 code expander generated in phase 2 and the \fBback\fR-primitives (a supplied
 250 library). This results in a compiler.
 251 .IP "phase 4:"
 252 .br
 253 The compiler runs. The routines in the code expander are
 254 executed and produce object code.
 255 .RE
 256 .NH
 257 Description of the EM_table
 258 .PP
 259 This section describes the EM_table. It contains four subsections.
 260 The first 3 sections describe the syntax of the EM_table,
 261 the
 262 semantics of the EM_table, and the functions and
 263 constants that must be present in the EM_table, in the file ``mach.c'' or in
 264 the file ``mach.h''. The last section explains how a table writer can generate
 265 assembly code instead of object code. The section on
 266 semantics contains many examples.
 267 .NH 2
 268 Grammar
 269 .PP
 270 The following grammar describes the syntax of the EM_table.
 271 .VS +4
 272 .TS
 273 center tab(%);
 274 l c l.
 275 TABLE%::=%( RULE)*
 276 RULE%::=%C_instr   ( COND_SEQUENCE | SIMPLE)
 277 COND_SEQUENCE%::=%( condition   SIMPLE)*   ``default''   SIMPLE
 278 SIMPLE%::=% ``==>'' ACTION_LIST
 279 ACTION_LIST%::=%[ ACTION   ( ``;'' ACTION)* ]   ``.''
 280 ACTION%::=%AS_INSTR
 281 %|%function-call
 282 AS_INSTR%::=%``"'' [ label ``:'']   [ INSTR] ``"''
 283 INSTR%::=%mnemonic   [ operand   ( ``,''   operand)* ]
 284 .TE
 285 .VS -4
 286 .PP
 287 The ``('' ``)'' brackets are used for grouping, ``['' ... ``]''
 288 means ... 0 or 1 time,
 289 a ``*'' means zero or more times, and
 290 a ``|'' means
 291 a choice between left or right. A \fBC_instr\fR is
 292 a name in the EM_CODE(3ACK) interface. \fBcondition\fR is a C expression.
 293 \fBfunction-call\fR is a call of a C function. \fBlabel\fR, \fBmnemonic\fR,
 294 and \fBoperand\fR are arbitrary strings. If an \fBoperand\fR
 295 contains brackets, the
 296 brackets must match. There is an upper bound on the number of
 297 operands; the maximum number is defined by the constant MAX_OPERANDS in de
 298 file ``const.h'' in the directory assemble.c. Comments in the table should be
 299 placed between ``/*'' and ``*/''.
 300 The table is processed by the C preprocessor, before being parsed by
 301 \fBemg\fR.
 302 .NH 2
 303 Semantics
 304 .PP
 305 The EM_table is processed by \fBemg\fR. \fBEmg\fR generates a C function
 306 for every instruction in the EM_CODE(3ACK).
 307 For every EM-instruction not mentioned in the EM_table, a
 308 C function that prints an error message is generated.
 309 It is possible to divide the EM_CODE(3ACK)-interface into four parts :
 310 .IP \0\01:
 311 text instructions      (e.g., C_loc, C_adi, ..)
 312 .IP \0\02:
 313 pseudo instructions    (e.g., C_open, C_df_ilb, ..)
 314 .IP \0\03:
 315 storage instructions   (e.g., C_rom_icon,  ..)
 316 .IP \0\04:
 317 message instructions   (e.g., C_mes_begin, ..)
 318 .LP
 319 This section starts with giving the semantics of the grammar. The examples
 320 are text instructions. The section ends with remarks on the pseudo
 321 instructions and the storage instructions. Since message instructions are not
 322 useful for a code expander, they are ignored.
 323 .PP
 324 .NH 3
 325 Actions
 326 .PP
 327 The EM_table is made up of rules describing how to expand a \fBC_instr\fR
 328 defined by the EM_CODE(3ACK)-interface (corresponding
 329 to an EM instruction) into actions.
 330 There are two kinds of actions: assembly instructions and C function calls.
 331 An assembly instruction is defined as a mnemonic followed by zero or more
 332 operands separated by commas. The semantics of an assembly instruction is
 333 defined by the table writer. When the assembly language is not expressive
 334 enough, then, as an escape route, function calls can be made. However, this
 335 reduces
 336 the speed of the actual code expander. Finally, actions can be grouped into
 337 a list of actions; actions are separated by a semicolon and terminated
 338 by a ``.''.
 339 .DS
 340 .ft CW
 341 C_nop   ==> .
 342        /* Empty action list : no operation. */
 343
 344 C_inc   ==> "incl (sp)".
 345        /* Assembler instruction, which is evaluated
 346         * during expansion of the EM_table */
 347
 348 C_slu   ==> C_sli( $1).
 349        /* Function call, which is evaluated during
 350         *  execution of the compiler. */
 351 \fR
 352 .DE
 353 .NH 3
 354 Labels
 355 .PP
 356 Since an assembly language without instruction labels is a rather weak
 357 language, labels inside a contiguous block of assembly instructions are
 358 allowed. When using labels two rules must be observed:
 359 .IP \0\01:
 360 The name of a label should be unique inside an action list.
 361 .IP \0\02:
 362 The labels used in an assembler instruction should be defined in the same
 363 action list.
 364 .LP
 365 The following example illustrates the usage of labels.
 366 .DS
 367 .ft CW
 368    /* Compare the two top elements on the stack. */
 369 C_cmp      ==>     "pop bx";
 370                    "pop cx";
 371                    "xor ax, ax";
 372                    "cmp cx, bx";
 373                 /* Forward jump to local label */
 374                    "je 2f";
 375                    "jb 1f";
 376                    "inc ax";
 377                    "jmp 2f";
 378                    "1: dec ax";
 379                    "2: push ax".
 380 \fR
 381 .DE
 382 We will come back to labels in the section on the as_table.
 383 .NH 3
 384 Arguments of an EM instruction
 385 .PP
 386 In most cases the translation of a \fBC_instr\fR depends on its arguments.
 387 The arguments of a \fBC_instr\fR are numbered from 1 to \fIn\fR, where \fIn\fR
 388 is the
 389 total number of arguments of the current \fBC_instr\fR (there are a few
 390 exceptions, see Implicit arguments). The table writer may
 391 refer to an argument as $\fIi\fR. If a plain $-sign is needed in an
 392 assembly instruction, it must be preceded by a extra $-sign.
 393 .PP
 394 There are two groups of \fBC_instr\fRs whose arguments are handled specially:
 395 .RS
 396 .IP "1: Instructions dealing with local offsets"
 397 .br
 398 The value of the $\fIi\fR argument referring to a parameter ($\fIi\fR >= 0)
 399 is increased by ``EM_BSIZE''. ``EM_BSIZE'' is the size of the return status block
 400 and must be defined in the file ``mach.h'' (see section 3.3). For example :
 401 .DS
 402 .ft CW
 403 C_lol   ==>     "push $1(bp)".
 404        /* automatic conversion of $1 */
 405 \fR
 406 .DE
 407 .IP "2: Instructions using global names or instruction labels"
 408 .br
 409 All the arguments referring to global names or instruction labels will be
 410 transformed into a unique assembly name. To prevent name clashes with library
 411 names the table writer has to provide the
 412 conversions in the file ``mach.h''. For example :
 413 .DS
 414 .ft CW
 415 C_bra   ==>     "jmp $1".
 416         /* automatic conversion of $1 */
 417         /* type arith is converted to string */
 418 \fR
 419 .DE
 420 .RE
 421 .NH 3
 422 Conditionals
 423 .PP
 424 The rules in the EM_table can be divided into two groups: simple rules and
 425 conditional rules. The simple rules are made up of a \fBC_instr\fR followed by
 426 a list of actions, as described above. The conditional rules (COND_SEQUENCE)
 427 allow the table writer to select an action list depending on the value of
 428 a condition.
 429 .PP
 430 A CONDITIONAL is a list of a boolean expression with the corresponding
 431 simple rule. If
 432 the expression evaluates to true then the corresponding simple rule is carried
 433 out. If more than one condition evaluates to true, the first one is chosen.
 434 The last case of a COND_SEQUENCE of a \fBC_instr\fR must handle
 435 the default case.
 436 The boolean expressions in a COND_SEQUENCE must be C expressions. Besides the
 437 ordinary C operators and constants, $\fIi\fR references can be used
 438 in an expression.
 439 .DS
 440 .ft CW
 441     /* Load address of LB $1 levels back. */
 442 C_lxl
 443     $1 == 0    ==>    "pushl fp".
 444     $1 == 1    ==>    "pushl 4(ap)".
 445     default    ==>    "movl $$$1, r0";
 446                       "jsb .lxl";
 447                       "pushl r0".
 448 \fR
 449 .DE
 450 .NH 3
 451 Abbreviations
 452 .PP
 453 EM instructions with an external as an argument come in three variants in
 454 the EM_CODE(3ACK) interface. In most cases it will be possible to take
 455 these variants together. For this purpose the ``..'' notation is introduced.
 456 For the code expander there is no difference between the
 457 following instructions.
 458 .DS
 459 .ft CW
 460 C_loe_dlb    ==>    "pushl $1 + $2".
 461 C_loe_dnam   ==>    "pushl $1 + $2".
 462 C_loe        ==>    "pushl $1 + $2".
 463 \fR
 464 .DE
 465 So it can be written in the following way.
 466 .DS
 467 .ft CW
 468 C_loe..      ==>    "pushl $1 + $2".
 469 \fR
 470 .DE
 471 .NH 3
 472 Implicit arguments
 473 .PP
 474 In the last example ``C_loe'' has two arguments, but in the EM_CODE interface
 475 it has one argument. This argument depends on the current ``hol''
 476 block; in the EM_table this is made explicit. Every \fBC_instr\fR whose
 477 argument depends on a ``hol'' block has one extra argument; argument 1 refers
 478 to the ``hol'' block.
 479 .NH 3
 480 Pseudo instructions
 481 .PP
 482 Most pseudo instructions are machine independent and are provided
 483 by \fBceg\fR. The table writer has only to supply the following functions,
 484 which are used to build a stackframe:
 485 .DS
 486 .ft CW
 487 C_prolog()
 488 /* Performs the prolog, for example save
 489  * return address */
 490
 491 C_locals( n)
 492 arith n;
 493 /* Allocate n bytes for locals on the stack */
 494
 495 C_jump( label)
 496 char *label;
 497 /* Generates code for a jump to ``label'' */
 498 \fR
 499 .DE
 500 .LP
 501 These functions can be defined in ``mach.c'' or in the EM_table (see
 502 section 3.3).
 503 .NH 3
 504 Storage instructions
 505 .PP
 506 The storage instructions ``C_bss_\fIcstp()\fR'', ``C_hol_\fIcstp()\fR'',
 507 ''C_con_\fIcstp()\fR'', and ``C_rom_\fIcstp()\fR'', except for the instructions
 508 dealing with constants of type string (C_..._icon, C_..._ucon, C_..._fcon), are
 509 generated automatically. No information is needed in the table.
 510 To generate the C_..._icon, C_..._ucon, C_..._fcon instructions
 511 \fBceg\fR only has to know how to convert a number of type string to bytes;
 512 this can be defined with the constants ONE_BYTE, TWO_BYTES, and FOUR_BYTES.
 513 C_rom_icon, C_con_icon, C_bss_icon, C_hol_icon can be abbreviated by ..icon.
 514 This also holds for ..ucon and ..fcon.
 515 For example :
 516 .DS
 517 .ft CW
 518 \\.\\.icon
 519     $2 == 1   ==>  gen1( (ONE_BYTE) atoi( $1)).
 520     $2 == 2   ==>  gen2( (TWO_BYTES) atoi( $1)).
 521     $2 == 4   ==>  gen4( (FOUR_BYTES) atol( $1)).
 522     default   ==>   arg_error( "..icon", $2).
 523 \fR
 524 .DE
 525 Gen1(), gen2() and gen4() are \fBback\fR-primitives (see appendix A), and
 526 generate one, two, or four byte constants. Atoi() is a C library function that
 527 converts strings to integers.
 528 The constants ``ONE_BYTE'', ``TWO_BYTES'', and ``FOUR_BYTES'' must be defined in
 529 the file ``mach.h''.
 530 .NH 2
 531 User supplied definitions and functions
 532 .PP
 533 If the table writer uses all the default functions he has only to supply
 534 the following constants and functions :
 535 .TS
 536 tab(#);
 537 l c lw(10c).
 538 C_prolog()#:#T{
 539 Do prolog
 540 T}
 541 C_jump( l)#:#T{
 542 Perform a jump to label l
 543 T}
 544 C_locals( n)#:#T{
 545 Allocate n bytes on the stack
 546 T}
 547 #
 548 NAME_FMT#:#T{
 549 Print format describing name to a unique name conversion. The format must
 550 contain %s.
 551 T}
 552 DNAM_FMT#:#T{
 553 Print format describing data-label to a unique name conversion. The  format
 554 must contain %s.
 555 T}
 556 DLB_FMT#:#T{
 557 Print format describing numerical-data-label to a unique name conversion.
 558 The format must contain a %ld.
 559 T}
 560 ILB_FMT#:#T{
 561 Print format describing instruction-label to a unique name conversion.
 562 The format must contain %d followed by %ld.
 563 T}
 564 HOL_FMT#:#T{
 565 Print format describing hol-block-number to a unique name conversion.
 566 The format must contain %d.
 567 T}
 568 #
 569 EM_WSIZE#:#T{
 570 Size of a word in bytes on the target machine
 571 T}
 572 EM_PSIZE#:#T{
 573 Size of a pointer in bytes on the target machine
 574 T}
 575 EM_BSIZE#:#T{
 576 Size of base block in bytes on the target machine
 577 T}
 578 #
 579 ONE_BYTE#:#T{
 580 \\C suitable type that can hold one byte on the machine where the \fBce\fR runs
 581 T}
 582 TWO_BYTES#:#T{
 583 \\C suitable type that can hold two bytes on the machine where the \fBce\fR runs
 584 T}
 585 FOUR_BYTES#:#T{
 586 \\C suitable type that can hold four bytes on the machine where the \fBce\fR runs
 587 T}
 588 #
 589 BSS_INIT#:#T{
 590 The default value that the loader puts in the bss segment
 591 T}
 592 #
 593 BYTES_REVERSED#:#T{
 594 Must be defined if the byte order must be reversed.
 595 By default the least significant byte is outputted first.\fR\(dg
 596 .FS
 597 \fR\(dg When both byte orders are used, for
 598 example NS 16032, the table writer has to
 599 supply his own set of routines.
 600 .FE
 601 T}
 602 WORDS_REVERSED#:#T{
 603 Must be defined if the word order must be reversed.
 604 By default the least significant word is outputted first.
 605 T}
 606 .TE
 607 .LP
 608 An example of the file ``mach.h'' for the vax4.
 609 .TS
 610 tab(:);
 611 l l l.
 612 #define : ONE_BYTE : int
 613 #define : TWO_BYTES : int
 614 #define : FOUR_BYTES : long
 615 :
 616 #define : EM_WSIZE : 4
 617 #define : EM_PSIZE : 4
 618 #define : EM_BSIZE : 0
 619 :
 620 #define : BSS_INIT : 0
 621 :
 622 #define : NAME_FMT : "_%s"
 623 #define : DNAM_FMT : "_%s"
 624 #define : DLB_FMT  : "_%ld"
 625 #define : ILB_FMT  : "I%03d%ld"
 626 #define : HOL_FMT  : "hol%d"
 627 .TE
 628 Notice that EM_BSIZE is zero. The vax ``call'' instruction takes automatically
 629 care of the base block.
 630 .PP
 631 There are three primitives that have to be defined by the table writer, either
 632 as functions in the file ``mach.c'' or as rules in the EM_table.
 633 For example, for the 8086 they look like this:
 634 .DS
 635 .ft CW
 636 C_jump       ==>       "jmp $1".
 637
 638 C_prolog     ==>       "push bp";
 639                      "mov bp, sp".
 640
 641 C_locals
 642   $1  == 0   ==>     .
 643   $1  == 2   ==>     "push ax".
 644   $1  == 4   ==>     "push ax";
 645                      "push ax".
 646   default    ==>     "sub sp, $1".
 647 \fR
 648 .DE
 649 .NH 2
 650 Generating assembly code
 651 .PP
 652 When the code expander generator is used for generating assembly instead of
 653 object code (see section 5), additional print formats have to be defined
 654 in ``mach.h''. The following table lists these formats.
 655 .TS
 656 tab(#);
 657 l c lw(10c).
 658 BYTE_FMT#:#T{
 659 Print format to allocate and initialize one byte. The format must
 660 contain %ld.
 661 T}
 662 WORD_FMT#:#T{
 663 Print format to allocate and initialize one word. The format must
 664 contain %ld.
 665 T}
 666 LONG_FMT#:#T{
 667 Print format to allocate and initialize one long. The format must
 668 contain %ld.
 669 T}
 670 BSS_FMT#:#T{
 671 Print format to allocate space in the bss segment. The format must
 672 contain %ld (number of bytes).
 673 T}
 674 COMM_FMT#:#T{
 675 Print format to declare a "common". The format must contain a %s (name to be declared
 676 common), followed by a %ld (number of bytes).
 677 T}
 678
 679 SEGTXT_FMT#:#T{
 680 Print format to switch to the text segment.
 681 T}
 682 SEGDAT_FMT#:#T{
 683 Print format to switch to the data segment.
 684 T}
 685 SEGBSS_FMT#:#T{
 686 Print format to switch to the bss segment.
 687 T}
 688
 689 SYMBOL_DEF_FMT#:#T{
 690 Print format to define a label. The format must contain %s.
 691 T}
 692 GLOBAL_FMT#:#T{
 693 Print format to declare a global name. The format must contain %s.
 694 T}
 695 LOCAL_FMT#:#T{
 696 Print format to declare a local name. The format must contain %s.
 697 T}
 698
 699 RELOC1_FMT#:#T{
 700 Print format to initialize a byte with an address expression. The format must
 701 contain %s (name) and %ld (offset).
 702 T}
 703 RELOC2_FMT#:#T{
 704 Print format to initialize a word with an address expression. The format must
 705 contain %s (name) and %ld (offset).
 706 T}
 707 RELOC4_FMT#:#T{
 708 Print format to initialize a long with an address expression. The format must
 709 contain %s (name) and %ld (offset).
 710 T}
 711
 712 ALIGN_FMT#:#T{
 713 Print format to align a segment.
 714 T}
 715 .TE
 716 .NH 1
 717 Description of the as_table
 718 .PP
 719 This section describes the as_table. Like the previous section, it is divided
 720 into
 721 four parts: the first two parts describe the grammar and the semantics of the
 722 as_table; the third part gives an overview
 723 of the functions and the constants that must be present in the as_table (in
 724 the file ``as.h'' or in the file ``as.c''); the last part describes the case when
 725 assembly is generated instead of object code.
 726 The part on semantics contains examples that appear in the as_table for the
 727 VAX or for the 8086.
 728 .NH 2
 729 Grammar
 730 .PP
 731 The form of the as_table is given by the following grammar :
 732 .VS +4
 733 .TS
 734 center tab(#);
 735 l c l.
 736 TABLE#::=#( RULE)*
 737 RULE#::=#( mnemonic | ``...'')   DECL_LIST   ``==>''   ACTION_LIST
 738 DECL_LIST#::=#DECLARATION   ( ``,''   DECLARATION)*
 739 DECLARATION#::=#operand   [ ``:''   type]
 740 ACTION_LIST#::=#ACTION   ( ``;''   ACTION) ``.''
 741 ACTION#::=#IF_STATEMENT
 742 #|#function-call
 743 #|#``@''function-call
 744 IF_STATEMENT#::=#''@if''   ``('' condition ``)''   ACTION_LIST
 745 ##( ``@elsif''   ``('' condition ``)''   ACTION_LIST)*
 746 ##[ ``@else''   ACTION_LIST]
 747 ##''@fi''
 748 function-call#::=#function-identifier ``('' [arg (,arg)*] ``)''
 749 arg#::=#argument
 750 #|#reference
 751 .TE
 752 .VS -4
 753 .LP
 754 \fBmnemonic\fR, \fBoperand\fR, and \fBtype\fR are all C identifiers;
 755 \fBcondition\fR is a normal C expression;
 756 \fBfunction-call\fR must be a C function call. A function can be called with
 757 standard C arguments or with a reference (see section 4.2.4).
 758 Since the as_table is
 759 interpreted during code expander generation as well as during code
 760 expander execution, two levels of calls are present in it. A ``function-call''
 761 is done during code expander generation, a ``@function-call'' during code
 762 expander execution.
 763 .NH 2
 764 Semantics
 765 .PP
 766 The as_table is made up of rules that map assembly instructions onto
 767 \fBback\fR-primitives, a set of functions that construct an object file.
 768 The table is processed by \fBasg\fR, which generates a C functions
 769 for each assembler mnemonic. The names of
 770 these functions are the assembler mnemonics postfixed
 771 with ``_instr'' (e.g., ``add'' becomes ``add_instr()''). These functions
 772 will be used by the function
 773 assemble() during the expansion of the EM_table.
 774 After explaining the semantics of the as_table the function
 775 assemble() will be described.
 776 .NH 3
 777 Rules
 778 .PP
 779 A rule in the as_table is made up of a left and a right hand side;
 780 the left hand side describes an assembler
 781 instruction (mnemonic and operands); the
 782 right hand side gives the corresponding actions as \fBback\fR-primitives or as
 783 functions defined by the table writer, which call \fBback-primitives\fR.
 784 Two simple examples from the VAX as_table and the 8086 as_table, resp.:
 785 .DS
 786 .ft CW
 787 movl src, dst  ==> @text1( 0xd0);
 788                    gen_operand( src);
 789                    gen_operand( dst).
 790     /* ``gen_operand'' is a function that encodes
 791      * operands by calling back-primitives. */
 792
 793 rep ens:MOVS   ==>  @text1( 0xf3);
 794                     @text1( 0xa5).
 795
 796 \fR
 797 .DE
 798 .NH 3
 799 Declaration of types.
 800 .PP
 801 In general, a machine instruction is encoded as an opcode followed by zero or
 802 more
 803 the operands. There are two methods for mapping assembler mnemonics
 804 onto opcodes: the mnemonic determines the opcode, or mnemonic and operands
 805 together determine the opcode. Both cases can be
 806 easily expressed in the as_table.
 807 The first case is obvious.
 808 The second case is handled by introducing type fields for the operands.
 809 .PP
 810 When mnemonic and operands together determine the opcode, the table writer has
 811 to give several rules for each combination of mnemonic and operands. The rules
 812 differ in the type fields of the operands.
 813 The table writer has to supply functions that check the type
 814 of the operand. The name of such a function is the name of the type; it
 815 has one argument: a pointer to a struct of type \fIt_operand\fR; it returns
 816 non-zero when the operand is of this type, otherwise it returns 0.
 817 .PP
 818 This will usually lead to a list of rules per mnemonic. To reduce the amount of
 819 work an abbreviation is supplied. Once the mnemonic is specified it can be
 820 referred to in the following rules by ``...''.
 821 One has to make sure
 822 that each mnemonic is mentioned only once in the as_table, otherwise
 823 \fBasg\fR will generate more than one function with the same name.
 824 .PP
 825 The following example shows the usage of type fields.
 826 .DS
 827 .ft CW
 828  mov dst:REG, src:EADDR  ==>
 829           @text1( 0x8b);                /* opcode */
 830           mod_RM( %d(dst->reg), src). /* operands */
 831
 832  ... dst:EADDR, src:REG  ==>
 833           @text1( 0x89);                /* opcode */
 834           mod_RM( %d(src->reg), dst). /* operands */
 835 \fR
 836 .DE
 837 The table-writer must supply the restriction functions,
 838 .ft CW
 839 REG\fR and
 840 .ft CW
 841 EADDR\fR in the previous example, in ``as.c'' or ''as.h''.
 842 .NH 3
 843 The function of the @-sign and the if-statement.
 844 .PP
 845 The right hand side of a rule is made up of function calls.
 846 Since the as_table is
 847 interpreted on two levels, during code expander generation and during code
 848 expander execution, two levels of calls are present in it. A function-call
 849 without an ``@''-sign
 850 is called during code expander generation (e.g., the
 851 .ft CW
 852 gen_operand()\fR in the
 853 first example).
 854 A function call with an ``@''-sign is called during code
 855 expander execution (e.g.,
 856 the \fBback\fR-primitives). So the last group will be part of the compiler.
 857 .PP
 858 The need for the ``@''-sign construction arises, for example, when
 859 implementing push/pop optimization (e.g., ``push x'' followed by ``pop y''
 860 can be replaced by ``move x, y'').
 861 In this case flags need to be set, unset, and tested during the execution of
 862 the compiler:
 863 .DS L
 864 .ft CW
 865 PUSH src  ==>   /* save in ax */
 866                 mov_instr( AX_oper, src);
 867                 /* set flag */
 868                 @assign( push_waiting, TRUE).
 869 \fR
 870 .DE
 871 .DS
 872 .ft CW
 873 POP dst   ==>   @if ( push_waiting)
 874                        /* ``mov_instr'' is asg-generated */
 875                        mov_instr( dst, AX_oper);
 876                        @assign( push_waiting, FALSE).
 877                 @else
 878                        /* ``pop_instr'' is asg-generated */
 879                        pop_instr( dst).
 880                 @fi.
 881 \fR
 882 .DE
 883 .LP
 884 Although the @-sign is followed syntactically by a
 885 function name, this function can very well be the name of a macro defined in C.
 886 This is in fact the case with ``@assign()'' in the above example.
 887 .PP
 888 The case may arise when information is needed that is not known
 889 until execution of
 890 the compiler.  For example one needs to know if a ``$\fIi\fR'' argument fits in
 891 one byte.
 892 In this case one can use a special if-statement provided
 893 by \fBasg\fR: @if, @elsif, @else, @fi. This means that the conditions
 894 will be evaluated at
 895 run time of the \fBce\fR. In such a condition one may of course refer
 896 to the ''$\fIi\fR'' arguments. For example, constants can be
 897 packed into one or two byte arguments as follows:
 898 .DS
 899 .ft CW
 900 mov dst:ACCU, src:DATA ==>
 901                        @if ( fits_byte( %$(dst->expr)))
 902                             @text1( 0xc0);
 903                             @text1( %$(dst->expr)).
 904                        @else
 905                             @text1( 0xc8);
 906                             @text2( %$(dst->expr)).
 907                        @fi.
 908 .DE
 909 .NH 3
 910 References to operands
 911 .PP
 912 As noted before, the operands of an assembler instruction may be used as
 913 pointers to the struct \fIt_operand\fR in the right hand side of the table.
 914 Because of the free format assembler, the types of the fields in the struct
 915 \fIt_operand\fR are unknown to \fBasg\fR. As these fields can appear in calls
 916 to functions, \fBasg\fR must know
 917 these types. This section explains how these types must be specified.
 918 .PP
 919 References to operands come in three forms: ordinary operands, operands that
 920 contain ``$\fIi\fR'' references, and operands that refer to names of local labels.
 921 The ``$\fIi\fR'' in operands represent names or numbers of a \fBC_instr\fR and must
 922 be given as arguments to the \fBback\fR-primitives. Labels in operands
 923 must be converted to a number that tells the distance, the number of bytes,
 924 between the label and the current position in the text-segment.
 925 .LP
 926 All these three cases are treated in an uniform way. When the table writer
 927 makes a reference to an operand of an assembly instruction, he must describe
 928 the type of the operand in the following way.
 929 .VS +4
 930 .TS
 931 center tab(#);
 932 l c l.
 933 reference#::=#``%'' conversion
 934 ##``('' operand-name ``\->'' field-name ``)''
 935 conversion#::=# printformat
 936 #|#``$''
 937 #|#``dist''
 938 printformat#::=#see PRINT(3ACK)
 939 .[
 940 PRINT
 941 .]
 942 .TE
 943 .VS -4
 944 .LP
 945 The three cases differ only in the conversion field. The printformat conversion
 946 applies to ordinary operands. The ``%$'' applies to operands that contain
 947 a ``$\fIi\fR''. The expression between parentheses must result in a pointer to
 948 a char. The
 949 result of ``%$'' is of the type of ``$\fIi\fR''. The ``%dist''
 950 applies to operands that refer to a local label. The expression between
 951 the brackets must result in a pointer to a char. The result of ``%dist'' is
 952 of type arith.
 953 .PP
 954 The following example illustrates the usage of ``%$''. (For an
 955 example that illustrates the usage of ordinary fields see
 956 the section on ``User supplied definitions and functions'').
 957 .DS
 958 .ft CW
 959 jmp dst ==>
 960     @text1( 0xe9);
 961     @reloc2( %$(dst->lab), %$(dst->off), PC_REL).
 962 \fR
 963 .DE
 964 .PP
 965 A useful function concerning $\fIi\fRs is arg_type(), which takes as input a
 966 string starting with $\fIi\fR and returns the type of the \fIi\fR''th argument
 967 of the current EM-instruction, which can be STRING, ARITH or INT. One may need
 968 this function while decoding operands if the context of the $\fIi\fR does not
 969 give enough information.
 970 If the function arg_type() is used, the file
 971 arg_type.h must contain the definition of STRING, ARITH and INT.
 972 .PP
 973 %dist is only guaranteed to work when called as a parameter of text1(), text2() or text4().
 974 The goal of the %dist conversion is to reduce the number of reloc1(), reloc2()
 975 and reloc4()
 976 calls, saving space and time (no relocation at compiler run time).
 977 The following example illustrates the usage of ``%dist''.
 978 .DS
 979 .ft CW
 980  jmp dst:ILB    ==> /* label in an instruction list */
 981      @text1( 0xeb);
 982      @text1( %dist( dst->lab)).
 983
 984  ... dst:LABEL  ==> /* global label */
 985      @text1( 0xe9);
 986      @reloc2( %$(dst->lab), %$(dst->off), PC_REL).
 987 \fR
 988 .DE
 989 .NH 3
 990 The functions assemble() and block_assemble()
 991 .PP
 992 The functions assemble() and block_assemble() are provided by \fBceg\fR.
 993 If, however, the table writer is not satisfied with the way they work
 994 he can
 995 supply his own assemble() or block_assemble().
 996 The default function assemble() splits an assembly string into a
 997 label, mnemonic,
 998 and operands and performs the following actions on them:
 999 .IP \0\01:
1000 It processes the local label; it records the name and current position. Thereafter it calls the function process_label() with one argument of type string,
1001 the label. The table writer has to define this function.
1002 .IP \0\02:
1003 Thereafter it calls the function process_mnemonic() with one argument of
1004 type string, the mnemonic. The table writer has to define this function.
1005 .IP \0\03:
1006 It calls process_operand() for each operand. Process_operand() must be
1007 written by the table-writer since no fixed representation for operands
1008 is enforced. It has two arguments: a string (the operand to decode)
1009 and a pointer to the struct \fIt_operand\fR. The declaration of the struct
1010 \fIt_operand\fR must be given in the
1011 file ``as.h'', and the table-writer can put all the information needed for
1012 encoding the operand in machine format in it.
1013 .IP \0\04:
1014 It examines the mnemonic and calls the associated function, generated by
1015 \fBasg\fR, with pointers to the decoded operands as arguments. This makes it
1016 possible to use the decoded operands in the right hand side of a rule (see
1017 below).
1018 .LP
1019 If the default assemble() does not work the way the table writer wants, he
1020 can supply his own version of it. Assemble() has the following arguments:
1021 .DS
1022 .ft CW
1023 assemble( instruction )
1024     char *instruction;
1025 \fR
1026 .DE
1027 \fIinstruction\fR points to a null-terminated string.
1028 .PP
1029 The default function block_assemble() is called with a sequence of assembly
1030 instructions that belong to one action list. It calls assemble() for
1031 every assembly instruction in
1032 this block. But if a special action is
1033 required on a block of assembly instructions, the table writer only has to
1034 rewrite this function to get a new \fBceg\fR that obliges to his wishes.
1035 The function block_assemble has the following arguments:
1036 .DS
1037 .ft CW
1038 block_assemble( instructions, nr, first, last)
1039       char   **instruction;
1040       int      nr, first, last;
1041 \fR
1042 .DE
1043 \fIInstruction\fR point to an array of pointers to strings representing
1044 assembly instructions. \fINr\fR is
1045 the number of instructions that must be assembled. \fIFirst\fR
1046 and \fIlast\fR have no function in the default block_assemble(), but are
1047 useful when optimizations are done in block_assemble().
1048 .PP
1049 Four things have to be specified in ``as.h'' and ``as.c''. First the user must
1050 give the declaration of struct \fIt_operand\fR in ``as.h'', and the functions
1051 process_operand(), process_mnemonic(), and process_label() must be given
1052 in ``as.c''. If the right hand side of the as_table
1053 contains function calls other than the \fBback\fR-primitives, these functions
1054 must also be present in ``as.c''. Note that both the ``@''-sign (see 4.2.3)
1055 and ``references'' (see 4.2.4) also work in the functions defined in ``as.c''.
1056 .PP
1057 The following example shows the representative and essential parts of the
1058 8086 ``as.h'' and ``as.c'' files.
1059 .nr PS 10
1060 .nr VS 12
1061 .LP
1062 .DS L
1063 .ft CW
1064 /* Constants and type definitions in as.h */
1065
1066 #define        UNKNOWN                0
1067 #define        IS_REG                 0x1
1068 #define        IS_ACCU                0x2
1069 #define        IS_DATA                0x4
1070 #define        IS_LABEL               0x8
1071 #define        IS_MEM                 0x10
1072 #define        IS_ADDR                0x20
1073 #define        IS_ILB                 0x40
1074
1075 #define AX                0
1076 #define BX                3
1077 #define CL                1
1078 #define SP                4
1079 #define BP                5
1080 #define SI                6
1081 #define DI                7
1082
1083 #define REG( op)         ( op->type & IS_REG)
1084 #define ACCU( op)        ( op->type & IS_REG  &&  op->reg == AX)
1085 #define REG_CL( op)      ( op->type & IS_REG  &&  op->reg == CL)
1086 #define DATA( op)        ( op->type & IS_DATA)
1087 #define LABEL( op)       ( op->type & IS_LABEL)
1088 #define ILB( op)         ( op->type & IS_ILB)
1089 #define MEM( op)         ( op->type & IS_MEM)
1090 #define ADDR( op)        ( op->type & IS_ADDR)
1091 #define EADDR( op)       ( op->type & ( IS_ADDR | IS_MEM | IS_REG))
1092 #define CONST1( op)      ( op->type & IS_DATA  && strcmp( "1", op->expr) == 0)
1093 #define MOVS( op)        ( op->type & IS_LABEL&&strcmp("\"movs\"", op->lab) == 0)
1094 #define IMMEDIATE( op)   ( op->type & ( IS_DATA | IS_LABEL))
1095
1096 struct t_operand {
1097         unsigned type;
1098         int reg;
1099         char *expr, *lab, *off;
1100        };
1101
1102 extern struct t_operand saved_op, *AX_oper;
1103 \fR
1104 .DE
1105 .nr PS 12
1106 .nr VS 14
1107 .LP
1108 .nr PS 10
1109 .nr VS 12
1110 .DS L
1111 .ft CW
1112
1113 /* Some functions in as.c. */
1114
1115 #include "arg_type.h"
1116 #include "as.h"
1117
1118 #define last( s)     ( s + strlen( s) - 1)
1119 #define LEFT         '('
1120 #define RIGHT        ')'
1121 #define DOLLAR       '$'
1122
1123 process_operand( str, op)
1124 char *str;
1125 struct t_operand *op;
1126
1127 /*        expr            ->        IS_DATA en IS_LABEL
1128  *        reg             ->        IS_REG en IS_ACCU
1129  *        (expr)          ->        IS_ADDR
1130  *        expr(reg)       ->        IS_MEM
1131  */
1132 {
1133         char *ptr, *index();
1134
1135         op->type = UNKNOWN;
1136         if ( *last( str) == RIGHT) {
1137                 ptr = index( str, LEFT);
1138                 *last( str) = '\0';
1139                 *ptr = '\0';
1140                 if ( is_reg( ptr+1, op)) {
1141                         op->type = IS_MEM;
1142                         op->expr = ( *str == '\0' ? "0" : str);
1143                 }
1144                 else {
1145                         set_label( ptr+1, op);
1146                         op->type = IS_ADDR;
1147                 }
1148         }
1149         else
1150                 if ( is_reg( str, op))
1151                         op->type = IS_REG;
1152                 else {
1153                         if ( contains_label( str))
1154                                 set_label( str, op);
1155                         else {
1156                                 op->type = IS_DATA;
1157                                 op->expr = str;
1158                         }
1159                 }
1160 }
1161
1162 /*********************************************************************/
1163
1164 mod_RM( reg, op)
1165 int reg;
1166 struct t_operand *op;
1167
1168 /* This function helps to decode operands in machine format.
1169  * Note the $-operators
1170  */
1171 {
1172       if ( REG( op))
1173               R233( 0x3, reg, op->reg);
1174       else if ( ADDR( op)) {
1175               R233( 0x0, reg, 0x6);
1176               @reloc2( %$(op->lab), %$(op->off), ABSOLUTE);
1177       }
1178       else if ( strcmp( op->expr, "0") == 0)
1179               switch( op->reg) {
1180                 case SI : R233( 0x0, reg, 0x4);
1181                           break;
1182
1183                 case DI : R233( 0x0, reg, 0x5);
1184                           break;
1185
1186                 case BP : R233( 0x1, reg, 0x6);        /* exception! */
1187                           @text1( 0);
1188                           break;
1189
1190                 case BX : R233( 0x0, reg, 0x7);
1191                           break;
1192
1193                 default : fprint( STDERR, "Wrong index register %d\en",
1194                                   op->reg);
1195               }
1196       else {
1197               @if ( fit_byte( %$(op->expr)))
1198                       switch( op->reg) {
1199                           case SI : R233( 0x1, reg, 0x4);
1200                                   break;
1201
1202                         case DI : R233( 0x1, reg, 0x5);
1203                                   break;
1204
1205                         case BP : R233( 0x1, reg, 0x6);
1206                                   break;
1207
1208                         case BX : R233( 0x1, reg, 0x7);
1209                                   break;
1210
1211                         default : fprint( STDERR, "Wrong index register %d\en",
1212                                           op->reg);
1213                       }
1214                       @text1( %$(op->expr));
1215               @else
1216                       switch( op->reg) {
1217                         case SI : R233( 0x2, reg, 0x4);
1218                                   break;
1219
1220                         case DI : R233( 0x2, reg, 0x5);
1221                                   break;
1222
1223                         case BP : R233( 0x2, reg, 0x6);
1224                                   break;
1225
1226                         case BX : R233( 0x2, reg, 0x7);
1227                                   break;
1228
1229                         default : fprint( STDERR, "Wrong index register %d\en",
1230                                           op->reg);
1231                       }
1232                       @text2( %$(op->expr));
1233               @fi
1234       }
1235 }
1236 \fR
1237 .DE
1238 .nr PS 12
1239 .nr VS 14
1240 .NH 2
1241 Generating assembly code
1242 .PP
1243 It is possible to generate assembly instead of object files (see section 5), in
1244 which case there is no need to supply ``as_table'', ``as.h'', and ``as.c''.
1245 This option is useful for debugging the EM_table.
1246 .NH 1
1247 Building a code expander
1248 .PP
1249 This section describes how to generate a code expander in two phases.
1250 In phase one, the EM_table is
1251 written and assembly code is generated. If the assembly code is an actual
1252 language, the EM_table can be tested by assembling and running the generated
1253 code.
1254 If an ad-hoc assembly language is used by the table writer, it is not possible
1255 to test the EM_table, but the code generated is at least in readable form.
1256 In the second phase, the as_table is written and object code is generated.
1257 After the generated object code is fed into the loader, it can be tested.
1258 .NH 2
1259 Phase one
1260 .PP
1261 The following is a list of instructions to make a
1262 code expander that generates assembly instructions.
1263 .IP \0\01:
1264 Create a new directory.
1265 .IP \0\02:
1266 Create the ``EM_table'', ``mach.h'', and ``mach.c'' files; there is no need
1267 for ``as_table'', ``as.h'', and ``as.c'' at this moment.
1268 .IP \0\03:
1269 type
1270 .br
1271 .ft CW
1272 install_ceg -as
1273 \fR
1274 .br
1275 install_ceg will create a Makefile and three directories : ceg, ce, and back.
1276 Ceg will contain the program ceg; this program will be
1277 used to turn ``EM_table'' into a set of C source files (in the ce directory),
1278 one for each
1279 EM-instruction. All these files will be compiled and put in a library called
1280 \fBce.a\fR.
1281 .br
1282 The option
1283 .ft CW
1284 -as\fR means that a \fBback\fR-library will be
1285 generated (in the directory ``back'') that
1286 supports the generation of assembly language. The library is named ``back.a''.
1287 .IP \0\04:
1288 Link a front end, ``ce.a'', and ``back.a'' together resulting in a compiler
1289 that generates assembly code.
1290 .LP
1291 If the table writer has chosen an actual assembly language, the EM_table can be
1292 tested (e.g., by running the compiler on the EM test set). If an error occurs,
1293 change the EM_table and type
1294 .IP
1295 .br
1296 .ft CW
1297 update_ceg\fR \fBC_instr
1298 \fR
1299 .br
1300 .LP
1301 where \fBC_instr\fR stands for the name of the erroneous EM-instruction.
1302 If the table writer has chosen an ad-hoc assembly language, he can at least
1303 read the generated code and look for possible errors. If an error is found,
1304 the same procedure as described above can be followed.
1305 .NH 2
1306 Phase two
1307 .PP
1308 The next phase is to generate a \fBce\fR that produces relocatable object
1309 code.
1310 .IP \0\01:
1311 Remove the ``ce'', ``ceg'', and ``back'' directories.
1312 .IP \0\02:
1313 Write the ``as_table'', ``as.h'', and ``as.c'' files.
1314 .IP \0\03:
1315 type
1316 .sp
1317 .ft CW
1318 install_ceg -obj \fR
1319 .sp
1320 The option
1321 .ft CW
1322 -obj\fR means that ``back.a'' will contain a library
1323 for generating
1324 ACK.OUT(5ACK) object files, see appendix B.
1325 If the writer does not want to use the default ``back.a'',
1326 the
1327 .ft CW
1328 -obj\fR flag must omitted and a ``back.a'' should be supplied that
1329 generates the generates object code in the desired format.
1330 .IP \0\04:
1331 Link a front end, ``ce.a'', and ``back.a'' together resulting in a compiler
1332 that generates object code.
1333 .LP
1334 The as_table is ready to be tested. If an error occurs, adapt the table.
1335 Then there are two ways to proceed:
1336 .IP \0\01:
1337 recompile the whole EM_table,
1338 .sp
1339 .ft CW
1340 update_ceg ALL \fR
1341 .sp
1342 .IP \0\02:
1343 recompile just the few EM-instructions that contained the error,
1344 .sp
1345 .ft CW
1346 update_ceg \fBC_instr\fR
1347 .sp
1348 where \fBC_instr\fR is an erroneous EM-instruction.
1349 This has to be done for every EM-instruction that contained the erroneous
1350 assembly instruction.
1351 .NH
1352 Acknowledgements
1353 .PP
1354 We want to thank Henri Bal, Dick Grune, and Ceriel Jacobs for their
1355 valuable suggestions and the critical reading of this paper.
1356 .NH
1357 References
1358 .LP
1359 .[
1360 $LIST$
1361 .]
1362 .bp
1363 .SH
1364 Appendix A, \fRthe \fBback\fR-primitives
1365 .PP
1366 This appendix describes the routines available to generate relocatable
1367 object code. If the default back.a is used, the object code is in
1368 ACK.OUT(5ACK) format.
1369 In de default back.a, the names defined here are remapped to more hidden names,
1370 to avoid name conflicts with for instance names used in the front-end. This
1371 remapping is done in an include-file, "back.h".
1372 A user-implemented back.a should do the same thing.
1373 .nr PS 10
1374 .nr VS 12
1375 .PP
1376 .IP A1.
1377 Text and data generation; with ONE_BYTE b; TWO_BYTES w; FOUR_BYTES l; arith n;
1378 .VS +4
1379 .TS
1380 tab(#);
1381 l c lw(10c).
1382 text1( b)#:#T{
1383 Put one byte in text-segment.
1384 T}
1385 text2( w)#:#T{
1386 Put word (two bytes) in text-segment, byte-order is defined by
1387 BYTES_REVERSED in mach.h.
1388 T}
1389 text4( l)#:#T{
1390 Put long ( two words) in text-segment, word-order is defined by
1391 WORDS_REVERSED in mach.h.
1392 T}
1393 #
1394 con1( b)#:#T{
1395 Same for CON-segment.
1396 T}
1397 con2( w)#:
1398 con4( l)#:
1399 #
1400 rom1( b)#:#T{
1401 Same for ROM-segment.
1402 T}
1403 rom2( w)#:
1404 rom4( l)#:
1405 #
1406 gen1( b)#:#T{
1407 Same for the current segment, only to be used in the ``..icon'', ``..ucon'', etc.
1408 pseudo EM-instructions.
1409 T}
1410 gen2( w)#:
1411 gen4( l)#:
1412 #
1413 bss( n)#:#T{
1414 Put n bytes in bss-segment, value is BSS_INIT.
1415 T}
1416 common( n)#:#T{
1417 If there is a saved label, generate a "common" for it, of size
1418 n. Otherwise, it is equivalent to bss(n).
1419 (see also the save_label routine).
1420 T}
1421 .TE
1422 .VS -4
1423 .IP A2.
1424 Relocation; with char *s; arith o; int r;
1425 .VS +4
1426 .TS
1427 tab(#);
1428 l c lw(10c).
1429 reloc1( s, o, r)#:#T{
1430 Generates relocation-information for 1 byte in the current segment.
1431 T}
1432 ##s\0:\0the string which must be relocated
1433 ##o\0:\0the offset in bytes from the string.
1434 ##T{
1435 r\0:\0relocation type. It can have the values ABSOLUTE or PC_REL. These
1436 two constants are defined in the file ``back.h''
1437 T}
1438 reloc2( s, o, r)#:#T{
1439 Generates relocation-information for 1 word in the
1440 current segment. Byte-order according to BYTES_REVERSED in mach.h.
1441 T}
1442 reloc4( s, o, r)#:#T{
1443 Generates relocation-information for 1 long in the
1444 current segment. Word-order according to WORDS_REVERSED in mach.h.
1445 T}
1446 .TE
1447 .VS -4
1448 .IP A3.
1449 Symbol table interaction; with int seg; char *s;
1450 .VS +4
1451 .TS
1452 tab(#);
1453 l c lw(10c).
1454 switch_segment( seg)#:#T{
1455 sets current segment to ``seg'', and does alignment if necessary. ``seg''
1456 can be one of the four constants defined in ``back.h'': SEGTXT, SEGROM,
1457 SEGCON, SEGBSS.
1458 T}
1459 #
1460 symbol_definition( s)#:#T{
1461 Define s in symbol-table.
1462 T}
1463 set_local_visible( s)#:#T{
1464 Record scope-information in symbol table.
1465 T}
1466 set_global_visible( s)#:#T{
1467 Record scope-information in symbol table.
1468 T}
1469 .TE
1470 .VS -4
1471 .IP A4.
1472 Start/end actions; with char *f;
1473 .VS +4
1474 .TS
1475 tab(#);
1476 l c lw(10c).
1477 open_back( f)#:#T{
1478 Directs output to file ``f'', if f is the null pointer output must be given on
1479 standard output.
1480 T}
1481 close_back()#:#T{
1482 close output stream.
1483 T}
1484 init_back()#:#T{
1485 Only used with user-written back-library, gives the opportunity to initialize.
1486 T}
1487 end_back()#:#T{
1488 Only used with user-written back-library.
1489 T}
1490 .TE
1491 .VS -4
1492 .IP A5.
1493 Label generation routines; with int n; arith g; char *l; These routines all
1494 return a "char *" to a static area, which is overwritten at each call.
1495 .VS +4
1496 .TS
1497 tab(#);
1498 l c lw(10c).
1499 extnd_pro( n)#:#T{
1500 Label set at the end of procedure \fIn\fP, to generate space for locals.
1501 T}
1502 extnd_start( n)#:#T{
1503 Label set at the beginning of procedure \fIn\fP, to jump back to after generating
1504 space for locals.
1505 T}
1506 extnd_name( l)#:#T{
1507 Create a name for a procedure named \fIl\fP.
1508 T}
1509 extnd_dnam( l)#:#T{
1510 Create a name for an external variable named \fIl\fP.
1511 T}
1512 extnd_dlb( g)#:#T{
1513 Create a name for numeric data label \fIg\fP.
1514 T}
1515 extnd_ilb( l, n)#:#T{
1516 Create a name for instruction label \fIl\fP in procedure \fIn\fP.
1517 T}
1518 extnd_hol( n)#:#T{
1519 Create a name for HOL block number \fIn\fP.
1520 T}
1521 extnd_part( n)#:#T{
1522 Create a unique label for the C_insertpart mechanism.
1523 T}
1524 extnd_cont( n)#:#T{
1525 Create another unique label for the C_insertpart mechanism.
1526 T}
1527 extnd_main( n)#:#T{
1528 Create yet another unique label for the C_insertpart mechanism.
1529 T}
1530 .TE
1531 .VS -4
1532 .IP A6.
1533 Some miscellaneous routines, with char *l;
1534 .VS +4
1535 .TS
1536 tab(#);
1537 l c lw(10c).
1538 save_label( l)#:#T{
1539 Save label \fIl\fP. Unfortunately, in EM, when a label is encountered,
1540 it is not yet
1541 known in which segment it will end up. The save_label/dump_label mechanism
1542 is there to solve this problem.
1543 T}
1544 dump_label()#:#T{
1545 If there is a label saved, force definition for it now.
1546 T}
1547 align_word()#:#T{
1548 Align to a word boundary, if the current segment is not a text segment.
1549 T}
1550 .TE
1551 .VS -4
1552 .nr PS 12
1553 .nr VS 14
1554 .bp
1555 .SH
1556 Appendix B, description of ACK-a.out library
1557 .PP
1558 The object file produced by \fBce\fR is by default in ACK.OUT(5ACK)
1559 format. The object file is made up of one header, followed by
1560 four segment headers, followed by text, data, relocation information,
1561 symbol table, and the string area. The object file is tuned for the ACK-LED,
1562 so there are some special things done just before the object file is dumped.
1563 First, four relocation records are added which contain the names of the four
1564 segments. Second, all the local relocation is resolved. This is done by the
1565 function do_relo(). If there is a record belonging to a local
1566 name this address is relocated in the segment to which the record belongs.
1567 Besides doing the local relocation, do_relo() changes the ``nami''-field
1568 of the local relocation records. This field receives the index of one of the
1569 four
1570 relocation records belonging to a segment. After the local
1571 relocation has been resolved the routine output_back() dumps the
1572 ACK object file.
1573 .LP
1574 If a different a.out format is wanted, one can choose between three strategies:
1575 .IP \ \1:
1576 The most simple one is to use a conversion program, which converts the ACK
1577 a.out format to the wanted a.out format. This program exists for all most
1578 all machines on which ACK runs. However,
1579 not all conversion programs can generate relocation information.
1580 The disadvantage is that the compiler will become slower.
1581 .IP \ \2:
1582 A better solution is to change the functions output_back(), do_relo(),
1583 open_back(), and close_back() in such a way
1584 that they produce the wanted a.out format. This strategy saves a lot of I/O.
1585 .IP \ \3:
1586 If this still is not satisfactory, the
1587 \fBback\fR-primitives can be adapted to produce the wanted a.out format.