From: kaashoek Date: Tue, 12 Apr 1988 14:31:05 +0000 (+0000) Subject: Initial revision X-Git-Tag: release-5-5~3445 X-Git-Url: https://git.ndcode.org/public/gitweb.cgi?a=commitdiff_plain;h=ca94deb2c261c27b6a015279e5c23a5150570dca;p=ack.git Initial revision --- diff --git a/doc/ceg/ceg.tr b/doc/ceg/ceg.tr new file mode 100644 index 000000000..b0def7e58 --- /dev/null +++ b/doc/ceg/ceg.tr @@ -0,0 +1,1244 @@ +.nr PS 12 +.nr VS 14 +.nr LL 6i +.TL +Code expander generator +.AU +Frans Kaashoek +Koen Langendoen +.AI +Dept. of Mathematics and Computer Science +Vrije Universiteit +Amsterdam, The Netherlands +.NH +Introduction +.PP +A \fBcode expander\fR ( \fBce\fR for short) is a part of the +Amsterdam Compiler Kit (\fBACK\fR), which provides the user with +high-speed generation of medium-quality code. Although conceptually +equivalent to the more usual \fBcode generator\fR, it differs in some +aspects. +.LP +Normally, a program to be compiled with \fBACK\fR +is first fed into the preprocessor. The output of the preprocessor goes +into the apropiate front end, whose job it is to produce EM (a +machine independent low level intermediate code). The generated EM code is fed +into the peephole optimizer, which scans it with a window of a few instructions, +replacing certain inefficient code sequences by better ones. After the +peephole optimizer a backend follows, which produces high quality assembly code. +The assembly code goes via the target optimizer into the assembler and the +objectcode then goes into the +linker/loader, the final component in the pipeline. +.LP +For various applications +this scheme is too slow, for example, for debugging programs; in this case +the program has to be compiled fast and the runtime of the program may be +slower. For this purpose a new scheme is introduced: +.IP \ \ 1: +The code generator and assembler have +been replaced by one program: the \fBcode expander\fR, which directly expands +the EM-instructions into an relocatable objectfile. +The peephole and target optimizer are not used. +.IP \ \ 2: +The front end and \fBce\fR have been combined into a single +program, eliminating the overhead of intermediate files. +.LP +This results in a fast compiler producing objectfiles, ready to be +linked and loaded, at the cost of unoptimized object code. +.LP +An extra speedup is gained by the way the code expander works. Instead of +trying to generate code for a sequence of EM-instructions, like the usual +code generator, it expands each EM-instruction separately. +.LP +Because of the +simple nature of the code expander, it is much easier to build, to debug and to +test. Experience has demonstrated that a code expander can be constructed, +debugged and tested in less than two weeks. +.LP +This document describes the tools for automatically generating a +\fBce\fR (a library of "C"-files), from two tables and +a few machine-dependent functions. +To understand this document and the examples it is necessary to have a +throughout knowledge of EM. +.NH +An overview +.PP +A code expander consists of a set of routines that convert EM-instructions +directly to relocatable object code. These routines are called by a front +end through the +EM_CODE(3L) interface. To free the table writer of the burden of building +a object file, we supply a set of routines that build an object file +in the NEW_A.OUT(5L) format (see appendix B). This set of routines is called +the +\fBback\fR-primitives (see appendix A). +.PP +To avoid repetition of the same sequences of +\fBback\fR-primitives in different +EM-instructions +and to improve readability, the EM to object information must be supplied in +two +tables. One that maps EM to an assembly language, the EM_table, and one one +that maps +assembler to \fBback\fR-primitives, the as_table. The assembler may be an +actual assembler or ad-hoc designed by the table writer. +.LP +The following picture shows the dependencies between the different components: +.sp +.PS +linewid = 0.5i +A: line down 2i +B: line down 2i with .start at A.start + (1.5i, 0) +C: line down 2i with .start at B.start + (1.5i, 0) +D: arrow right with .start at A.center - (0.25i, 0) +E: arrow right with .start at B.center - (0.25i, 0) +F: arrow right with .start at C.center - (0.25i, 0) +"EM_CODE(3L)" at A.start above +"EM_TABLE" at B.start above +"as_table" at C.start above +"source language " at D.start rjust +"EM" at 0.5 of the way between D.end and E.start +G: "assembler" at 0.5 of the way between E.end and F.start +H: " back primitives" at F.end ljust +"(user defined)" at G - (0, 0.2i) +" (NEW_A.OUT)" at H - (0, 0.2i) ljust +.PE +.PP +The entries in the as_table map assembly instructions on \fBback\fR-primitives. +The as_table is used to transform the EM - assembly mapping into a EM - +\fBback\fR- primitives mapping; +the expanded EM_table is then transformed into a set of C-routines, which are +normally incorporated in a compiler. All this happens during compiler +generation time. The C-routines are activated during the +execution of the compiler. +.PP +To illustrate what happens, we give an example. The example is an entry in +the tables for the VAX-machine. The assembly language chosen is a subset of the +VAX assembly language. +.PP +One of the most fundamental operations in EM is 'loc c', load the value of c +on the stack. To expand this instruction the +tables contain the following information: +.DS +\f5 +EM_table : C_loc ==> "pushl $$$1". + /* $1 refers to the first argument of C_loc. */ + + +as_table : pushl src : CONST ==> + @text1( 0xd0); + @text1( 0xef); + @text4( %$( src->num)). +\fR +.DE +.LP +The following routine will be generated for C_loc: +.DS +\f5 +C_loc( c) +arith c; +{ + swtxt(); + text1( 0xd0); /* text1(), text4() are library routines, */ + text1( 0xef); /* which fill the text segment */ + text4( c); +} +\fR +.DE +.LP +A call by the compiler to 'C_loc' will cause that the 1-byte numbers '0xd0' +and '0xef' +and the 4-byte value of the variable 'c' will be stored in the text segment. +.PP +The transformations on the tables are done automatically by the code expander +generator. +The code expander generator consists of two tools, one to handle the EM_table +, emg, and one to handle the as_table, \fBasg\fR. Asg transforms +each assembly instruction in a C-routine. These C-routines generate calls +to the \fBback\fR-primitives. Finally, the generated C-routines are used +by emg to generate from the EM_table the actual code expander. +.PP +The link between emg and \fBasg\fR is an assembly language. +We didn't enforce a specific syntax for the assembly language; +instead we have chosen to give the table writer the freedom +to make an ad-hoc assembly language or to use an actual assembly language +suitable for his purpose. Apart from a greater flexibility this +has another advantage; if the table writer adopts the assembly language that +runs on the machine at hand, he can test the EM_table independently from the +as_table. Of course there is a price to pay; the table writer has to +do the decoding of the operands himself. See section 4 for more details.1 +.PP +Before we explain the several parts of the ceg, we will give an overview of +the four important phases. +.IP "phase 1):" +.br +The as_table is transformed by \fBasg\fR. This results in a set of C-routines. +Each assembly-opcode generates one C-routine. +.IP "phase 2):" +.br +The C-routines generated by \fBasg\fR are used by emg to expand the EM_table. +This +results in a set of C-routines, the code expander, which form the procedural +interface EM_CODE(3L). +.IP "phase 3):" +.br +The front end that uses the procedural interface is linked/loaded with the +code expander generated in phase 2) and the \fBback\fR-primitives. +This results in a compiler. +.IP "phase 4):" +.br +Execution of the compiler; The routines in the code expander are +executed and produce object code. +.RE +.NH +Description of the EM_table +.PP +This section describes the EM_table. It contains four subsections : +a section that describes the syntax of the EM_table; a section that deals with the +semantics of the EM_table; a section that gives an list of the functions and +constants that must be present in the EM_table, in the file 'mach.c' or in +the file 'mach.h'; a section that deals with the case that the table +writer wants to generate assembly instead of object code. The section on +semantics contains many examples. +.NH 2 +Grammar +.PP +The following grammar describes the syntax of the EM_table. +.VS +4 +.TS +center tab(%); +l c l. +TABLE%::=%( RULE)* +RULE%::=%C_instr ( CONDITIONALS | SIMPLE) +CONDITIONAL%::=%( condition SIMPLE)+ 'default' SIMPLE +SIMPLE%::=%( '==>' | '::=') ACTION_LIST +ACTION_LIST%::=%[ ACTION ( ';' ACTION)* ] '.' +ACTION%::=%AS_INSTR +%|%function-call +.sp +AS_INSTR%::=%'"' [ label ':'] [ INSTR] '"' +INSTR%::=%mnemonic [ operand ( ',' operand)* ] +.TE +.VS -4 +.PP +\'(' ')' brackets are used for grouping, '[' ... ']' means ... 0 or 1 time, +\'*' means zero or more times, '+' means one or more times and '|' means +a choice between left or right. A \fBC_instr\fR is +a name in the EM_CODE(3L) interface. \fBcondition\fR is a 'C' expression. +\fBfunction-call\fR is a call of a 'C' function. \fBlabel\fR, \fBmnemonic\fR +and \fBoperand\fR are arbitrary strings. If an \fBoperand\fR contains brackets the +brackets must match. In reality there is an upperound to the number of +operands; The maxium number is defined by the constant MAX_OPERANDS in de +file 'const.h' in the directory assemble.c. Comments in the table should be +placed between '/*' and '*/'. Finally, before the table is parsed, the +C-preprocessor runs. +.NH 2 +Semantics +.PP +The EM_table is processed by \fBemg\fR. \fBEmg\fR generates for every +instruction in the EM_CODE(3L) a C function. +For every EM-instruction not mentioned in the EM_table an +C function that prints an error message is generated . +It is possible to divide the EM_CODE(3L)-interface in four parts : +.IP \0\01) +text instructions (e.g., C_loc, C_adi, ..) +.IP \0\02) +pseudo instructions (e.g., C_open, C_df_ilb, ..) +.IP \0\03) +storage instructions (e.g., C_rom_icon, ..) +.IP \0\04) +message instructions (e.g., C_mes_begin, ..) +.LP +This section starts with giving the semantics of the grammar. The examples +are text instructions. The section ends with remarks on the pseudo +instructions and the storage instructions. Since message instructions aren't +useful for a code expander, they are ignored. +.PP +.NH 3 +Actions +.PP +The EM_table consists of rules which describe how to expand a \fBC_instr\fR +from the EM_CODE(3L)-interface, an EM instruction, into actions. +There are two kind of actions: assembly instructions and C function calls. +An assembly instruction is defined as a mnemonic followed by zero or more +operands, separated by commas. The semantic of an assembly instruction is +defined by the table writer. When the assembly language is not expressive +enough, then, as an escape route, function calls can be made. However, this +reduces +the speed of the actual code expander. Finally, actions can be grouped into +a list of actions; actions are separated by a semicolon and terminated +by a '.'. +.DS +\f5 +C_nop ==> . /* Empty action list : no operation. */ + +C_inc ==> "incl (sp)". /* Assembler instruction, which is evaluated + * during expansion of the EM_table */ + +C_slu ==> C_sli( $1). /* Function call, which is evaluated during + * execution of the compiler. */ +\fR +.DE +.NH 3 +Labels +.PP +Since an assembly language without instruction labels is a rather weak +language, labels inside a contiguous block of assembly instructions are +allowed. When using labels two rules must be observed: +.IP \0\01) +The name of a label should be unique inside an action list. +.IP \0\02) +The labels used in an assembler instruction should be defined in the same +action list. +.LP +The following example illustrates the usage of labels. +.DS +\f5 +C_cmp ==> "pop bx"; /* Compare the two top */ + "pop cx"; /* elements on the stack. */ + "xor ax, ax"; + "cmp cx, bx"; + "je 2f"; /* Forward jump to local label */ + "jb 1f"; + "inc ax"; + "jmp 2f"; + "1: dec ax"; + "2: push ax". +\fR +.DE +We will come back to labels in the section on the as_table. +.NH 3 +Arguments of an EM instruction +.PP +In most cases the translation of a \fBC_instr\fR depends on its arguments. +The arguments of a \fBC_instr\fR are numbered from 1 to \fIn\fR, where \fIn\fR +is the +total number of arguments of the current \fBC_instr\fR (There are a few +exceptions, see Implicit arguments). The table writer may +refer to an argument as $\fIi\fR. If a plain $-sign is needed in an +assembly instruction, it must be preceded by a extra $-sign. +.PP +There are two groups of \fBC_instr\fRs whose arguments are specially handled: +.RS +.IP "1) Instructions dealing with local offsets." +.br +The value of the $\fIi\fR argument referring to a parameter ($\fIi\fR >= 0), +is increased by 'EM_BSIZE'. 'EM_BSIZE' is the size of the return status block +and must be defined in the file 'mach.h', see section 3.3. For example : +.DS +\f5 +C_lol ==> "push $1(bp)". /* automatic conversion of $1 */ +\fR +.DE +.IP "2) Instructions using global names or instruction labels" +.br +All the arguments referring to global names or instruction labels will be +transformed into a unique assembly name. To prevent name clashes with library +names the table writer has to provide the +conversions in the file 'mach.h'. For example : +.DS +\f5 +C_bra ==> "jmp $1". /* automatic conversion of $1 */ + /* type arith is converted to string */ +\fR +.DE +.RE +.NH 3 +Conditionals +.PP +The rules in the EM_table can be divided in two groups: simple rules and +conditional rules. The simple rules consist of a \fBC_instr\fR followed by +a list of actions, as described above. The conditional rules (CONDITIONAL) +allow the table writer to select an action list depending on the value of +a condition. +.PP +A CONDITIONAL is a list of a boolean expression with the corresponding +simple rule. If +the expression evaluates to true then the corresponding simple rule is carried +out. If more than one condition evaluates to true, an abritary is chosen. +The last case of a CONDITIONAL of a \fBC_instr\fR must handle the default case. +The boolean expression in a CONDITIONAL must be an 'C' expression. Besides the +ordinary 'C' operators and constants, $\fIi\fR references can be used +in an expression. +.DS +\f5 +C_lxl /* Load address of LB $1 levels back. */ + $1 == 0 ==> "pushl fp". + $1 == 1 ==> "pushl 4(ap)". + default ==> "movl $$$1, r0"; + "jsb .lxl"; + "pushl r0". +\fR +.DE +.NH 3 +Equivalence rule +.PP +Among the simple rules there is special case rule: +the equivalence rule. This rule declares two \fBC_instr\fR equivalent. To +distinguish it from the usual simple rule '==>' is replaced by a '::='. +The benefit of a equivalence rule is that the arguments are not converted. +.DS +\f5 +C_slu ::= C_sli( $1). +\fR +.DE +.NH 3 +Abbreviations +.PP +EM instructions with an external as argument come in three variants in +the EM_CODE(3L) interface. In most cases it will be possible to take +these variants together. For this purpose the '..' notation is introduced. +.DS +\f5 +/* For the code expander there is no difference between the following + * instructions. */ +C_loe_dlb ==> "pushl $1 + $2". +C_loe_dnam ==> "pushl $1 + $2". +C_loe ==> "pushl $1 + $2". + +/* So it can be written in the following way. + */ +C_loe.. ==> "pushl $1 + $2". +\fR +.DE +.NH 3 +Implicit arguments +.PP +In the last example 'C_loe' has two arguments, but in the EM_CODE interface +it has one argument. However, this argument dependents on the current 'hol' +block; in the EM_table this it made explicit. Every \fBC_instr\fR whose +argument depends on 'hol' block has one extra argument; argument 1 refers +to the 'hol' block. +.NH 3 +Pseudo instructions +.PP +Most pseudo instructions are machine independent and are provided +by \fBceg\fR. The table writer has only to supply the functions : +.DS +\f5 +prolog() +/* Performs the prolog, for example save return address */ + +locals( n) +arith n; +/* Allocate n bytes for locals on the stack */ + +jump( label) +char *label; +/* Generates code for a jump to 'label' */ +\fR +.DE +.LP +These functions can be defined in 'mach.c' or in the EM_table. +.NH 3 +Storage instructions +.PP +The storage instructions 'C_bss_\fIcstp()\fR', 'C_hol_\fIcstp()\fR', +'C_con_\fIcstp()\fR' and 'C_rom_\fIcstp()\fR', except for the instructions +dealing with constants of type string ( C_..._icon, C_..._ucon, C_..._fcon), are +generated automatically. No information is needed in the table. +To generate the C_..._icon, C_..._ucon, C_..._fcon instructions +\fBceg\fR only has to know how to convert a number of type string to bytes; +this can be defined with the constants ONE_BYTE, TWO_BYTES, and FOUR_BYTES. +C_rom_icon, C_con_icon, C_bss_icon, C_hol_icon can be abbreviated by ..icon. +This also holds for ..ucon and ..fcon. +For example : +.DS +\f5 +\\.\\.icon + $2 == 1 ==> gen1( (ONE_BYTE) atoi( $1)). + $2 == 2 ==> gen2( (TWO_BYTES) atoi( $1)). + $2 == 4 ==> gen4( (FOUR_BYTES) atoi( $1)). + default ==> arg_error( "..icon", $2). +\fR +.DE +Gen1(), gen2() and gen4() are \fBback\fR-primitives, see appendix A, and +generate one, two or four byte constants. Atoi() is a 'C' library function which +converts strings to integers. +The constants 'ONE_BYTE', 'TWO_BYTES' and 'FOUR_BYTES' must be defined in +the file 'mach.h'. +.NH 2 +User supplied definitions and functions +.PP +If the table writer uses all the default functions he has only to supply +the following constants and functions : +.TS +tab(#); +l c lw(10c). +prolog()#:#T{ +Do prolog +T} +jump( l)#:#T{ +Perform a jump to label l +T} +locals( n)#:#T{ +Allocate n bytes on the stack +T} +# +NAME_FMT#:#T{ +Print format describing name to a unique name conversion. The format must +contain %s. +T} +DNAM_FMT#:#T{ +Print format describing data-label to a unique name conversion. The format +must contain %s. +T} +DLB_FMT#:#T{ +Print format describing numerical-data-label to a unique name conversion. +The format must contain a %ld. +T} +ILB_FMT#:#T{ +Print format describing instruction-label to a unique name conversion. +The format must contain %d followed by %ld. +T} +HOL_FMT#:#T{ +Print format describing hol-block-number to a unique name conversion. +The format must contain %d. +T} +# +EM_WSIZE#:#T{ +Size of a word in bytes on the target machine +T} +EM_PSIZE#:#T{ +Size of a pointer in bytes on the target machine +T} +EM_BSIZE#:#T{ +Size of base block in bytes on the target machine +T} +# +ONE_BYTE#:#T{ +\\'C'-type which occupies one byte on the machine where the \fBce\fR runs +T} +TWO_BYTES#:#T{ +\\'C'-type which occupies two bytes on the machine where the \fBce\fR runs +T} +FOUR_BYTES#:#T{ +\\'C'-type which occupies four bytes on the machine where the \fBce\fR runs +T} +# +BSS_INIT#:#T{ +The default value which the loader puts in the bss segment +T} +# +BYTES_REVERSED#:#T{ +Must be defined if you want the byte order reversed. +By default the least significant byte is outputted first. +T} +WORD_REVERSED#:#T{ +Must be defined if you want the word order reversed. +By default the least significant word is outputted first. +T} +.TE +.LP +An example of the file 'mach.h' for the vax4 with 4.1 BSD - UNIX. +.TS +tab(:); +l l l. +#define : ONE_BYTE : char +#define : TWO_BYTES : short +#define : FOUR_BYTES : long +: +#define : EM_WSIZE : 4 +#define : EM_PSIZE : 4 +#define : EM_BSIZE : 0 +: +#define : BSS_INIT : 0 +: +#define : NAME_FMT : "_%s" +#define : DNAM_FMT : "_%s" +#define : DLB_FMT : "_%ld" +#define : ILB_FMT : "I%03d%ld" +#define : HOL_FMT : "hol%d" +.TE +.nr PS 12 +.nr VS 20 +Notice that EM_BSIZE is zero. The vax4 takes care of this automatically. +.PP +There are three routine's which have to be defined by the table writer. The +table writer can define them as ordinary "C"-functions in the file "mach.c" or +define them in the EM_table. For example, for the 8086 it looks like this: +.DS +\f5 +jump ==> "jmp $1". + +prolog ==> "push bp"; + "mov bp, sp". + +locals + $1 == 0 ::= . + $1 == 2 ==> "push ax". + $1 == 4 ==> "push ax"; + "push ax". + default ==> "sub sp, $1". +\fR +.DE +.NH 2 +Generating assembly +.PP +The constants 'BYTES_REVERSED' and 'WORDS_REVERSED' are not needed. +.NH 1 +Description of the as_table +.PP +This section describes the as_table. Like the previous section it is divided in +four parts: the first part describes the grammar of the as_table; the second +part describes the semantics of the as_table; the third part gives an overview +of the functions and the constants that must be present in the as_table, in +the file 'as.h' or in the file 'as.c'; the last part describes the case when +assembly is generated instead of object code. +The part on semantics contains examples which appear in the as_table for the +VAX or for the 8086. +.NH 2 +Grammar +.PP +The formal form of the as_table is given by the following grammar : +.VS +4 +.TS +center tab(#); +l c l. +TABLE#::=#( RULE)* +RULE#::=#( mnemonic | '...') DECL_LIST '==>' ACTION_LIST +DECL_LIST#::=#DECLARATION ( ',' DECLARATION)* +DECLARATION#::=#operand [ ':' type] +ACTION_LIST#::=#ACTION ( ';' ACTION) '.' +ACTION#::=#IF_STATEMENT +#|#function-call +#|#@function-call +IF_STATEMENT#::=#'@if' '(' condition ')' ACTION_LIST +##( '@elsif' '(' condition ')' ACTION_LIST)* +##[ '@else' ACTION_LIST] +##'@fi' +.TE +.VS -4 +.LP +\fBmnemonic\fR, \fBoperand\fR and \fBtype\fR are all C-identifiers, +\fBcondition\fR is a normal C-expression. +\fBfunction-call\fR must be a C function call. +.NH 2 +Semantics +.PP +The as_table consists of rules which map assembly instructions onto +\fBback\fR-primitives, a set of functions that write in the object file. +The table is processed by \fBasg\fR, and it generates a set of C-functions, +one for each assembler mnemonic. (The names of +these functions are the assembler mnemonics postfixed with '_instr', e.g. +\'add' becomes 'add_instr()'.) These functions will be used by the function +assemble() during the expansion of the EM_table. +After explainig the semantics of the as_table the function function +assemble() will be described. +.NH 3 +Rules +.PP +A rule in the as_table consists of a left and right side; +the left side describes an assembler instruction (mnemonic and operands); the +right side gives the corresponding actions as \fBback\fR-primitives or as +functions, defined by the table writer, that call \fBback-primitives\fR. +A simple example from the VAX as_table and the 8086 as_table: +.DS L +\f5 +movl src, dst ==> @text1( 0xd0); + gen_operand( src); /* function that encodes operands */ + gen_operand( dst). /* by calling back-primitives. */ + +rep ens:MOVS ==> @text1( 0xf3); + @text1( 0xa5). + +\fR +.DE +.NH 3 +Declaration of types. +.PP +In general a machine instruction is encoded as an opcode optionally followed by +the operands, but there are two methods for mapping assembler mnemonics +onto opcodes : the mnemonic determines the opcode, or mnemonic and operands +determine the opcode. Both cases can be easily expressed in the as_table. +The first case is obvious. For the second case type fields for the operands +are introduced. +.LP +When both mnemonic and operands determine the opcode, the table writer has +to give several rules for each combination of mnemonic and operands. The rules +differ in the type fields of the operands. +The table writer has to supply functions that check the type +of the operand. The name of such an function is the name of the type; it +has one argument: a pointer to a struct of type t_operand; it returns +1 when the operand is of this type, otherwise it returns 0. +.LP +This will usually lead to a list of rules per mnemonic. To reduce the amount of +work an abbrevation is supplied. Once the mnemonic is specified it can be +refered to in the following rules by '...'. +One has to make sure +that each mnemonic is once mentioned in the as_table, otherwise \fBasg\fR will +generate more than one function with the same name. +.LP +The following example shows the usage of type fields. +.DS L +\f5 + mov dst:REG, src:EADDR ==> @text1( 0x8b); /* opcode */ + mod_RM( %d(dst->reg), src). /* operands */ + + ... dst:EADDR, src:REG ==> @text1( 0x89); /* opcode */ + mod_RM( %d(src->reg), dst). /* operands */ +\fR +.DE +The table-writer must supply the restriction functions, \f5REG\fR and +\f5EADDR\fR in the previous example, in 'as.c'/'as.h'. +.NH 3 +The function of the @-sign and the if-statement. +.PP +The righthand side of a rule consists of function calls. Some of the +functions generate object code directly (e.g., the \fBback\fR-primitives), +others are needed for further assemblation (e.g., \f5gen_operand()\fR in the +first example). The last group will be evaluated during the expansion +of the EM_table, while the first group is incorporated in the compiler. +This is denoted by the @-sign in front of the \fBback\fR-primitives. +.LP +The next example concerns the use of the '@'-sign in front of a table writer +written +function. The need for this construction arises when you implement push/pop +optimization; flags need to be set/unset and tested during the execution of +the compiler: +.DS L +\f5 +PUSH src ==> mov_instr( AX_oper, src); /* save in ax */ + @assign( push_waiting, TRUE). /* set flag */ + +POP dst ==> @if ( push_waiting) + mov_instr( dst, AX_oper); /* asg-generated */ + @assign( push_waiting, FALSE). + @else + pop_instr( dst). /* asg-generated */ + @fi. +\fR +.DE +.PP +A problem arises when information is needed that is not known until execution of +the compiler. For example one needs to know if a '$\fIi\fR' argument fits in +one byte. +In this case one can use a special if-statement provided by \fBasg\fR: +@if, @elsif, @else, @fi. This means that the conditions will be evaluated at +runtime of the \fBce\fR. In such a condition one may of course refer to the +'$\fIi\fR' arguments. For example, constants can be packed into one or two byte +arguments: +.DS L +\f5 +mov dst:ACCU, src:DATA ==> @if ( fits_byte( %$(dst->expr))) + @text1( 0xc0); + @text1( %$(dst->expr)). + @else + @text1( 0xc8); + @text2( %$(dst->expr)). + @fi. +.DE +.NH 3 +References to operands +.PP +As mentioned before, the operands of an assembler instruction may be used as +pointers, to the struct t_operand, in the righthand side of the table. +Because of the free format assembler, the types of the fields in the struct +t_operand are unknown to \fBasg\fR. Clearly \fBasg\fR must know these types. +This section explains how these types must be specified. +.LP +References to operands come in three forms: ordinary operands, operands that +contain '$\fIi\fR' referneces, and operands that refer to names of local labels. +The '$\fIi\fR' in operands represent names or numbers of an \fBC_instr\fR and must +be given as arguments to the \fBback\fR-primitives. Labels in operands +must be converted to a number that tells the distance, the number of bytes, +between the label and the current position in the text-segment. +.LP +All these three cases are treated in an uniform way. When the table writer +makes a reference to an operand of an assembly instruction, he must describe +the type of the operand in the following way. +.DS +\f5 + reference := '%' conversion '(' operand-name '->' field-name ')' + conversion := printformat | + '$' | + 'dist' + printformat := see PRINT(3ACK) +\fR +.DE +The three cases differ only in the conversion field. The first conversion +applies to ordinary operands. The second applies to operands that contain +a '$\fIi\fR'. The expression between brackets must of type char *. The +result of '%$' is of the type of '$\fIi\fR'. The +third applies operands that refer to a local label. The expression between +the brackets must be of type char *. The result of '%dist' is of type arith. +.LP +The following example illustrates the usage of '%$'. (For an +example that illustrates the usage of ordinary fields see the example in +the section on 'User supplied definitions and functions). +.DS L +\f5 +jmp dst ==> @text1( 0xe9); + @reloc2( %$(dst->lab), %$(dst->off), PC_REL). +\fR +.DE +.LP +A useful function concerning $\fIi\fRs is arg_type(), which takes as input a +string starting with $\fIi\fR and returns the type of the \fIi\fR'th argument +of the current EM-instruction, which can be STRING, ARITH or INT. One may need +this function while decoding operands if the context of the $\fIi\fR doesn't +give enough information. +If the function arg_type() is used, the file +arg_type.h must contain the definition of STRING, ARITH and INT. +.LP +%dist is only guaranteed to work when called as a parameter of text1(), text2() or text4(). +The goal of the %dist conversion is to reduce the number of reloc1(), reloc2() +and reloc4() +calls, saving space and time (no relocation at compiler runtime). +.LP +The following example illustrates the usage of '%dist'. +.DS L +\f5 + jmp dst:ILB ==> @text1( 0xeb); /* label in an instructionlist */ + @text1( %dist( dst->lab)). + + ... dst:LABEL ==> @text1( 0xe9); /* global label */ + @reloc2( %$(dst->lab), %$(dst->off), PC_REL). +\fR +.DE +.NH 3 +The functions assemble() and block_assemble +.PP +Assemble() and block_assemble() are two function that are provided by \fBceg\fR. +However, if one is not satisfied with the way they work the table writer can +supply his own assemble or block_assemble(). +The default function assemble() splits an assembly string in a label, mnemonic +and operands and performs the following actions on them: +.IP \0\01) +It processes the local label; records the name and current position. Thereafter it calls the function process_label() with one argument of type string, +the label. The table writer has to define this function. +.IP \0\02) +Thereafter it calls the function process_mnemonic() with one argument of +type string, the mnemonic. The table writer has to define this function. +.IP \0\03) +It calls process_operand() for each operand. Process_operand() must be +written by the table-writer since no fixed representation for operands +is enforced. It has two arguments, a string (the operand to decode) +and a pointer to the struct t_operand. The declaration of the struct +t_operand must be given in the +file 'as.h', and the table-writer can put in it all the information needed for +encoding the operand in machine format. +.IP \0\04) +It examines the mnemonic and calls the associated function, generated by +\fBasg\fR, with pointers to the decoded operands as arguments. This makes it +possible to use the decoded operands in the right hand side of a rule (see +below). +.PP +The default function block_assemble() is called with a sequence of assembly +instructions that belong to one action list. For every assembly instruction +in +this block assemble() is called. But, if a special action is +required on bloack of assembly instructions, the table writer only has to +rewrite this function to get a new \fBceg\fR that oblies to his wishes. +.PP +Only four things have to be specified in 'as.h' and 'as.c'. First the user must +give the declaration of struct t_operand in 'as.h', and the functions +process_operand(), process_mnemonic() and process_label() must be given +in 'as.c'. If the right side of the as_table +contains function calls other than the \fBback\fR-primitives, these functions +must also be present in 'as.c'. Note that both the '@'-sign and 'references' +also work in +the functions defined in 'as.c'. Example, part of 8086 'as.h' and 'as.c' +files : +.nr PS 10 +.nr VS 12 +.DS L +\f5 +/*============== as.h ========================================*/ + +/* type of operand */ +#define UNKNOWN 0 +#define IS_REG 0x1 +#define IS_ACCU 0x2 +#define IS_DATA 0x4 +#define IS_LABEL 0x8 +#define IS_MEM 0x10 +#define IS_ADDR 0x20 +#define IS_ILB 0x40 + +/* restriction macros */ +#define REG( op) ( op->type & IS_REG) +#define DATA( op) ( op->type & IS_DATA) +#define lABEL( op) ( op->type & IS_LABEL) +#define ILB( op) ( op->type & IS_ILB) +#define MEM( op) ( op->type & IS_MEM) +#define ADDR( op) ( op->type & IS_ADDR) + +/* decoded information */ +struct t_operand { + unsigned type; + int reg; + char *expr, *lab, *off; + }; +\fR +.DE +.DS L +\f5 +/*============== as.c ========================================*/ + +#include "as.h" +#include "arg_type.h" + + +#define last( s) ( s + strlen( s) - 1) +#define LEFT '(' +#define RIGHT ')' + +decode_operand( str, op) +char *str; +struct t_operand *op; + +/* Operands in i86-assembly have the following syntax : + * + * expr -> IS_DATA | IS_LABEL | IS_ILB + * reg -> IS_REG + * (expr) -> IS_ADDR + * expr(reg) -> IS_MEM + */ +{ + char *ptr, *index(); + + op->type = UNKNOWN; + if ( *last( str) == RIGHT) { /* (expr) or expr(reg) */ + ptr = index( str, LEFT); + *last( str) = '\\\\0'; + *ptr = '\\\\0'; + if ( is_reg( ptr+1, op)) { /* expr(reg) */ + op->type = IS_MEM; + op->expr = ( *str == '\\\\0' ? "0" : str); + } + else { + set_label( ptr+1, op); /* (expr) */ + op->type = IS_ADDR; + } + } + else + if ( is_reg( str, op)) + op->type = IS_REG; + else { + if ( contains_label( str)) + set_label( str, op); + else { + op->type = IS_DATA; + op->expr = str; + } + } +} + + +mod_RM( reg, op) +int reg; +struct t_operand *op; + +/* This function helps to decode operands in machine format, + * note the $-operators + */ +{ + if ( REG( op)) + @R233( 0x3, reg, op->reg); + else if ( ADDR( op)) { + @R233( 0x0, reg, 0x6); + @reloc2( %$(op->lab), %$(op->off), !PC_REL); + } + else if ( strcmp( op->expr, "0") == 0) + switch( op->reg) { + case SI : @R233( 0x0, %d(reg), 0x4); + break; + + case DI : @R233( 0x0, %d(reg), 0x5); + break; + + case BP : @R233( 0x1, %d(reg), 0x6); + @text1( 0); + break; + + case BX : @R233( 0x0, %d(reg), 0x7); + break; + + default : fprint( STDERR, "Wrong index register %d\\\\n", + op->reg); + } + else { + switch( op->reg) { + case SI : @R233( 0x2, %d(reg), 0x4); + break; + + case DI : @R233( 0x2, %d(reg), 0x5); + break; + + case BP : @R233( 0x2, %d(reg), 0x6); + break; + + case BX : @R233( 0x2, %d(reg), 0x7); + break; + + default : fprint( STDERR, "Wrong index register %d\\\\n", + op->reg); + } + @text2( %$(op->expr)); + } +} +\fR +.DE +.nr PS 12 +.nr VS 20 +If one is unsatisfied with the default assemble() function, one may put one's +own one in the file 'as.c'; assemble() has one string-argument. +.NH 2 +Generating assembly +.PP +It is possible to generate assembly in stead of objectfiles (see section 5), in +which case one doesn't have to supply 'as_table', 'as.h' and 'as.c'. This option +is useful for debugging the EM_table. +.NH 1 +Building a ce +.PP +This section describes how to generate a code expander. The best way to +generate one is to build it in two phases. In phase one, the EM_table is +written and tested. In the second phase, the as_table is written and tested. +.NH 2 +Phase one +.PP +The following is a list of instruction that describe how to make a +code expander that generates assembly instruction. +.IP \0\0-1 +Create a new directory. +.IP \0\0-2 +Create the 'EM_table', 'mach.h' and 'mach.c' files; there is no need +for 'as_table', 'as.h' and 'as.c' at this moment. +.IP \0\0-3 +type +.br +\f5 +install_ceg -as +\fR +.br +install_ceg will create a Makefile, and three directories : ceg, ce and back. +Ceg will contain the program ceg; this program will be +used to turn 'EM_table' into a set of C-source files ( in the ce directory) +, one for each +EM-instruction. All these files will be compiled and put in a library called +\fBce.a\fR. +.br +The option \f5-as\fR means that a \fBback\fR-library will be generated ( in the directory back) that +supports the generation of assembly language. The library is named 'back.a'. +.IP \0\0-4 +Link a front end, 'ce.a' and 'back.a' together resulting in a compiler. +.LP +Now, the EM_table can be tested; if an error occures, change the table +and type +\f5 +.DS +\f5update\fR \fBC_instr\fR + ,where \fBC_instr\fR stands for the name of the erronous EM-instruction. +.DE +\fR +.NH 2 +Phase two +.PP +The next phase is to generate a \fBce\fR that produces relocatable object +code. +.IP \0\0-1 +Remove the 'ce' and 'ceg' directories. +.IP \0\0-2 +Write the 'as_table', 'as.h' and 'as.c' files. +.IP \0\0-3 +type +.br +\f5 +install_ceg -obj +\fR +.br +The option \f5-obj\fR means that 'back.a' will contain a library for generating +NEW A.OUT(5L) object files, see appendix B. If another 'back.a' is used, +omit the \f5-obj\fR flag. +.IP \0\0-4 +Link a front end, 'ce.a' and 'back.a' together resulting in a compiler. +.LP +The as_table is ready to be tested. If an error occures, change the table. +Then there are two ways to proceed: +.IP \0\0-1 +recompile the whole EM_table, +.br +\f5 +update ALL +\fR +.br +.IP \0\0-2 +recompile just the few EM-instructions that contained the error, +\f5 +.br +update \fBC_instr\fR +.br +,where \fBC_instr\fR is an erroneous EM-instruction. +\fR +.NH +References +.PP +.IP \ \1: +PRINT(3ACK), an ACK manual page. +.IP \ \2: +EM_CODE(3L), an ACK manual page. +.IP \ \3: +NEW_A.OUT(5L), an ACK manual page. +.IP \ \4: +The C programming language, B.W. Kernighan & D.M. Ritchie. +.IP \ \5: +Description of a Machine Architecture for use with Block Structured +Languages (IR-81), A.S Tanenbaum & H. van Staveren & E.G. Keizer & +J.H. Stevenson. +.bp +.SH +Appendix A, \fRthe \fBback\fR-primitives +.PP +This appendix describes the routines avaible to generate relocatable +object code. If the default back.a is used, the object code is in +ACK A.OUT(5L) format. +.nr PS 10 +.nr VS 12 +.PP +.IP A1. +Text and data generation; with ONE_BYTE b; TWO_BYTES w; FOUR_BYTES l; arith n; +.VS +4 +.TS +tab(#); +l c lw(10c). +text1( b)#:#T{ +Put one byte in text-segment. +T} +text2( w)#:#T{ +Put word (two bytes) in text-segment, byte-order is defined by +BYTES_REVERSED in mach.h. +T} +text4( l)#:#T{ +Put long ( two words) in text-segment, word-order is defined by +WORDS_REVERSED in mach.h. +T} +# +con1( b)#:#T{ +Same for CON-segment. +T} +con2( w)#: +con4( l)#: +# +rom1( b)#:#T{ +Same for ROM-segment. +T} +rom2( w)#: +rom4( l)#: +# +gen1( b)#:#T{ +Same for the current segment, only to be used in the "..icon", "..ucon", etc. +pseudo EM-instructions. +T} +gen2( w)#: +gen4( l)#: +# +bss( n)#:#T{ +Put n bytes in bss-segment, value is BSS_INIT. +T} +.TE +.VS -4 +.IP A2. +Relocation; with char *s; arith o; int r; +.VS +4 +.TS +tab(#); +l c lw(10c). +reloc1( s, o, r)#:#T{ +Generates relocation-information for 1 byte in the current segment. +T} +##s\0:\0the string which must be relocated +##o\0:\0the offset in bytes from the string. +##T{ +r\0:\0relocation type. It can have the values ABSOLUTE or PC_REL. These +two constants are defined in the file 'back.h' +T} +reloc2( s, o, r)#:#T{ +Generates relocation-information for 1 word in the +current segment. Byte-order according to BYTES_REVERSED in mach.h. +T} +reloc4( s, o, r)#:#T{ +Generates relocation-information for 1 long in the +current segment. Word-order according to WORDS_REVERSED in mach.h. +T} +.TE +.VS -4 +.IP A3. +Symbol table interaction; with int seg; char *s; +.VS +4 +.TS +tab(#); +l c lw(10c). +switch_segment( seg)#:#T{ +sets current segment to 'seg', and does alignment if necessary. +'seg' can be one of the four constants defined in 'back.h': SEGTXT, SEGROM, +SEGCON, SEGBSS. +T} +# +symbol_definition( s)#:#T{ +Define s in symbol-table. +T} +set_local_visible( s)#:#T{ +Record scope-information in symbol table. +T} +set_global_visible( s)#: +.TE +.VS -4 +.IP A4. +Start/end actions; with char *f; +.VS +4 +.TS +tab(#); +l c lw(10c). +do_open( f)#:#T{ +Directs output to file 'f', if f is the null pointer output must be given on +standard output. +T} +output()#:#T{ +End of the job, flush output. +T} +do_close()#:#T{ +close outputstream. +T} +init_back()#:#T{ +Only used with user-written back-library, gives the opportunity to initialize. +T} +end_back()#:#T{ +Only used with user-written back-library. +T} +.TE +.VS -4 +.nr PS 12 +.nr VS 14 +.bp +.SH +Appendix B, description of ACK-a.out library +.PP +The object file produced by \fBce\fR is by default in ACK NEW_A.OUT(5L) +format. The object file consists of one header, followed by +four segment headers, followed by text, data, relocation information, +symbol table and the string area. The object file is tuned for the ACK-LED, +so there are some special things done just before the object file is dumped. +First, the four relocation records are added which contain the names of the four +segments. Second, all the local relocation is resolved. This is done by the +function do_relo(). If there is a record belonging to a local +name this address is relocated in the segment to which the record belongs. +Besides doing the local relocation, do_relo() changes the 'nami'-field +of the local relocation records. This field receives the index of one of the +four +relocation records belonging to a segment. After the local +relocation has been resolved the routine output() dumps the ACK object file. +.LP +If a different a.out format is wanted, one can choose between three strategies: +.IP \ \1: +The most simple one is to use a conversion program, which converts the ACK +a.out format to the wanted a.out format. This program exists for all most +all machines on which ACK runs. The disadvantage is that the compiler +will become slower. +.IP \ \2: +A better solution is to change the function output(), do_relo(), do_open() +and do_close() in such a way +that it produces the wanted a.out format. This strategy saves a lot of I/O. +.IP \ \3: +If you still are not satisfied and have a lot of spare time change the +\fBback\fR-primitives in such a way that they produce the wanted a.out format.