From b52eecb904ec0ade5d2d6610b7399ce428a46384 Mon Sep 17 00:00:00 2001
From: kaashoek <none@none>
Date: Mon, 16 May 1988 10:32:30 +0000
Subject: [PATCH] Initial revision

---
 doc/ceg/proposal.tr  | 284 +++++++++++++++++++++++++++++++++++++++++++
 doc/ceg/prototype.tr | 276 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 560 insertions(+)
 create mode 100644 doc/ceg/proposal.tr
 create mode 100644 doc/ceg/prototype.tr

diff --git a/doc/ceg/proposal.tr b/doc/ceg/proposal.tr
new file mode 100644
index 000000000..0063bb623
--- /dev/null
+++ b/doc/ceg/proposal.tr
@@ -0,0 +1,284 @@
+.TL
+
+Code Expander
+.br
+(proposal)
+
+.SH
+Introduction
+.LP
+The \fBcode expander\fR, \fBce\fR, is a program that translates EM-code to
+objectcode. The main goal is to translate very fast. \fBce\fR is an instance
+of the EM_CODE(3L)-interface. During execution of \fBce\fR, \fBce\fR will build
+in core a machine independent objectfile ( NEW A.OUT(5L)). With \fBcv\fR or 
+with routines supplied by the user the machine independent objectcode will
+be converted to a machine dependent object code. \fBce\fR needs 
+information about the targetmachine (e.g. the opcode's). We divide the
+information into two parts:
+.IP
+- The description in assembly instructions of EM-code instructions.
+.IP
+- The description in objectcode of assembly instructions.
+.LP
+With these two tables we can make a \fBcode expander generator\fR which
+generates a \fBce\fR. It is possible to put the information in one table
+but that will probably introduce (propable) more bugs in the table. So we 
+divide and conquer.  With this approach it is also possible to generate 
+assembly code ( rather yhan objectcode), wich is useful for debugging.
+There is of course a link between the two tables, the link 
+consist of a restriction on the assembly format. Every assembly
+instruction must have the following format:
+.sp
+	INSTR ::= LABEL : MNEMONIC  [ OPERAND ( "," OPERAND)* ]
+.sp
+.LP
+\fBCeg\fR uses the following algorithm:
+.IP \0\0a)
+The assembly table will be converted to a (C-)routine assemble().
+assemble() gets as argument a string, the assembler instruction,
+and can use the MNEMONIC to execute the corresponding action in the 
+assembly table.
+.IP \0\0b)
+The routine assemble() can now be used to convert the EM-code table to
+a set of C-routines, wich together form an instance of the
+EM_CODE(3L).
+.SH
+The EM-instruction table
+.LP
+We use the following grammar:
+.sp
+.TS
+center box ;
+l.
+TABLE ::= (ROW)*
+ROW   ::= C_instr ( SPECIAL | SIMPLE)
+SPECIAL ::= ( CONDITION SIMPLE)+  'default'  SIMPLE
+SIMPLE ::= '==>' ACTIONLIST | '::=' ACTIONLIST
+ACTIONLIST ::= [ ACTION ( ';' ACTION)* ] '.'
+ACTION ::= function-call | assembly-instruction
+.TE
+.LP
+An example for the 8086:
+.LP
+.DS
+C_lxl
+	$arg1 == 0  ==>  "push bp".
+	$arg1 == 1  ==>  "push EM_BSIZE(bp)".
+        default     ==>  "mov cx, $arg1";
+		         "mov si, bp";
+		         "1: mov si, EM_BSIZE(si);
+		         "loop 1b"
+		         "push si".
+.DE
+.sp
+Some remarks:
+.sp
+* The C_instr is a function indentifier in the EM_CODE(3L)-interface.
+.LP
+* CONDITION is a "boolean" C-expression. 
+.LP
+* The arguments of an EM-instruction can be used in CONDITION and in assembly
+instructions. They are referred by $arg\fIi\fR. \fBceg\fR modifies the 
+arguments as follows:
+.IP \0\0-
+For local variables at positive offsets it increases this offset by EM_BSIZE
+.IP \0\0-
+It makes names en labels unique. The user must supply the formats (see mach.h).
+.LP
+* function-call is allowed to implement e.g. push/pop optimization.
+For example:
+.LP
+.DS
+C_adi   
+	$arg1 == 2   ==> combine( "pop ax");
+		 	 combine( "pop bx");
+		 	 "add ax, bx";
+                         save( "push ax").
+        default      ==> arg_error( "C_adi", $arg1).
+.DE
+.LP
+* The C-functions called in the EM-instructions table have to use the routine
+assemble()/gen?(). "assembler-instr" is in fact assemble( "assembler-instr").
+.LP
+* \fBceg\fR takes care not only about the conversions of arguments but also 
+about
+changes between segments. There are situation when one doesn't want 
+conversion of arguments. This can be done by using ::= in stead of ==>.
+This is usefull when two C_instr are equivalent. For example:
+.IP
+C_slu  ::=  C_sli( $arg1)
+.LP
+* There are EM-CODE instructions wich are machine independent (e.g. C_open()).
+For these EM_CODE instructions \fBceg\fR will generate \fIdefault\fR-
+instructions. There is one exception: in the case of C_pro() the tablewriter 
+has to supply a function prolog().
+.LP
+* Also the EM-pseudoinstructions C_bss_\fIcstp\fR(), C_hol_\fIcstp\fR(),
+C_con_\fIcstp\fR() and C_rom_\fIcstp\fR can be translated automaticly.
+\fBceg\fR only has to know how to interpretate string-constants: 
+.DS
+\&..icon  $arg2 == 1  ==>  gen1( (char) atoi( $arg1))
+          $arg2 == 2  ==>  gen2( atoi( $arg1))
+          $arg2 == 4  ==>  gen4( atol( $arg1))
+\&..ucon  $arg2 == 1  ==>  gen1( (char) atoi( $arg1))
+	  $arg2 == 2  ==>  gen2( atoi( $arg1))
+    	  $arg2 == 4  ==>  gen4( atol( $arg1))
+\&..fcon  ::=  not_implemented( "..fcon")
+.DE
+.LP
+* Still, life can be made easier for the tablewriter; For the routines wich 
+he/she didn't implement \fBceg\fR will generate a default instruction wich
+generates an error-message. \fBceg\fR seems to generate :
+.IP
+C_xxx  ::=  not_implemented( "C_xxx")
+.SH
+The assembly table
+.LP
+How to map assembly on objectcode.
+.LP
+Each row in the table consists of two fields, one field for the assembly
+instruction, the other field for the corresponding objectcode. The tablewriter
+can use the following primitives to generate code for the machine
+instructions :
+.IP "\0\0gen1( b)\0\0:" 17
+generates one byte in de machine independent objectfile.
+.IP "\0\0gen2( w)\0\0:" 17
+generates one word ( = two bytes), the table writer can change the byte
+order by setting the flag BYTES_REVERSED.
+.IP "\0\0gen4( l)\0\0:" 17
+generates two words ( = four bytes), the table writer can change the word
+order by setting the flag WORDS_REVERSED.
+.IP "\0\0reloc( n, o, r)\0\0:" 17
+generates relocation information for a label ( = name + offset +
+relocationtype).
+.LP
+Besides these primitives the table writer may use his self written
+C-functions. This allows the table writer e.g. to write functions to set
+bitfields within a byte.
+.LP
+There are more or less two methods to encode the assembly instructions:
+.IP \0\0a)
+MNEMONIC and OPERAND('s) are encoded independently of each other. This can be
+done when the target machine has an orthogonal instruction set (e.g. pdp-11).
+.IP \0\0b)
+MNEMONIC and OPERAND('s) together determine the opcode. In this case the
+assembler often uses overloading: one MNEMONIC is used for several
+different machine-instructions. For example : (8086)
+.br
+	mov ax, bx
+.br
+	mov ax, variable
+.br
+These instructions have different opcodes.
+.LP
+As the transformation MNEMONIC-OPCODE is not one to
+one the table writer must be allowed to put restrictions on the operands.
+This can be done with type declarations. For example:
+.LP
+.DS
+	mov  dst:REG, src:MEM  ==>
+		gen1( 0x8b);
+		modRM( op2.reg, op1);
+.DE
+.DS
+	mov  dst:REG, src:REG  ==>
+		gen1( 0x89);
+		modRM( op2.reg, op1);
+.DE
+.LP
+modRM() is a function written by the tablewriter and is used to encode
+the operands. This frees the table writer of endless typing.
+.LP
+The table writer has to do the "typechecking" by himself. But typechecking
+is almost the same as operand decoding. So it's more efficient to do this
+in one function. We now have all the tools to describe the function
+assemble(). 
+.IP
+assemble() first calls the function
+decode_operand() ( by the table writer written), with two arguments: a 
+string ( the operand) and a
+pointer to a struct. The struct is declared by the table writer and must
+consist of at least a field called type. ( the other fields in the struct can
+be used to remember information about the decoded operand.) Now assemble()
+fires a row wich is selected by mapping the MNEMONIC and the type of the 
+operands. 
+.br
+In the second field of a row there may be references to other
+fields in the struct (e.g. op2.reg in the example above).
+.LP
+We ignored one problem. It's possible when the operands are encoded, that
+not everything is known. For example $arg\fIi\fR arguments in the
+EM-instruction table get their value at runtime. This problem is solved by
+introducing a function eval(). eval() has a string as argument and returns
+an arith. The string consists of constants and/or $arg\fIi\fR's and the value
+returned by eval() is the value of the string. To encode the $arg\fIi\fR's
+in as few bytes as possible the table writer can use the statements %if,
+%else and %endif. They can be used in the same manner as #if, #else and
+#endif in C and result in a runtime test. An example : 
+.LP
+.DS
+ -- Some rows of the assembly table
+ 
+ mov dst:REG, src:DATA  ==>
+        %if  sfit( eval( src), 8)   /* does the immediate-data fit in 1 byte? */
+ 		R53( 0x16 , op1.reg);
+ 	     	gen1( eval( src));
+        %else
+ 	      	R53( 0x17 , op1.reg);
+ 	      	gen2( eval( src));
+        %endif
+.LD
+ 
+ mov dst:REG, src:REG  ==>
+        gen1( 0x8b);
+        modRM( op1.reg, op2);
+ 
+.DE 
+.DS
+ -- The corresponding part in the function assemble() :
+ 
+ case MNEM_mov : 
+ 		decode_operand( arg1, &op1);
+ 		decode_operand( arg2, &op2);
+ 		if ( REG( op1.type) && DATA( op2.type)) {
+ 			printf( "if ( sfit( %s, 8)) {\\\\n", eval( src));
+ 			R53( 0x16 , op1.reg);
+ 			printf( "gen1( %s)\\\\n", eval( arg2));
+ 			printf( "}\\\\nelse {\\\\n");
+ 			R53( 0x17 , op1.reg);
+ 			printf( "gen2( %s)\\\\n", eval( arg2));
+ 			printf( "}\\\\n");
+ 		}
+ 		else if ( REG( op1.type) && REG( op2.type)) {
+ 			gen1( 0x8b);
+ 			modRM( op1.reg, op2);
+ 		}
+ 
+ 
+.DE
+.DS
+ -- Some rows of the right part of the EM-instruction table are translated
+ -- in the following C-functions.
+
+ "mov ax, $arg1" ==>
+ 	if ( sfit( w, 8)) {	/* w is the actual argument of C_xxx( w) */
+ 		gen1( 176);	/* R53() */
+ 		gen1( w);
+ 	}
+ 	else {
+ 		gen1( 184);
+ 		gen2( w);
+ 	}
+.LD 
+
+ "mov ax, bx"    ==> 
+ 	gen1( 138);
+  	gen1( 99);		/* modRM() */
+.DE
+.SH
+Restrictions
+.LP
+.IP \0\01)
+The EM-instructions  C_exc() is not implemented.
+.IP \0\03)
+All messages are ignored.
diff --git a/doc/ceg/prototype.tr b/doc/ceg/prototype.tr
new file mode 100644
index 000000000..c5c5d91bd
--- /dev/null
+++ b/doc/ceg/prototype.tr
@@ -0,0 +1,276 @@
+.TL 
+A prototype Code expander
+.NH
+Introduction
+.PP
+A program to be compiled with ACK is first fed into the preprocessor.
+The output of the preprocessor goes into the appropiate front end,
+whose job it is to produce EM. The EM code generated is
+fed into the peephole optimizer, wich scans it with a window of few 
+instructions, replacing certain inefficient code sequences by better
+ones. Following the peephole optimizer follows a backend wich produces
+good assembly code. The assembly code goes into the assembler and the objectcode
+then goes into the loader/linker, the final component in the pipeline.
+.PP
+For various applications this scheme is too slow. For example for testing
+programs; In this case the program has to be translated fast and the 
+runtime of the objectcode may be slower. A solution is to build a code
+expander ( \fBce\fR) wich translates EM code to objectcode. Of course this 
+has to
+be done automaticly by a code expander generator, but to get some feeling
+for the problem we started out to build prototypes. 
+We built two types of ce's. One wich tranlated EM to assembly, one
+wich translated EM to objectcode.
+.NH
+EM to assembly
+.PP
+We made one for the 8086 and one for the vax4. These ce's are instances of the
+EM_CODE(3L)-interface and produce for a single EM instruction a set 
+of assembly instruction wich are semantic equivalent.
+We implemented in the 8086-ce push/pop-optimalization.
+.NH
+EM to objectcode
+.PP
+Instead of producing assembly code we tried to produce vax4-objectcode.
+During execution of ce, ce builds in core a machine independent
+objectfile ( NEW A.OUT(5L)) and just before dumping the tables this
+objectfile is converted to a Berkly 4.2BSD a.out-file. We build two versions;
+One with static memory allocation and one with dynamic memory allocation.
+If the first one runs out of memory it will give an error message and stop,
+the second one will allocate more memory and proceed with producing 
+objectcode.
+.PP
+The C-frontend calls the EM_CODE-interface. So after linking the frontend
+and the ce we have a pipeline in a program saving a lot of i/o.
+It is interesting to compare this C-compiler ( called fcemcom) with "cc -c". 
+fcemcom1 (the dynamic variant of fcemcom) is tuned in such a way, that
+alloc() won't be called.
+.NH 2
+Compile time
+.PP
+fac.c is a small program that produces n! ( see below). foo.c is small program
+that loops a lot.
+.TS
+center, box, tab(:);
+c | c | c | c | c | c
+c | c | n | n | n | n.
+compiler : program : real : user : sys : object size
+=
+fcemcom : sort.c : 31.0 : 17.5 : 1.8 : 23824
+fcemcom1 : : 59.0 : 21.2 : 3.3 : 
+cc -c : : 50.0 : 38.0 : 3.5 : 6788
+_
+fcemcom : ed.c : 37.0 : 23.6 : 2.3 : 41744
+fcemcom1 : : 1.16.0 : 28.3 : 4.6 : 
+cc -c : : 1.19.0 : 54.8 : 4.3 : 11108
+_
+fcemcom : cp.c :  4.0 : 2.4 : 0.8 : 4652
+fcemcom1 : : 9.0 : 3.0 : 1.0 : 
+cc -c : :  8.0 : 5.2 : 1.6 : 1048
+_
+fcemcom : uniq.c : 5.0 : 2.5 : 0.8 : 5568
+fcemcom1 : : 9.0 : 2.9 : 0.8 : 
+cc -c : : 13.0 : 5.4 : 2.0 : 3008
+_
+fcemcom : btlgrep.c : 24.0 : 7.2 : 1.4 : 12968
+fcemcom1 : : 23.0 : 8.1 : 1.2 : 
+cc -c : : 1.20.0 : 15.3 : 3.8 : 2392
+_
+fcemcom : fac.c : 1.0 : 0.1 : 0.5 : 216
+fecmcom1 : : 2.0 : 0.2 : 0.5 : 
+cc -c : : 3.0 : 0.7 : 1.3 : 92
+_
+fcemcom : foo.c : 4.0 : 0.2 : 0.5 : 272
+fcemcom1 : : 11.0 : 0.3 : 0.5 : 
+cc -c : : 7.0 : 0.8 : 1.6 : 108
+.TE
+.NH 2
+Run time
+.LP
+Is the runtime very bad?
+.TS
+tab(:), box, center;
+c | c | c | c | c
+c | c | n | n | n.
+compiler : program : real : user : system
+=
+fcem : sort.c : 22.0 : 17.5 : 1.5
+cc : : 5.0 : 2.4 : 1.1
+_
+fcem : btlgrep.c : 1.58.0 : 27.2 : 4.2
+cc : : 12.0 : 3.6 : 1.1
+_
+fcem : foo.c : 1.0 : 0.7 : 0.1
+cc : : 1.0 : 0.4 : 0.1
+_
+fcem : uniq.c : 2.0 : 0.5 : 0.3
+cc : : 1.0 : 0.1 : 0.2
+.TE
+.NH 2
+quality object code
+.LP
+The runtime is very bad so its interesting to have look at the code which is
+produced by fcemcom and by cc -c. I took a program which computes recursively
+n!.
+.DS
+long fac();
+
+main()
+{
+	int n;
+
+	scanf( "%D", &n); 
+	printf( "fac is %D\\\\n", fac( n));
+}
+
+long fac( n)
+int n;
+{
+	if ( n == 0)
+		return( 1);
+	else
+		return( n * fac( n-1));
+}
+.DE
+.br
+.br
+.br
+.br
+.LP
+"cc -c fac.c" produces :
+.DS 
+fac:	tstl 4(ap)
+	bnequ 7f
+	movl $1, r0
+	ret
+7f:	subl3 $1, 4(ap), r0
+	pushl r0
+	call $1, fac
+	movl r0, -4(fp)
+	mull3 -4(fp), 4(ap), r0
+	ret
+.DE
+.br
+.br
+.LP
+"fcem fac.c fac.o" produces :
+.DS 
+_fac:		0
+42:		jmp	be
+48:		pushl	4(ap)
+4e:		pushl	$0
+54:		subl2	(sp)+,(sp)
+57:		tstl	(sp)+
+59:		bnequ	61
+5b:		jmp	67
+61:		jmp	79
+67:		pushl	$1
+6d:		jmp	ba
+73:		jmp	b9
+79:		pushl	4(ap)
+7f:		pushl	$1
+85:		subl2	(sp)+,(sp)
+88:		calls	$0,_fac
+8f:		addl2	$4,sp
+96:		pushl	r0
+98:		pushl	4(ap)
+9e:		pushl	$4
+a4:		pushl	$4
+aa:		jsb	.cii
+b0:		mull2	(sp)+,(sp)
+b3:		jmp	ba
+b9:		ret
+ba:		movl	(sp)+,r0
+bd:		ret
+be:		jmp	48
+.DE
+.NH 1
+Conclusions
+.PP
+comparing "cc -c" with "fcemcom"
+.LP
+.TS
+center, box, tab(:);
+c | c  s | c | c  s
+^ | c  s | ^ | c  s
+^ | c | c | ^ | c | c
+l | n | n | n | n | n.
+program : compile time : object size : runtime
+:_::_
+: user : sys :: user : sys
+=
+sort.c : 0.47 : 0.5 : 3.5 : 7.3 : 1.4
+_
+ed.c : 0.46 : 0.5 : 3.8 : : :
+_
+cp.c : 0.46 : 0.5 : 4.4 : : :
+_
+uniq.c : 0.46 : 0.4 : 1.8 : : :
+_
+btlgrep.c : 0.47 : 0.3 : 5.4 : 7.5 : 3.8
+_
+fac.c : 0.14 : 0.4 : 2.3 : 1.8 : 1.0
+_
+foo.c : 0.25 : 0.3 : 2.5 : 5.0 : 1.5
+.TE
+.PP
+The results for fcemcom1 are almost identical; The only thing that changes
+is that fcemcom1 is 1.2 slower than fcemcom. ( compile time) This is due to
+to an another datastructure . In the static version we use huge array's for 
+the text- and 
+data-segment, the relocation information, the symboltable and stringarea.
+In the dynamic version we use linked lists, wich makes it expensive to get
+and to put a byte on a abritrary memory location. So it is probably better
+to use realloc(), because in the most cases there will be enough memory. 
+.PP
+The quality of the objectcode is very bad. The reason is that the frontend
+generates bad code and expects the peephole-optimizer to improve the code.
+This is also one of the main reasons that the runtime is very bad.
+(e.g. the expensive "cii" with arguments 4 and 4 could be deleted.) 
+So its seems a good
+idea to put a new peephole-optimizer between the frontend and the ce.
+.PP
+Using the peephole optimizer the ce would produce :
+.DS
+_fac:	0
+	pushl	4(ap)
+	tstl	(sp)+
+	beqlu	1f
+	jmp	3f
+ 1 :	pushl	$1
+	jmp	2f
+ 3 :	pushl	4(ap)
+	decl	(sp)
+	calls	$0,_fac
+	addl2	$4,sp
+	pushl	r0
+	pushl	4(ap)
+	mull2	(sp)+,(sp)
+	movl	(sp)+,r0
+  2 :   ret
+.DE
+.PP
+Bruce McKenzy already implemented it and made some improvements in the
+source code of the ce. The compile-time is two to two and a half times better 
+and the
+size of the objectcode is two to three times bigger.(comparing with "cc -c")
+Still we could do better.
+.PP
+Using peephole- and push/pop-optimization ce could produce :
+.DS 
+_fac:		0
+	tstl	4(ap)
+	beqlu	1f
+	jmp	2f
+  1 :	pushl	$1
+	jmp	3f
+  2 :	decl	4(ap)
+	calls	$0,_fac
+	addl2	$4,sp
+	mull3	4(ap), r0, -(sp)
+	movl 	(sp)+, r0
+  3 : 	ret
+.DE
+.PP
+prof doesn't cooperate, so no profile information.
+.PP
-- 
2.34.1