From: ceriel Date: Tue, 3 Mar 1987 10:59:52 +0000 (+0000) Subject: Initial revision X-Git-Tag: release-5-5~4526 X-Git-Url: https://git.ndcode.org/public/gitweb.cgi?a=commitdiff_plain;h=004f017550b333f2feacca1e3320939831ef1b7d;p=ack.git Initial revision --- diff --git a/doc/ego/ic/ic1 b/doc/ego/ic/ic1 new file mode 100644 index 000000000..6347bc73f --- /dev/null +++ b/doc/ego/ic/ic1 @@ -0,0 +1,57 @@ +.bp +.NH +The Intermediate Code and the IC phase +.PP +In this chapter the intermediate code of the EM global optimizer +will be defined. +The 'Intermediate Code construction' phase (IC), +which builds the initial intermediate code from +EM Compact Assembly Language, +will be described. +.NH 2 +Introduction +.PP +The EM global optimizer is a multi pass program, +hence there is a need for an intermediate code. +Usually, programs in the Amsterdam Compiler Kit use the +Compact Assembly Language format +.[~[ +keizer architecture +.], section 11.2] +for this purpose. +Although this code has some convenient features, +such as being compact, +it is quite unsuitable in our case, +because of a number of reasons. +At first, the code lacks global information +about whole procedures or whole basic blocks. +Second, it uses identifiers ('names') to bind +defining and applied occurrences of +procedures, data labels and instruction labels. +Although this is usual in high level programming +languages, it is awkward in an intermediate code +that must be read many times. +Each pass of the optimizer would have +to incorporate an identifier look-up mechanism +to associate a defining occurrence with each +applied occurrence of an identifier. +Finally, EM programs are used to declare blocks of bytes, +rather than variables. A 'hol 6' instruction may be used to +declare three 2-byte variables. +Clearly, the optimizer wants to deal with variables, and +not with rows of bytes. +.PP +To overcome these problems, we have developed a new +intermediate code. +This code does not merely consist of the EM instructions, +but also contains global information in the +form of tables and graphs. +Before describing the intermediate code we will +first leap aside to outline +the problems one generally encounters +when trying to store complex data structures such as +graphs outside the program, i.e. in a file. +We trust this will enhance the +comprehensibility of the +intermediate code definition and the design and implementation +of the IC phase. diff --git a/doc/ego/ic/ic2 b/doc/ego/ic/ic2 new file mode 100644 index 000000000..13715626a --- /dev/null +++ b/doc/ego/ic/ic2 @@ -0,0 +1,146 @@ +.NH 2 +Representation of complex data structures in a sequential file +.PP +Most programmers are quite used to deal with +complex data structures, such as +arrays, graphs and trees. +There are some particular problems that occur +when storing such a data structure +in a sequential file. +We call data that is kept in +main memory +.UL internal +,as opposed to +.UL external +data +that is kept in a file outside the program. +.sp +We assume a simple data structure of a +scalar type (integer, floating point number) +has some known external representation. +An +.UL array +having elements of a scalar type can be represented +externally easily, by successively +representing its elements. +The external representation may be preceded by a +number, giving the length of the array. +Now, consider a linear, singly linked list, +the elements of which look like: +.DS +record + data: scalar_type; + next: pointer_type; +end; +.DE +It is significant to note that the "next" +fields of the elements only have a meaning within +main memory. +The field contains the address of some location in +main memory. +If a list element is written to a file in +some program, +and read by another program, +the element will be allocated at a different +address in main memory. +Hence this address value is completely +useless outside the program. +.sp +One may represent the list by ignoring these "next" fields +and storing the data items in the order they are linked. +The "next" fields are represented \fIimplicitly\fR. +When the file is read again, +the same list can be reconstructed. +In order to know where the external representation of the +list ends, +it may be useful to put the length of +the list in front of it. +.sp +Note that arrays and linear lists have the +same external representation. +.PP +A doubly linked, linear list, +with elements of the type: +.DS +record + data: scalar_type; + next, + previous: pointer_type; +end +.DE +can be represented in precisely the same way. +Both the "next" and the "previous" fields are represented +implicitly. +.PP +Next, consider a binary tree, +the nodes of which have type: +.DS +record + data: scalar_type; + left, + right: pointer_type; +end +.DE +Such a tree can be represented sequentially, +by storing its nodes in some fixed order, e.g. prefix order. +A special null data item may be used to +denote a missing left or right son. +For example, let the scalar type be integer, +and let the null item be 0. +Then the tree of fig. 3.1(a) +can be represented as in fig. 3.1(b). +.DS + 4 + + 9 12 + + 12 3 4 6 + + 8 1 5 1 + +Fig. 3.1(a) A binary tree + + +4 9 12 0 0 3 8 0 0 1 0 0 12 4 0 5 0 0 6 1 0 0 0 + +Fig. 3.1(b) Its sequential representation +.DE +We are still able to represent the pointer fields ("left" +and "right") implicitly. +.PP +Finally, consider a general +.UL graph +, where each node has a "data" field and +pointer fields, +with no restriction on where they may point to. +Now we're at the end of our tale. +There is no way to represent the pointers implicitly, +like we did with lists and trees. +In order to represent them explicitly, +we use the following scheme. +Every node gets an extra field, +containing some unique number that identifies the node. +We call this number its +.UL id. +A pointer is represented externally as the id of the node +it points to. +When reading the file we use a table that maps +an id to the address of its node. +In general this table will not be completely filled in +until we have read the entire external representation of +the graph and allocated internal memory locations for +every node. +Hence we cannot reconstruct the graph in one scan. +That is, there may be some pointers from node A to B, +where B is placed after A in the sequential file than A. +When we read the node of A we cannot map the id of B +to the address of node B, +as we have not yet allocated node B. +We can overcome this problem if the size +of every node is known in advance. +In this case we can allocate memory for a node +on first reference. +Else, the mapping from id to pointer +cannot be done while reading nodes. +The mapping can be done either in an extra scan +or at every reference to the node. diff --git a/doc/ego/ic/ic3 b/doc/ego/ic/ic3 new file mode 100644 index 000000000..d98e8233f --- /dev/null +++ b/doc/ego/ic/ic3 @@ -0,0 +1,414 @@ +.NH 2 +Definition of the intermediate code +.PP +The intermediate code of the optimizer consists +of several components: +.IP - +the object table +.IP - +the procedure table +.IP - +the em code +.IP - +the control flow graphs +.IP - +the loop table +.LP - +.PP +These components are described in +the next sections. +The syntactic structure of every component +is described by a set of context free syntax rules, +with the following conventions: +.DS +x a non-terminal symbol +A a terminal symbol (in capitals) +x: a b c; a grammar rule +a | b a or b +(a)+ 1 or more occurrences of a +{a} 0 or more occurrences of a +.DE +.NH 3 +The object table +.PP +EM programs declare blocks of bytes rather than (global) variables. +A typical program may declare 'HOL 7780' +to allocate space for 8 I/O buffers, +2 large arrays and 10 scalar variables. +The optimizer wants to deal with +.UL objects +like variables, buffers and arrays +and certainly not with huge numbers of bytes. +Therefore the intermediate code contains information +about which global objects are used. +This information can be obtained from an EM program +by just looking at the operands of instruction +such as LOE, LAE, LDE, STE, SDE, INE, DEE and ZRE. +.PP +The object table consists of a list of +.UL datablock +entries. +Each such entry represents a declaration like HOL, BSS, +CON or ROM. +There are five kinds of datablock entries. +The fifth kind, +UNKNOWN, denotes a declaration in a +separately compiled file that is not made +available to the optimizer. +Each datablock entry contains the type of the block, +its size, and a description of the objects that +belong to it. +If it is a rom, +it also contains a list of values given +as arguments to the rom instruction, +provided that this list contains only integer numbers. +An object has an offset (within its datablock) +and a size. +The size need not always be determinable. +Both datablock and object contain a unique +identifying number +(see previous section for their use). +.DS +.UL syntax + object_table: + {datablock} ; + datablock: + D_ID -- unique identifying number + PSEUDO -- one of ROM,CON,BSS,HOL,UNKNOWN + SIZE -- # bytes declared + FLAGS + {value} -- contents of rom + {object} ; -- objects of the datablock + object: + O_ID -- unique identifying number + OFFSET -- offset within the datablock + SIZE ; -- size of the object in bytes + value: + argument ; +.DE +A data block has only one flag: "external", indicating +whether the data label is externally visible. +The syntax for "argument" will be given later on +(see em_text). +.NH 3 +The procedure table +.PP +The procedure table contains global information +about all procedures that are made available +to the optimizer +and that are needed by the EM program. +(Library units may not be needed, see section 3.5). +The table has one entry for +every procedure. +.DS +.UL syntax + procedure_table: + {procedure} + procedure: + P_ID -- unique identifying number + #LABELS -- number of instruction labels + #LOCALS -- number of bytes for locals + #FORMALS -- number of bytes for formals + FLAGS -- flag bits + calling -- procedures called by this one + change -- info about global variables changed + use ; -- info about global variables used + calling: + {P_ID} ; -- procedures called + change: + ext -- external variables changed + FLAGS ; + use: + FLAGS ; + ext: + {O_ID} ; -- a set of objects +.DE +.PP +The number of bytes of formal parameters accessed by +a procedure is determined by the front ends and +passed via a message (parameter message) to the optimizer. +If the front end is not able to determine this number +(e.g. the parameter may be an array of dynamic size or +the procedure may have a variable number of arguments) the attribute +contains the value 'UNKNOWN_SIZE'. +.sp 0 +A procedure has the following flags: +.IP - +external: true if the proc. is externally visible +.IP - +bodyseen: true if its code is available as EM text +.IP - +calunknown: true if it calls a procedure that has its bodyseen +flag not set +.IP - +environ: true if it uses or changes a (non-global) variable in +a lexically enclosing procedure +.IP - +lpi: true if is used as operand of an lpi instruction, so +it may be called indirect +.LP +The change and use attributes both have one flag: "indirect", +indicating whether the procedure does a 'use indirect' +or a 'store indirect' (indirect means through a pointer). +.NH 3 +The EM text +.PP +The EM text contains the EM instructions. +Every EM instruction has an operation code (opcode) +and 0 or 1 operands. +EM pseudo instructions can have more than +1 operand. +The opcode is just a small (8 bit) integer. +.sp +There are several kinds of operands, which we will +refer to as +.UL types. +Many EM instructions can have more than one type of operand. +The types and their encodings in Compact Assembly Language +are discussed extensively in. +.[~[ +keizer architecture +.], section 11.2] +Of special interest is the way numeric values +are represented. +Of prime importance is the machine independency of +the representation. +Ultimately, one could store every integer +just as a string of the characters '0' to '9'. +As doing arithmetic on strings is awkward, +Compact Assembly Language allows several alternatives. +The main idea is to look at the value of the integer. +Integers that fit in 16, 32 or 64 bits are +represented as a row of resp. 2, 4 and 8 bytes, +preceded by an indication of how many bytes are used. +Longer integers are represented as strings; +this is only allowed within pseudo instructions, however. +This concept works very well for target machines +with reasonable word sizes. +At present, most ACK software cannot be used for word sizes +higher than 32 bits, +although the handles for using larger word sizes are +present in the design of the EM code. +In the intermediate code we essentially use the +same ideas. +We allow three representations of integers. +.IP - +integers that fit in a short are represented as a short +.IP - +integers that fit in a long but not in a short are represented +as longs +.IP - +all remaining integers are represented as strings +(only allowed in pseudos). +.LP +The terms short and long are defined in +.[~[ +ritchie reference manual programming language +.], section 4] +and depend only on the source machine +(i.e. the machine on which ACK runs), +not on the target machines. +For historical reasons a long will often be called an +.UL offset. +.PP +Operands can also be instruction labels, +objects or procedures. +Instruction labels are denoted by a +.UL label +.UL identifier, +which can be distinguished from a normal identifier. +.sp +The operand of a pseudo instruction can be a list of +.UL arguments. +Arguments can have the same type as operands, except +for the type short, which is not used for arguments. +Furthermore, an argument can be a string or +a string representation of a signed integer, unsigned integer +or floating point number. +If the number of arguments is not fully determined by +the pseudo instruction (e.g. a ROM pseudo can have any number +of arguments), then the list is terminated by a special +argument of type CEND. +.DS +.UL syntax + em_text: + {line} ; + line: + INSTR -- opcode + OPTYPE -- operand type + operand ; + operand: + empty | -- OPTYPE = NO + SHORT | -- OPTYPE = SHORT + OFFSET | -- OPTYPE = OFFSET + LAB_ID | -- OPTYPE = INSTRLAB + O_ID | -- OPTYPE = OBJECT + P_ID | -- OPTYPE = PROCEDURE + {argument} ; -- OPTYPE = LIST + argument: + ARGTYPE + arg ; + arg: + empty | -- ARGTYPE = CEND + OFFSET | + LAB_ID | + O_ID | + P_ID | + string | -- ARGTYPE = STRING + const ; -- ARGTYPE = ICON,UCON or FCON + string: + LENGTH -- number of characters + {CHARACTER} ; + const: + SIZE -- number of bytes + string ; -- string representation of (un)signed + -- or floating point constant +.DE +.NH 3 +The control flow graphs +.PP +Each procedure can be divided +into a number of basic blocks. +A basic block is a piece of code with +no jumps in, except at the beginning, +and no jumps out, except at the end. +.PP +Every basic block has a set of +.UL successors, +which are basic blocks that can follow it immediately in +the dynamic execution sequence. +The +.UL predecessors +are the basic blocks of which this one +is a successor. +The successor and predecessor attributes +of all basic blocks of a single procedure +are said to form the +.UL control +.UL flow +.UL graph +of that procedure. +.PP +Another important attribute is the +.UL immediate +.UL dominator. +A basic block B dominates a block C if +every path in the graph from the procedure entry block +to C goes through B. +The immediate dominator of C is the closest dominator +of C on any path from the entry block. +(Note that the dominator relation is transitive, +so the immediate dominator is well defined.) +.PP +A basic block also has an attribute containing +the identifiers of every +.UL loop +that the block belongs to (see next section for loops). +.DS +.UL syntax + control_flow_graph: + {basic_block} ; + basic_block: + B_ID -- unique identifying number + #INSTR -- number of EM instructions + succ + pred + idom -- immediate dominator + loops -- set of loops + FLAGS ; -- flag bits + succ: + {B_ID} ; + pred: + {B_ID} ; + idom: + B_ID ; + loops: + {LP_ID} ; +.DE +The flag bits can have the values 'firm' and 'strong', +which are explained below. +.NH 3 +The loop tables +.PP +Every procedure has an associated +.UL loop +.UL table +containing information about all the loops +in the procedure. +Loops can be detected by a close inspection of +the control flow graph. +The main idea is to look for two basic blocks, +B and C, for which the following holds: +.IP - +B is a successor of C +.IP - +B is a dominator of C +.LP +B is called the loop +.UL entry +and C is called the loop +.UL end. +Intuitively, C contains a jump backwards to +the beginning of the loop (B). +.PP +A loop L1 is said to be +.UL nested +within loop L2 if all basic blocks of L1 +are also part of L2. +It is important to note that loops could +originally be written as a well structured for -or +while loop or as a messy goto loop. +Hence loops may partly overlap without one +being nested inside the other. +The +.UL nesting +.UL level +of a loop is the number of loops in +which it is nested (so it is 0 for +an outermost loop). +The details of loop detection will be discussed later. +.PP +It is often desirable to know whether a +basic block gets executed during every iteration +of a loop. +This leads to the following definitions: +.IP - +A basic block B of a loop L is said to be a \fIfirm\fR block +of L if B is executed on all successive iterations of L, +with the only possible exception of the last iteration. +.IP - +A basic block B of a loop L is said to be a \fIstrong\fR block +of L if B is executed on all successive iterations of L. +.LP +Note that a strong block is also a firm block. +If a block is part of a conditional statement, it is neither +strong nor firm, as it may be skipped during some iterations +(see Fig. 3.2). +.DS +loop + if cond1 then + ... -- this code will not + -- result in a firm or strong block + end if; + ... -- strong (always executed) + exit when cond2; + ... -- firm (not executed on + -- last iteration). +end loop; + +Fig. 3.2 Example of firm and strong block +.DE +.DS +.UL syntax + looptable: + {loop} ; + loop: + LP_ID -- unique identifying number + LEVEL -- loop nesting level + entry -- loop entry block + end ; + entry: + B_ID ; + end: + B_ID ; +.DE diff --git a/doc/ego/ic/ic4 b/doc/ego/ic/ic4 new file mode 100644 index 000000000..950a660ed --- /dev/null +++ b/doc/ego/ic/ic4 @@ -0,0 +1,80 @@ +.NH 2 +External representation of the intermediate code +.PP +The syntax of the intermediate code was given +in the previous section. +In this section we will make some remarks about +the representation of the code in sequential files. +.sp +We use sequential files in order to avoid +the bookkeeping of complex file indices. +As a consequence of this decision +we can't store all components +of the intermediate code +in one file. +If a phase wishes to change some attribute +of a procedure, +or wants to add or delete entire procedures +(inline substitution may do the latter), +the procedure table will only be fully updated +after the entire EM text has been scanned. +Yet, the next phase undoubtedly wants +to read the procedure table before it +starts working on the EM text. +Hence there is an ordering problem, which +can be solved easily by putting the +procedure table in a separate file. +Similarly, the data block table is kept +in a file of its own. +.PP +The control flow graphs (CFGs) could be mixed +with the EM text. +Rather, we have chosen to put them +in a separate file too. +The control flow graph file should be regarded as a +file that imposes some structure on the EM-text file, +just as an overhead sheet containing a picture +of a Flow Chart may be put on an overhead sheet +containing statements. +The loop tables are also put in the CFG file. +A loop imposes an extra structure on the +CFGs and hence on the EM text. +So there are four files: +.IP - +the EM-text file +.IP - +the procedure table file +.IP - +the object table file +.IP - +the CFG and loop tables file +.LP +Every table is preceded by its length, in order to +tell where it ends. +The CFG file also contains the number of instructions of +every basic block, +indicating which part of the EM text belongs +to that block. +.DS +.UL syntax + intermediate_code: + object_table_file + proctable_file + em_text_file + cfg_file ; + object_table_file: + LENGTH -- number of objects + object_table ; + proctable_file: + LENGTH -- number of procedures + procedure_table ; + em_text_file: + em_text ; + cfg_file: + {per_proc} ; -- one for every procedure + per_proc: + BLENGTH -- number of basic blocks + LLENGTH -- number of loops + control_flow_graph + looptable ; +.DE diff --git a/doc/ego/ic/ic5 b/doc/ego/ic/ic5 new file mode 100644 index 000000000..9dd5daae2 --- /dev/null +++ b/doc/ego/ic/ic5 @@ -0,0 +1,163 @@ +.NH 2 +The Intermediate Code construction phase +.PP +The first phase of the global optimizer, +called +.UL IC, +constructs a major part of the intermediate code. +To be specific, it produces: +.IP - +the EM text +.IP - +the object table +.IP - +part of the procedure table +.LP +The calling, change and use attributes of a procedure +and all its flags except the external and bodyseen flags +are computed by the next phase (Control Flow phase). +.PP +As explained before, +the intermediate code does not contain +any names of variables or procedures. +The normal identifiers are replaced by identifying +numbers. +Yet, the output of the global optimizer must +contain normal identifiers, as this +output is in Compact Assembly Language format. +We certainly want all externally visible names +to be the same in the input as in the output, +because the optimized EM module may be a library unit, +used by other modules. +IC dumps the names of all procedures and data labels +on two files: +.IP - +the procedure dump file, containing tuples (P_ID, procedure name) +.IP - +the data dump file, containing tuples (D_ID, data label name) +.LP +The names of instruction labels are not dumped, +as they are not visible outside the procedure +in which they are defined. +.PP +The input to IC consists of one or more files. +Each file is either an EM module in Compact Assembly Language +format, or a Unix archive file (library) containing such modules. +IC only extracts those modules from a library that are +needed somehow, just as a linker does. +It is advisable to present as much code +of the EM program as possible to the optimizer, +although it is not required to present the whole program. +If a procedure is called somewhere in the EM text, +but its body (text) is not included in the input, +its bodyseen flag in the procedure table will still +be off. +Whenever such a procedure is called, +we assume the worst case for everything; +it will change and use all variables it has access to, +it will call every procedure etc. +.sp +Similarly, if a data label is used +but not defined, the PSEUDO attribute in its data block +will be set to UNKNOWN. +.NH 3 +Implementation +.PP +Part of the code for the EM Peephole Optimizer +.[ +staveren peephole toplass +.] +has been used for IC. +Especially the routines that read and unravel +Compact Assembly Language and the identifier +lookup mechanism have been used. +New code was added to recognize objects, +build the object and procedure tables and to +output the intermediate code. +.PP +IC uses singly linked linear lists for both the +procedure and object table. +Hence there are no limits on the size of such +a table (except for the trivial fact that it must fit +in main memory). +Both tables are outputted after all EM code has +been processed. +IC reads the EM text of one entire procedure +at a time, +processes it and appends the modified code to +the EM text file. +EM code is represented internally as a doubly linked linear +list of EM instructions. +.PP +Objects are recognized by looking at the operands +of instructions that reference global data. +If we come across the instructions: +.DS +LDE X+6 -- Load Double External +LAE X+20 -- Load Address External +.DE +we conclude that the data block +preceded by the data label X contains an object +at offset 6 of size twice the word size, +and an object at offset 20 of unknown size. +.sp +A data block entry of the object table is allocated +at the first reference to a data label. +If this reference is a defining occurrence +or a INA pseudo instruction, +the label is not externally visible +.[~[ +keizer architecture +.], section 11.1.4.3] +In this case, the external flag of the data block +is turned off. +If the first reference is an applied occurrence +or a EXA pseudo instruction, the flag is set. +We record this information, because the +optimizer may change the order of defining and +applied occurrences. +The INA and EXA pseudos are removed from the EM text. +They may be regenerated by the last phase +of the optimizer. +.sp +Similar rules hold for the procedure table +and the INP and EXP pseudos. +.NH 3 +Source files of IC +.PP +The source files of IC consist +of the files ic.c, ic.h and several packages. +.UL ic.h +contains type definitions, macros and +variable declarations that may be used by +ic.c and by every package. +.UL ic.c +contains the definitions of these variables, +the procedure +.UL main +and some high level I/O routines used by main. +.sp +Every package xxx consists of two files. +ic_xxx.h contains type definitions, +macros, variable declarations and +procedure declarations that may be used by +every .c file that includes this .h file. +The file ic_xxx.c provides the +definitions of these variables and +the implementation of the declared procedures. +IC uses the following packages: +.IP lookup: 18 +procedures that loop up procedure, data label +and instruction label names; procedures to dump +the procedure and data label names. +.IP lib: +one procedure that gets the next useful input module; +while scanning archives, it skips unnecessary modules. +.IP aux: +several auxiliary routines. +.IP io: +low-level I/O routines that unravel the Compact +Assembly Language. +.IP put: +routines that output the intermediate code +.LP diff --git a/doc/ego/il/il1 b/doc/ego/il/il1 new file mode 100644 index 000000000..5bc33e6af --- /dev/null +++ b/doc/ego/il/il1 @@ -0,0 +1,112 @@ +.bp +.NH 1 +Inline substitution +.NH 2 +Introduction +.PP +The Inline Substitution technique (IL) +tries to decrease the overhead associated +with procedure calls (invocations). +During a procedure call, several actions +must be undertaken to set up the right +environment for the called procedure. +.[ +johnson calling sequence +.] +On return from the procedure, most of these +effects must be undone. +This entire process introduces significant +costs in execution time as well as +in object code size. +.PP +The inline substitution technique replaces +some of the calls by the modified body of +the called procedure, hence eliminating +the overhead. +Furthermore, as the calling and called procedure +are now integrated, they can be optimized +together, using other techniques of the optimizer. +This often leads to extra opportunities for +optimization +.[ +ball predicting effects +.] +.[ +carter code generation cacm +.] +.[ +scheifler inline cacm +.] +.PP +An inline substitution of a call to a procedure P increases +the size of the program, unless P is very small or P is +called only once. +In the latter case, P can be eliminated. +In practice, procedures that are called only once occur +quite frequently, due to the +introduction of structured programming. +(Carter +.[ +carter umi ann arbor +.] +states that almost 50% of the Pascal procedures +he analyzed were called just once). +.PP +Scheifler +.[ +scheifler inline cacm +.] +has a more general view of inline substitution. +In his model, the program under consideration is +allowed to grow by a certain amount, +i.e. code size is sacrificed to speed up the program. +The above two cases are just special cases of +his model, obtained by setting the size-change to +(approximately) zero. +He formulates the substitution problem as follows: +.IP +"Given a program, a subset of all invocations, +a maximum program size, and a maximum procedure size, +find a sequence of substitutions that minimizes +the expected execution time." +.LP +Scheifler shows that this problem is NP-complete +.[~[ +aho hopcroft ullman analysis algorithms +.], chapter 10] +by reduction to the Knapsack Problem. +Heuristics will have to be used to find a near-optimal +solution. +.PP +In the following chapters we will extend +Scheifler's view and adapt it to the EM Global Optimizer. +We will first describe the transformations that have +to be applied to the EM text when a call is substituted +in line. +Next we will examine in which cases inline substitution +is not possible or desirable. +Heuristics will be developed for +chosing a good sequence of substitutions. +These heuristics make no demand on the user +(such as making profiles +.[ +scheifler inline cacm +.] +or giving pragmats +.[~[ +ichbiah ada military standard +.], section 6.3.2]), +although the model could easily be extended +to use such information. +Finally, we will discuss the implementation +of the IL phase of the optimizer. +.PP +We will often use the term inline expansion +as a synonym of inline substitution. +.sp 0 +The inverse technique of procedure abstraction +(automatic subroutine generation) +.[ +shaffer subroutine generation +.] +will not be discussed in this report. diff --git a/doc/ego/il/il2 b/doc/ego/il/il2 new file mode 100644 index 000000000..ea69b35d7 --- /dev/null +++ b/doc/ego/il/il2 @@ -0,0 +1,93 @@ +.NH 2 +Parameters and local variables. +.PP +In the EM calling sequence, the calling procedure +pushes its parameters on the stack +before doing the CAL. +The called routine first saves some +status information on the stack and then +allocates space for its own locals +(also on the stack). +Usually, one special purpose register, +the Local Base (LB) register, +is used to access both the locals and the +parameters. +If memory is highly segmented, +the stack frames of the caller and the callee +may be allocated in different fragments; +an extra Argument Base (AB) register is used +in this case to access the actual parameters. +See 4.2 of +.[ +keizer architecture +.] +for further details. +.PP +If a procedure call is expanded in line, +there are two problems: +.IP 1. 3 +No stack frame will be allocated for the called procedure; +we must find another place to put its locals. +.IP 2. +The LB register cannot be used to access the actual +parameters; +as the CAL instruction is deleted, the LB will +still point to the local base of the \fIcalling\fR procedure. +.LP +The local variables of the called procedure will +be put in the stack frame of the calling procedure, +just after its own locals. +The size of the stack frame of the +calling procedure will be increased +during its entire lifetime. +Therefore our model will allow a +limit to be set on the number of bytes +for locals that the called procedure may have +(see next section). +.PP +There are several alternatives to access the parameters. +An actual parameter may be any auxiliary expression, +which we will refer to as +the \fIactual parameter expression\fR. +The value of this expression is stored +in a location on the stack (see above), +the \fIparameter location\fR. +.sp 0 +The alternatives for accessing parameters are: +.IP - +save the value of the stackpointer at the point of the CAL +in a temporary variable X; +this variable can be used to simulate the AB register, i.e. +parameter locations are accessed via an offset to +the value of X. +.IP - +create a new temporary local variable T for +the parameter (in the stack frame of the caller); +every access to the parameter location must be changed +into an access to T. +.IP - +do not evaluate the actual parameter expression before the call; +instead, substitute this expression for every use of the +parameter location. +.LP +The first method may be expensive if X is not +put in a register. +We will not use this method. +The time required to evaluate and access the +parameters when the second method is used +will not differ much from the normal +calling sequence (i.e. not in line call). +It is not expensive, but there are no +extra savings either. +The third method is essentially the 'by name' +parameter mechanism of Algol60. +If the actual parameter is just a numeric constant, +it is advantageous to use it. +Yet, there are several circumstances +under which it cannot or should not be used. +We will deal with this in the next section. +.sp 0 +In general we will use the third method, +if it is possible and desirable. +Such parameters will be called \fIin line parameters\fR. +In all other cases we will use the second method. diff --git a/doc/ego/il/il3 b/doc/ego/il/il3 new file mode 100644 index 000000000..e8ec7ee85 --- /dev/null +++ b/doc/ego/il/il3 @@ -0,0 +1,164 @@ +.NH 2 +Feasibility and desirability analysis +.PP +Feasibility and desirability analysis +of in line substitution differ +somewhat from most other techniques. +Usually, much effort is needed to find +a feasible opportunity for optimization +(e.g. a redundant subexpression). +Desirability analysis then checks +if it is really advantageous to do +the optimization. +For IL, opportunities are easy to find. +To see if an in line expansion is +desirable will not be hard either. +Yet, the main problem is to find the most +desirable ones. +We will deal with this problem later and +we will first attend feasibility and +desirability analysis. +.PP +There are several reasons why a procedure invocation +cannot or should not be expanded in line. +.sp +A call to a procedure P cannot be expanded in line +in any of the following cases: +.IP 1. 3 +The body of P is not available as EM text. +Clearly, there is no way to do the substitution. +.IP 2. +P, or any procedure called by P (transitively), +follows the chain of statically enclosing +procedures (via a LXL or LXA instruction) +or follows the chain of dynamically enclosing +procedures (via a DCH). +If the call were expanded in line, +one level would be removed from the chains, +leading to total chaos. +This chaos could be solved by patching up +every LXL, LXA or DCH in all procedures +that could be part of the chains, +but this is hard to implement. +.IP 3. +P, or any procedure called by P (transitively), +calls a procedure whose body is not +available as EM text. +The unknown procedure may use an LXL, LXA or DCH. +However, in several languages a separately +compiled procedure has no access to the +static or dynamic chain. +In this case +this point does not apply. +.IP 4. +P, or any procedure called by P (transitively), +uses the LPB instruction, which converts a +local base to an argument base; +as the locals and parameters are stored +in a non-standard way (differing from the +normal EM calling sequence) this instruction +would yield incorrect results. +.IP 5. +The total number of bytes of the parameters +of P is not known. +P may be a procedure with a variable number +of parameters or may have an array of dynamic size +as value parameter. +.LP +It is undesirable to expand a call to a procedure P in line +in any of the following cases: +.IP 1. 3 +P is large, i.e. the number of EM instructions +of P exceeds some threshold. +The expanded code would be large too. +Furthermore, several programs in ACK, +including the global optimizer itself, +may run out of memory if they they have to run +in a small address space and are provided +very large procedures. +The threshold may be set to infinite, +in which case this point does not apply. +.IP 2. +P has many local variables. +All these variables would have to be allocated +in the stack frame of the calling procedure. +.PP +If a call may be expanded in line, we have to +decide how to access its parameters. +In the previous section we stated that we would +use in line parameters whenever possible and desirable. +There are several reasons why a parameter +cannot or should not be expanded in line. +.sp +No parameter of a procedure P can be expanded in line, +in any of the following cases: +.IP 1. 3 +P, or any procedure called by P (transitively), +does a store-indirect or a use-indirect (i.e. through +a pointer). +However, if the front-end has generated messages +telling that certain parameters can not be accessed +indirectly, those parameters may be expanded in line. +.IP 2. +P, or any procedure called by P (transitively), +calls a procedure whose body is not available as EM text. +The unknown procedure may do a store-indirect +or a use-indirect. +However, the same remark about front-end messages +as for 1. holds here. +.IP 3. +The address of a parameter location is taken (via a LAL). +In the normal calling sequence, all parameters +are stored sequentially. If the address of one +parameter location is taken, the address of any +other parameter location can be computed from it. +Hence we must put every parameter in a temporary location; +furthermore, all these locations must be in +the same order as for the normal calling sequence. +.IP 4. +P has overlapping parameters; for example, it uses +the parameter at offset 10 both as a 2 byte and as a 4 byte +parameter. +Such code may be produced by the front ends if +the formal parameter is of some record type +with variants. +.PP +Sometimes a specific parameter must not be expanded in line. +.sp 0 +An actual parameter expression cannot be expanded in line +in any of the following cases: +.IP 1. 3 +P stores into the parameter location. +Even if the actual parameter expression is a simple +variable, it is incorrect to change the 'store into +formal' into a 'store into actual', because of +the parameter mechanism used. +In Pascal, the following expansion is incorrect: +.DS +procedure p (x:integer); +begin + x := 20; +end; +... +a := 10; a := 10; +p(a); ---> a := 20; +write(a); write(a); +.DE +.IP 2. +P changes any of the operands of the +actual parameter expression. +If the expression is expanded and evaluated +after the operand has been changed, +the wrong value will be used. +.IP 3. +The actual parameter expression has side effects. +It must be evaluated only once, +at the place of the call. +.LP +It is undesirable to expand an actual parameter in line +in the following case: +.IP 1. 3 +The parameter is used more than once +(dynamically) and the actual parameter expression +is not just a simple variable or constant. +.LP diff --git a/doc/ego/il/il4 b/doc/ego/il/il4 new file mode 100644 index 000000000..fdc664b1b --- /dev/null +++ b/doc/ego/il/il4 @@ -0,0 +1,132 @@ +.NH 2 +Heuristic rules +.PP +Using the information described +in the previous section, +we can find all calls that can +be expanded in line, and for which +this expansion is desirable. +In general, we cannot expand all these calls, +so we have to choose the 'best' ones. +With every CAL instruction +that may be expanded, we associate +a \fIpay off\fR, +which expresses how desirable it is +to expand this specific CAL. +.sp +Let Tc denote the portion of EM text involved +in a specific call, i.e. the pushing of the actual +parameter expressions, the CAL itself, +the popping of the parameters and the +pushing of the result (if any, via an LFR). +Let Te denote the EM text that would be obtained +by expanding the call in line. +Let Pc be the original program and Pe the program +with Te substituted for Tc. +The pay off of the CAL depends on two factors: +.IP - +T = execution_time(Pe) - execution_time(Pc) +.IP - +S = code_size(Pe) - code_size(Pc) +.LP +The change in execution time (T) depends on: +.IP - +T1 = execution_time(Te) - execution_time(Tc) +.IP - +N = number of times Te or Tc get executed. +.LP +We assume that T1 will be the same every +time the code gets executed. +This is a reasonable assumption. +(Note that we are talking about one CAL, +not about different calls to the same procedure). +Hence +.DS +T = N * T1 +.DE +T1 can be estimated by a careful analysis +of the transformations that are performed. +Below, we list everything that will be +different when a call is expanded in line: +.IP - +The CAL instruction is not executed. +This saves a subroutine jump. +.IP - +The instructions in the procedure prolog +are not executed. +These instructions, generated from the PRO pseudo, +save some machine registers +(including the old LB), set the new LB and allocate space +for the locals of the called routine. +The savings may be less if there are no +locals to allocate. +.IP - +In line parameters are not evaluated before the call +and are not pushed on the stack. +.IP - +All remaining parameters are stored in local variables, +instead of being pushed on the stack. +.IP - +If the number of parameters is nonzero, +the ASP instruction after the CAL is not executed. +.IP - +Every reference to an in line parameter is +substituted by the parameter expression. +.IP - +RET (return) instructions are replaced by +BRA (branch) instructions. +If the called procedure 'falls through' +(i.e. it has only one RET, at the end of its code), +even the BRA is not needed. +.IP - +The LFR (fetch function result) is not executed +.PP +Besides these changes, which are caused directly by IL, +other changes may occur as IL influences other optimization +techniques, such as Register Allocation and Constant Propagation. +Our heuristic rules do not take into account the quite +inpredictable effects on Register Allocation. +It does, however, favour calls that have numeric \fIconstants\fR +as parameter; especially the constant "0" as an inline +parameter gets high scores, +as further optimizations may often be possible. +.PP +It cannot be determined statically how often a CAL instruction gets +executed. +We will use \fIloop nesting\fR information here. +The nesting level of the loop in which +the CAL appears (if any) will be used as an +indication for the number of times it gets executed. +.PP +Based on all these facts, +the pay off of a call will be computed. +The following model was developed empirically. +Assume procedure P calls procedure Q. +The call takes place in basic block B. +.DS +ZP = # zero parameters +CP = # constant parameters - ZP +LN = Loop Nesting level (0 if outside any loop) +F = \fIif\fR # formal parameters of Q > 0 \fIthen\fR 1 \fIelse\fR 0 +FT = \fIif\fR Q falls through \fIthen\fR 1 \fIelse\fR 0 +S = size(Q) - 1 - # inline_parameters - F +L = \fIif\fR # local variables of P > 0 \fIthen\fR 0 \fIelse\fR -1 +A = CP + 2 * ZP +N = \fIif\fR LN=0 and P is never called from a loop \fIthen\fR 0 \fIelse\fR (LN+1)**2 +FM = \fIif\fR B is a firm block \fIthen\fR 2 \fIelse\fR 1 + +pay_off = (100/S + FT + F + L + A) * N * FM +.DE +S stands for the size increase of the program, +which is slightly less than the size of Q. +The size of a procedure is taken to be its number +of (non-pseudo) EM instructions. +The terms "loop nesting level" and "firm" were defined +in the chapter on the Intermediate Code (section "loop tables"). +If a call is not inside a loop and the calling procedure +is itself never called from a loop (transitively), +then the call will probably be executed at most once. +Such a call is never expanded in line (its pay off is zero). +If the calling procedure doesn't have local variables, a penalty (L) +is introduced, as it will most likely get local variables if the +call gets expanded. diff --git a/doc/ego/il/il5 b/doc/ego/il/il5 new file mode 100644 index 000000000..6445ba7df --- /dev/null +++ b/doc/ego/il/il5 @@ -0,0 +1,440 @@ +.NH 2 +Implementation +.PP +A major factor in the implementation +of Inline Substitution is the requirement +not to use an excessive amount of memory. +IL essentially analyzes the entire program; +it makes decisions based on which procedure calls +appear in the whole program. +Yet, because of the memory restriction, it is +not feasible to read the entire program +in main memory. +To solve this problem, the IL phase has been +split up into three subphases that are executed sequentially: +.IP 1. +analyze every procedure; see how it accesses its parameters; +simultaneously collect all calls +appearing in the whole program an put them +in a \fIcall-list\fR. +.IP 2. +use the call-list and decide which calls will be substituted +in line. +.IP 3. +take the decisions of subphase 2 and modify the +program accordingly. +.LP +Subphases 1 and 3 scan the input program; only +subphase 3 modifies it. +It is essential that the decisions can be made +in subphase 2 +without using the input program, +provided that subphase 1 puts enough information +in the call-list. +Subphase 2 keeps the entire call-list in main memory +and repeatedly scans it, to +find the next best candidate for expansion. +.PP +We will specify the +data structures used by IL before +describing the subphases. +.NH 3 +Data structures +.NH 4 +The procedure table +.PP +In subphase 1 information is gathered about every procedure +and added to the procedure table. +This information is used by the heuristic rules. +A proctable entry for procedure p has +the following extra information: +.IP - +is it allowed to substitute an invocation of p in line? +.IP - +is it allowed to put any parameter of such a call in line? +.IP - +the size of p (number of EM instructions) +.IP - +does p 'fall through'? +.IP - +a description of the formal parameters that p accesses; this information +is obtained by looking at the code of p. For every parameter f, +we record: +.RS +.IP - +the offset of f +.IP - +the type of f (word, double word, pointer) +.IP - +may the corresponding actual parameter be put in line? +.IP - +is f ever accessed indirectly? +.IP - +if f used: never, once or more than once? +.RE +.IP - +the number of times p is called (see below) +.IP - +the file address of its call-count information (see below). +.LP +.NH 4 +Call-count information +.PP +As a result of Inline Substitution, some procedures may +become useless, because all their invocations have been +substituted in line. +One of the tasks of IL is to keep track which +procedures are no longer called. +Note that IL is especially keen on procedures that are +called only once +(possibly as a result of expanding all other calls to it). +So we want to know how many times a procedure +is called \fIduring\fR Inline Substitution. +It is not good enough to compute this +information afterwards. +The task is rather complex, because +the number of times a procedure is called +varies during the entire process: +.IP 1. +If a call to p is substituted in line, +the number of calls to p gets decremented by 1. +.IP 2. +If a call to p is substituted in line, +and p contains n calls to q, then the number of calls to q +gets incremented by n. +.IP 3. +If a procedure p is removed (because it is no +longer called) and p contains n calls to q, +then the number of calls to q gets decremented by n. +.LP +(Note that p may be the same as q, if p is recursive). +.sp 0 +So we actually want to have the following information: +.DS +NRCALL(p,q) = number of call to q appearing in p, + +for all procedures p and q that may be put in line. +.DE +This information, called \fIcall-count information\fR is +computed by the first subphase. +It is stored in a file. +It is represented as a number of lists, rather than as +a (very sparse) matrix. +Every procedure has a list of (proc,count) pairs, +telling which procedures it calls, and how many times. +The file address of its call-count list is stored +in its proctable entry. +Whenever this information is needed, it is fetched from +the file, using direct access. +The proctable entry also contains the number of times +a procedure is called, at any moment. +.NH 4 +The call-list +.PP +The call-list is the major data structure use by IL. +Every item of the list describes one procedure call. +It contains the following attributes: +.IP - +the calling procedure (caller) +.IP - +the called procedure (callee) +.IP - +identification of the CAL instruction (sequence number) +.IP - +the loop nesting level; our heuristic rules appreciate +calls inside a loop (or even inside a loop nested inside +another loop, etc.) more than other calls +.IP - +the actual parameter expressions involved in the call; +for every actual, we record: +.RS +.IP - +the EM code of the expression +.IP - +the number of bytes of its result (size) +.IP - +an indication if the actual may be put in line +.RE +.LP +The structure of the call-list is rather complex. +Whenever a call is expanded in line, new calls +will suddenly appear in the program, +that were not contained in the original body +of the calling subroutine. +These calls are inherited from the called procedure. +We will refer to these invocations as \fInested calls\fR +(see Fig. 5.1). +.DS +procedure p is +begin . + a(); . + b(); . +end; + +procedure r is procedure r is +begin begin + x(); x(); + p(); -- in line a(); -- nested call + y(); b(); -- nested call +end; y(); + end; + +Fig. 5.1 Example of nested procedure calls +.DE +Nested calls may subsequently be put in line too +(probably resulting in a yet deeper nesting level, etc.). +So the call-list does not always reflect the source program, +but changes dynamically, as decisions are made. +If a call to p is expanded, all calls appearing in p +will be added to the call-list. +.sp 0 +A convenient and elegant way to represent +the call-list is to use a LISP-like list. +.[ +poel lisp trac +.] +Calls that appear at the same level +are linked in the CDR direction. If a call C +to a procedure p is expanded, +all calls appearing in p are put in a sub-list +of C, i.e. in its CAR. +In the example above, before the decision +to expand the call to p is made, the +call-list of procedure r looks like: +.DS +(call-to-x, call-to-p, call-to-y) +.DE +After the decision, it looks like: +.DS +(call-to-x, (call-to-p*, call-to-a, call-to-b), call-to-y) +.DE +The call to p is marked, because it has been +substituted. +Whenever IL wants to traverse the call-list of some procedure, +it uses the well-known LISP technique of +recursion in the CAR direction and +iteration in the CDR direction +(see page 1.19-2 of +.[ +poel lisp trac +.] +). +All list traversals look like: +.DS +traverse(list) +{ + for (c = first(list); c != 0; c = CDR(c)) { + if (c is marked) { + traverse(CAR(c)); + } else { + do something with c + } + } +} +.DE +The entire call-list consists of a number of LISP-like lists, +one for every procedure. +The proctable entry of a procedure contains a pointer +to the beginning of the list. +.NH 3 +The first subphase: procedure analysis +.PP +The tasks of the first subphase are to determine +several attributes of every procedure +and to construct the basic call-list, +i.e. without nested calls. +The size of a procedure is determined +by simply counting its EM instructions. +Pseudo instructions are skipped. +A procedure does not 'fall through' if its CFG +contains a basic block +that is not the last block of the CFG and +that ends on a RET instruction. +The formal parameters of a procedure are determined +by inspection of +its code. +.PP +The call-list in constructed by looking at all CAL instructions +appearing in the program. +The call-list should only contain calls to procedures +that may be put in line. +This fact is only known if the procedure was +analyzed earlier. +If a call to a procedure p appears in the program +before the body of p, +the call will always be put in the call-list. +If p is later found to be unsuitable, +the call will be removed from the list by the +second subphase. +.PP +An important issue is the recognition +of the actual parameter expressions of the call. +The front ends produces messages telling how many +bytes of formal parameters every procedure accesses. +(If there is no such message for a procedure, it +cannot be put in line). +The actual parameters together must account for +the same number of bytes.A recursive descent parser is used +to parse side-effect free EM expressions. +It uses a table and some +auxiliary routines to determine +how many bytes every EM instruction pops from the stack +and how many bytes it pushes onto the stack. +These numbers depend on the EM instruction, its argument, +and the wordsize and pointersize of the target machine. +Initially, the parser has to recognize the +number of bytes specified in the formals-message, +say N. +Assume the first instruction before the CAL pops S bytes +and pushes R bytes. +If R > N, too many bytes are recognized +and the parser fails. +Else, it calls itself recursively to recognize the +S bytes used as operand of the instruction. +If it succeeds in doing so, it continues with the next instruction, +i.e. the first instruction before the code recognized by +the recursive call, to recognize N-R more bytes. +The result is a number of EM instructions that collectively push N bytes. +If an instruction is come across that has side-effects +(e.g. a store or a procedure call) or of which R and S cannot +be computed statically (e.g. a LOS), it fails. +.sp 0 +Note that the parser traverses the code backwards. +As EM code is essentially postfix code, the parser works top down. +.PP +If the parser fails to recognize the parameters, the call will not +be substituted in line. +If the parameters can be determined, they still have to +match the formal parameters of the called procedure. +This check is performed by the second subphase; it cannot be +done here, because it is possible that the called +procedure has not been analyzed yet. +.PP +The entire call-list is written to a file, +to be processed by the second subphase. +.NH 3 +The second subphase: making decisions +.PP +The task of the second subphase is quite easy +to understand. +It reads the call-list file, +builds an incore call-list and deletes every +call that may not be expanded in line (either because the called +procedure may not be put in line, or because the actual parameters +of the call do not match the formal parameters of the called procedure). +It assigns a \fIpay-off\fR to every call, +indicating how desirable it is to expand it. +.PP +The subphase repeatedly scans the call-list and takes +the call with the highest ratio. +The chosen one gets marked, +and the call-list is extended with the nested calls, +as described above. +These nested calls are also assigned a ratio, +and will be considered too during the next scans. +.sp 0 +After every decision the number of times +every procedure is called is updated, using +the call-count information. +Meanwhile, the subphase keeps track of the amount of space left +available. +If all space is used, or if there are no more calls left to +be expanded, it exits this loop. +Finally, calls to procedures that are called only +once are also chosen. +.PP +The actual parameters of a call are only needed by +this subphase to assign a ratio to a call. +To save some space, these actuals are not kept in main memory. +They are removed after the call has been read and a ratio +has been assigned to it. +So this subphase works with \fIabstracts\fR of calls. +After all work has been done, +the actual parameters of the chosen calls are retrieved +from a file, +as they are needed by the transformation subphase. +.NH 3 +The third subphase: doing transformations +.PP +The third subphase makes the actual modifications to +the EM text. +It is directed by the decisions made in the previous subphase, +as expressed via the call-list. +The call-list read by this subphase contains +only calls that were selected for expansion. +The list is ordered in the same way as the EM text, +i.e. if a call C1 appears before a call C2 in the call-list, +C1 also appears before C2 in the EM text. +So the EM text is traversed linearly, +the calls that have to be substituted are determined +and the modifications are made. +If a procedure is come across that is no longer needed, +it is simply not written to the output EM file. +The substitution of a call takes place in distinct steps: +.IP "change the calling sequence" 7 +.sp 0 +The actual parameter expressions are changed. +Parameters that are put in line are removed. +All remaining ones must store their result in a +temporary local variable, rather than +push it on the stack. +The CAL instruction and any ASP (to pop actual parameters) +or LFR (to fetch the result of a function) +are deleted. +.IP "fetch the text of the called procedure" +.sp 0 +Direct disk access is used to to read the text of the +called procedure. +The file offset is obtained from the proctable entry. +.IP "allocate bytes for locals and temporaries" +.sp 0 +The local variables of the called procedure will be put in the +stack frame of the calling procedure. +The same applies to any temporary variables +that hold the result of parameters +that were not put in line. +The proctable entry of the caller is updated. +.IP "put a label after the CAL" +.sp 0 +If the called procedure contains a RET (return) instruction +somewhere in the middle of its text (i.e. it does +not fall through), the RET must be changed into +a BRA (branch), to jump over the +remainder of the text. +This label is not needed if the called +procedure falls through. +.IP "copy the text of the called procedure and modify it" +.sp 0 +References to local variables of the called routine +and to parameters that are not put in line +are changed to refer to the +new local of the caller. +References to in line parameters are replaced +by the actual parameter expression. +Returns (RETs) are either deleted or +replaced by a BRA. +Messages containing information about local +variables or parameters are changed. +Global data declarations and the PRO and END pseudos +are removed. +Instruction labels and references to them are +changed to make sure they do not have the +same identifying number as +labels in the calling procedure. +.IP "insert the modified text" +.sp 0 +The pseudos of the called procedure are put after the pseudos +of the calling procedure. +The real text of the callee is put at +the place where the CAL was. +.IP "take care of nested substitutions" +.sp 0 +The expanded procedure may contain calls that +have to be expanded too (nested calls). +If the descriptor of this call contains actual +parameter expressions, +the code of the expressions has to be changed +the same way as the code of the callee was changed. +Next, the entire process of finding CALs and doing +the substitutions is repeated recursively. +.LP diff --git a/doc/ego/il/il6 b/doc/ego/il/il6 new file mode 100644 index 000000000..bf61cad5c --- /dev/null +++ b/doc/ego/il/il6 @@ -0,0 +1,27 @@ +.NH 2 +Source files of IL +.PP +The sources of IL are in the following files +and packages (the prefixes 1_, 2_ and 3_ refer to the three subphases): +.IP il.h: 14 +declarations of global variables and +data structures +.IP il.c: +the routine main; the driving routines of the three subphases +.IP 1_anal: +contains a subroutine that analyzes a procedure +.IP 1_cal: +contains a subroutine that analyzes a call +.IP 1_aux: +implements auxiliary procedures used by subphase 1 +.IP 2_aux: +implements auxiliary procedures used by subphase 2 +.IP 3_subst: +the driving routine for doing the substitution +.IP 3_change: +lower level routines that do certain modifications +.IP 3_aux: +implements auxiliary procedures used by subphase 3 +.IP aux +implements auxiliary procedures used by several subphases. +.LP diff --git a/doc/ego/intro/head b/doc/ego/intro/head new file mode 100644 index 000000000..0d015a9d8 --- /dev/null +++ b/doc/ego/intro/head @@ -0,0 +1,7 @@ +.ND +.ll 80m +.nr LL 80m +.nr tl 78m +.tr ~ +.ds >. . +.ds [. " \[ diff --git a/doc/ego/intro/intro1 b/doc/ego/intro/intro1 new file mode 100644 index 000000000..de7a5ae89 --- /dev/null +++ b/doc/ego/intro/intro1 @@ -0,0 +1,79 @@ +.TL +The design and implementation of +the EM Global Optimizer +.AU +H.E. Bal +.AI +Vrije Universiteit +Wiskundig Seminarium, Amsterdam +.AB +The EM Global Optimizer is part of the Amsterdam Compiler Kit, +a toolkit for making retargetable compilers. +It optimizes the intermediate code common to all compilers of +the toolkit (EM), +so it can be used for all programming languages and +all processors supported by the kit. +.PP +The optimizer is based on well-understood concepts like +control flow analysis and data flow analysis. +It performs the following optimizations: +Inline Substitution, Strength Reduction, Common Subexpression Elimination, +Stack Pollution, Cross Jumping, Branch Optimization, Copy Propagation, +Constant Propagation, Dead Code Elimination and Register Allocation. +.PP +This report describes the design of the optimizer and several +of its implementation issues. +.AE +.bp +.NH 1 +Introduction +.PP +.FS +This work was supported by the +Stichting Technische Wetenschappen (STW) +under grant VWI00.0001. +.FE +The EM Global Optimizer is part of a software toolkit +for making production-quality retargetable compilers. +This toolkit, +called the Amsterdam Compiler Kit +.[ +tanenbaum toolkit rapport +.] +.[ +tanenbaum toolkit cacm +.] +runs under the Unix* +.FS +*Unix is a Trademark of Bell Laboratories +.FE +operating system. +.sp 0 +The main design philosophy of the toolkit is to use +a language- and machine-independent +intermediate code, called EM. +.[ +keizer architecture +.] +The basic compilation process can be split up into +two parts. +A language-specific front end translates the source program into EM. +A machine-specific back end transforms EM to assembly code +of the target machine. +.PP +The global optimizer is an optional phase of the +compilation process, and can be used to obtain +machine code of a higher quality. +The optimizer transforms EM-code to better EM-code, +so it comes between the front end and the back end. +It can be used with any combination of languages +and machines, as far as they are supported by +the compiler kit. +.PP +This report describes the design of the +global optimizer and several of its +implementation issues. +Measurements can be found in. +.[ +bal tanenbaum global +.] diff --git a/doc/ego/intro/tail b/doc/ego/intro/tail new file mode 100644 index 000000000..6cd2d4867 --- /dev/null +++ b/doc/ego/intro/tail @@ -0,0 +1,3 @@ +.[ +$LIST$ +.] diff --git a/doc/ego/lv/lv1 b/doc/ego/lv/lv1 new file mode 100644 index 000000000..7574ca6f8 --- /dev/null +++ b/doc/ego/lv/lv1 @@ -0,0 +1,95 @@ +.bp +.NH 1 +Live-Variable analysis +.NH 2 +Introduction +.PP +The "Live-Variable analysis" optimization technique (LV) +performs some code improvements and computes information that may be +used by subsequent optimizations. +The main task of this phase is the +computation of \fIlive-variable information\fR. +.[~[ +aho compiler design +.] section 14.4] +A variable A is said to be \fIdead\fR at some point p of the +program text, if on no path in the control flow graph +from p to a RET (return), A can be used before being changed; +else A is said to be \fIlive\fR. +.PP +A statement of the form +.DS +VARIABLE := EXPRESSION +.DE +is said to be dead if the left hand side variable is dead just after +the statement and the right hand side expression has no +side effects (i.e. it doesn't change any variable). +Such a statement can be eliminated entirely. +Dead code will seldom be present in the original program, +but it may be the result of earlier optimizations, +such as copy propagation. +.PP +Live-variable information is passed to other phases via +messages in the EM code. +Live/dead messages are generated at points in the EM text where +variables become dead or live. +This information is especially useful for the Register +Allocation phase. +.NH 2 +Implementation +.PP +The implementation uses algorithm 14.6 of. +.[ +aho compiler design +.] +First two sets DEF and USE are computed for every basic block b: +.IP DEF(b) 9 +the set of all variables that are assigned a value in b before +being used +.IP USE(b) 9 +the set of all variables that may be used in b before being changed. +.LP +(So variables that may, but need not, be used resp. changed via a procedure +call or through a pointer are included in USE but not in DEF). +The next step is to compute the sets IN and OUT : +.IP IN[b] 9 +the set of all variables that are live at the beginning of b +.IP OUT[b] 9 +the set of all variables that are live at the end of b +.LP +IN and OUT can be computed for all blocks simultaneously by solving the +data flow equations: +.DS +(1) IN[b] = OUT[b] - DEF[b] + USE[b] +[2] OUT[b] = IN[s1] + ... + IN[sn] ; + where SUCC[b] = {s1, ... , sn} +.DE +The equations are solved by a similar algorithm as for +the Use Definition equations (see previous chapter). +.PP +Finally, each basic block is visited in turn to remove its dead code +and to emit the live/dead messages. +Every basic block b is traversed from its last +instruction backwards to the beginning of b. +Initially, all variables that are dead at the end +of b are marked dead. All others are marked live. +If we come across an assignment to a variable X that +was marked live, a live-message is put after the +assignment and X is marked dead; +if X was marked dead, the assignment may be removed, provided that +the right hand side expression contains no side effects. +If we come across a use of a variable X that +was marked dead, a dead-message is put after the +use and X is marked live. +So at any point, the mark of X tells whether X is +live or dead immediately before that point. +A message is also generated at the start of a basic block +for every variable that was live at the end of the (textually) +previous block, but dead at the entry of this block, or v.v. +.PP +Only local variables are considered. +This significantly reduces the memory needed by this phase, +eases the implementation and is hardly less efficient than +considering all variables. +(Note that it is very hard to prove that an assignment to +a global variable is dead). diff --git a/doc/ego/ov/ov1 b/doc/ego/ov/ov1 new file mode 100644 index 000000000..a447b0d06 --- /dev/null +++ b/doc/ego/ov/ov1 @@ -0,0 +1,371 @@ +.bp +.NH 1 +Overview of the global optimizer +.NH 2 +The ACK compilation process +.PP +The EM Global Optimizer is one of three optimizers that are +part of the Amsterdam Compiler Kit (ACK). +The phases of ACK are: +.IP 1. +A Front End translates a source program to EM +.IP 2. +The Peephole Optimizer +.[ +tanenbaum staveren peephole toplass +.] +reads EM code and produces 'better' EM code. +It performs a number of optimizations (mostly peephole +optimizations) +such as constant folding, strength reduction and unreachable code +elimination. +.IP 3. +The Global Optimizer further improves the EM code. +.IP 4. +The Code Generator transforms EM to assembly code +of the target computer. +.IP 5. +The Target Optimizer improves the assembly code. +.IP 6. +An Assembler/Loader generates an executable file. +.LP +For a more extensive overview of the ACK compilation process, +we refer to. +.[ +tanenbaum toolkit rapport +.] +.[ +tanenbaum toolkit cacm +.] +.PP +The input of the Global Optimizer may consist of files and +libraries. +Every file or module in the library must contain EM code in +Compact Assembly Language format. +.[~[ +tanenbaum machine architecture +.], section 11.2] +The output consists of one such EM file. +The input files and libraries together need not +constitute an entire program, +although as much of the program as possible should be supplied. +The more information about the program the optimizer +gets, the better its output code will be. +.PP +The Global Optimizer is language- and machine-independent, +i.e. it can be used for all languages and machines supported by ACK. +Yet, it puts some unavoidable restrictions on the EM code +produced by the Front End (see below). +It must have some knowledge of the target machine. +This knowledge is expressed in a machine description table +which is passed as argument to the optimizer. +This table does not contain very detailed information about the +target (such as its instruction set and addressing modes). +.NH 2 +The EM code +.PP +The definition of EM, the intermediate code of all ACK compilers, +is given in a separate document. +.[ +tanenbaum machine architecture +.] +We will only discuss some features of EM that are most relevant +to the Global Optimizer. +.PP +EM is the assembly code of a virtual \fIstack machine\fR. +All operations are performed on the top of the stack. +For example, the statement "A := B + 3" may be expressed in EM as: +.DS +LOL -4 -- push local variable B +LOC 3 -- push constant 3 +ADI 2 -- add two 2-byte items on top of + -- the stack and push the result +STL -2 -- pop A +.DE +So EM is essentially a \fIpostfix\fR code. +.PP +EM has a rich instruction set, containing several arithmetic +and logical operators. +It also contains special-case instructions (such as INCrement). +.PP +EM has \fIglobal\fR (\fIexternal\fR) variables, accessible +by all procedures and \fIlocal\fR variables, accessible by a few +(nested) procedures. +The local variables of a lexically enclosing procedure may +be accessed via a \fIstatic link\fR. +EM has instructions to follow the static chain. +There are EM instruction to allow a procedure +to access its local variables directly (such as LOL and STL above). +Local variables are referenced via an offset in the stack frame +of the procedure, rather than by their names (e.g. -2 and -4 above). +The EM code does not contain the (source language) type +of the variables. +.PP +All structured statements in the source program are expressed in +low level jump instructions. +Besides conditional and unconditional branch instructions, there are +two case instructions (CSA and CSB), +to allow efficient translation of case statements. +.NH 2 +Requirements on the EM input +.PP +As the optimizer should be useful for all languages, +it clearly should not put severe restrictions on the EM code +of the input. +There is, however, one immovable requirement: +it must be possible to determine the \fIflow of control\fR of the +input program. +As virtually all global optimizations are based on control flow information, +the optimizer would be totally powerless without it. +For this reason we restrict the usage of the case jump instructions (CSA/CSB) +of EM. +Such an instruction is always called with the address of a case descriptor +on top the the stack. +.[~[ +tanenbaum machine architecture +.] section 7.4] +This descriptor contains the labels of all possible +destinations of the jump. +We demand that all case descriptors are allocated in a global +data fragment of type ROM, i.e. the case descriptors +may not be modifyable. +Furthermore, any case instruction should be immediately preceded by +a LAE (Load Address External) instruction, that loads the +address of the descriptor, +so the descriptor can be uniquely identified. +.PP +The optimizer will work improperly if the user deceives the control flow. +We will give two methods to do this. +.PP +In "C" the notorious library routines "setjmp" and "longjmp" +.[ +unix programmer's manual +.] +may be used to jump out of a procedure, +but can also be used for a number of other stuffy purposes, +for example, to create an extra entry point in a loop. +.DS + while (condition) { + .... + setjmp(buf); + ... + } + ... + longjmp(buf); +.DE +The invocation to longjmp actually is a jump to the place of +the last call to setjmp with the same argument (buf). +As the calls to setjmp and longjmp are indistinguishable from +normal procedure calls, the optimizer will not see the danger. +No need to say that several loop optimizations will behave +unexpectedly when presented with such pathological input. +.PP +Another way to deceive the flow of control is +by using exception handling routines. +Ada* +.FS +* Ada is a registered trademark of the U.S. Government +(Ada Joint Program Office). +.FE +has clearly recognized the dangers of exception handling, +but other languages (such as PL/I) have not. +.[ +ada rationale +.] +.PP +The optimizer will be more effective if the EM input contains +some extra information about the source program. +Especially the \fIregister message\fR is very important. +These messages indicate which local variables may never be +accessed indirectly. +Most optimizations benefit significantly by this information. +.PP +The Inline Substitution technique needs to know how many bytes +of formal parameters every procedure accesses. +Only calls to procedures for which the EM code contains this information +will be substituted in line. +.NH 2 +Structure of the optimizer +.PP +The Global Optimizer is organized as a number of \fIphases\fR, +each one performing some task. +The main structure is as follows: +.IP IC 6 +the Intermediate Code construction phase transforms EM into the +intermediate code (ic) of the optimizer +.IP CF +the Control Flow phase extends the ic with control flow +information and interprocedural information +.IP OPTs +zero or more optimization phases, each one performing one or +more related optimizations +.IP CA +the Compact Assembly phase generates Compact Assembly Language EM code +out of ic. +.LP +.PP +An important issue in the design of a global optimizer is the +interaction between optimization techniques. +It is often advantageous to combine several techniques in +one algorithm that takes into account all interactions between them. +Ideally, one single algorithm should be developed that does +all optimizations simultaneously and deals with all possible interactions. +In practice, such an algorithm is still far out of reach. +Instead some rather ad hoc (albeit important) combinations are chosen, +such as Common Subexpression Elimination and Register Allocation. +.[ +prabhala sethi common subexpressions +.] +.[ +sethi ullman optimal code +.] +.PP +In the Em Global Optimizer there is one separate algorithm for +every technique. +Note that this does not mean that all techniques are independent +of each other. +.PP +In principle, the optimization phases can be run in any order; +a phase may even be run more than once. +However, the following rules should be obeyed: +.IP - +the Live Variable analysis phase (LV) must be run prior to +Register Allocation (RA), as RA uses information outputted by LV. +.IP - +RA should be the last phase; this is a consequence of the way +the interface between RA and the Code Generator is defined. +.LP +The ordering of the phases has significant impact on +the quality of the produced code. +In +.[ +wulf overview production quality carnegie-mellon +.] +two kinds of phase ordering problems are distinguished. +If two techniques A and B both take away opportunities of each other, +there is a "negative" ordering problem. +If, on the other hand, both A and B introduce new optimization +opportunities for each other, the problem is called "positive". +In the Global Optimizer the following interactions must be +taken into account: +.IP - +Inline Substitution (IL) may create new opportunities for most +other techniques, so it should be run as early as possible +.IP - +Use Definition analysis (UD) may introduce opportunities for LV. +.IP - +Strength Reduction may create opportunities for UD +.LP +The optimizer has a default phase ordering, which can +be changed by the user. +.NH 2 +Structure of this document +.PP +The remaining chapters of this document each describe one +phase of the optimizer. +For every phase, we describe its task, its design, +its implementation, and its source files. +The latter two sections are intended to aid the +maintenance of the optimizer and +can be skipped by the initial reader. +.NH 2 +References +.PP +There are very +few modern textbooks on optimization. +Chapters 12, 13, and 14 of +.[ +aho compiler design +.] +are a good introduction to the subject. +Wulf et. al. +.[ +wulf optimizing compiler +.] +describe one specific optimizing (Bliss) compiler. +Anklam et. al. +.[ +anklam vax-11 +.] +discuss code generation and optimization in +compilers for one specific machine (a Vax-11). +Kirchgaesner et. al. +.[ +optimizing ada compiler +.] +present a brief description of many +optimizations; the report also contains a lengthy (over 60 pages) +bibliography. +.PP +The number of articles on optimization is quite impressive. +The Lowrey and Medlock paper on the Fortran H compiler +.[ +object code optimization +.] +is a classical one. +Other papers on global optimization are. +.[ +faiman optimizing pascal +.] +.[ +perkins sites +.] +.[ +harrison general purpose optimizing +.] +.[ +morel partial redundancies +.] +.[ +Mintz global optimizer +.] +Freudenberger +.[ +freudenberger setl optimizer +.] +describes an optimizer for a Very High Level Language (SETL). +The Production-Quality Compiler-Compiler (PQCC) project uses +very sophisticated compiler techniques, as described in. +.[ +wulf overview ieee +.] +.[ +wulf overview carnegie-mellon +.] +.[ +wulf machine-relative +.] +.PP +Several Ph.D. theses are dedicated to optimization. +Davidson +.[ +davidson simplifying +.] +outlines a machine-independent peephole optimizer that +improves assembly code. +Katkus +.[ +katkus +.] +describes how efficient programs can be obtained at little cost by +optimizing only a small part of a program. +Photopoulos +.[ +photopoulos mixed code +.] +discusses the idea of generating interpreted intermediate code as well +as assembly code, to obtain programs that are both small and fast. +Shaffer +.[ +shaffer automatic +.] +describes the theory of automatic subroutine generation. +.] +Leverett +.[ +leverett register allocation compilers +.] +deals with register allocation in the PQCC compilers. +.PP +References to articles about specific optimization techniques +will be given in later chapters. diff --git a/doc/ego/ra/ra1 b/doc/ego/ra/ra1 new file mode 100644 index 000000000..fb5343f93 --- /dev/null +++ b/doc/ego/ra/ra1 @@ -0,0 +1,33 @@ +.bp +.NH 1 +Register Allocation +.NH 2 +Introduction +.PP +The efficient usage of the general purpose registers +of the target machine plays a key role in any optimizing compiler. +This subject, often referred to as \fIRegister Allocation\fR, +has great impact on both the code generator and the +optimizing part of such a compiler. +The code generator needs registers for at least the evaluation of +arithmetic expressions; +the optimizer uses the registers to decrease the access costs +of frequently used entities (such as variables). +The design of an optimizing compiler must pay great +attention to the cooperation of optimization, register allocation +and code generation. +.PP +Register allocation has received much attention in literature (see +.[ +leverett register allocation compilers +.] +.[ +chaitin register coloring +.] +.[ +freiburghouse usage counts +.] +and +.[~[ +sites register +.]]). diff --git a/doc/ego/ra/ra2 b/doc/ego/ra/ra2 new file mode 100644 index 000000000..e6dfc138f --- /dev/null +++ b/doc/ego/ra/ra2 @@ -0,0 +1,139 @@ +.NH 2 +Usage of registers in ACK compilers +.PP +We will first describe the major design decisions +of the Amsterdam Compiler Kit, +as far as they concern register allocation. +Subsequently we will outline +the role of the Global Optimizer in the register +allocation process and the interface +between the code generator and the optimizer. +.NH 3 +Usage of registers without the intervention of the Global Optimizer +.PP +Registers are used for two purposes: +.IP 1. +for the evaluation of arithmetic expressions +.IP 2. +to hold local variables, for the duration of the procedure they +are local to. +.LP +It is essential to note that no translation part of the compilers, +except for the code generator, knows anything at all +about the register set of the target computer. +Hence all decisions about registers are ultimately made by +the code generator. +Earlier phases of a compiler can only \fIadvise\fR the code generator. +.PP +The code generator splits the register set into two: +a fixed part for the evaluation of expressions (called \fIscratch\fR +registers) and a fixed part to store local variables. +This partitioning, which depends only on the target computer, significantly +reduces the complexity of register allocation, at the penalty +of some loss of code quality. +.PP +The code generator has some (machine-dependent) knowledge of the access costs +of memory locations and registers and of the costs of saving and +restoring registers. (Registers are always saved by the \fIcalled\fR +procedure). +This knowledge is expressed in a set of procedures for each target machine. +The code generator also knows how many registers there are and of +which type they are. +A register can be of type \fIpointer\fR, \fIfloating point\fR +or \fIgeneral\fR. +.PP +The front ends of the compilers determine which local variables may +be put in a register; +such a variable may never be accessed indirectly (i.e. through a pointer). +The front end also determines the types and sizes of these variables. +The type can be any of the register types or the type \fIloop variable\fR, +which denotes a general-typed variable that is used as loop variable +in a for-statement. +All this information is collected in a \fIregister message\fR in +the EM code. +Such a message is a pseudo EM instruction. +This message also contains a \fIscore\fR field, +indicating how desirable it is to put this variable in a register. +A front end may assign a high score to a variable if it +was declared as a register variable (which is only possible in +some languages, such as "C"). +Any compiler phase before the code generator may change this score field, +if it has reason to do so. +The code generator bases its decisions on the information contained +in the register message, most notably on the score. +.PP +If the global optimizer is not used, +the score fields are set by the Peephole Optimizer. +This optimizer simply counts the number of occurrences +of every local (register) variable and adds this count +to the score provided by the front end. +In this way a simple, yet quite effective +register allocation scheme is achieved. +.NH 3 +The role of the Global Optimizer +.PP +The Global Optimizer essentially tries to improve the scheme +outlined above. +It uses the following principles for this purpose: +.IP - +Entities are not always assigned a register for the duration +of an entire procedure; smaller regions of the program text +may be considered too. +.IP - +several variables may be put in the same register simultaneously, +provided at most one of them is live at any point. +.IP - +besides local variables, other entities (such as constants and addresses of +variables and procedures) may be put in a register. +.IP - +more accurate cost estimates are used. +.LP +To perform its task, the optimizer must have some +knowledge of the target machine. +.NH 3 +The interface between the register allocator and the code generator +.PP +The RA phase of the optimizer must somehow be able to express its +decisions. +Such decisions may look like: 'put constant 1283 in a register from +line 12 to line 40'. +To be precise, RA must be able to tell the code generator to: +.IP - +initialize a register with some value +.IP - +update an entity from a register +.IP - +replace all occurrences of an entity in a certain region +of text by a reference to the register. +.LP +At least three problems occur here: the code generator is only used to +put local variables in registers, +it only assigns a register to a variable for the duration of an entire +procedure and it is not used to have some earlier compiler phase +make all the decisions. +.PP +All problems are solved by one mechanism, that involves no changes +to the code generator. +With every (non-scratch) register R that will be used in +a procedure P, we associate a new variable T, local to P. +The size of T is the same as the size of R. +A register message is generated for T with an exceptionally high score. +The scores of all original register messages are set to zero. +Consequently, the code generator will always assign precisely those new +variables to a register. +If the optimizer wants to put some entity, say the constant 1283, in +a register, it emits the code "T := 1283" and replaces all occurrences +of '1283' by T. +Similarly, it can put the address of a procedure in T and replace all +calls to that procedure by indirect calls. +Furthermore, it can put several different entities in T (and thus in R) +during the lifetime of P. +.PP +In principle, the code generated by the optimizer in this way would +always be valid EM code, even if the optimizer would be presented +a totally wrong description of the target computer register set. +In practice, it would be a waste of data as well as text space to +allocate memory for these new variables, as they will always be assigned +a register (in the correct order of events). +Hence, no memory locations are allocated for them. +For this reason they are called pseudo local variables. diff --git a/doc/ego/ra/ra3 b/doc/ego/ra/ra3 new file mode 100644 index 000000000..6ba296bd3 --- /dev/null +++ b/doc/ego/ra/ra3 @@ -0,0 +1,383 @@ +.NH 2 +The register allocation phase +.PP +.NH 3 +Overview +.PP +The RA phase deals with one procedure at a time. +For every procedure, it first determines which entities +may be put in a register. Such an entity +is called an \fIitem\fR. +For every item it decides during which parts of the procedure it +might be assigned a register. +Such a region is called a \fItimespan\fR. +For any item, several (possibly overlapping) timespans may +be considered. +A pair (item,timespan) is called an \fIallocation\fR. +If the items of two allocations are both live at some +point of time in the intersections of their timespans, +these allocations are said to be \fIrivals\fR of each other, +as they cannot be assigned the same register. +The rivals-set of every allocation is computed. +Next, the gains of assigning a register to an allocation are estimated, +for every allocation. +With all this information, decisions are made which allocations +to store in which registers (\fIpacking\fR). +Finally, the EM text is transformed to reflect these decisions. +.NH 3 +The item recognition subphase +.PP +RA tries to put the following entities in a register: +.IP - +a local variable for which a register message was found +.IP - +the address of a local variable for which no +register message was found +.IP - +the address of a global variable +.IP - +the address of a procedure +.IP - +a numeric constant. +.LP +Only the \fIaddress\fR of a global variable +may be put in a register, not the variable itself. +This approach avoids the very complex problems that would be +caused by procedure calls and indirect pointer references (see +.[~[ +aho design compiler +.] sections 14.7 and 14.8] +and +.[~[ +spillman side-effects +.]]). +Still, on most machines accessing a global variable using indirect +addressing through a register is much cheaper than +accessing it via its address. +Similarly, if the address of a procedure is put in a register, the +procedure can be called via an indirect call. +.PP +With every item we associate a register type. +This type is +.DS +for local variables: the type contained in the register message +for addresses of variables and procedures: the pointer type +for constants: the general type +.DE +An entity other than a local variable is not taken to be an item +if it is used only once within the current procedure. +.PP +An item is said to be \fIlive\fR at some point of the program text +if its value may be used before it is changed. +As addresses and constants are never changed, all items but local +variables are always live. +The region of text during which a local variable is live is +determined via the live/dead messages generated by the +Live Variable analysis phase of the Global Optimizer. +.NH 3 +The allocation determination subphase +.PP +If a procedure has more items than registers, +it may be advantageous to put an item in a register +only during those parts of the procedure where it is most +heavily used. +Such a part will be called a timespan. +With every item we may associate a set of timespans. +If two timespans of an item overlap, +at most one of them may be granted a register, +as there is no use in putting the same item in two +registers simultaneously. +If two timespans of an item are distinct, +both may be chosen; +the item will possibly be put in two +different registers during different parts of the procedure. +The timespan may also consist +of the whole procedure. +.PP +A list of (item,timespan) pairs (allocations) +is build, which will be the input to the decision making +subphase of RA (packing subphase). +This allocation list is the main data structure of RA. +The description of the remainder of RA will be in terms +of allocations rather than items. +The phrase "to assign a register to an allocation" means "to assign +a register to the item of the allocation for the duration of +the timespan of the allocation". +Subsequent subphases will add more information +to this list. +.PP +Several factors must be taken into account when a +timespan for an item is constructed: +.IP 1. +At any \fIentry point\fR of the timespan where the +item is live, +the register must be initialized with the item +.IP 2. +At any exit point of the timespan where the item is live, +the item must be updated. +.LP +In order to decrease these costs, we will only consider timespans with +one entry point +and no live exit points. +.NH 3 +The rivals computation subphase +.PP +As stated before, several different items may be put in the +same register, provided they are not live simultaneously. +For every allocation we determine the intersection +of its timespan and the lifetime of its item (i.e. the part of the +procedure during which the item is live). +The allocation is said to be busy during this intersection. +If two allocations are ever busy simultaneously they are +said to be rivals of each other. +The rivals information is added to the allocation list. +.NH 3 +The profits computation subphase +.PP +To make good decisions, the packing subphase needs to +know which allocations can be assigned the same register +(rivals information) and how much is gained by +granting an allocation a register. +.PP +Besides the gains of using a register instead of an +item, +two kinds of overhead costs must be +taken into account: +.IP - +the register must be initialized with the item +.IP - +the register must be saved at procedure entry +and restored at procedure exit. +.LP +The latter costs should not be due to a single +allocation, as several allocations can be assigned the same register. +These costs are dealt with after packing has been done. +They do not influence the decisions of the packing algorithm, +they may only undo them. +.PP +The actual profits consist of improvements +of execution time and code size. +As the former is far more difficult to estimate , we will +discuss code size improvements first. +.PP +The gains of putting a certain item in a register +depends on how the item is used. +Suppose the item is +a pointer variable. +On machines that do not have a +double-indirect addressing mode, +two instructions are needed to dereference the variable +if it is not in a register, but only one if it is put in a register. +If the variable is not dereferenced, but simply copied, one instruction +may be sufficient in both cases. +So the gains of putting a pointer variable in a register are higher +if the variable is dereferenced often. +.PP +To make accurate estimates, detailed knowledge of +the target machine and of the code generator +would be needed. +Therefore, a simplification has been made that substantially limits +the amount of target machine information that is needed. +The estimation of the number of bytes saved does +not take into account how an item is used. +Rather, an average number is used. +So these gains are computed as follows: +.DS +#bytes_saved = #occurrences * gains_per_occurrence +.DE +The number of occurrences is derived from +the EM code. +Note that this is not exact either, +as there is no one-to-one correspondence between occurrences in +the EM code and in the assembler code. +.PP +The gains of one occurrence depend on: +.IP 1. +the type of the item +.IP 2. +the size of the item +.IP 3. +the type of the register +.LP +and for local variables and addresses of local variables: +.IP 4. +the type of the local variable +.IP 5. +the offset of the variable in the stackframe +.LP +For every allocation we try two types of registers: the register type +of the item and the general register type. +Only the type with the highest profits will subsequently be used. +This type is added to the allocation information. +.PP +To compute the gains, RA uses a machine-dependent table +that is read from a machine descriptor file. +By means of this table the number of bytes saved can be computed +as a function of the five properties. +.PP +The costs of initializing a register with an item +is determined in a similar way. +The cost of one initialization is also +obtained from the descriptor file. +Note that there can be at most one initialization for any +allocation. +.PP +To summarize, the number of bytes a certain allocation would +save is computed as follows: +.DS +net_bytes_saved = bytes_saved - init_cost +bytes_saved = #occurrences * gains_per_occ +init_cost = #initializations * costs_per_init +.DE +.PP +It is inherently more difficult to estimate the execution +time saved by putting an item in a register, +because it is impossible to predict how +many times an item will be used dynamically. +If an occurrence is part of a loop, +it may be executed many times. +If it is part of a conditional statement, +it may never be executed at all. +In the latter case, the speed of the program may even get +worse if an initialization is needed. +As a clear example, consider the piece of "C" code in Fig. 13.1. +.DS +switch(expr) { + case 1: p(); break; + case 2: p(); p(); break; + case 3: p(); break; + default: break; +} + +Fig. 13.1 A "C" switch statement +.DE +Lots of bytes may be saved by putting the address of procedure p +in a register, as p is called four times (statically). +Dynamically, p will be called zero, one or two times, +depending on the value of the expression. +.PP +The optimizer uses the following strategy for optimizing +execution time: +.IP 1. +try to put items in registers during \fIloops\fR first +.IP 2. +always keep the initializing code outside the loop +.IP 3. +if an item is not used in a loop, do not put it in a register if +the initialization costs may be higher than the gains +.LP +The latter condition can be checked by determining the +minimal number of usages (dynamically) of the item during the procedure, +via a shortest path algorithm. +In the example above, this minimal number is zero, so the address of +p is not put in a register. +.PP +The costs of one occurrence is estimated as described above for the +code size. +The number of dynamic occurrences is guessed by looking at the +loop nesting level of every occurrence. +If the item is never used in a loop, +the minimal number of occurrences is used. +From these facts, the execution time improvement is assessed +for every allocation. +.NH 3 +The packing subphase +.PP +The packing subphase takes as input the allocation +list and outputs a +description of which allocations should be put +in which registers. +So it is essentially the decision making part of RA. +.PP +The packing system tries to assign a register to allocations one +at a time, in some yet to be defined order. +For every allocation A, it first checks if there is a register +(of the right type) +that is already assigned to one or more allocations, +none of which are rivals of A. +In this case A is assigned the same register. +Else, A is assigned a new register, if one exists. +A table containing the number of free registers for every type +is maintained. +It is initialized with the number of non-scratch registers of +the target computer and updated whenever a +new register is handed out. +The packing algorithm stops when no more allocations can +or need be assigned a register. +.PP +After an allocation A has been packed, +all allocations with non-disjunct timespans (including +A itself) are removed from the allocation list. +.PP +In case the number of items exceeds the number of registers, it +is important to choose the most profitable allocations. +Due to the possibility of having several allocations +occupying the same register, +this problem is quite complex. +Our packing algorithm uses simple heuristic rules +and avoids any combinatorial search. +It has distinct rules for different costs measures. +.PP +If object code size is the most important factor, +the algorithm is greedy and chooses allocations in +decreasing order of their profits attribute. +It does not take into account the fact that +other allocations may be passed over because of +this decision. +.PP +If execution time is at prime stake, the algorithm +first considers allocations whose timespans consist of loops. +After all these have been packed, it considers the remaining +allocations. +Within the two subclasses, it considers allocations +with the highest profits first. +When assigning a register to an allocation with a loop +as timespan, the algorithm checks if the item has +already been put in a register during another loop. +If so, it tries to use the same register for the +new allocation. +After all packing has been done, +it checks if the item has always been assigned the same +register (although not necessarily during all loops). +If so, it tries to put the item in that register during +the entire procedure. This is possible +if the allocation (item,whole_procedure) is not a rival +of any allocation with a different item that has been +assigned to the same register. +Note that this approach is essentially 'bottom up', +as registers are first assigned over small regions +of text which are later collapsed into larger regions. +The advantage of this approach is the fact that +the decisions for one loop can be made independently +of all other loops. +.PP +After the entire packing process has been completed, +we compute for each register how much is gained in using +this register, by simply adding the net profits +of all allocations assigned to it. +This total yield should outweigh the costs of +saving/restoring the register at procedure entry/exit. +As most modern processors (e.g. 68000, Vax) have special +instructions to save/restore several registers, +the differential costs of saving one extra register are by +no means constant. +The costs are read from the machine descriptor file and +compared to the total yields of the registers. +As a consequence of this analysis, some allocations +may have their registers taken away. +.NH 3 +The transformation subphase +.PP +The final subphase of RA transforms the EM text according to the +decisions made by the packing system. +It traverses the text of the currently optimized procedure and +changes all occurrences of items at points where +they are assigned a register. +It also clears the score field of the register messages for +normal local variables and emits register messages with a very +high score for the pseudo locals. +At points where registers have to be initialized with items, +it generates EM code to do so. +Finally it tries to decrease the size of the stackframe +of the procedure by looking at which local variables need not +be given memory locations. diff --git a/doc/ego/ra/ra4 b/doc/ego/ra/ra4 new file mode 100644 index 000000000..4bfeef74a --- /dev/null +++ b/doc/ego/ra/ra4 @@ -0,0 +1,28 @@ +.NH 2 +Source files of RA +.PP +The sources of RA are in the following files and packages: +.IP ra.h: 14 +declarations of global variables and data structures +.IP ra.c: +the routine main; initialization of target machine-dependent tables +.IP items: +a routine to build the list of items of one procedure; +routines to manipulate items +.IP lifetime: +contains a subroutine that determines when items are live/dead +.IP alloclist: +contains subroutines that build the initial allocations list +and that compute the rivals sets. +.IP profits: +contains a subroutine that computes the profits of the allocations +and a routine that determines the costs of saving/restoring registers +.IP pack: +contains the packing subphase +.IP xform: +contains the transformation subphase +.IP interval: +contains routines to manipulate intervals of time +.IP aux: +contains auxiliary routines +.LP diff --git a/doc/ego/sp/sp1 b/doc/ego/sp/sp1 new file mode 100644 index 000000000..20c633f8a --- /dev/null +++ b/doc/ego/sp/sp1 @@ -0,0 +1,171 @@ +.bp +.NH 1 +Stack pollution +.NH 2 +Introduction +.PP +The "Stack Pollution" optimization technique (SP) decreases the costs +(time as well as space) of procedure calls. +In the EM calling sequence, the actual parameters are popped from +the stack by the \fIcalling\fR procedure. +The ASP (Adjust Stack Pointer) instruction is used for this purpose. +A call in EM is shown in Fig. 8.1 +.DS +Pascal: EM: + +f(a,2) LOC 2 + LOE A + CAL F + ASP 4 -- pop 4 bytes + +Fig. 8.1 An example procedure call in Pascal and EM +.DE +As procedure calls occur often in most programs, +the ASP is one of the most frequently used EM instructions. +.PP +The main intention of removing the actual parameters after a procedure call +is to avoid the stack size to increase rapidly. +Yet, in some cases, it is possible to \fIdelay\fR or even \fIavoid\fR the +removal of the parameters without letting the stack grow +significantly. +In this way, considerable savings in code size and execution time may +be achieved, at the cost of a slightly increased stack size. +.PP +A stack adjustment may be delayed if there is some other stack adjustment +later on in the same basic block. +The two ASPs can be combined into one. +.DS +Pascal: EM: optimized EM: + +f(a,2) LOC 2 LOC 2 +g(3,b,c) LOE A LOE A + CAL F CAL F + ASP 4 LOE C + LOE C LOE B + LOE B LOC 3 + LOC 3 CAL G + CAL G ASP 10 + ASP 6 + +Fig. 8.2 An example of local Stack Pollution +.DE +The stacksize will be increased only temporarily. +If the basic block contains another ASP, the ASP 10 may subsequently be +combined with that next ASP, and so on. +.PP +For some back ends, a stack adjustment also takes place +at the point of a procedure return. +There is no need to specify the number of bytes to be popped at a +return. +This provides an opportunity to remove ASPs more globally. +If all ASPs outside any loop are removed, the increase of the +stack size will still only be small, as no such ASP is executed more +than once without an intervening return from the procedure it is part of. +.PP +This second approach is not generally applicable to all target machines, +as some back ends require the stack to be cleaned up at the point of +a procedure return. +.NH 2 +Implementation +.PP +There is one main problem the implementation has to solve. +In EM, the stack is not only used for passing parameters, +but also for evaluating expressions. +Hence, ASP instructions can only be combined or removed +if certain conditions are satisfied. +.PP +Two consecutive ASPs of one basic block can only be combined +(as described above) if: +.IP 1. +On no point of text in between the two ASPs, any item is popped from +the stack that was pushed onto it before the first ASP. +.IP 2. +The number of bytes popped from the stack by the second ASP must equal +the number of bytes pushed since the first ASP. +.LP +Condition 1. is not satisfied in Fig. 8.3. +.DS +Pascal: EM: + +5 + f(10) + g(30) LOC 5 + LOC 10 + CAL F + ASP 2 -- cannot be removed + LFR 2 -- push function result + ADI 2 + LOC 30 + CAL G + ASP 2 + LFR 2 + ADI 2 +Fig. 8.3 An illegal transformation +.DE +If the first ASP were removed (delayed), the first ADI would add +10 and f(10), instead of 5 and f(10). +.sp +Condition 2. is not satisfied in Fig. 8.4. +.DS +Pascal: EM: + +f(10) + 5 * g(30) LOC 10 + CAL F + ASP 2 + LFR 2 + LOC 5 + LOC 30 + CAL G + ASP 2 + LFR 2 + MLI 2 -- 5 * g(30) + ADI 2 + +Fig. 8.4 A second illegal transformation +.DE +If the two ASPs were combined into one 'ASP 4', the constant 5 would +have been popped, rather than the parameter 10 (so '10 + f(10)*g(30)' +would have been computed). +.PP +The second approach to deleting ASPs (i.e. let the procedure return +do the stack clean-up) +is only applied to the last ASP of every basic block. +Any preceding ASPs are dealt with by the first approach. +The last ASP of a basic block B will only be removed if: +.IP - +on no path in the control flow graph from B to any block containing a +RET (return) there is a basic block that, at some point of its text, pops +items from the stack that it has not itself pushed earlier. +.LP +Clearly, if this condition is satisfied, no harm can be done; no +other basic block will ever access items that were pushed +on the stack before the ASP. +.PP +The number of bytes pushed onto or popped from the stack can be +easily encoded in a so called "pop-push table". +The numbers in general depend on the target machine word- and pointer +size and on the argument given to the instruction. +For example, an ADS instruction is described by: +.DS + -a-p+p +.DE +which means: an 'ADS n' first pops an n-byte value (n being the argument), +next pops a pointer-size value and finally pushes a pointer-size value. +For some infrequently used EM instructions the pop-push numbers +cannot be computed statically. +.PP +The stack pollution algorithm first performs a depth first search over +the control flow graph and marks all blocks that do not satisfy +the global condition. +Next it visits all basic blocks in turn. +For every pair of adjacent ASPs, it checks conditions 1. and 2. and +combines the ASPs if they are satisfied. +The new ASP may be used as first ASP in the next pair. +If a condition fails, it simply continues with the next ASP. +Finally, the last ASP is removed if: +.IP - +nothing has been popped from the stack after the last ASP that was +pushed before it +.IP - +the block was not marked by the depth first search +.IP - +the block is not in a loop +.LP diff --git a/doc/ego/sr/sr1 b/doc/ego/sr/sr1 new file mode 100644 index 000000000..cc8f660e4 --- /dev/null +++ b/doc/ego/sr/sr1 @@ -0,0 +1,44 @@ +.bp +.NH 1 +Strength reduction +.NH 2 +Introduction +.PP +The Strength Reduction optimization technique (SR) +tries to replace expensive operators +by cheaper ones, +in order to decrease the execution time +of the program. +A classical example is replacing a 'multiplication by 2' +by an addition or a shift instruction. +These kinds of local transformations are already +done by the EM Peephole Optimizer. +Strength reduction can also be applied +more generally to operators used in a loop. +.DS +i := 1; i := 1; +while i < 100 loop --> TMP := i * 118; + put(i * 118); while i < 100 loop + i := i + 1; put(TMP); +end loop; i := i + 1; + TMP := TMP + 118; + end loop; + +Fig. 6.1 An example of Strenght Reduction +.DE +In Fig. 6.1, a multiplication inside a loop is +replaced by an addition inside the loop and a multiplication +outside the loop. +Clearly, this is a global optimization; it cannot +be done by a peephole optimizer. +.PP +In some cases a related technique, \fItest replacement\fR, +can be used to eliminate the +loop variable i. +This technique will not be discussed in this report. +.sp 0 +In the example above, the resulting code +can be further optimized by using +constant propagation. +Obviously, this is not the task of the +Strength Reduction phase. diff --git a/doc/ego/sr/sr2 b/doc/ego/sr/sr2 new file mode 100644 index 000000000..c3000f93e --- /dev/null +++ b/doc/ego/sr/sr2 @@ -0,0 +1,217 @@ +.NH 2 +The model of strength reduction +.PP +In this section we will describe +the transformations performed by +Strength Reduction (SR). +Before doing so, we will introduce the +central notion of an induction variable. +.NH 3 +Induction variables +.PP +SR looks for variables whose +values form an arithmetic progression +at the beginning of a loop. +These variables are called induction variables. +The most frequently occurring example of such +a variable is a loop-variable in a high-order +programming language. +Several quite sophisticated models of strength +reduction can be found in the literature. +.[ +cocke reduction strength cacm +.] +.[ +allen cocke kennedy reduction strength +.] +.[ +lowry medlock cacm +.] +.[ +aho compiler design +.] +In these models the notion of an induction variable +is far more general than the intuitive notion +of a loop-variable. +The definition of an induction variable we present here +is more restricted, +yielding a simpler model and simpler transformations. +We think the principle source for strength reduction lies in +expressions using a loop-variable, +i.e. a variable that is incremented or decremented +by the same amount after every loop iteration, +and that cannot be changed in any other way. +.PP +Of course, the EM code does not contain high level constructs +such as for-statements. +We will define an induction variable in terms +of the Intermediate Code of the optimizer. +Note that the notions of a loop in the +EM text and of a firm basic block +were defined in section 3.3.5. +.sp +.UL definition +.sp 0 +An induction variable i of a loop L is a local variable +that is never accessed indirectly, +whose size is the word size of the target machine, and +that is assigned exactly once within L, +the assignment: +.IP - +being of the form i := i + c or i := c +i, +c is a constant +called the \fIstep value\fR of i. +.IP - +occurring in a firm block of L. +.LP +(Note that the first restriction on the assignment +is not described in terms of the Intermediate Code; +we will give such a description later; the current +definition is easier to understand however). +.NH 3 +Recognized expressions +.PP +SR recognizes certain expressions using +an induction variable and replaces +them by cheaper ones. +Two kinds of expensive operations are recognized: +multiplication and array address computations. +The expressions that are simplified must +use an induction variable +as an operand of +a multiplication or as index in an array expression. +.PP +Often a linear function of an induction variable is used, +rather than the variable itself. +In these cases optimization is still possible. +We call such expressions \fIiv-expressions\fR. +.sp +.UL definition: +.sp 0 +An iv-expression of an induction variable i of a loop L is +an expression that: +.IP - +uses only the operators + and - (unary as well as binary) +.IP - +uses i as operand exactly once +.IP - +uses (besides i) only constants or variables that are +never changed in L as operands. +.LP +.PP +The expressions recognized by SR are of the following forms: +.IP (1) +iv_expression * constant +.IP (2) +constant * iv_expression +.IP (3) +A[iv-expression] := (assign to array element) +.IP (4) +A[iv-expression] (use array element) +.IP (5) +& A[iv-expression] (take address of array element) +.LP +(Note that EM has different instructions to use an array element, +store into one, or take the address of one, resp. LAR, SAR, and AAR). +.sp 0 +The size of the elements of A must +be known statically. +In cases (3) and (4) this size +must equal the word size of the +target machine. +.NH 3 +Transformations +.PP +With every recognized expression we associate +a new temporary local variable TMP, +allocated in the stack frame of the +procedure containing the expression. +At any program point within the loop, TMP will +contain the following value: +.IP multiplication: 18 +the current value of iv-expression * constant +.IP arrays: +the current value of &A[iv-expression]. +.LP +In the second case, TMP essentially is a pointer variable, +pointing to the element of A that is currently in use. +.sp 0 +If the same expression occurs several times in the loop, +the same temporary local is used each time. +.PP +Three transformations are applied to the EM text: +.IP (1) +TMP is initialized with the right value. +This initialization takes place just +before the loop. +.IP (2) +The recognized expression is simplified. +.IP (3) +TMP is incremented; this takes place just +after the induction variable is incremented. +.LP +For multiplication, the initial value of TMP +is the value of the recognized expression at +the program point immediately before the loop. +For arrays, TMP is initialized with the address +of the first array element that is accessed. +So the initialization code is: +.DS +TMP := iv-expression * constant; or +TMP := &A[iv-expression] +.DE +At the point immediately before the loop, +the induction variable will already have been +initialized, +so the value used in the code above will be the +value it has during the first iteration. +.PP +For multiplication, the recognized expression can simply be +replaced by TMP. +For array optimizations, the replacement +depends on the form: +.DS +\fIform\fR \fIreplacement\fR +(3) A[iv-expr] := *TMP := (assign indirect) +(4) A[iv-expr] *TMP (use indirect) +(5) &A[iv-expr] TMP +.DE +The '*' denotes the indirect operator. (Note that +EM has different instructions to do +an assign-indirect and a use-indirect). +As the size of the array elements is restricted +to be the word size in case (3) and (4), +only one EM instruction needs to +be generated in all cases. +.PP +The amount by which TMP is incremented is: +.IP multiplication: 18 +step value * constant +.IP arrays: +step value * element size +.LP +Note that the step value (see definition of induction variable above), +the constant, and the element size (see previous section) can all +be determined statically. +If the sign of the induction variable in the +iv-expression is negative, the amount +must be negated. +.PP +The transformations are demonstrated by an example. +.DS +i := 100; i := 100; +while i > 1 loop TMP := (6-i) * 5; + X := (6-i) * 5 + 2; while i > 1 loop + Y := (6-i) * 5 - 8; --> X := TMP + 2; + i := i - 3; Y := TMP - 8; +end loop; i := i - 3; + TMP := TMP + 15; + end loop; + +Fig. 6.2 Example of complex Strength Reduction transformations +.DE +The expression '(6-i)*5' is recognized twice. The constant +is 5. +The step value is -3. +The sign of i in the recognized expression is '-'. +So the increment value of TMP is -(-3*5) = +15. diff --git a/doc/ego/sr/sr3 b/doc/ego/sr/sr3 new file mode 100644 index 000000000..12dbcff7a --- /dev/null +++ b/doc/ego/sr/sr3 @@ -0,0 +1,232 @@ +.NH 2 +Implementation +.PP +Like most phases, SR deals with one procedure +at a time. +Within a procedure, SR works on one loop at a time. +Loops are processed in textual order. +If loops are nested inside each other, +SR starts with the outermost loop and proceeds in the +inwards direction. +This order is chosen, +because it enables the optimization +of multi-dimensional array address computations, +if the elements are accessed in the usual way +(i.e. row after row, rather than column after column). +For every loop, SR first detects all induction variables +and then tries to recognize +expressions that can be optimized. +.NH 3 +Finding induction variables +.PP +The process of finding induction variables +can conveniently be split up +into two parts. +First, the EM text of the loop is scanned to find +all \fIcandidate\fR induction variables, +which are word-sized local variables +that are assigned precisely once +in the loop, within a firm block. +Second, for every candidate, the single assignment +is inspected, to see if it has the form +required by the definition of an induction variable. +.PP +Candidates are found by scanning the EM code of the loop. +During this scan, two sets are maintained. +The set "cand" contains all variables that were +assigned exactly once so far, within a firm block. +The set "dismiss" contains all variables that +should not be made a candidate. +Initially, both sets are empty. +If a variable is assigned to, it is put +in the cand set, if three conditions are met: +.IP 1. +the variable was not in cand or dismiss already +.IP 2. +the assignment takes place in a firm block +.IP 3. +the assignment is not a ZRL instruction (assignment by zero) +or a SDL instruction (store double local). +.LP +If any condition fails, the variable is dismissed from cand +(if it was there already) and put in dismiss +(if it was not there already). +.sp 0 +All variables for which no register message was generated (i.e. those +variables that may be accessed indirectly) are assumed +to be changed in the loop. +.sp 0 +All variables that remain in cand are candidate induction variables. +.PP +From the set of candidates, the induction variables can +be determined, by inspecting the single assignment. +The assignment must match one of the EM patterns below. +('x' is the candidate. 'ws' is the word size of the target machine. +'n' is any number.) +.DS +\fIpattern\fR \fIstep size\fR +INL x | +1 +DEL x | -1 +LOL x ; (INC | DEC) ; STL x | +1 | -1 +LOL x ; LOC n ; (ADI ws | SBI ws) ; STL x | +n | -n +LOC n ; LOL x ; ADI ws ; STL x. +n +.DE +From the patterns the step size of the induction variable +can also be determined. +These step sizes are displayed on the right hand side. +.sp +For every induction variable we maintain the following information: +.IP - +the offset of the variable in the stackframe of its procedure +.IP - +a pointer to the EM text of the assignment statement +.IP - +the step value +.LP +.NH 3 +Optimizing expressions +.PP +If any induction variables of the loop were found, +the EM text of the loop is scanned again, +to detect expressions that can be optimized. +SR scans for multiplication and array instructions. +Whenever it finds such an instruction, it analyses the +code in front of it. +If an expression is to be optimized, it must +be generated by the following syntax rules. +.DS + optimizable_expr: + iv_expr const mult | + const iv_expr mult | + address iv_expr address array_instr; + mult: + MLI ws | + MLU ws ; + array_instr: + LAR ws | + SAR ws | + AAR ws ; + const: + LOC n ; +.DE +An 'address' is an EM instruction that loads an +address on the stack. +An instruction like LOL may be an 'address', if +the size of an address (pointer size, =ps) is +the same as the word size. +If the pointer size is twice the word size, +instructions like LDL are an 'address'. +(The addresses in the third grammar rule +denote resp. the array address and the +array descriptor address). +.DS + address: + LAE | + LAL | + LOL if ps=ws | + LOE ,, | + LIL ,, | + LDL if ps=2*ws | + LDE ,, ; +.DE +The notion of an iv-expression was introduced earlier. +.DS + iv_expr: + iv_expr unair_op | + iv_expr iv_expr binary_op | + loopconst | + iv ; + unair_op: + NGI ws | + INC | + DEC ; + binary_op: + ADI ws | + ADU ws | + SBI ws | + SBU ws ; + loopconst: + const | + LOL x if x is not changed in loop ; + iv: + LOL x if x is an induction variable ; +.DE +An iv_expression must satisfy one additional constraint: +it must use exactly one operand that is an induction +variable. +A simple, hand written, top-down parser is used +to recognize an iv-expression. +It scans the EM code from right to left +(recall that EM is essentially postfix). +It uses semantic attributes (inherited as well as +derived) to check the additional constraint. +.PP +All information assembled during the recognition +process is put in a 'code_info' structure. +This structure contains the following information: +.IP - +the optimizable code itself +.IP - +the loop and basic block the code is part of +.IP - +the induction variable +.IP - +the iv-expression +.IP - +the sign of the induction variable in the +iv-expression +.IP - +the offset and size of the temporary local variable +.IP - +the expensive operator (MLI, LAR etc.) +.IP - +the instruction that loads the constant +(for multiplication) or the array descriptor +(for arrays). +.LP +The entire transformation process is driven +by this information. +As the EM text is represented internally +as a list, this process consists +mainly of straightforward list manipulations. +.sp 0 +The initialization code must be put +immediately before the loop entry. +For this purpose a \fIheader block\fR is +created that has the loop entry block as +its only successor and that dominates the +entry block. +The CFG and all relations (SUCC,PRED, IDOM, LOOPS etc.) +are updated. +.sp 0 +An EM instruction that will +replace the optimizable code +is created and put at the place of the old code. +The list representing the old optimizable code +is used to create a list for the initializing code, +as they are similar. +Only two modifications are required: +.IP - +if the expensive operator is a LAR or SAR, +it must be replaced by an AAR, as the initial value +of TMP is the \fIaddress\fR of the first +array element that is accessed. +.IP - +code must be appended to store the result of the +expression in TMP. +.LP +Finally, code to increment TMP is created and put after +the code of the single assignment to the +induction variable. +The generated code uses either an integer addition +(ADI) or an integer-to-pointer addition (ADS) +to do the increment. +.PP +SR maintains a set of all expressions that have already +been recognized in the present loop. +Such expressions are said to be \fIavailable\fR. +If an expression is recognized that is +already available, +no new temporary local variable is allocated for it, +and the code to initialize and increment the local +is not generated. diff --git a/doc/ego/sr/sr4 b/doc/ego/sr/sr4 new file mode 100644 index 000000000..ae8764378 --- /dev/null +++ b/doc/ego/sr/sr4 @@ -0,0 +1,28 @@ +.NH 2 +Source files of SR +.PP +The sources of SR are in the following files +and packages: +.IP sr.h: 14 +declarations of global variables and +data structures +.IP sr.c: +the routine main; a driving routine to process +(possibly nested) loops in the right order +.IP iv +implements a procedure that finds the induction variables +of a loop +.IP reduce +implements a procedure that finds optimizable expressions +and that does the transformations +.IP cand +implements a procedure that finds the candidate induction +variables; used to implement iv +.IP xform +implements several useful routines that transform +lists of EM text or a CFG; used to implement reduce +.IP expr +implements a procedure that parses iv-expressions +.IP aux +implements several auxiliary procedures. +.LP diff --git a/doc/ego/ud/ud1 b/doc/ego/ud/ud1 new file mode 100644 index 000000000..8f2a12f53 --- /dev/null +++ b/doc/ego/ud/ud1 @@ -0,0 +1,58 @@ +.bp +.NH 1 +Use-Definition analysis +.NH 2 +Introduction +.PP +The "Use-Definition analysis" phase (UD) consists of two related optimization +techniques that both depend on "Use-Definition" information. +The techniques are Copy Propagation and Constant Propagation. +They are best explained via an example (see Figs. 11.1 and 11.2). +.DS + (1) A := B A := B + ... --> ... + (2) use(A) use(B) + +Fig. 11.1 An example of Copy Propagation +.DE +.DS + (1) A := 12 A := 12 + ... --> ... + (2) use(A) use(12) + +Fig. 11.2 An example of Constant Propagation +.DE +Both optimizations have to check that the value of A at line (2) +can only be obtained at line (1). +Copy Propagation also has to assure that the value of B is +the same at line (1) as at line (2). +.PP +One purpose of both transformations is to introduce +opportunities for the Dead Code Elimination optimization. +If the variable A is used nowhere else, the assignment A := B +becomes useless and can be eliminated. +.sp 0 +If B is less expensive to access than A (e.g. this is sometimes the case +if A is a local variable and B is a global variable), +Copy Propagation directly improves the code itself. +If A is cheaper to access the transformation will not be performed. +Likewise, a constant as operand may be cheeper than a variable. +Having a constant as operand may also facilitate other optimizations. +.PP +The design of UD is based on the theory described in section +14.1 and 14.3 of. +.[ +aho compiler design +.] +As a main departure from that theory, +we do not demand the statement A := B to become redundant after +Copy Propagation. +If B is cheaper to access than A, the optimization is always performed; +if B is more expensive than A, we never do the transformation. +If A and B are equally expensive UD uses the heuristic rule to +replace infrequently used variables by frequently used ones. +This rule increases the chances of the assignment to become useless. +.PP +In the next section we will give a brief outline of the data +flow theory used +for the implementation of UD. diff --git a/doc/ego/ud/ud2 b/doc/ego/ud/ud2 new file mode 100644 index 000000000..21174f459 --- /dev/null +++ b/doc/ego/ud/ud2 @@ -0,0 +1,64 @@ +.NH 2 +Data flow information +.PP +.NH 3 +Use-Definition information +.PP +A \fIdefinition\fR of a variable A is an assignment to A. +A definition is said to \fIreach\fR a point p if there is a +path in the control flow graph from the definition to p, such that +A is not redefined on that path. +.PP +For every basic block B, we define the following sets: +.IP GEN[b] 9 +the set of definitions in b that reach the end of b. +.IP KILL[b] +the set of definitions outside b that define a variable that +is changed in b. +.IP IN[b] +the set of all definitions reaching the beginning of b. +.IP OUT[b] +the set of all definitions reaching the end of b. +.LP +GEN and KILL can be determined by inspecting the code of the procedure. +IN and OUT are computed by solving the following data flow equations: +.DS +(1) OUT[b] = IN[b] - KILL[b] + GEN[b] +(2) IN[b] = OUT[p1] + ... + OUT[pn], + where PRED(b) = {p1, ... , pn} +.DE +.NH 3 +Copy information +.PP +A \fIcopy\fR is a definition of the form "A := B". +A copy is said to be \fIgenerated\fR in a basic block n if +it occurs in n and there is no subsequent assignment to B in n. +A copy is said to be \fIkilled\fR in n if: +.IP (i) +it occurs in n and there is a subsequent assignment to B within n, or +.IP (ii) +it occurs outside n, the definition A := B reaches the beginning of n +and B is changed in n (note that a copy also is a definition). +.LP +A copy \fIreaches\fR a point p, if there are no assignments to B +on any path in the control flow graph from the copy to p. +.PP +We define the following sets: +.IP C_GEN[b] 11 +the set of all copies in b generated in b. +.IP C_KILL[b] +the set of all copies killed in b. +.IP C_IN[b] +the set of all copies reaching the beginning of b. +.IP C_OUT[b] +the set of all copies reaching the end of b. +.LP +C_IN and C_OUT are computed by solving the following equations: +(root is the entry node of the current procedure; '*' denotes +set intersection) +.DS +(1) C_OUT[b] = C_IN[b] - C_KILL[b] + C_GEN[b] +(2) C_IN[b] = C_OUT[p1] * ... * C_OUT[pn], + where PRED(b) = {p1, ... , pn} and b /= root + C_IN[root] = {all copies} +.DE diff --git a/doc/ego/ud/ud3 b/doc/ego/ud/ud3 new file mode 100644 index 000000000..99bf2a036 --- /dev/null +++ b/doc/ego/ud/ud3 @@ -0,0 +1,26 @@ +.NH 2 +Pointers and subroutine calls +.PP +The theory outlined above assumes that variables can +only be changed by a direct assignment. +This condition does not hold for EM. +In case of an assignment through a pointer variable, +it is in general impossible to see which variable is affected +by the assignment. +Similar problems occur in the presence of procedure calls. +Therefore we distinguish two kinds of definitions: +.IP - +an \fIexplicit\fR definition is a direct assignment to one +specific variable +.IP - +an \fIimplicit\fR definition is the potential alteration of +a variable as a result of a procedure call or an indirect assignment. +.LP +An indirect assignment causes implicit definitions to +all variables that may be accessed indirectly, i.e. +all local variables for which no register message was generated +and all global variables. +If a procedure contains an indirect assignment it may change the +same set of variables, else it may change some global variables directly. +The KILL, GEN, IN and OUT sets contain explicit as well +as implicit definitions. diff --git a/doc/ego/ud/ud4 b/doc/ego/ud/ud4 new file mode 100644 index 000000000..c31ad64b2 --- /dev/null +++ b/doc/ego/ud/ud4 @@ -0,0 +1,78 @@ +.NH 2 +Implementation +.PP +UD first builds a number of tables: +.IP locals: 9 +contains information about the local variables of the +current procedure (offset,size,whether a register message was found +for it and, if so, the score field of that message) +.IP defs: +a table of all explicit definitions appearing in the +current procedure. +.IP copies: +a table of all copies appearing in the +current procedure. +.LP +Every variable (local as well as global), definition and copy +is identified by a unique number, which is the index +in the table. +All tables are constructed by traversing the EM code. +A fourth table, "vardefs" is used, indexed by a 'variable number', +which contains for every variable the set of explicit definitions of it. +Also, for each basic block b, the set CHGVARS containing all variables +changed by it is computed. +.PP +The GEN sets are obtained in one scan over the EM text, +by analyzing every EM instruction. +The KILL set of a basic block b is computed by looking at the +set of variables +changed by b (i.e. CHGVARS[b]). +For every such variable v, all explicit definitions to v +(i.e. vardefs[v]) that are not in GEN[b] are added to KILL[b]. +Also, the implicit defininition of v is added to KILL[b]. +Next, the data flow equations for use-definition information +are solved, +using a straight forward, iterative algorithm. +All sets are represented as bitvectors, so the operations +on sets (union, difference) can be implemented efficiently. +.PP +The C_GEN and C_KILL sets are computed simultaneously in one scan +over the EM text. +For every copy A := B appearing in basic block b we do +the following: +.IP 1. +for every basic block n /= b that changes B, see if the definition A := B +reaches the beginning of n (i.e. check if the index number of A := B in +the "defs" table is an element of IN[n]); +if so, add the copy to C_KILL[n] +.IP 2. +if B is redefined later on in b, add the copy to C_KILL[b], else +add it to C_GEN[b] +.LP +C_IN and C_OUT are computed from C_GEN and C_KILL via the second set of +data flow equations. +.PP +Finally, in one last scan all opportunities for optimization are +detected. +For every use u of a variable A, we check if +there is a unique explicit definition d reaching u. +.sp +If the definition is a copy A := B and B has the same value at d as +at u, then the use of A at u may be changed into B. +The latter condition can be verified as follows: +.IP - +if u and d are in the same basic block, see if there is +any assignment to B in between d and u +.IP - +if u and d are in different basic blocks, the condition is +satisfied if there is no assignment to B in the block of u prior to u +and d is in C_IN[b]. +.LP +Before the transformation is actually done, UD first makes sure the +alteration is really desirable, as described before. +The information needed for this purpose (access costs of local and +global variables) is read from a machine descriptor file. +.sp +If the only definition reaching u has the form "A := constant", the use +of A at u is replaced by the constant. + diff --git a/doc/ego/ud/ud5 b/doc/ego/ud/ud5 new file mode 100644 index 000000000..1d617e128 --- /dev/null +++ b/doc/ego/ud/ud5 @@ -0,0 +1,19 @@ + +.NH 2 +Source files of UD +.PP +The sources of UD are in the following files and packages: +.IP ud.h: 14 +declarations of global variables and data structures +.IP ud.c: +the routine main; initialization of target machine dependent tables +.IP defs: +routines to compute the GEN and KILL sets and routines to analyse +EM instructions +.IP const: +routines involved in constant propagation +.IP copy: +routines involved in copy propagation +.IP aux: +contains auxiliary routines +.LP