doc/sparc/4

   1 .In
   2 .hw data-structures
   3 .nr H1 3
   4 .NH
   5 SOLUTIONS
   6 .NH 2
   7 Maintaining SPARC speed
   8 .PP
   9 In chapter 3 we wrote:
  10 .sp 0.3
  11 .nf
  12 >If we want to generate efficient code, we should at least try to reduce the number of
  13 >memory references and use registers wherever we can.
  14 .fi
  15 .sp 0.3
  16 In this chapter we will device a strategy to swiftly generate acceptable
  17 code by using push-pop optimization.
  18 Note that this is not the push-pop
  19 optimization already available in the EM-kit, since that is only present
  20 in the assembler-to-binary part which we do not use
  21 .[ [
  22 The Code Expander Generator
  23 .]].
  24 Our push-pop optimization
  25 works more like the fake-stack described in
  26 .[ [
  27 The table driven code generator
  28 .]].
  29 .NH 3
  30 Ad-hoc optimization
  31 .PP
  32 Before getting involved in any optimization let's have a look at some
  33 code generated with a straightforward EM to SPARC conversion of the
  34 C statement: \*(Sif(a[i]);\*(So Note that \*(Si%SP\*(So is an alias
  35 for a general purpose
  36 register and acts as the EM stack pointer. It has nothing to do with
  37 \*(Si%sp\*(So \(em the SPARC stack pointer.
  38 Analogous \*(Si%LB\*(So is EMs local base pointer.
  39 .br
  40 .IP
  41 .HS
  42 .TS
  43 ;
  44 l s l s l
  45 l1f6 lf6 l2f6 lf6 l.
  46 EM code SPARC code      Comment
  47
  48 lae     _a      set     _a, %g1 ! load address of external _a
  49                 dec     4, %SP
  50                 st      %g1, [%SP]
  51
  52 lol     -4      set     -4, %g1 ! load local -4 (i)
  53                 ld      [%g1+%LB], %g2
  54                 dec     4, %SP
  55                 st      %g2, [%SP]
  56
  57 loc     2       set     2, %g1  ! load constant 2
  58                 dec     4, %SP
  59                 st      %g1, [%SP]
  60
  61 sli     4       ld      [%SP], %g1      ! pop shift count
  62                 ld      [%SP+4], %g2    ! pop shiftee
  63                 sll     %g2, %g1, %g3
  64                 inc     4, %SP
  65                 st      %g3, [%SP]      ! push 4 * i
  66
  67 ads     4       ld      [%SP], %g1      ! add pointer and offset
  68                 ld      [%SP+4], %g2
  69                 add     %g1, %g2, %g3
  70                 inc     4, %SP
  71                 st      %g3, [%SP]      ! push address of _a + (4 * i)
  72
  73 loi     4       ld      [%SP], %g1      ! load indirect 4 bytes
  74                 ld      [%g1], %g2
  75                 st      %g2, [%SP]      ! push a[i]
  76 cal     _f
  77                 ...
  78 .TE
  79 .HS
  80 .LP
  81 Although the code is easy understand, it clearly is far from optimal.
  82 The above code uses approximately 60 clock-cycles\(dg
  83 .FS
  84 \(dg In general each instruction only takes one cycle,
  85 except for \*(Sild\*(So and
  86 \*(Sist\*(So which may both require additional clock cycles. The exact amount
  87 of extra cycles needed depends on the SPARC implementation and memory access
  88 time. Furthermore, the
  89 \*(Siset\*(So pseudo-instruction is a bit tricky. It takes one cycle when
  90 its argument lies between -4096 and 4095, and two cycles otherwise.
  91 .FE
  92 to push an array-element on the stack,
  93 something which a 68020 can do in a single instruction. The SPARC
  94 processor may be fast, but not fast enough to justify the above code.
  95 .PP
  96 The same statement can be translated much more efficiently:
  97 .DS
  98 .TS
  99 ;
 100 l2f6 lf6 l.
 101 sll     %i0, 2, %g2     ! multiply index by 4
 102 set     _a, g3
 103 ld      [%g2+%g3], %g1  ! get contents of a[i]
 104 dec     4, SP
 105 st      %g2, [SP]       ! push a[i] onto the stack
 106 .TE
 107 .DE
 108 which, instead of 60, uses only 5 clock cycles to retrieve the element
 109 from memory and 5 additional cycles when the result has to be pushed
 110 on the stack. Note that when the result is not a parameter it does not
 111 have to be pushed on the stack. By making efficient use of the SPARC
 112 registers we can fetch \*(Sia[i]\*(So in only 5 cycles!
 113 .NH 3
 114 Analyzing optimization
 115 .PP
 116 Instead of ad-hoc optimization we will need something more solid.
 117 When one tries to optimize the above code in an ad-hoc manner one will
 118 probably notice the large overhead due to stack access. Almost every EM
 119 instruction requires at least three SPARC instructions: one to carry out
 120 the EM instruction and two to pop and push the result from and onto the
 121 stack. This happens for every instruction, even though the data being pushed
 122 will probably be needed by the next instruction. To optimize this extensive
 123 pushing and popping of data we will use the appropriately named push-pop
 124 optimization.
 125 .PP
 126 The idea behind push-pop optimization is to delay the push operation until
 127 it is almost certain that the data actually has to be pushed.
 128 As is often the case, the data does not have to be pushed,
 129 but will be used as input to another EM instruction.
 130 If we can decide at compile time that this will indeed be
 131 the case we can save the time of first pushing the data and then popping it
 132 back again by temporarily storing the data (possibly only during compilation!)
 133 and using it no sooner than it is actually needed.
 134 .PP
 135 The \*(Sisli 4\*(So instruction, for instance, expects two inputs on top of the
 136 stack: on top a counter and right below that the shiftee (the number
 137 to be shifted). As a result \*(Sisli\*(So
 138 pushes 'shiftee << counter' back to the stack. Now consider the following
 139 sequence, which could be the result of the expression \*(Si4 * i\*(So
 140 .DS
 141 .TS
 142 ;
 143 l1f6 lf6 l.
 144 lol     -4
 145 loc     2
 146 sli     4
 147 .TE
 148 .DE
 149 In the non-optimized situation the \*(Silol\*(So would push
 150 a local variable (whose offset is -4) on the stack.
 151 Then the \*(Siloc\*(So pushes a 2 on the stack and finally \*(Sisli\*(So
 152 retrieves both these numbers to replace then with the result.
 153 On most machines it is not necessary to
 154 push the 2 on the stack, since it can be used in the shift instruction
 155 as an immediately operand. On a SPARC, for instance, one can write
 156 .DS
 157 .TS
 158 ;
 159 l2f6 lf6 l.
 160 ld      [%LB-4], %g1    ! load local variable into register g1
 161 sll     %g1, 2, %g2     ! perform the shift-left-by-2
 162 .TE
 163 .DE
 164 where the output of the \*(Silol\*(So, as well as the immediate operand 2 are used
 165 in the shift instruction. As suggested before, all of this can be
 166 achieved with push-pop optimization.
 167 .NH 3
 168 A mechanism for push-pop optimization
 169 .PP
 170 To implement the above optimization we need some mechanism to
 171 temporarily store information during compilation.
 172 We need to be able to store, compare and retrieve information from the
 173 temporary storage (cache) without any
 174 loss of information. Before describing all the routines used
 175 to implement our cache we will first describe how the cache works.
 176 .PP
 177 Items in the cache are structures containing an external (\*(Sichar *\*(So),
 178 two registers (\*(Sireg_t\*(So) and a constant (\*(Siarith\*(So),
 179 any of which may be 0.
 180 The value of such a structure is the sum of (the values of)
 181 its elements. To put a register in the cache, one has to be allocated either
 182 by calling \*(Sialloc_reg\*(So which returns a free register, by
 183 \*(Siforced_alloc_reg\*(So which allocates a specific register or any
 184 of the other routines available to allocate a register. The keep things
 185 simple, we will not discuss all of the available primitives here.
 186 When the register
 187 is then put in the cache by the \*(Sipush_reg\*(So routine, the ownership will
 188 be transferred from the user to the cache. Ownership is important, because
 189 only the owner of a register may (and must!) deallocate it. Registers can be
 190 owned by either an (imaginary) register manager, the cache or the user.
 191 When the user retrieves a register from the stack with \*(Sipop_reg\*(So for
 192 instance, ownership is back to the user.
 193 The user should then call \*(Sifree_reg\*(So
 194 to transfer ownership to the register manager or call \*(Sipush_reg\*(So
 195 to give it back to the cache.
 196 Since the cache behaves itself as a stack we will use the term pop resp. push
 197 to get items from, resp. put items in the cache.
 198 .PP
 199 We shall now present the sets of routines that implement the cache.
 200 .IP \(bu
 201 The routines
 202 .DS
 203 \*(Si
 204 reg_t alloc_reg(void)
 205 reg_t alloc_reg_var(void)
 206 reg_t alloc_float(void)
 207 reg_t alloc_float_var(void)
 208 reg_t alloc_double(void)
 209 reg_t alloc_double_var(void)
 210
 211 void forced_alloc_reg(reg_t)
 212 void soft_alloc_reg(reg_t)
 213
 214 void free_reg(reg_t)
 215 void free_double_reg(reg_t)
 216 \*(So
 217 .DE
 218 allocate and deallocate registers. If there are no more register left,
 219 i.e. they are owned by the cache,
 220 one or more registers will be freed by flushing part of the cache
 221 onto the real stack.
 222 The \*(Sialloc_xxx_var\*(So primitives try to allocate a register that
 223 can be used to store local variables. (In the current implementation
 224 only the input and local registers.) If none can be found \*(SiNULL\*(So
 225 is returned. \*(Siforced_alloc_reg\*(So forces the allocation of a certain
 226 register. If it was already in use, its contents are moved to another
 227 register. Finally \*(Sisoft_alloc_reg\*(So provides the possibility to
 228 push a register onto the cache and still keep a copy for later use.
 229 (Used to implement the \*(Sidup 4\*(So for example.)
 230 .IP \(bu
 231 The routines
 232 .DS
 233 \*(Si
 234 void push_const(arith)
 235 arith pop_const(void)
 236 \*(So
 237 .DE
 238 push or pop a constant onto or from the stack. Distinction between
 239 constants and other types is made so as not to loose any information; constants
 240 may be used later on as immediate operators, which is not the case
 241 for other types. If \*(Sipop_const\*(So is called, but the element on top of
 242 the cache has either one of the external or register fields non-zero a
 243 fatal error will be reported.
 244 .IP \(bu
 245 The routines
 246 .DS
 247 \*(Si
 248 reg_t pop_reg(void)
 249 reg_t pop_float(void)
 250 reg_t pop_double(void)
 251 reg_t pop_reg_c13(char *n)
 252
 253 void pop_reg_as(reg_t)
 254
 255 void push_reg(reg_t)
 256 \*(So
 257 .DE
 258 push or pop a register. These will be used most often since results from one
 259 EM instruction, which are computed in a register, are often used in the next.
 260 When the element on top of the cache is more
 261 than just a register the cache manager
 262 will generate code to compute the sum of its fields and put the result in a
 263 register. This register will then be given to the user.
 264 If the user wants the result is a special register, he should use the
 265 \*(Sipop_reg_as\*(So routine.
 266 The \*(Sipop_reg_c13\*(So gives an optional number (as character string) whose
 267 value can be represented in 13 bits. The constant can then be used as an
 268 offset for the SPARC \*(Sild\*(So and \*(Sist\*(So instructions.
 269 .IP \(bu
 270 The routine
 271 .DS
 272 \*(Si
 273 void push_ext(char *)
 274 \*(So
 275 .DE
 276 pushes an external onto the stack. There is no pop-variant of this one since
 277 there is no use in popping an external.
 278 .IP \(bu
 279 The routines
 280 .DS
 281 \*(Si
 282 void inc_tos(arith n)
 283 void inc_tos_reg(reg_t r)
 284 \*(So
 285 .DE
 286 increment the element on top of the cache by either the constant \*(Sin\*(So
 287 or by a register. The latter is useful for pointer addition when referencing
 288 external memory.
 289 .KS
 290 .IP \(bu
 291 The routine
 292 .DS
 293 \*(Si
 294 int type_of_tos(void)
 295 \*(So
 296 .DE
 297 .KE
 298 returns the type of the element on top of the cache. This is a combination
 299 (binary OR) of \*(SiT_ext\*(So, \*(SiT_reg\*(So or \*(SiT_float\*(So,
 300 \*(SiT_reg2\*(So or \*(SiT_float2\*(So, and \*(SiT_cst\*(So,
 301 and tells the
 302 user which of the three fields are non-zero. When the register-fields
 303 represent \*(Si%g0\*(So, it is considered zero.
 304 .IP \(bu
 305 Miscellaneous routines:
 306 .DS
 307 \*(Si
 308 void init_cache(void)
 309 void cache_need(int)
 310 void change_reg(void)
 311 void flush_cache(void)
 312 \*(So
 313 .DE
 314 \*(Siinit_cache\*(So should be called before any
 315 other cache routines, to initialize some internal datastructures.
 316 \*(Sicache_need\*(So is used to tell the cache that a certain number
 317 of register are needed for the next operation. This way the cache can
 318 load them efficiently in one fell swoop. \*(Sichange_reg\*(So is to be
 319 called when the user changes a register of which the cache (possibly) has
 320 co-ownership. Because the contents of registers in the cache are
 321 not allowed to change the user should call \*(Sichange_reg\*(So to
 322 instruct the cache to copy the contents to some other register.
 323 \*(Siflush_cache\*(So writes the cache to the stack and invalidates
 324 the cache. It should be used before branches,
 325 before labels and on other places where the stack has to be valid (i.e. where
 326 every item on the EM-stack should be stored on the real stack, not in some
 327 virtual cache).
 328 .NH 3
 329 Implementing push-pop optimization in the EM_table
 330 .PP
 331 As indicated above, there is no regular way to represent the described
 332 optimization in the EM_table. The only possible escapes from the EM_table
 333 are function calls, but that is clearly not enough to implement a good
 334 push-pop optimizer. Therefore we will use a modified version of the EM_table
 335 format, where the description of, say, the \*(Silol\*(So instruction might look
 336 like this\(dg:
 337 .FS
 338 \(dg This is not the way the \*(Silol\*(So actually looks in the EM_table;
 339 it only shows how it \fImight\fR look using the forementioned push/pop
 340 primitives.
 341 .FE
 342 .DS
 343 \*(Si
 344 reg_t A, B;
 345 const_str_t n;
 346
 347 alloc_reg(A);
 348 push_reg(LB);
 349 inc_tos($1);
 350 B = pop_reg_c13(n);
 351 "ld  [$B+$n], $A";
 352 push_reg(A);
 353 free_reg(B);
 354 \*(So
 355 .DE
 356 For more details about the exact implementation consult
 357 appendix B which contains some characteristic excerpts from the EM_table.
 358 .NH 2
 359 Stack management
 360 .PP
 361 When converting EM code to some executable code there is the problem of
 362 maintaining multiple stacks. The usual way to do this is described in
 363 .[ [
 364 Description of a Machine Architecture
 365 .]]
 366 and is shown in figure \*(SN1.
 367 .KE
 368 .PS
 369 copy "pics/EM_stack.orig"
 370 .PE
 371 .ce 1
 372 \fIFigure \*(SN1: usual stack management.
 373 .KE
 374 .sp
 375 .LP
 376 This means that the EM stack and the hardware stack (used
 377 for subroutine calls, etc.) are interleaved in memory. On the SPARC, however,
 378 this brings up a large problem: in the former model it is assumed that the
 379 resolution of the stack pointer is a word, but this is not the case on the
 380 SPARC processor. On the SPARC processor the stack-pointer as well as the
 381 frame-pointer have to be aligned on 8-byte boundaries, so one can not simply
 382 push a word on the stack and then lower the stack-pointer by 4 bytes!
 383 .NH 3
 384 Possible solutions
 385 .PP
 386 A simple idea might be to use a swiss-cheese stack; we could
 387 push a 4-byte word onto the stack and then lower the stack by 8.
 388 Unfortunately, this is not a very solid solution, because
 389 pointer-arithmetic involving pointers to objects on the stack would cause
 390 hard-to-predict anomalies.
 391 .PP
 392 Another try would be not to use the hardware stack at all. As long as we
 393 do not generate subroutine-calls everything will be all right. This
 394 approach, however, also has some disadvantages: first we would not be able
 395 to use any of the existing debuggers such as \fIadb\fR, because they all
 396 assume a regular stack format. Secondly, we would not be able to make use
 397 of the SPARC's register windows to keep local variables. Finally, doing all the
 398 administrative work necessary for subroutine calls ourselves instead of
 399 letting the hardware handle it for us,
 400 causes unnecessary procedure-call overhead.
 401 .PP
 402 Yet another alternative would be to emulate the EM-part of the stack,
 403 and to let the hardware handle the subroutine call. Since we will
 404 emulate our own stack, there are no alignment restrictions and because
 405 we will use the hardware procedure call we can still make use of
 406 the register windows.
 407 .NH 3
 408 Our implementation
 409 .PP
 410 To implement the hybrid stack we need two extra registers: one for the
 411 the EM stack pointer (the forementioned \*(Si%SP\*(So) and one for the
 412 EM local base pointer (\*(Si%LB\*(So). The most elegant solution would be to
 413 put both stacks in different segments, so they would not influence
 414 each other. Unfortunately
 415 .UX
 416 lacks the ability to add segments and
 417 since we will implement our backend under
 418 .UX,
 419 we will have to put
 420 both stacks in the same segment. Exactly how this can be done is shown
 421 in figure \*(SN2.
 422 .DS
 423 .PS
 424 copy "pics/mem_config"
 425 .PE
 426 .ce 1
 427 \fIFigure \*(SN2: our stack management.\fR
 428 .DE
 429 .sp
 430 During normal procedure execution, the SPARC stack pointer has to point to
 431 a memory location where the operating system can dump the active part of
 432 the register window. The rest of the
 433 register window will be dumped in the therefor pre-allocated (stack) space
 434 by following the frame
 435 pointer. When a signal occurs things get even more complicated and
 436 result in figure \*(SN3.
 437 .DS
 438 .PS
 439 copy "pics/signal_stack"
 440 .PE
 441 .ce 1
 442 \fIFigure \*(SN3: our signal stack.\fR
 443 .DE
 444 .PP
 445 The exact implementation of the stack is shown in figure \*(SN4.
 446 .KF
 447 .PS
 448 copy "pics/EM_stack.ours"
 449 .PE
 450 .ce 1
 451 \fIFigure \*(SN4: stack overview.\fR
 452 .KE
 453 .NH 2
 454 Miscellaneous
 455 .PP
 456 As mentioned in the previous chapter, the generated \fI.o\fR-files are
 457 not compatible with Sun's own object format. The primary reason for
 458 this is that Sun usually passes the first six parameters of a procedure call
 459 through registers. If we were to do that too, we would always have
 460 to fetch the top six words from the stack into registers, even when
 461 the procedure would not have any parameters at all. Apart from this,
 462 structure-passing is another exception in Sun's object format which
 463 makes is impossible to generate object-compatible code.\(dg
 464 .FS
 465 \(dg Exactly how Sun passes structures as parameters is described in
 466 Appendix D of the SPARC Architecture Manual (Software Considerations)
 467 .FE
 468 .bp