doc/sparc/A

   1 .In
   2 .SH
   3 A. MEASUREMENTS
   4 .SH
   5 A.1. \*(OQThe bottom line\*(CQ
   6 .PP
   7 Although examples often are most illustrative, the cruel world out there is
   8 usually more interested in everyday performance figures. To satisfy those
   9 people too, we will present a series of measurements on our code expander
  10 taken from (close to) real life situations. These include measurements
  11 of compile and run times of different programs,
  12 compiled with different compilers.
  13 .SH
  14 A.2. Compile time measurements
  15 .PP
  16 Figure A.2.1 shows compile-time measurements for typical C code:
  17 the dhrystone benchmark\(dg
  18 .[ [
  19 dhrystone
  20 .]].
  21 .FS
  22 \(dg To be certain that we only tested the compiler and not the quality of
  23 the code in the library, we have added our own version of
  24 \fIstrcmp\fR and \fIstrcpy\fR and have not used the ones present in the
  25 library.
  26 .FE
  27 The numbers represent the duration of each separate pass of the compiler.
  28 The numbers at the end of each bar represent the total duration of the
  29 compilation process. As with all measurements in this chapter, the
  30 quoted time or duration is the sum of user and system time in seconds.
  31 .PS
  32 copy "pics/compile_bars"
  33 .PE
  34 .DS
  35 .IP cem: 6
  36 C to EM frontend
  37 .IP opt:
  38 EM peep-hole optimizer
  39 .IP be:
  40 EM to assembler backend
  41 .IP cpp:
  42 Sun's C preprocessor
  43 .IP ccom:
  44 Sun's C compiler
  45 .IP iropt:
  46 Sun's optimizer
  47 .IP cg:
  48 Sun's code generator
  49 .IP as:
  50 Sun's assembler
  51 .IP ld:
  52 Sun's linker
  53 .ce 1
  54 \fIFigure A.2.1: compile-time measurements.\fR
  55 .DE
  56 .sp
  57 .PP
  58 A close examination of the first two bars in fig A.2.1 shows that the maximum
  59 achievable compile-time
  60 gain compared to \fIcc\fR is about 50% for medium-sized
  61 programs.\(dd
  62 .FS
  63 \(dd (cpp+ccom+as+ld)/(cem+as+ld) = 1.53
  64 .FE
  65 For small programs the gain will be less, due to the almost constant
  66 start-up time of each pass in the compilation process. Only a
  67 built-in assembler may increase this number up to
  68 180% in the ideal case that the optimizer, backend and assembler
  69 would run in zero time. Speed-ups of 5 to 10 times as mentioned in
  70 .[ [
  71 fast portable compilers
  72 .]]
  73 are therefore not possible on the Sun-4 family. This is also due to
  74 Sun's implementation of saving and restoring register windows. With
  75 the current implementation in which only a single window is saved
  76 or restored on a register-window overflow, it is very time consuming
  77 when programs have highly dynamic stack use
  78 due to procedure calls (as is often the case with compilers).
  79 .PP
  80 Although we are currently a little slower than \fIcc\fR, it is hard to
  81 blame this on our backend. Optimizing the backend so that it would run
  82 twice as fast would only reduce the total compilation process by
  83 a mere 14%.
  84 .PP
  85 Finally it is nice to see that our push/pop-optimization,
  86 initially designed to generate faster code, has also increased the
  87 compilation speed. (see also figures A.4.1 and A.4.2.)
  88 .SH
  89 A.3. Run time performance
  90 .PP
  91 Figure A.3.1 shows the run-time performance of different compilers.
  92 All results are normalized, where the best available compiler (Sun's
  93 compiler with full optimization) is represented by 1.0 on our scale.
  94 .PS
  95 copy "pics/run-time_bars"
  96 .PE
  97 .ce 1
  98 \fIFigure A.3.1: run time performance.\fR
  99 .sp 1
 100 .PP
 101 The fact that our compiler behaves rather poorly compared to Sun's
 102 compiler is due to the fact that the dhrystone benchmark uses
 103 relatively many subroutine calls; all of which have to be 'emulated'
 104 by our backend.
 105 .SH
 106 A.4. Overall performance
 107 .LP
 108 In the next two figures we will show the combined run and compile time
 109 performance of 'our' compiler (the ACK C frontend and our backend)
 110 compared to Sun's C compiler. Figure A.4.1 shows the results from
 111 measurements on the dhrystone benchmark.
 112 .G1
 113 frame invis left solid bot solid
 114 label left "run time" "(in \(*msec/dhrystone)"
 115 label bot "compile time (in sec)"
 116 coord x 0,21 y 0,610
 117 ticks left out from 0 to 600 by 200
 118 ticks bot out from 0 to 20 by 5
 119 "\(bu" at 3.5, 1000000/1700
 120 "ack w/o opt" ljust at 3.5 + 1, 1000000/1700
 121 "\(bu" at 2.8, 1000000/8770
 122 "ack with opt" below at 2.8 + 0.1, 1000000/8770
 123 "\(bu" at 16.0, 1000000/10434
 124 "ack -O4" above at 16.0, 1000000/10434
 125 "\(bu" at 2.3, 1000000/7270
 126 "\fIcc\fR" above at 2.3, 1000000/7270
 127 "\(bu" at 9.0, 1000000/12500
 128 "\fIcc -O4\fR" above at 9.0, 1000000/12500
 129 "\(bu" at 5.9, 1000000/15250
 130 "\fIcc -O\fR" below at 5.9, 1000000/15250
 131 .G2
 132 .ce 1
 133 \fIFigure A.4.1: overall performance on dhrystones.
 134 .sp 1
 135 .LP
 136 Fortunately for us, dhrystones are not all there is. The following
 137 figure shows the same measurements as the previous one, except
 138 this time we took a benchmark that uses no subroutines: an implementation
 139 of Eratosthenes' sieve:
 140 .G1
 141 frame invis left solid bot solid
 142 label left "run time" "for one run" "(in sec)" left .6
 143 label bot "compile time (in sec)"
 144 coord x 0,11 y 0,21
 145 ticks bot out from 0 to 10 by 5
 146 ticks left out from 0 to 20 by 5
 147 "\(bu" at 2.5, 17.28
 148 "ack w/o opt" above at 2.5, 17.28
 149 "\(bu" at 1.6, 2.93
 150 "ack with opt" above at 1.6, 2.93
 151 "\(bu" at 9.4, 2.26
 152 "ack -O4" above at 9.4, 2.26
 153 "\(bu" at 1.5, 7.43
 154 "\fIcc\fR" above at 1.5, 7.43
 155 "\(bu" at 2.7, 2.02
 156 "\fIcc -O4\fR" ljust at 1.9, 1.2
 157 "\(bu" at 2.6, 2.10
 158 "\fIcc -O\fR" ljust at 3.1,2.5
 159 .G2
 160 .ce 1
 161 \fIFigure A.4.2: overall performance on Eratosthenes' sieve.
 162 .sp 1
 163 .PP
 164 Although the above figures speak for themselves, a small comment
 165 may be in place. At first it is clear that our compiler is neither
 166 faster than \fIcc\fR, nor produces faster code than \fIcc -O4\fR. It should
 167 also be noted however, that we do produce better code than \fIcc\fR
 168 at only a very small additional cost.
 169 It is also worth noticing that push-pop optimization
 170 increases run-time speed as well as compile speed.
 171 The first seems rather obvious,
 172 since optimized code is
 173 faster code, but the increase in compile speed may come as a surprise.
 174 The main reason is that the \fIas\fR+\fIld\fR time depends largely on the
 175 amount of generated code, which in general
 176 depends on the efficiency of the code.
 177 Push-pop optimization removes a lot of useless instructions which
 178 would otherwise
 179 have found their way through to the assembler and the loader.
 180 Useless instructions inserted in an early stage in the compilation
 181 process will slow down every following stage, so elimination of useless
 182 instructions in an early stage, even when it requires a little computational
 183 overhead, can often be beneficial to the overall compilation speed.
 184 .bp