DS20L Benchmarks

I've benchmarked by DS20L using nbench, played with CFLAGS, and tested Compaq's Alpha-optimized math library.

Test Setup


Benchmarks

First, I tried pretty standard CFLAGS. I use -O2 system wide, but figured -O3 might be a bit better.

gcc -O3 -mcpu=ev67 -mieee -lm
   MEMORY INDEX        : 6.937
   INTEGER INDEX       : 6.672
   FLOATING-POINT INDEX: 6.977

As far as integer performance, this was about as fast as an early Athlon. In terms of floating point performance, this was equivalent to a 950MHz Duron using gcc-2.8.1. Ugh. I decided to make use of Compaq's excellent Alpha optimized math library, libcpml, and forced nbench to link against it instead of glibc's libm.

gcc -O3 -mcpu=ev67 -mieee -lcpml
   MEMORY INDEX        : 6.927
   INTEGER INDEX       : 6.676
   FLOATING-POINT INDEX: 11.157

Immediately, we see a huge jump in the floating point score. It zoomed from a 950MHz Duron running gcc from 1998 to a Xeon at 2.2GHz (still running an anchient version of gcc, nonetheless). I decided to add a few more CFLAGS. First, I added -funroll-loops.

gcc -O3 -mcpu=ev67 -mieee -funroll-loops -lcpml
   MEMORY INDEX        : 7.935
   INTEGER INDEX       : 6.686
   FLOATING-POINT INDEX: 14.499

Another nice performance improvement in the FPU category. The memory index also went up by nearly 15% by the addition on -funroll-loops. If adding this flag improved these scores substatially, then let's add another: -ftree-vectorize.

gcc -O3 -mcpu=ev67 -mieee -funroll-loops -free-vectorize -lcpml
   MEMORY INDEX        : 8.185
   INTEGER INDEX       : 6.856
   FLOATING-POINT INDEX: 14.644

Here we see a slight improvement in all three categories. Why the heck not try our luck and add another CFLAG? I added -ftracer.

gcc -O3 -mcpu=ev67 -mieee -funroll-loops -free-vectorize -ftracer -lcpml
   MEMORY INDEX        : 7.874
   INTEGER INDEX       : 6.975
   FLOATING-POINT INDEX: 14.534

The results were mixed. Memory down, integer up, floating point down. Since it did not yield a significant improvement in any category, and actually hurt performance in two, I gave it the axe. Next, I tried -ffast-math.

gcc -O3 -mcpu=ev67 -mieee -funroll-loops -free-vectorize -ffast-math -lcpml
   MEMORY INDEX        : 8.182
   INTEGER INDEX       : 6.880
   FLOATING-POINT INDEX: 14.560

Overall, not really any difference was seen by -ffast-math's addition, so I removed it.

A neat feature of Compaq's libcpml is that it provides not only standard math functions designed to be accurate while still highly optimized, but also functions that are fast at the expense of accuracy. In order to use the fast math functions, I added the line #include <cpml.h> to nmglobal.h and recompiled.

gcc -O3 -mcpu=ev67 -mieee -funroll-loops -free-vectorize -lcpml
   #include <cpml.h> in nmglobal.h
   MEMORY INDEX        : 8.182
   INTEGER INDEX       : 6.859
   FLOATING-POINT INDEX: 15.980

Once again, we see an improvement in the floating point index. Whether the loss of accuracy would be important if this were not a benchmark is not known. Lastly, I compiled the entire program statically to remove some overhead from external library calls (such as math library calls to libcpml).

gcc -O3 -mcpu=ev67 -mieee -funroll-loops -ftree-vectorize -static -lcpml
   #include <cpml.h>
   MEMORY INDEX        : 8.177
   INTEGER INDEX       : 6.858
   FLOATING-POINT INDEX: 16.326

Sure enough, there was another (although slight) improvement in the floating point score. In terms of floating point performance, I started with a budget AMD running at less than a gigahertz with a compiler from the Clinton-era and ended with Pentium 4 at 2.6GHz with relatively new compiler and libc.


Results and Implications

Overall, I cannot say whether adding -funroll-loops and -ftree-vectorize to the global CFLAGS will yield an overall performance increase across the boards. I can, though, say that using Compaq's optimized math library, will provide a significant performance increase (nearly 100%) in math intensive programs.

If you've got any suggestions for CFLAGS or other tricks to try, contact me. Note, I did try compiling with Compaq's C Compiler, but nbench crashes during the first test with a null pointer error.

Addendum 1

By request, I reran the initial test using -O2 instead of -O3. Here are the results.

gcc -O2 -mcpu=ev67 -mieee -lm
   MEMORY INDEX        : 6.900
   INTEGER INDEX       : 4.757
   FLOATING-POINT INDEX: 7.244

You may think that the increase in the floating point index is strange, but it is only 3% different from the original -O3 results. The integer index, on the other hand, did decrease by 29% though. -O2 seems to be significantly slower than -O3 when it comes to integer operations.

Addendum 2

While looking through Alpha-specific gcc flags, I noticed the -msmall-text and -msmall-data flags. There flags reduce the number of instructions required to access memory in small programs.

gcc -O3 -mcpu=ev67 -mieee -msmall-data -msmall-text -lm
   MEMORY INDEX        : 6.905
   INTEGER INDEX       : 6.679
   FLOATING-POINT INDEX: 7.612

Possibly, there was a slight increase in the floating point index. I decided to try adding these two flags to the fastest combination I found previously. Note the first results do not use the fast cpml math functions, but rather the standard cpml math functions. The second results use the fast math functions.

gcc -O3 -mcpu=ev67 -mieee -pipe -msmall-data -msmall-text -funroll-loops -ftree-vectorize -static -lcpml
   MEMORY INDEX        : 8.163
   INTEGER INDEX       : 6.857
   FLOATING-POINT INDEX: 14.593
gcc -O3 -mcpu=ev67 -mieee -pipe -msmall-data -msmall-text -funroll-loops -ftree-vectorize -static -lcpml
   #include <cpml.h>
   MEMORY INDEX        : 8.157
   INTEGER INDEX       : 6.853
   FLOATING-POINT INDEX: 15.533

From these results, the two -msmall flags don't seem to help performance, and maybe actually hurt performance.