I've benchmarked by DS20L using nbench, played with CFLAGS, and tested Compaq's Alpha-optimized math library.
- HP Alphaserver DS20L
- Dual 833MHz EV68ALs with 4MB external L2 cache each
- 2 GB ECC Registered RAM
- Radeon X1550 256MB PCI
- 18 GB 15K RPM SCSI Drive
- nbench version 2.2.2
First, I tried pretty standard CFLAGS. I use -O2 system wide, but figured -O3 might be a bit better.
gcc -O3 -mcpu=ev67 -mieee -lm MEMORY INDEX : 6.937 INTEGER INDEX : 6.672 FLOATING-POINT INDEX: 6.977
As far as integer performance, this was about as fast as an early Athlon. In terms of floating point performance, this was equivalent to a 950MHz Duron using gcc-2.8.1. Ugh. I decided to make use of Compaq's excellent Alpha optimized math library, libcpml, and forced nbench to link against it instead of glibc's libm.
gcc -O3 -mcpu=ev67 -mieee -lcpml MEMORY INDEX : 6.927 INTEGER INDEX : 6.676 FLOATING-POINT INDEX: 11.157
Immediately, we see a huge jump in the floating point score. It zoomed from a 950MHz Duron running gcc from 1998 to a Xeon at 2.2GHz (still running an anchient version of gcc, nonetheless). I decided to add a few more CFLAGS. First, I added -funroll-loops.
gcc -O3 -mcpu=ev67 -mieee -funroll-loops -lcpml MEMORY INDEX : 7.935 INTEGER INDEX : 6.686 FLOATING-POINT INDEX: 14.499
Another nice performance improvement in the FPU category. The memory index also went up by nearly 15% by the addition on -funroll-loops. If adding this flag improved these scores substatially, then let's add another: -ftree-vectorize.
gcc -O3 -mcpu=ev67 -mieee -funroll-loops -free-vectorize -lcpml MEMORY INDEX : 8.185 INTEGER INDEX : 6.856 FLOATING-POINT INDEX: 14.644
Here we see a slight improvement in all three categories. Why the heck not try our luck and add another CFLAG? I added -ftracer.
gcc -O3 -mcpu=ev67 -mieee -funroll-loops -free-vectorize -ftracer -lcpml MEMORY INDEX : 7.874 INTEGER INDEX : 6.975 FLOATING-POINT INDEX: 14.534
The results were mixed. Memory down, integer up, floating point down. Since it did not yield a significant improvement in any category, and actually hurt performance in two, I gave it the axe. Next, I tried -ffast-math.
gcc -O3 -mcpu=ev67 -mieee -funroll-loops -free-vectorize -ffast-math -lcpml MEMORY INDEX : 8.182 INTEGER INDEX : 6.880 FLOATING-POINT INDEX: 14.560
Overall, not really any difference was seen by -ffast-math's addition, so I removed it.
A neat feature of Compaq's libcpml is that it provides not only standard math functions designed to be accurate while still highly optimized, but also functions that are fast at the expense of accuracy. In order to use the fast math functions, I added the line #include <cpml.h> to nmglobal.h and recompiled.
gcc -O3 -mcpu=ev67 -mieee -funroll-loops -free-vectorize -lcpml #include <cpml.h> in nmglobal.h MEMORY INDEX : 8.182 INTEGER INDEX : 6.859 FLOATING-POINT INDEX: 15.980
Once again, we see an improvement in the floating point index. Whether the loss of accuracy would be important if this were not a benchmark is not known. Lastly, I compiled the entire program statically to remove some overhead from external library calls (such as math library calls to libcpml).
gcc -O3 -mcpu=ev67 -mieee -funroll-loops -ftree-vectorize -static -lcpml #include <cpml.h> MEMORY INDEX : 8.177 INTEGER INDEX : 6.858 FLOATING-POINT INDEX: 16.326
Sure enough, there was another (although slight) improvement in the floating point score. In terms of floating point performance, I started with a budget AMD running at less than a gigahertz with a compiler from the Clinton-era and ended with Pentium 4 at 2.6GHz with relatively new compiler and libc.
Results and Implications
Overall, I cannot say whether adding -funroll-loops and -ftree-vectorize to the global CFLAGS will yield an overall performance increase across the boards. I can, though, say that using Compaq's optimized math library, will provide a significant performance increase (nearly 100%) in math intensive programs.
If you've got any suggestions for CFLAGS or other tricks to try, contact me. Note, I did try compiling with Compaq's C Compiler, but nbench crashes during the first test with a null pointer error.
By request, I reran the initial test using -O2 instead of -O3. Here are the results.
gcc -O2 -mcpu=ev67 -mieee -lm MEMORY INDEX : 6.900 INTEGER INDEX : 4.757 FLOATING-POINT INDEX: 7.244
You may think that the increase in the floating point index is strange, but it is only 3% different from the original -O3 results. The integer index, on the other hand, did decrease by 29% though. -O2 seems to be significantly slower than -O3 when it comes to integer operations.
While looking through Alpha-specific gcc flags, I noticed the -msmall-text and -msmall-data flags. There flags reduce the number of instructions required to access memory in small programs.
gcc -O3 -mcpu=ev67 -mieee -msmall-data -msmall-text -lm MEMORY INDEX : 6.905 INTEGER INDEX : 6.679 FLOATING-POINT INDEX: 7.612
Possibly, there was a slight increase in the floating point index. I decided to try adding these two flags to the fastest combination I found previously. Note the first results do not use the fast cpml math functions, but rather the standard cpml math functions. The second results use the fast math functions.
gcc -O3 -mcpu=ev67 -mieee -pipe -msmall-data -msmall-text -funroll-loops -ftree-vectorize -static -lcpml MEMORY INDEX : 8.163 INTEGER INDEX : 6.857 FLOATING-POINT INDEX: 14.593
gcc -O3 -mcpu=ev67 -mieee -pipe -msmall-data -msmall-text -funroll-loops -ftree-vectorize -static -lcpml #include <cpml.h> MEMORY INDEX : 8.157 INTEGER INDEX : 6.853 FLOATING-POINT INDEX: 15.533
From these results, the two -msmall flags don't seem to help performance, and maybe actually hurt performance.