The Case for Open Sourcing Alpha-Optimized Libraries
As long as I've been interested in Alpha hardware, I've been intrigued by Compaq's Alpha-optimized compilers and libraries. In some cases, the compilers produce code multiple times faster than by gcc. The math library, libcpml, contains functions that execute in half the time of their glibc equivalents. Since the abandonment of the Alpha platform, this code has languished. In some cases, the performance gap between Compaq's tools and their open source counterparts has shrunk. In others, the benefits of hand-tuned assembly still shine. This prompted me to contact HP and request the release of the code. They unfortunately concluded that an old MIPS license prevented them from releasing the compilers. I've recently contacted HP once again to persuade them to release libcpml and libots as free software, as libraries containing nothing but hand-tuned Alpha assembly could not be encumbered by this license. I also attached the following benchmarks as evidence of why this code is still valuable so many years after it was written.
Using a test suite I wrote, I benchmarked the implementation of math functions found in glibc with those in libcpml.
Function | Library | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Run 6 | Run 7 | Run 8 | Run 9 | Run 10 | Average | Speed Up (cpml over glibc) |
sin | glibc | 311 | 292 | 312 | 310 | 297 | 304 | 307 | 300 | 312 | 312 | 305.11 | |
sin | cpml | 156 | 155 | 148 | 150 | 152 | 154 | 152 | 151 | 151 | 163 | 152.89 | 99.56% |
cos | glibc | 251 | 246 | 245 | 243 | 392 | 251 | 240 | 240 | 252 | 240 | 261 | |
cos | cpml | 156 | 151 | 153 | 160 | 159 | 160 | 146 | 149 | 151 | 157 | 154 | 69.48% |
tan | glibc | 9384 | 9345 | 9351 | 9252 | 9195 | 9273 | 9237 | 9272 | 9213 | 9239 | 9264.11 | |
tan | cpml | 172 | 169 | 168 | 166 | 168 | 173 | 170 | 176 | 183 | 175 | 172 | 5286.11% |
sinh | glibc | 305 | 296 | 296 | 302 | 291 | 295 | 437 | 295 | 294 | 298 | 311.56 | |
sinh | cpml | 141 | 136 | 140 | 139 | 135 | 139 | 137 | 140 | 136 | 139 | 137.89 | 125.95% |
cosh | glibc | 352 | 327 | 316 | 363 | 351 | 338 | 352 | 358 | 329 | 362 | 344 | |
cosh | cpml | 138 | 139 | 137 | 141 | 140 | 142 | 138 | 138 | 144 | 138 | 139.67 | 146.29% |
tanh | glibc | 260 | 258 | 270 | 260 | 266 | 260 | 263 | 257 | 265 | 256 | 261.67 | |
tanh | cpml | 212 | 203 | 199 | 203 | 203 | 198 | 210 | 195 | 212 | 198 | 202.33 | 29.33% |
asin | glibc | 1434 | 1197 | 1227 | 1300 | 1346 | 1323 | 1390 | 1227 | 1274 | 1244 | 1280.89 | |
asin | cpml | 627 | 611 | 581 | 641 | 612 | 596 | 660 | 586 | 620 | 692 | 622.11 | 105.89% |
acos | glibc | 1034 | 1207 | 1054 | 1015 | 1015 | 1031 | 1068 | 994 | 1051 | 964 | 1044.33 | |
acos | cpml | 621 | 625 | 657 | 635 | 610 | 638 | 587 | 623 | 617 | 614 | 622.89 | 67.66% |
atan | glibc | 932 | 860 | 904 | 904 | 866 | 902 | 948 | 879 | 908 | 880 | 894.56 | |
atan | cpml | 566 | 536 | 538 | 519 | 538 | 513 | 521 | 533 | 526 | 497 | 524.56 | 70.54% |
asinh | glibc | 784 | 743 | 749 | 741 | 742 | 773 | 751 | 726 | 784 | 743 | 750.22 | |
asinh | cpml | 519 | 513 | 506 | 494 | 527 | 494 | 494 | 529 | 495 | 510 | 506.89 | 48.00% |
acosh | glibc | 912 | 866 | 855 | 785 | 990 | 823 | 865 | 820 | 845 | 836 | 853.89 | |
acosh | cpml | 954 | 912 | 898 | 946 | 905 | 904 | 898 | 920 | 928 | 885 | 910.67 | -6.23% |
atanh | glibc | 1125 | 1939 | 1071 | 1053 | 1079 | 1085 | 1068 | 996 | 1062 | 1143 | 1166.22 | |
atanh | cpml | 875 | 898 | 851 | 828 | 912 | 864 | 843 | 871 | 1355 | 851 | 919.22 | 26.87% |
floor | glibc | 88 | 82 | 82 | 79 | 76 | 91 | 87 | 84 | 84 | 86 | 83.44 | |
floor | cpml | 121 | 123 | 128 | 120 | 121 | 119 | 117 | 120 | 126 | 112 | 120.67 | -30.85% |
ceil | glibc | 89 | 90 | 85 | 85 | 86 | 79 | 81 | 80 | 81 | 80 | 83 | |
ceil | cpml | 123 | 117 | 115 | 114 | 131 | 114 | 119 | 114 | 122 | 118 | 118.22 | -29.79% |
round | glibc | 102 | 77 | 78 | 90 | 85 | 87 | 77 | 83 | 85 | 84 | 82.89 | |
round | cpml | 366 | 111 | 115 | 102 | 118 | 112 | 108 | 107 | 109 | 111 | 110.33 | -24.87% |
trunc | glibc | 84 | 83 | 87 | 89 | 77 | 84 | 82 | 85 | 85 | 84 | 84 | |
trunc | cpml | 118 | 120 | 118 | 117 | 116 | 119 | 112 | 109 | 114 | 110 | 115 | -26.96% |
log | glibc | 790 | 764 | 767 | 763 | 768 | 768 | 732 | 739 | 744 | 747 | 754.67 | |
log | cpml | 502 | 456 | 476 | 476 | 465 | 460 | 922 | 484 | 481 | 456 | 519.56 | 45.25% |
log10 | glibc | 840 | 803 | 808 | 802 | 784 | 856 | 785 | 782 | 774 | 826 | 802.22 | |
log10 | cpml | 527 | 534 | 549 | 551 | 536 | 578 | 535 | 530 | 540 | 537 | 543.33 | 47.65% |
log2 | glibc | 493 | 499 | 522 | 504 | 478 | 504 | 519 | 495 | 499 | 539 | 506.56 | |
log2 | cpml | 520 | 493 | 514 | 509 | 481 | 520 | 493 | 493 | 493 | 728 | 524.89 | -3.49% |
log1p | glibc | 233 | 240 | 235 | 224 | 234 | 230 | 227 | 228 | 229 | 233 | 231.11 | |
log1p | cpml | 304 | 279 | 276 | 297 | 269 | 299 | 291 | 291 | 305 | 286 | 288.11 | -19.78% |
exp | glibc | 401 | 357 | 357 | 403 | 344 | 407 | 413 | 475 | 372 | 401 | 392.11 | |
exp | cpml | 130 | 138 | 132 | 128 | 133 | 139 | 124 | 130 | 137 | 126 | 131.89 | 197.30% |
expm1 | glibc | 225 | 205 | 218 | 208 | 214 | 213 | 216 | 211 | 221 | 208 | 212.67 | |
expm1 | cpml | 160 | 165 | 169 | 157 | 166 | 162 | 157 | 164 | 165 | 157 | 162.44 | 30.92% |
exp2 | glibc | 1339 | 1314 | 1327 | 1305 | 1339 | 1310 | 1284 | 1284 | 1334 | 1309 | 1311.78 | |
exp2 | cpml | 151 | 149 | 136 | 140 | 148 | 137 | 139 | 149 | 138 | 138 | 141.56 | 826.66% |
As can be seen, many math (especially trigonometric) functions are 50-200% faster in libcpml. In other cases, such as the rounding functions, glibc is faster.
A few notes:
- Testing was done on my UP1500 with an 800 MHz EV68AL, 8MB L2 cache, and 4 GB RAM
- It may not be fair to benchmark ceil/floor as their implementations in glibc are not correct
- I don't entirely trust the glibc tan results, as they appear to be 50x slower than libcpml
As more evidence of libcpml's superiority, by simply linking nbench with -lcpml instead of -lm, the fourier benchmark gets a speed up of 2.5x to 3.0x.
If you'd like to run this test yourself, here's how. (I assume you run Gentoo on your Alpha.)
- Download libcpml for your processor, EV5, EV6 or later
- Add the libcpml ebuild to your portage overlay
svn co svn://mattst88.com/svn/compaq-c-overlay/
- Install libcpml (and its libots dependency) and make sure to set the ev6 USE flag if your Alpha is an EV6
- Checkout the test suite with
svn co svn://mattst88.com/svn/cpml-benchmarks
- Edit the CFLAGS variable in Makefile for your processor
- Run
make test
from the cpml-benchmarks folder - A results.csv file is generated in the cpml-benchmarks folder. Analyze results using a Spreadsheet program
If you like, email the results to me. I'd like to see what these benchmarks look like on an EV5 machine.