The Case for Open Sourcing Alpha-Optimized Libraries
As long as I've been interested in Alpha hardware, I've been intrigued by Compaq's Alpha-optimized compilers and libraries. In some cases, the compilers produce code multiple times faster than by gcc. The math library, libcpml, contains functions that execute in half the time of their glibc equivalents. Since the abandonment of the Alpha platform, this code has languished. In some cases, the performance gap between Compaq's tools and their open source counterparts has shrunk. In others, the benefits of hand-tuned assembly still shine. This prompted me to contact HP and request the release of the code. They unfortunately concluded that an old MIPS license prevented them from releasing the compilers. I've recently contacted HP once again to persuade them to release libcpml and libots as free software, as libraries containing nothing but hand-tuned Alpha assembly could not be encumbered by this license. I also attached the following benchmarks as evidence of why this code is still valuable so many years after it was written.
Using a test suite I wrote, I benchmarked the implementation of math functions found in glibc with those in libcpml.
Function | Library | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 | Run 6 | Run 7 | Run 8 | Run 9 | Run 10 | Average | Speed Up (cpml over glibc) |
sin | glibc | 311 | 292 | 312 | 310 | 297 | 304 | 307 | 300 | 312 | 312 | 305.11 | |
sin | cpml | 156 | 155 | 148 | 150 | 152 | 154 | 152 | 151 | 151 | 163 | 152.89 | 99.56% |
cos | glibc | 251 | 246 | 245 | 243 | 392 | 251 | 240 | 240 | 252 | 240 | 261 | |
cos | cpml | 156 | 151 | 153 | 160 | 159 | 160 | 146 | 149 | 151 | 157 | 154 | 69.48% |
tan | glibc | 9384 | 9345 | 9351 | 9252 | 9195 | 9273 | 9237 | 9272 | 9213 | 9239 | 9264.11 | |
tan | cpml | 172 | 169 | 168 | 166 | 168 | 173 | 170 | 176 | 183 | 175 | 172 | 5286.11% |
sinh | glibc | 305 | 296 | 296 | 302 | 291 | 295 | 437 | 295 | 294 | 298 | 311.56 | |
sinh | cpml | 141 | 136 | 140 | 139 | 135 | 139 | 137 | 140 | 136 | 139 | 137.89 | 125.95% |
cosh | glibc | 352 | 327 | 316 | 363 | 351 | 338 | 352 | 358 | 329 | 362 | 344 | |
cosh | cpml | 138 | 139 | 137 | 141 | 140 | 142 | 138 | 138 | 144 | 138 | 139.67 | 146.29% |
tanh | glibc | 260 | 258 | 270 | 260 | 266 | 260 | 263 | 257 | 265 | 256 | 261.67 | |
tanh | cpml | 212 | 203 | 199 | 203 | 203 | 198 | 210 | 195 | 212 | 198 | 202.33 | 29.33% |
asin | glibc | 1434 | 1197 | 1227 | 1300 | 1346 | 1323 | 1390 | 1227 | 1274 | 1244 | 1280.89 | |
asin | cpml | 627 | 611 | 581 | 641 | 612 | 596 | 660 | 586 | 620 | 692 | 622.11 | 105.89% |
acos | glibc | 1034 | 1207 | 1054 | 1015 | 1015 | 1031 | 1068 | 994 | 1051 | 964 | 1044.33 | |
acos | cpml | 621 | 625 | 657 | 635 | 610 | 638 | 587 | 623 | 617 | 614 | 622.89 | 67.66% |
atan | glibc | 932 | 860 | 904 | 904 | 866 | 902 | 948 | 879 | 908 | 880 | 894.56 | |
atan | cpml | 566 | 536 | 538 | 519 | 538 | 513 | 521 | 533 | 526 | 497 | 524.56 | 70.54% |
asinh | glibc | 784 | 743 | 749 | 741 | 742 | 773 | 751 | 726 | 784 | 743 | 750.22 | |
asinh | cpml | 519 | 513 | 506 | 494 | 527 | 494 | 494 | 529 | 495 | 510 | 506.89 | 48.00% |
acosh | glibc | 912 | 866 | 855 | 785 | 990 | 823 | 865 | 820 | 845 | 836 | 853.89 | |
acosh | cpml | 954 | 912 | 898 | 946 | 905 | 904 | 898 | 920 | 928 | 885 | 910.67 | -6.23% |
atanh | glibc | 1125 | 1939 | 1071 | 1053 | 1079 | 1085 | 1068 | 996 | 1062 | 1143 | 1166.22 | |
atanh | cpml | 875 | 898 | 851 | 828 | 912 | 864 | 843 | 871 | 1355 | 851 | 919.22 | 26.87% |
floor | glibc | 88 | 82 | 82 | 79 | 76 | 91 | 87 | 84 | 84 | 86 | 83.44 | |
floor | cpml | 121 | 123 | 128 | 120 | 121 | 119 | 117 | 120 | 126 | 112 | 120.67 | -30.85% |
ceil | glibc | 89 | 90 | 85 | 85 | 86 | 79 | 81 | 80 | 81 | 80 | 83 | |
ceil | cpml | 123 | 117 | 115 | 114 | 131 | 114 | 119 | 114 | 122 | 118 | 118.22 | -29.79% |
round | glibc | 102 | 77 | 78 | 90 | 85 | 87 | 77 | 83 | 85 | 84 | 82.89 | |
round | cpml | 366 | 111 | 115 | 102 | 118 | 112 | 108 | 107 | 109 | 111 | 110.33 | -24.87% |
trunc | glibc | 84 | 83 | 87 | 89 | 77 | 84 | 82 | 85 | 85 | 84 | 84 | |
trunc | cpml | 118 | 120 | 118 | 117 | 116 | 119 | 112 | 109 | 114 | 110 | 115 | -26.96% |
log | glibc | 790 | 764 | 767 | 763 | 768 | 768 | 732 | 739 | 744 | 747 | 754.67 | |
log | cpml | 502 | 456 | 476 | 476 | 465 | 460 | 922 | 484 | 481 | 456 | 519.56 | 45.25% |
log10 | glibc | 840 | 803 | 808 | 802 | 784 | 856 | 785 | 782 | 774 | 826 | 802.22 | |
log10 | cpml | 527 | 534 | 549 | 551 | 536 | 578 | 535 | 530 | 540 | 537 | 543.33 | 47.65% |
log2 | glibc | 493 | 499 | 522 | 504 | 478 | 504 | 519 | 495 | 499 | 539 | 506.56 | |
log2 | cpml | 520 | 493 | 514 | 509 | 481 | 520 | 493 | 493 | 493 | 728 | 524.89 | -3.49% |
log1p | glibc | 233 | 240 | 235 | 224 | 234 | 230 | 227 | 228 | 229 | 233 | 231.11 | |
log1p | cpml | 304 | 279 | 276 | 297 | 269 | 299 | 291 | 291 | 305 | 286 | 288.11 | -19.78% |
exp | glibc | 401 | 357 | 357 | 403 | 344 | 407 | 413 | 475 | 372 | 401 | 392.11 | |
exp | cpml | 130 | 138 | 132 | 128 | 133 | 139 | 124 | 130 | 137 | 126 | 131.89 | 197.30% |
expm1 | glibc | 225 | 205 | 218 | 208 | 214 | 213 | 216 | 211 | 221 | 208 | 212.67 | |
expm1 | cpml | 160 | 165 | 169 | 157 | 166 | 162 | 157 | 164 | 165 | 157 | 162.44 | 30.92% |
exp2 | glibc | 1339 | 1314 | 1327 | 1305 | 1339 | 1310 | 1284 | 1284 | 1334 | 1309 | 1311.78 | |
exp2 | cpml | 151 | 149 | 136 | 140 | 148 | 137 | 139 | 149 | 138 | 138 | 141.56 | 826.66% |
As can be seen, many math (especially trigonometric) functions are 50-200% faster in libcpml. In other cases, such as the rounding functions, glibc is faster.
A few notes:
- Testing was done on my UP1500 with an 800 MHz EV68AL, 8MB L2 cache, and 4 GB RAM
- It may not be fair to benchmark ceil/floor as their implementations in glibc are not correct
- I don't entirely trust the glibc tan results, as they appear to be 50x slower than libcpml
As more evidence of libcpml's superiority, by simply linking nbench with -lcpml instead of -lm, the fourier benchmark gets a speed up of 2.5x to 3.0x.
If you'd like to run this test yourself, here's how. (I assume you run Gentoo on your Alpha.)
- Download libcpml for your processor, EV5, EV6 or later
- Add the libcpml ebuild to your portage overlay
svn co svn://mattst88.com/svn/compaq-c-overlay/
- Install libcpml (and its libots dependency) and make sure to set the ev6 USE flag if your Alpha is an EV6
- Checkout the test suite with
svn co svn://mattst88.com/svn/cpml-benchmarks
- Edit the CFLAGS variable in Makefile for your processor
- Run
make test
from the cpml-benchmarks folder - A results.csv file is generated in the cpml-benchmarks folder. Analyze results using a Spreadsheet program
If you like, email the results to me. I'd like to see what these benchmarks look like on an EV5 machine.
UP1500's and 833 MHz Alpha CPUs For Sale
One of the hardest things about using an Alternative Architecture like the Alpha is the small userbase. Since very few people have Alpha hardware, relative to other architectures, if one encounters a problem there are exceedingly few users able and willing to help. Even worse, if the problem is specific to your model, the chances of getting help are slimmed even more. Another issue is the difficulty in finding replacement parts. Want replacement Slot B CPUs? How about the impossible to find UP1500? In most cases, you'd have a terrible time even finding the parts and when you do, watch out for the price tag. Fortunately for you, I've got both of these areas covered. I've got brand new, sealed, in the box, latest revision UP1500 motherboards and unused, in the box 833 MHz 4MB Slot B CPUs for sale! Edit: Sold out.
The Samsung UP1500 is the quintessential Alpha motherboard. It sports
- an 800 MHz EV68AL Alpha processor, with 8 MB of L2 cache
- the latest revision (Rev B4) AMD-761 (Irongate-2) chipset
- up to 4 GB of ECC Registered DDR PC2100 RAM
- an AGP 4x
- other niceties such as ATA-100, on-board Ethernet, Sound, USB, and 3 PCI slots
Unbelievably, these boards are brand new and still sealed in the box. The factory date is listed as 01/12/28. Someone packed these away in a warehouse seven years ago and forgot about them. Back then, they could have sold them at prices in excess of 2500 dollars. Bad for them. Good for you. Their loss is your gain.
This is the only Alpha to support DDR RAM, and outside of the outrageously expensive EV7 Marvel systems, the only Alpha to support AGP 4x!
At the time of this writing, I've got mine set up with 4 GB of CL2 DDR RAM, a 4 port USB 2.0 PCI card, and a Radeon X1550 PCI card. I can't pass up the AGP though, so I'm waiting to grab a Radeon 9800 Pro.
Maybe you've already got a nice Samsung Alpha motherboard, such as the UP2000[+]. Unfortunately, your really nice and expensive processors wore out after years of service, and you can't find replacements. Don't worry about replacements. I've got upgrades!
These processors are the fastest available for the UP2000! Upgrade from your old 667 MHz 2MB CPUs to a pair of 833 MHz 4MB EV68AL Slot B processors.
Disclaimer: All the parts are working to the best of my knowledge. All UP1500s are sealed in their original boxes. I've opened one for myself, and it operates beautifully. The Slot B CPUs are opened but unused.
All these parts are guaranteed not to be dead-on-arrival.
If you're an Alpha fan and would like to get your hands on the perfect Samsung motherboard or a pair of the fastest Slot B CPUs, contact me. Quantities are extremely limited. Customers are served on a first come first served basis.
I sincerely hope that by putting some UP1500s and fast CPUs in the hands of Alpha users, we can band together to fix the problems we face.
The State of Alpha Linux
Software is never finished; it's forgotten. There is always one more enhancement to be made or one little quirk to work out. Sometimes there are even big problems. It happens from time to time. It's expected, and it's expected that the problems will be fixed. After spending quite a bit of time recently working with Linux on the Alpha platform, I've come to realize we face some very serious problems. And unfortunately, these may not ever be fixed, putting in jeopardy the future (hah!) of Alpha/Linux. I decided to articulate these problems in an email to the Linux on Alpha Processors mailing list in order to inform and ultimately find solutions and breathe a bit of life back into Alpha/Linux. I'd like to think that Alpha/Linux isn't a piece of forgotten software, not yet.
The State of Alpha Linux
We're all subscribed to this list because we use a dying platform. We do what we can to keep it going, but in recent months the State of Alpha Linux has been deteriorating at an accelerated rate.
Let me outline some issues facing us today:
- We have no glibc/Alpha maintainer [1]
- Kernel development for Alpha is comatose
- We can't run modern X.Org [2]
To make things worse, for such a small group of users, we're much too segregated and disorganized. For instance, how many (of the only four) Gentoo/Alpha maintainers are subscribed to this list? Debian/Alpha? How many realized we were without a glibc maintainer? That we can't use X.Org 7.4?
If this trend continues, we will completely first lose X.Org support. I even had an X.Org developer tell me he didn't care [about Alpha support] when I pinged him about an Alpha bug he had originally filed [3]!
We'll later lose glibc support. As it stands now, Alpha isn't even in the main tree [4]. I'm not sure what version Debian ships, but Gentoo is 3 versions behind at 2.6.1. Newer than that and the test suite causes a hard lock [5]. How much longer is it going to be before 2.6 is incompatible with the latest version and we begin to lose the ability to use other modern software?
While we may never lose kernel support, it will certainly begin to lag behind other platforms more and more. Bugs begin to take longer and longer to be fixed [6]. Release candidate kernels as late in the cycle as rc-8 of the 2.6.28 series fail to compile on Alpha [7]. This is definitely a worrying sign.
It is certainly expected that as a platform ages, it slowly loses its users and developers. In 1999, many average users knew or we're interested in learning Alpha assembly language, were interested in support for Alpha among Free Software, and were interested in programming for the platform. Obviously this cannot be the case today. We don't expect that it should.
We, the ones who do wish to see our platform live on, even if only a little longer, should focus on fixing what we can and maintaining what we already have.
Whether Fedora adds Alpha as a Second Tier Architecture is trivial in comparison to these issues. We should focus on making sure we have working software for Fedora/Alpha before we consider how to properly market it.
We, the small band of Alpha users, need to work together. We have the same problems, why should we work separately on them?
In order to facilitate better communication among Alpha users, developers, please use the Alpha IRC channel on Freenode,
#alpha
, and the Wiki [8]. If you have unused hardware that may be useful to developers, consider donating it.From here, it's up to us to find solutions to these problems.
Ideas and Suggestions requested.
Matt Turner
- http://sourceware.org/ml/libc-alpha/2008-12/msg00009.html
- http://bugs.freedesktop.org/show_bug.cgi?id=17801
http://bugzilla.kernel.org/show_bug.cgi?id=10893- http://bugs.freedesktop.org/show_bug.cgi?id=19026
- http://sources.redhat.com/bugzilla/show_bug.cgi?id=6896
- Actually a kernel problem,
http://bugs.gentoo.org/show_bug.cgi?id=205099- http://bugzilla.kernel.org/show_bug.cgi?id=10893
- http://lkml.org/lkml/2008/10/29/69
- http://alphalinux.org/wiki/index.php/Main_Page
What can we do? I think there are a couple things we need to do, namely:
- Consolidate our efforts by consolidating distributions. With as few users as we have, we have fewer developers. There's no use in testing packages on Debian or Fedora when they're already tested in Gentoo.
- Demand that Alpha remain supported. Projects, including projects integral to the Linux desktop such as X.Org, need to know that we do still use Alpha hardware and that we want to be supported. Make yourself heard in
#xorg-devel
and appropriate mailing lists. - Experienced developers need to take the lead. We understand that it's hard to justify time spent working on Alpha-related issues. We do not ask much. We just ask that you not abandon us.
If we can do these things, we will be on the road to recovery.
– Tags: alpha gentoo glibc linux xorg
Does Anyone Care About Fixing Bugs?
As time goes on, alternative architectures like Alpha and PA-RISC slowly lose their userbase. Experienced developers move on to things that interest them more. Emphasis isn't put on fixing bugs for these aging platforms, and the level of support slowly erodes. Eventually a small hardcore userbase is all that is left. The Gentoo Bugzilla showed this effect on the Alpha platform. All nontrivial bugs were left to rot. What's worse, many bugs were so old that the software containing them wasn't even in Portage anymore, yet no one closed the bug report or asked if it was fixed. One, a two-and-a-half-year-old bug about a failing cipher algorithm in libmcrypt caught my eye. I decided I'd give fixing it a shot.
The project's KNOWN-BUGS file stated
- cast-256 and rc6 do not work properly on Alpha (64 bit) machines
Fittingly, the bug was filed by a developer who has since retired. An automated test suite included with libmcrypt reported a failing cipher, CAST-256. Maybe it's a bug with gcc. Months pass. If it is, it's a bug across both 3.x and 4.x series. Years pass. Maybe we'll just mask the failure.
No one seemed to want to fire up vi and check the code.
I decided I'd compile the same version side by side on my AMD64 desktop and my UP1500 Alpha. Both compile cleanly, and I can reproduce the failing case quickly. The first thing I check is the test suite itself by adding print statements and comparing the output between the AMD64 and Alpha systems. All the start-up code looks fine. The problem has to be in the library itself, which is what I expect.
Finally, I find that the results begin to vary during a function call to _mcrypt_set_key. Continuing, I slowly isolate the failing code to the k_rnd macro, then the f1 macro, and finally to the rotl32 macro.
The rotl32 macro rotates bits left in a 32-bit memory cell. The macro and its siblings look like
#define rotl32(x,n) (((x) << ((word32)(n))) | ((x) >> (32 - (word32)(n)))) #define rotr32(x,n) (((x) >> ((word32)(n))) | ((x) << (32 - (word32)(n)))) #define rotl16(x,n) (((x) << ((word16)(n))) | ((x) >> (16 - (word16)(n)))) #define rotr16(x,n) (((x) >> ((word16)(n))) | ((x) << (16 - (word16)(n))))
I confirmed that this function did yield different results on AMD64 and Alpha by writing a small test program. Guessing, I figured that this implementation wasn't compatible with Alphas and that I could easily find another working implementation. In the Linux Kernel's include/linux/bitops.h file, they had virtually the same implementation. No luck there.
After a few hours of scouring the internet for quick-fix solutions, I turned to the Alpha Architecture Handbook and look up Alpha's shift instructions, sll and srl.
SxL Ra.rq,Rb.rq,Rc.wq Rc <- LEFT_SHIFT (Rav, Rbv<5:0>) !SLL Rc <- RIGHT_SHIFT(Rav, Rbv<5:0>) !SRL
Beyond the terse syntax, this means that only six bits of the shift argument matter. The designers did this because with the Alpha's 64-bit wide registers, it doesn't make sense to implement instructions (and circuitry) to shift more than 63 times. Just the same, the rotl32 macro is only supposed to operate on 32-bit numbers, so it doesn't make sense to rotate more than 31 times.
The result of rotating 32 times should be the same as the number input, since it would rotate the bits the entire width of the field. On Alpha though there are more than 32-bits in each register, so shifting 32 times doesn't leave the bits in place. It moves them into the upper part of the register.
By masking the shift argument and ignoring all but the first five bits, I fixed the problem.
#define rotl32(x,n) (((x) << ((word32)(n & 31))) | ((x) >> (32 - (word32)(n & 31)))) #define rotr32(x,n) (((x) >> ((word32)(n & 31))) | ((x) << (32 - (word32)(n & 31)))) #define rotl16(x,n) (((x) << ((word16)(n & 15))) | ((x) >> (16 - (word16)(n & 15)))) #define rotr16(x,n) (((x) >> ((word16)(n & 15))) | ((x) << (16 - (word16)(n & 15))))
This bug didn't affect AMD64, since it has 32-bit shift instructions as well as 64-bit. Undoubtedly though, had this been a problem on AMD64 instead of an obscure and aging architecture such as Alpha, it would have been fixed in a heartbeat.
It's amazing that such a simple fix was needed to squash a bug that (1) was reported by a Gentoo/Alpha developer, and (2) had been in the tracker for two-and-a-half years.
Now, I need to check on that Kernel code. Who knows how long it's contained this bug!
Status of X11 on Alpha
As mentioned yesterday, X.Org 7.4 (xserver-1.5 and newer) cannot operate on Alpha due to way it accesses PCI resources such as ROM information and video memory. Kernel Bug 10893 was filed 6 months ago, but nothing has been fixed. A work-around is to implement a fallback in libpciaccess that would access /dev/mem directly, as previous Xservers do. Unfortunately, no one appears to care enough about X support on Alpha to implement it.
Julien Cristau (jcristau), an X.Org developer, originally reported the implications of providing no fallback to the Debian bug tracking system. After failing to find it reported anywhere in FreeDesktop.org's Bugzilla, I reported it. On the #xorg-devel
IRC channel, I asked jcristau if he could add anything to the bug report.
<mattst88> jcristau, if you could add anything to bug 19026, I'd really appreciate it.
<jcristau> mattst88: honestly i don't really care..
Not a good sign. Discouraged, I worked on something else for a half hour. I came back to IRC to see that developers had been discussing the fallback. No one was particularly enthusiastic. I asked Adam Jackson (ajax) of Red Hat why he opposed the fallback.
<mattst88> do we not want this fallback just on principle or because no one really cares to write it?
<ajax> mattst88: can't it be both?
<mattst88> sure, but is a temporary fallback really unacceptable?
<ajax> it's distasteful. i'm not going to write it. if someone else did, i probably wouldn't stop them.
I figured at this point I'd bother Ian Romanick, the libpciaccess maintainer, a bit more to see if I could get anything done. Before I had the chance though, David Airlie, responsible for all sorts of X development, responded.
<airlied> mattst88: also kms doesn't use the sysfs files
<airlied> or pciaccess.
The obvious implication of this statement is that once KMS (kernel modesetting) is implemented, lacking PCI resource files won't matter!
Unfortunately, it's not as quick and easy as we'd hope.
<airlied> but I need to revisit the whole mapping VRAM into unpriv userspace on those bonghits platforms.
<mattst88> right, so it should allow people to use radeons without fbdev, but isn't the radeon driver going to use sysfs/libpciaccess?
<airlied> mattst88: not with kms.
<airlied> userspace drivers in kms don't get access to all of VRAM
<airlied> or to registers.
<mattst88> so with kms, all this business about fallbacks and sysfs won't matter?
<airlied> no, however we have a whole new set of worries.
<airlied> mattst88: things like alpha sparsemem means we can't map VRAM into userspace on those platforms nicely.
<airlied> I need to read up more on the drug induced haze that is alpha mmio
<mattst88> is it doable? that is, are you at all interested in doing it? :)
<airlied> mattst88: I'm probably having to figure out how it might all work for IA64.
<mattst88> is that a similar situation to alpha?
<airlied> well its bad in that you can crap out certain machines if you allow users to access the mmio space.
<airlied> so its a DoS.
As always, there's work to be done, but this time it looks like there's someone who is actually going to do the work.
If anyone is interested in testing kernel modesetting with an R300, R400, or R500, check out David Airlie's drm-rawhide branch of his drm-2.6 kernel tree on Kernel.org.
I'll attempt to test with my Radeon X1550 and UP1500 motherboard soon and will report what I find.