13 December 2008 - Does Anyone Care About Fixing Bugs?

As time goes on, alternative architectures like Alpha and PA-RISC slowly lose their userbase. Experienced developers move on to things that interest them more. Emphasis isn't put on fixing bugs for these aging platforms, and the level of support slowly erodes. Eventually a small hardcore userbase is all that is left. The Gentoo Bugzilla showed this effect on the Alpha platform. All nontrivial bugs were left to rot. What's worse, many bugs were so old that the software containing them wasn't even in Portage anymore, yet no one closed the bug report or asked if it was fixed. One, a two-and-a-half-year-old bug about a failing cipher algorithm in libmcrypt caught my eye. I decided I'd give fixing it a shot.

The project's KNOWN-BUGS file stated

- cast-256 and rc6 do not work properly on Alpha (64 bit) machines

Fittingly, the bug was filed by a developer who has since retired. An automated test suite included with libmcrypt reported a failing cipher, CAST-256. Maybe it's a bug with gcc. Months pass. If it is, it's a bug across both 3.x and 4.x series. Years pass. Maybe we'll just mask the failure.

No one seemed to want to fire up vi and check the code.

I decided I'd compile the same version side by side on my AMD64 desktop and my UP1500 Alpha. Both compile cleanly, and I can reproduce the failing case quickly. The first thing I check is the test suite itself by adding print statements and comparing the output between the AMD64 and Alpha systems. All the start-up code looks fine. The problem has to be in the library itself, which is what I expect.

Finally, I find that the results begin to vary during a function call to _mcrypt_set_key. Continuing, I slowly isolate the failing code to the k_rnd macro, then the f1 macro, and finally to the rotl32 macro.

The rotl32 macro rotates bits left in a 32-bit memory cell. The macro and its siblings look like

#define rotl32(x,n)   (((x) << ((word32)(n))) | ((x) >> (32 - (word32)(n))))
#define rotr32(x,n)   (((x) >> ((word32)(n))) | ((x) << (32 - (word32)(n))))
#define rotl16(x,n)   (((x) << ((word16)(n))) | ((x) >> (16 - (word16)(n))))
#define rotr16(x,n)   (((x) >> ((word16)(n))) | ((x) << (16 - (word16)(n))))

I confirmed that this function did yield different results on AMD64 and Alpha by writing a small test program. Guessing, I figured that this implementation wasn't compatible with Alphas and that I could easily find another working implementation. In the Linux Kernel's include/linux/bitops.h file, they had virtually the same implementation. No luck there.

After a few hours of scouring the internet for quick-fix solutions, I turned to the Alpha Architecture Handbook and look up Alpha's shift instructions, sll and srl.

SxL   Ra.rq,Rb.rq,Rc.wq
Rc <- LEFT_SHIFT (Rav, Rbv<5:0>)      !SLL
Rc <- RIGHT_SHIFT(Rav, Rbv<5:0>)      !SRL

Beyond the terse syntax, this means that only six bits of the shift argument matter. The designers did this because with the Alpha's 64-bit wide registers, it doesn't make sense to implement instructions (and circuitry) to shift more than 63 times. Just the same, the rotl32 macro is only supposed to operate on 32-bit numbers, so it doesn't make sense to rotate more than 31 times.

The result of rotating 32 times should be the same as the number input, since it would rotate the bits the entire width of the field. On Alpha though there are more than 32-bits in each register, so shifting 32 times doesn't leave the bits in place. It moves them into the upper part of the register.

By masking the shift argument and ignoring all but the first five bits, I fixed the problem.

#define rotl32(x,n)   (((x) << ((word32)(n & 31))) | ((x) >> (32 - (word32)(n & 31))))
#define rotr32(x,n)   (((x) >> ((word32)(n & 31))) | ((x) << (32 - (word32)(n & 31))))
#define rotl16(x,n)   (((x) << ((word16)(n & 15))) | ((x) >> (16 - (word16)(n & 15))))
#define rotr16(x,n)   (((x) >> ((word16)(n & 15))) | ((x) << (16 - (word16)(n & 15))))

This bug didn't affect AMD64, since it has 32-bit shift instructions as well as 64-bit. Undoubtedly though, had this been a problem on AMD64 instead of an obscure and aging architecture such as Alpha, it would have been fixed in a heartbeat.

It's amazing that such a simple fix was needed to squash a bug that (1) was reported by a Gentoo/Alpha developer, and (2) had been in the tracker for two-and-a-half years.

Now, I need to check on that Kernel code. Who knows how long it's contained this bug!

Tags: alpha gentoo linux