19 March 2014 - Laptop choices and aftermath

In November I was lamenting the lack of selection in credible Haswell-powered laptops for Mesa development. I chose the 15" MacBook Pro, while coworkers picked the 13" MBP and the System76 Galago Pro. After using the three laptops for a few months, I review our choices and whether they panned out like we expected.

  CPU RAM Graphics Screen Storage Battery
13" MacBook Pro 2.8 GHz 4558U 16 GiB GT3 - 1200 MHz 13.3" 2560x1600 512 GiB PCIe 71.8 Wh
15" MacBook Pro 2.0 GHz 4750HQ 16 GiB GT3e - 1200 MHz 15.4" 2880x1800 256 GiB PCIe 95 Wh
Galago Pro 2.0 GHz 4750HQ 16 GiB GT3e - 1200 MHz 14.1" 1920x1080 many options 52 Wh

15" MacBook Pro

The installation procedure on the MacBook was very simple. I shrunk the HFS partition from OS X and installed rEFInd, before following the usual Gentoo installation.

Quirks and Annoyances

Running Linux on the MacBook is a good experience overall, with some quirks:

Worst Thing: Insufficient cooling

The worst thing about the MacBook is the insufficient cooling. Even forcing the two fans to their maximum frequencies isn't enough to prevent the CPUs from thermal throttling in less than a minute of full load. Most worrying is that my CPU's core #1 seems to run significantly hotter under load that the other cores. It's always the first, and routinely the only, core to reach 100 C, causing the whole CPU package to be throttled until it cools slightly. The temperature gradient across a chip only 177 square millimeters is also troubling: frequently core #1 is 15 C hotter than core #3 under load. The only plausible conclusion I've come to is that the thermal paste isn't applied evenly across the CPU die. And since Apple uses tamper resistant screws I couldn't reapply the thermal paste without special tools (and probably voiding the warranty).

Best Thing: Retina display

I didn't realize how much the Retina display would improve the experience. Having multiple windows (that would have been close to full screen at 1080p) open at once is really nice. Being able to have driver code open on the left half of the screen, and the PDF documentation open on the right makes patch review quicker and more efficient. I've attached other laptops I've used to larger monitors, but I've never even felt like trying with the 15" MBP.

13" MacBook Pro

I consider the 13" MacBook Pro to be strictly inferior (okay, lighter and smaller is nice, but...) to the 15". Other than the obvious differences in the hardware, the most disappointing thing I've discovered about it is that the 13" screen isn't really big enough to be comfortable for development. The coworker that owns it plugs it into his physically larger 1080p monitor when he gets to the office. For a screen that's supposed to be probably the biggest selling point of the laptop, it's not getting a lot of use.

As I mentioned, I'm perfectly satisfied with the 15" screen for everyday development.

System76 Galago Pro

I used the Galago Pro for about three weeks before switching to the 15" MacBook. In total it's a really compelling system, except for the serious lack of attention to detail.

Quirks and Annoyances

Worst Thing: Keyboard

The keyboard is probably the worst part. The first time I booted the system, typing k while holding the shift key wouldn't register a key press. Lower case k typed fine, but with shift held – nothing. After about 25 presses, it began working without any indication as to what changed.

The key stroke is very short, you get almost no feedback, and if you press the keys at an angle slightly off center they may not register. Typing on it can be a rather frustrating experience. Beyond it being a generally unpleasant keyboard, the function key placement confirms that the keyboard is a complete afterthought: Suspend is between Mute and Volume Down. Whoops!

Best Thing: Cooling

The Galago Pro has an excellent cooling system. Its fans are capable of moving a surprising amount of air and don't make too much noise doing it. Under full load, the CPU's temperature never passed 84 C – 16 C cooler than the 15" MBP (and the MBP doesn't break 100 C only because it starts throttling!). On top of not scorching your lap during compiles, the cooler temperatures mean the CPU and GPU are going to be able to stay in turbo mode longer and give better performance.

Final thoughts

Concerns about the keyboard and general build quality of the Galago Pro turned out to be true. I think it's possible to get used to the keyboard, and if you do I feel confident that the system is really nice to use (well, I guess you have to get used to the other input device too).

I'm overall quite happy with the MacBook Pro. The Retina display is awesome, and the PCIe SSD is incredibly fast. I was most worried about the 15" MacBook overheating and triggering thermal throttling. Unfortunately this was well founded. Other than the quirks, which are par for the course, the overheating issue is the one significant downside to this machine.

Tags: freedesktop intel linux mesa

17 November 2013 - Difficulty in Finding a Good Development Laptop

When I started at working at Intel last year on the open source 3D driver I was given a spare Lenovo T420s (Sandybridge) as my development machine. Almost everyone on my team had upgraded to Ivy Bridge by February, but I planned just to hold out a few months until Haswell was released. I then spent all summer wondering where the Haswell laptops were, and only now, five months later has Lenovo released Thinkpads with Haswell. It's time for a new development machine, and after months of research the only conclusion I've come to is that it's really hard to find a good laptop for my (admittedly strange) case.

I use my development laptop for undemanding tasks like text editing, reading documentation, email, patch review, but also things that benefit greatly by fast multicore hardware: compiling Mesa, running piglit, and compiling large sets of real world GLSL shaders. All of these are parallel tasks that see linear speed ups given additional CPU cores. Spending less time waiting for a compile or a run of the test suite to finish means I can test changes more quickly and do my work more efficiently.

Given these uses, my requirements are a quad-core Haswell laptop with a large resolution screen (greater than 1080p), GT3e graphics, and at least 8 GiB of RAM, on a budget of $2000 (less than this is obviously easier to justify to the people with the checkbook). I also have no use for dedicated graphics, and do not want it if it will cause any problems in using the Haswell GT.

I'm looking for a high end laptop with fast graphics, but without a discrete card. Sort of unsurprisingly, this is hard to find.

13" – 14" laptops

An early favorite was System76's rebranded Galago Pro. It looks amazing on paper. I can configure a 14" quad-core GT3e system with 16 GiB and two disks for around $1500. Unfortunately many owners have said the keyboard was the awful. The "keyboard is literally the worst keyboard I've ever used in my life", said one reviewer, with other descriptions ranging from "attrocious", to "junk", to "a bit rubbish"; and that it was so bad that they had to return the system. I consider such a bad keyboard a deal breaker. To add to the misery, the touchpad is apparently equally terrible. Who needs to use input devices anyway?!

The Haswell successor of my T420s, the T440s, starts in price from $1419 to $1870. It contains a lower-end i5 4200U by default with the option to upgrade to a 4600U (for an additional $270). Those prices also get you only 4 GiB of RAM. Adding an extra 4 GiB SODIMM costs an additional $80; adding an 8 GiB SODIMM costs an additional $210! 16 GiB of RAM isn't even an option.

An interesting option is Lenovo's Yoga 2 Pro. It's top selling point is its awesome 3200x1800 13.3" screen. But other than that it's not super impressive. Having only a 1.8 GHz CPU leaves me wondering how much extra time I'll spend over the life of the laptop waiting for piglit test runs to finish than if I'd gotten a faster CPU.

Apple products aren't usually compelling to the Linux user in me, but the 13" MacBook Pro is a strong option based on its specifications. It starts in price from $1299 to $1799 (depending on the size of the SSD) and has a great 2560x1600 (16:10 aspect ratio!) Retina display. It offers a 2.8 GHz 4558U CPU with GT3 (no e) graphics and a PCIe-based SSD which according to reviewers has read and write speeds of 700 MB/sec! I've read that MacBooks often have cooling problems, but according to Notebookcheck.net's review the CPU didn't throttle after an hour at maximum load.

ASUS's unreleased Zenbook Infinity UX301LA seems compelling as well. Notebookcheck.net's review says that its price will be higher than anything else I've spec'd so far, and outside of my stated budget at $2450. It does have some really nice features to attempt to justify the price: 2.8 GHz 4558U CPU and GT3 graphics, 2560x1440 (16:9) screen, and strangely two 128 GB SSDs in RAID 0. The SSDs use some exotic connector I've never heard of, which worries me, and in general the potential for data loss in RAID 0 does too. According to the review, this laptop is designed to compete with the MacBook Air rather than the Pro, but offers higher performance than the Air. I'd be worried about throttling with this system (as I would with the Air) and the review confirms this – in testing the CPU throttled down to only 1.2 GHz after a few minutes under load.

13" – 14" dual-core laptops
  CPU RAM Graphics Screen Storage Battery Price
MacBook Pro 13" 2.8 GHz 4558U 8 GiB GT3 - 1200 MHz 13.3" 2560x1600 512 GiB PCIe 71.8 Wh $1999
T440s 2.1 GHz 4600U 8 GiB GT2 - 1100 MHz 14.0" 1920x1080 256 GiB SATA 23.2 Wh + 72 Wh $2074
Yoga 2 Pro 1.8 GHz 4500U 8 GiB GT2 - 1100 MHz 13.3" 3200x1800 512 GiB SATA 54? Wh $1599
Zenbook UX301LA 2.8 GHz 4558U 8 GiB GT3 - 1200 MHz 13.3" 2560x1440 256 GiB (RAID 0) 50 Wh $2450

In comparison with the MacBook Pro I can immediately remove the T440s and Zenbook from consideration based on price (and for the latter, that it's not yet available). I think the MacBook Pro is a better choice over the Yoga 2 because I believe that the difference in price is more than worth the improvement of a 1.8 to a 2.8 GHz CPU, 2x the graphics execution units, and that the PCIe SSD is amazingly fast.

I still haven't met my goals of a quad-core CPU or GT3e graphics (all GT3e are with quad-core CPUs). With the exception of the Galago Pro (why oh why don't you just have a decent keyboard?) these features seem to only be available on 15" laptops.

15" laptops

Apple's 15" MacBook Pro offers quad-core CPUs and GT3e graphics and has a 2880x1800 Retina display, satisfying all of my criteria. The price leaves no room for upgrades though, since it starts at $1999. But for that price you get a 2.0 GHz quad-core 4750HQ CPU, GT3e, 8 GiB of RAM, and a 256 GiB PCIe SSD. Compared with the 13", you're trading a higher frequency dual-core CPU for a lower frequency (but still higher than the Yoga 2) quad-core CPU, gaining 128 MiB of graphics eDRAM, and the increasing the dimensions and weight (by a pound). Parallel tasks like compiling code and piglit test runs will take less time on the quad-core CPU.

One potential concern is heat dissipation and the potential for thermal throttling. It's hard to find objective reviews of the MacBooks (Notebookcheck.net's reviews are really good, but there's no review of the 15" MBP), much less reviews that consider thermal throttling. Most reviews seem to be of the model with Nvidia graphics anyway. The battery size is also really impressive at 95 Wh.

An option suggested to me is the Toshiba S55-A5358. It's a compelling option based on price; Newegg sells it for only $850. It has a quad-core 4700MQ CPU, GT2 graphics, a spinning 1 TiB disk, and a probably overly large 15.6" 1080p screen. Even though it lacks GT3(e) graphics, it still has the same thermal output as the 15" MacBook Pro with GT3e. Effectively, this means that the time it will take to perform graphic workloads (e.g., piglit runs) will be longer while still producing the same amount of heat per time, so the threat of thermal throttling is actually much worse. It's also a half-pound heavier than the 15" MacBook Pro, but perhaps more worrying is that its battery is less than half that of the MacBook, only 43 Wh, while its CPU and GPU still consume the same amount of power as those in the MacBook.

As far as I know, there aren't any other laptops with GT3e graphics which is largely the reason for me to consider 15" laptops.

15" quad-core laptops
  CPU RAM Graphics Screen Storage Battery Price
15" MacBook Pro 2.0 GHz 4750HQ 8 GiB GT3e - 1200 MHz 15.4" 2880x1800 256 GiB PCIe 95 Wh $1999
15" MacBook Pro (faster CPU) 2.6 GHz 4960HQ 8 GiB GT3e - 1300 MHz 15.4" 2880x1800 256 GiB PCIe 95 Wh $2299
Toshiba S55-A5358 2.4 GHz 4700MQ 8 GiB GT2   - 1150 MHz 15.6" 1920x1080 1 TiB spinning 43 Wh $850 + SSD
Galago Pro 2.0 GHz 4750HQ 16 GiB GT3e - 1200 MHz 14.1" 1920x1080 many options 52 Wh ~ $1500

The Galago Pro is included in the table just to make me sick.

I need to think hard about whether portability or speed are more important to me, but regardless of the decision both paths lead to a MacBook Pro.

Tags: freedesktop intel linux mesa

06 July 2012 - My time optimizing graphics performance on the OLPC XO 1,75 laptop

Last summer after a year of graduate school, I was looking for an interesting project to work on. After asking around, Chris Ball found me in the #xorg-devel IRC channel and set me up working with One Laptop per Child. I started working with Chris and Jon Nettleton on improving the graphics performance of the ARM-based XO 1.75 laptop. The graphics drivers were in a state of flux, and in a number of cases the Sugar interface felt noticeably slower than on the VIA-powered XO 1.5. We wanted to know why it was slower and how to quantitatively measure graphics performance of real-world applications.

XO 1.75 laptops

I suggested that we use cairo's trace tool to benchmark our hardware and to find performance bottlenecks. Using it to create traces of your own applications is very easy, so I captured a trace of me playing Sugar's Implode activity. The Implode activity's graphics consists only of moving solid-colored squares around the window but it still lagged during normal play.

Implode activity

Replaying the trace of Implode under a profiler revealed which compositing function I needed to focus on – over_n_8_8888. I made other traces too, although they weren't always useful. Five minutes of contorting my wrists to fit the tiny keyboard in order to complete the touch-typing lessons in the Typing Turtle activity created a trace that could be executed on the unoptimized graphics stack in 0.4 seconds. At least there was no performance issue there.

The Marvell CPU in the XO 1.75 is the successor to Intel's XScale line of ARM CPUs and as such has the iwMMXt SIMD instruction set. The neat thing about iwMMXt is that since it was designed by Intel to have the same features as x86's MMX, compilers can implement the same set of intrinsic functions and software can be written to take advantage of x86/MMX and ARM/iwMMXt with a single piece of code. pixman already had a set of MMX-optimized compositing functions written using the intrinsics, so the basic port of this code to ARM was relatively straightforward and consisted mostly of fixing unaligned accesses.

Unfortunately, the last time gcc's iwMMXt support was tested was the last time someone cared about XScale CPUs (i.e., a long time ago) and as a result gcc would crash with an internal compiler error when trying to compile some of the intrinsic functions (gcc PR35294). I submitted a small patch to fix the problem, but my school (NC State) took eight months to acknowledge that they don't own my work, and by that time Marvell had contributed a five-patch series to significantly improve iwMMXt scheduling and support. Marvell hadn't been successful in finding a gcc maintainer to commit their code, so Jon and I tested it, benchmarked pixman built with it, and resubmitted to the gcc mailing list. Nick Clifton at Red Hat took it upon himself to regression test it, fix some formatting issues, and finally commit it to gcc. Improved iwMMXt code generation support (and an addition to the test suite to ensure iwMMXt support doesn't bitrot again) will be available in gcc-4.8.

A year after beginning to work on the graphics stack of the XO 1.75 laptop I've now graduated and concluded my work, so I think now is a good time to show the results.

The image columns show the time in seconds to complete a cairo-perf-trace workload when using 32 bits per pixel and likewise image16 for 16 bits per pixel. The first column in both image and image16 groupings is the time to complete the workload without using the iwMMXt code. The second column is time to complete the workload when using the iwMMXt code.

 imageimage16
cairo-traceBeforeAfterChangeBeforeAfterChange
implode-sugarless50.01934.55744.7% faster51.87125.874100.5% faster
evolution33.49229.59014.7% faster30.33424.75122.6% faster
firefox-planet-gnome191.465173.83510.1% faster211.297187.57012.6% faster
gnome-system-monitor51.95644.54916.6% faster52.27240.52529.0% faster
gnome-terminal-vim53.62554.554no change47.59347.341no change
gvim35.32150.01829.4% slower35.44135.539no change
midori-zoomed38.03328.50033.4% faster38.57626.93743.2% faster
poppler41.09631.94928.6% faster41.23031.74929.9% faster
swfdec-giant-steps20.06216.91218.6% faster28.29417.28663.7% faster
swfdec-youtube42.28137.33513.2% faster52.84847.05312.3% faster
xfce4-terminal-a164.31151.01126.1% faster62.59251.19122.3% faster

Generally the iwMMXt code improves performance rather significantly. 32-bpp gvim is a bit of a mystery and requires some more investigation. The Implode activity, which initially started this adventure has seen awesome improvements, namely a doubling of performance in 16-bpp mode.

More recently, I began working on bilinear scaling compositing functions. I implemented three of the most important ones (the same ones that the SSE2 code implements). Bilinear scaling is used a lot by web browsers, so I benchmarked a couple of Firefox traces.

 image
cairo-traceBeforeAfterChange due to bilinearTotal change
firefox-fishtank2042.7231363.91349.7% fasterdon't want to know
firefox-planet-gnome173.835144.93919.9% faster32.1% faster

The firefox-fishtank (a trace of Firefox running an HTML5 demo) spends an enormous percentage of its time in the over_8888_8_8888 compositing function, so it came as little surprise that implementing a bilinear scaling version of it would yield large performance improvements. I just didn't expect it to cut more than 11 minutes out of the run time. The firefox-planet-gnome trace sees an additional 19.9% improvement and in total more than a 30% improvement overall.

In looking through my old emails to write this, I came across some benchmarks I did last year before a lot of other awesome performance work was done at OLPC, like switching to a hard-float build. They show how much performance has improved in general and not due to the work I've done on pixman.

 image
cairo-traceBeforeAfterAfter iwMMXtChangeTotal change
implode-sugarless56.17850.01934.55712.3% faster62.6% faster
firefox-planet-gnome230.332191.465144.93920.3% faster58.9% faster
gnome-system-monitor83.24551.95644.54960.2% faster86.9% faster

I have had a wonderful time working on pixman and working with the great group of people at OLPC. Special thanks goes to

Tags: arm freedesktop linux olpc pixman xorg

17 May 2012 - Optimizing pixman for Loongson: Process and Results

The Lemote Yeeloong is a small notebook that is often the computer of choice for Free Software advocates, including Richard Stallman. It's powered by an 800 MHz STMicroelectronics Loongson 2F processor and has an antiquated Silicon Motion 712 graphics chip. The SM712's acceleration features are pretty subpar for today's standards, and performance of the old XFree86 Acceleration Architecture (XAA) that supports the SM712 has slowly decayed as developers move to support newer hardware and newer acceleration architectures. In short, graphics performance of the SM712 isn't very good with new X servers, so how can we improve it?

If you don't care about how pixman was optimized and just want to see the results, you can skip ahead.

pixman, the pixel-manipulation library used by cairo and X has MMX-accelerated compositing functions, written using MMX via C-level intrinsic functions, which allow the programmer to write C but still have fine-grained control over performance sensitive MMX code.

Last summer I began optimizing graphics performance of the OLPC XO-1.75 laptop. The Marvell processor it uses supports iwMMXt2, a 64-bit SIMD instruction set designed by Intel for their XScale ARM CPUs. The instruction set is predictably very similar to Intel's original MMX instruction set. By design, Intel's MMX intrinsics also support generating iwMMXt instructions, so that the same optimized C code will be easily portable to processors supporting iwMMXt. With a relatively small amount of work (as compared to writing compositing functions in ARM/iwMMXt assembly) I had pixman's MMX optimized code working on the XO-1.75 for some nice performance gains.

The Loongson 2F processor also includes a 64-bit SIMD instruction set, very similar to Intel's MMX. Its SIMD instructions use the 32 floating-point registers, and like iwMMXt it provides some useful instructions not found in x86 processors until AMD's Enhanced 3DNow! or Intel's SSE instruction sets.

So just like I did with the XO-1.75, I planned to use pixman's existing MMX code to optimize performance on my Yeeloong.

While Intel's MMX intrinsic functions are well designed, well tested, well supported, and widely used, the Loongson intrinsics are none of these. In fact, they're incomplete, badly designed, and used no where I can find (indeed, all of the instances of Loongson-optimized SIMD code I have found are written in inline assembly, which is no surprise given the state of the intrinsics). Of course, the gcc manual doesn't tell me this, so I learned it only after trying to use them with pixman.

Aside: let me pretend that I'm designing and implementing Loongson's vector intrinsics, covering an instruction set very similar to MMX, which already has an excellent set of intrinsic functions. Why would I create my own incompatible set, instead of implementing the same interface that lots of software already use?!

Using the Loongson vector intrinsics, pixman passed the test suite, and objdump verified that gcc was successfully generating vector instructions, but the performance was terrible. gcc apparently was not privy to the knowledge that the integer data types returned by the intrinsics were actually stored in floating point registers, so in between any two vector instructions you might find three or four instructions that simply copied the same data back and forth between integer and floating-point registers.

punpcklwd	$f9,$f9,$f5
    dmtc1	v0,$f8
punpcklwd	$f19,$f19,$f5
    dmfc1	t9,$f9
    dmtc1	v0,$f9
    dmtc1	t9,$f20
    dmfc1	s0,$f19
punpcklbh	$f20,$f20,$f2

This path lead no where, so I decided to take the hint from previous programmers and forget that the Loongson intrinsics exist. I still wanted to use pixman's MMX code, so I implemented Intel's MMX intrinsics myself using Loongson inline assembly. Object code size was significantly smaller and performance was better, in fact much better in some select functions, but overall was still a net loss. There must have been optimization opportunities that I was missing.

On the XO-1.75 the MMX code is faster than the generic code, so I didn't recognize inefficiencies in the MMX code the first time I worked with it, but with the Loongson it was necessary that I find and fix them. The great thing is that optimizations to this code benefit x86/MMX, ARM/iwMMXt, and Loongson.

I took a look at the book Dirty Pixels at the suggestion of pixman's maintainer, Søren Sandmann. In it, I discovered that the original MMX instruction set lacked an unsigned packed-multiply high instruction which would be useful for the over compositing operation. To work around the lack of this instruction, an extra two shifts and an add had to be used. AMD recognized this inefficiency and added the instruction in Enhanced 3DNow! and later Intel did the same with SSE. I modified the pix_multiply function to use the new instruction, and the resulting object code size shrunk by 5%.

I realized that the expand_alpha, expand_alpha_rev, and invert_colors functions that mix and duplicate pixel components could be reduced from a combined total of around 30 instruction to a single instruction each. This change further reduced object code size by another 9%.

After that, I focused on eliminating unnecessary copies to and from the vector registers. Consider this code:

__m64 vsrc = load8888 (*src);

The code loads *src into an integer register, and then load8888 loads and expands the value into a vector register. Instead, it's simpler and faster to load from memory into a vector register directly. By counting the number of dmfc1 (doubleword move from floating-point) and dmtc1 (doubleword move to floating-point) instructions I could determine which functions had room for improvement.

After reducing the number of unnecessary copies and adding a number of other optimizations (list available here) I was ready to see if the Yeeloong was more usable.

Results gathered from cairo's perf-trace tool confirm the real-world performance improvements given by the pixman optimizations. The image columns show the time in seconds to complete a cairo-perf-trace workload when using 32 bits per pixel and likewise image16 for 16 bits per pixel. The first column in both image and image16 groupings is the time to complete the workload without using Loongson MMI code. The second column is time to complete the workload after pixman commit c78e986085, the commit that turns on the Loongson MMI code. The third column is the time to complete the workload with pixman-0.25.6 which has many more optimizations.

 imageimage16
evolution32.98529.66728.75214.7% faster27.31423.87022.96019.0% faster
firefox-planet-gnome197.982180.437169.53216.8% faster220.986205.057199.07711.0% faster
gnome-terminal-vim60.79950.52850.79219.7% faster51.65544.13143.56118.6% faster
gvim38.64632.55233.57015.1% faster38.12634.45335.4577.5% faster
ocitysmap23.06518.05717.51631.7% faster23.04618.05517.54331.4% faster
poppler43.67636.07735.49823.0% faster43.06536.09035.53421.2% faster
swfdec-giant-steps20.16620.36520.469no change22.35416.57814.47354.4% faster
swfdec-youtube31.50228.11824.16830.3% faster44.05241.77138.57714.2% faster
xfce4-terminal-a169.51751.28850.83836.7% faster62.22553.30944.29740.5% faster

May 29th edit: the % faster numbers were previously calculated as a percent difference between the initial workload times and the final workload times. I realized that this calculation's result is not strictly a metric of how much faster the code is. To calculate that, the new formula is (1/initial - 1/final) / (1/initial)) which calculates the percent difference in terms of operations/second. This number is % faster. The table has been updated accordingly.

As the results show, real-world performance is improved by the Loongson MMI code. I can tell a difference when using GNOME 3 (in fallback mode) on my Yeeloong.

So far this has been very successful. I've optimized pixman on an interesting platform, learned a new instruction set, and in the process found many opportunities to optimize the MMX code on x86 and ARM. I still see a bunch of things to work on with just these compositing operations alone. Beyond that, there are many other things to do like bilinear and nearest scaling functions (which are extremely important for Firefox performance). And beyond that, I've improved my understanding of pixman's code and have a few ideas for improvements in general.

Thanks to

Tags: freedesktop gentoo linux loongson mips pixman xorg yeeloong

02 August 2011 - New multilib N32 Gentoo MIPS Stages

Gentoo/MIPS has been in, well, not great shape for quite some time. When I was going through Gentoo recruitment, there were no stages (used for installing Gentoo) newer than 2008, so this was one of the main things I wanted to improve, specifically by creating new N32 ABI stages. Even though the N32 (meaning New 32-bit) ABI was introduced in IRIX in 1996 to replace SGI's o32 (Old 32-bit) ABI, Linux support for N32 has lagged behind until the last few years. Now, I'm pleased to unofficially announce new multilib N32 stages and that we'll be supporting as the preferred ABI.

MIPS has three main ABIs: o32 (32-bit integer and pointer), N32 (64-bit integer, 32-bit pointer), N64 (64-bit integer and pointer). Compared with N32 and N64, o32 is very restrictive. Very few function arguments are passed in registers; only half the number of floating point registers are usable; no native 64-bit integer datatype; no long double type. (see SGI's MIPSpro N32 ABI Handbook for details). Offering N32 as the default ABI means better performance, sometimes 30% more, just by removing the unnecessary restrictions a 32-bit ABI imposes on 64-bit CPUs. Providing multilib stages (ie, stages with glibc and gcc built for all three ABIs) gives the user flexibility to switch to another ABI relatively easily if desired, while also allowing him to reduce build times by switching to an N32-only profile.

The process of creating N32 (and especially multilib) stages wasn't straight forward. Our profiles were long unmaintained and in many cases totally broken. There were lots of keywording bugs open for mips, many where the MIPS was the last team to complete the request by years. There were actually some real bugs discovered too, like 354877 and 358149, usually caused by the incorrect assumption that the lib directory is always a symlink to lib32. All in all, I've reduced the number of open bugs for MIPS down to ~20.

Work needed to be done to catalyst, Gentoo's release building tool. Since the end of June, I've made 15 commits cleaning, fixing, and adding to the mips support code in catalyst. Other developers like Sebastian Pipping have also resumed work on a project that had otherwise been minimally maintained since the beginning of the year.

The last major component in reviving Gentoo's MIPS support is to create installation media, preferably in an automated manner. I've acquired two Broadcom BCM91250A MIPS development boards (and should be receiving a third soon), but they need disks, controllers, RAM, and cases. For that, I wrote a funding Proposal to build three MIPS development computers (pdf) and had it approved by the Gentoo Foundation. Things seem to be going well in acquisitions (track progress) so I hope to have the project completed in the next few months with the systems automatically building stages for a wide variety of MIPS systems.

Initially, I used a big-endian 2006.1 N32 stage and had to bootstrap my system with a series of at least 20 hacks (not a fun experience) until it was usable enough that I was able to build a clean N32 stage. From there, using crossdev I built a multilib toolchain, and with a few more hacks I was able to build a multilib stage.

With that in the past, I've been building stages that can be used to seed the automated stage creation system to come. At this point, my TODO list looks like this:

The final touches will be to create bootable media like CD, USB, and netboot images.

All stages are available in the experimental/mips/stages/ directory (as soon as the files propagate) of a Gentoo Mirror.

Hopefully by the time I'm able to convince Lemote (or, who?) to send me a Loongson 3A laptop, installing and using Gentoo/MIPS will be a fun and pleasant experience.

Tags: gentoo linux mips

1 2 3 4 5 Next