mattst88's blog

GNOME 40 available in Gentoo

GNOME 40 was released at the end of March, and yesterday I added the last bits of it to Gentoo. You may not think that's fast, and you'd be right, but it's a lot faster than any GNOME release has been added to Gentoo that I can recall. I wasn't looking to become Gentoo's GNOME maintainer when I joined the team 18 months ago. I only wanted to use a GNOME release that was a little less stale. So how did I get here?

I asked about the GNOME 3.26 status when 3.28 and 3.30 were already out. Repeat that story until I got tired of waiting and added myself to the Gentoo/GNOME team and started updating glib... then I started updating mutter and gnome-shell... then I started updating everything...

— Matt Turner (@mattst88) May 1, 2021

GNOME has two major releases per year (in March and September), so to be more than two major releases behind is significant. At least two of my coworkers on the Mesa team at Intel switched to Gentoo for one reason or another, but ultimately switched back to their old distribution because Gentoo's GNOME packages were so out of date. That was pretty disappointing to hear, but I sympathized with them.

I maintain the X11/Wayland stack in Gentoo, and I think I do a good job of keeping on top of it. I make upstream releases of X packages and contribute to Mesa professionally so I'm often able to make the upstream and downstream changes at the same time.

But for GNOME I was just a user who happened to be a Gentoo Developer, so I started by just poking and asking if there was anything I could do to help. Unfortunately the answer was "no" nearly every time.

So I just watched and occasionally asked how things were going. And occasionally GNOME updates happened, but the gap between Gentoo and upstream never really closed. GNOME 3.26 was added to Gentoo, and before significant progress was made on adding 3.28 or 3.30 a new major version 3.32 was released upstream. It looked like we were just treading water.

What's worse, there were multiple unofficial overlays often providing newer versions of GNOME than what the ::gentoo repository contained. For reasons that were never clear to me, it seemed that none of the external overlay contributors (one of whom was a full Gentoo Developer!) were willing or able to collaborate with the Gentoo GNOME team.

I started small by adding new versions of GNOME packages and making pull request on GitHub for more experienced GNOME team members to review. Unfortunately by this time, the GNOME team had only one active member.

I joined the GNOME team in October 2019 and worked around the edges, doing small version bumps of non-critical packages.

Since most of the GNOME packages were behind, I began adding the next major GNOME's glib to the tree to get extra testing. I figured if that additional testing caught issues before they could block the rest of GNOME from being updated that I could save us some time.

That worked out pretty well, and I felt a little more confident so I added the next major GNOME's mutter and gnome-shell. Kind of scary.

But that worked out well too. Users tested, filed bugs, and I fixed them. And since the most critical GNOME packages entered the ::gentoo repo long before the ancillary applications we didn't have any big surprises when it was time to ask for stabilization.

Initially I had no idea which packages were related or if there were particular problems to look out for. This knowledge existed only in the head of one Gentoo Developer, so as I squeezed it out of him (as I made mistakes and he let me know!) I began documenting it on the Wiki.

As I updated packages, I encountered various build system bugs. Gentoo naturally uncovers problems binary distributions don't notice. Whenever possible, I made a merge request upstream so that the next time we added a new version we wouldn't have to carry a patch. So far I've had 13 merge requests accepted!

Starting on March 20 I added the first bits of GNOME 40 to the tree (glib and some other packages are often released before the official release date). I added glib first, and then I figured I couldn't break anything too badly if I just bumped the GNOME games. I added gnome-shell (behind package.mask), and then sort of forgot that's where I normally stopped. Less than 8 weeks later, all of GNOME is entirely up to date in Gentoo!

The bookends of adding GNOME 40 are commits 71e9245b05e6 and b93e3e581161. In that time I made 610 commits. The vast majority are GNOME-related (511 of them by my count). Categorized, they are:

2 reverted commits (both mine)
229 commits adding new package versions
152 commits dropping old package versions
3 commits adding new packages
7 commits adding support for Python 3.9
118 miscellaneous commits fixing, cleaning, masking, unmasking

Those commits closed 120 bugs (and referenced 21 more), which made a nice dent in the Gentoo GNOME team's bug backlog. At the time of this writing, there are 514 bugs assigned to the GNOME team or with the GNOME team in the Cc list. By default, Bugzilla only shows 500 bugs on a single page, so the GNOME bug list doesn't even fit. That was a bit of a psychological hurdle for me to get started. It'll be a nice moment when we get to the other side of 500.

I hope that with the gap to upstream now closed that some other Gentoo Developers and users will be more willing to help contribute. GNOME fell behind in Gentoo because it was too much work for a single person to maintain sustainably. I've remedied the most glaring symptom of the situation but not the underlying problem. Reach out to me if you'd like to help!

Because it's fun to look at, here's the output of our gnome-bumpchecker.py tool, showing that we're indeed up-to-date on everything.

13 May 2021 – Tags: gentoo gnome linux

Combining constants in i965 fragment shaders

On Intel's Gen graphics, three source instructions like MAD and LRP cannot have constants as arguments. When support for MAD instructions was introduced with Sandybridge, we assumed the choice between a MOV+MAD and a MUL+ADD sequence was inconsequential, so we chose to perform the multiply and add operations separately. Revisiting that assumption has uncovered some interesting things about the hardware and has lead us to some pretty nice performance improvements.

On Gen 7 hardware (Ivybridge, Haswell, Baytrail), multiplies and adds without immediate value arguments can be co-issued, meaning that multiple instructions can be issued from the same execution unit in the same cycle. MADs, never having immediates as sources, can always be co-issued. Considering that, we should prefer MADs, but a typical vec4 * vec4 + vec4(constant) pattern would lead to three duplicate (four total) MOV imm instructions.

mov(8)  g10<1>F    1.0F
mov(8)  g11<1>F    1.0F
mov(8)  g12<1>F    1.0F
mov(8)  g13<1>F    1.0F
mad(8)  g40<1>F    g10<8,8,1>F   g20<8,8,1>F   g30<8,8,1>F
mad(8)  g41<1>F    g11<8,8,1>F   g21<8,8,1>F   g31<8,8,1>F
mad(8)  g42<1>F    g12<8,8,1>F   g22<8,8,1>F   g32<8,8,1>F
mad(8)  g43<1>F    g13<8,8,1>F   g23<8,8,1>F   g33<8,8,1>F

Should be easy to clean up, right? We should simply combine those 1.0F MOVs and modify the MAD instructions to access the same register. Well, conceptually yes, but in practice not quite.

Since the i965 driver's fragment shader backend doesn't use static single assignment form (it's on our TODO list), our common subexpression elimination pass has to emit a MOV instruction when combining instructions. As a result, performing common subexpression elimination on immediate MOVs would undo constant propagation and the compiler's optimizer would go into an infinite loop. Not what you wanted.

Instead, I wrote a pass that scans the instruction list after the main optimization loop and creates a list of immediate values that are used. If an immediate value is used by a 3-source instruction (a MAD or a LRP) or at least four times by an instruction that can co-issue (ADD, MUL, CMP, MOV) then it's put into a register and sourced from there.

But there's still room for improvement. Each general register can store 8 floats, and instead of storing 8 separate constants in each, we're storing a single constant 8 times (and on SIMD16, 16 times!). Fixing that wasn't hard, and it significantly reduces register usage – we now only use one register for each 8 immediate values. Using a special vector-float immediate type we can even load four floating-point values in a single instruction.

With that in place, we can now always emit MAD instructions.

I'm pretty pleased with the results. Without using the New Intermediate Representation (NIR), the shader-db results are:

total instructions in shared programs: 5895414 -> 5747578 (-2.51%)
instructions in affected programs: 3618111 -> 3470275 (-4.09%)

And with NIR (that already unconditionally emits MAD instructions):

total instructions in shared programs: 7992936 -> 7772474 (-2.76%)
instructions in affected programs: 3738730 -> 3518268 (-5.90%)

Effects on a WebGL microbenchmark

In December, I checked what effect my constant combining pass would have on a WebGL procedural noise demo. The demo generates an effect ("noise") that looks like a ball of fire. Its fragment shader contains a ton of instructions but no texturing operations. We're currently able to compile the program in SIMD8 without spilling any registers, but at a cost of scheduling the instructions very badly.

The effects the constant combining pass has on this demo are really interesting, and it actually gives me evidence that some of the ideas I had for the pass are valid, namely that co-issuing instructions is worth a little extra register pressure.

1.00x FPS of baseline – 3123 instructions – baseline
1.09x FPS of baseline – 2841 instructions – after promoting constants only if used by more than 2 MADs

Going from no-constant-combining to restricted-constant-combining gives us a 9% increase in frames per second for a 9% instruction count reduction. We're totally limited by fragment shader performance.

1.46x FPS of baseline – 2841 instructions – after promote any constant used by a MAD

Going from step 2 to 3 though is interesting. The instruction count doesn't change, but we reduced register pressure sufficiently that we can now schedule instructions better without spilling (SCHEDULE_PRE, instead of SCHEDULE_PRE_NON_LIFO) – a 33% speed up just by rearranging instructions.

1.62x FPS of baseline – 2852 instructions – after promoting constants used by at least 4 co-issueable instructions

I was worried that we weren't going to be able to measure any performance difference from pulling constants out of co-issueable instructions, but we can definitely get a nice improvement here, of about 10% increase in frames per second.

As an aside, I did an experiment to see what would happen if we used SCHEDULE_PRE and spilled registers anyway (I added a couple of extra instructions to increase register pressure over the threshold). I changed the window size to 2048x2048 and rendered a fixed number of frames.

SCHEDULE_PRE with no spills: 17.5 seconds
SCHEDULE_PRE with 4 spills (8 send instructions): 17.5 seconds
SCHEDULE_PRE_NON_LIFO with no spills: 28 seconds

So there's some good evidence that the cure is worse than the disease. Of course this demo doesn't do any texturing, so memory bandwidth is not at a premium.

1.76x FPS of baseline – 2609 instructions – ???

I ran the demo to see if we'd made any changes in the last two months and was pleasantly surprised to find that we'd cut another 9% of instructions. I have no idea what caused it, but I'll take it! Combined with everything else, we're up to a 76% performance improvement.

Where's the code

The Mesa patches that implement the constant combining pass were committed (commit bb33a31c) and will be in the next major release (presumably version 10.6).

If any of this sounds interesting enough that you'd like to do it for a living, feel free to contact me. My team at Intel is responsible for the open source 3D driver in Mesa and is looking for new talent.

07 April 2015 – Tags: freedesktop intel linux mesa xorg

Laptop choices and aftermath

In November I was lamenting the lack of selection in credible Haswell-powered laptops for Mesa development. I chose the 15" MacBook Pro, while coworkers picked the 13" MBP and the System76 Galago Pro. After using the three laptops for a few months, I review our choices and whether they panned out like we expected.

	CPU	RAM	Graphics	Screen	Storage	Battery
13" MacBook Pro	2.8 GHz 4558U	16 GiB	GT3 - 1200 MHz	13.3" 2560x1600	512 GiB PCIe	71.8 Wh
15" MacBook Pro	2.0 GHz 4750HQ	16 GiB	GT3e - 1200 MHz	15.4" 2880x1800	256 GiB PCIe	95 Wh
Galago Pro	2.0 GHz 4750HQ	16 GiB	GT3e - 1200 MHz	14.1" 1920x1080	many options	52 Wh

15" MacBook Pro

The installation procedure on the MacBook was very simple. I shrunk the HFS partition from OS X and installed rEFInd, before following the usual Gentoo installation.

Quirks and Annoyances

Running Linux on the MacBook is a good experience overall, with some quirks:

the Broadcom BCM4360 wireless chip is supported by a proprietary driver (net-wireless/broadcom-sta in Gentoo)
the high DPI Retina display often necessitates 150~200% zoom (or lots of squinting)
the keyboard causes some annoyances:
- the function keys operate only as F* keys when the function key is held, making common key combinations awkward (behavior can be changed with the /sys/module/hid_apple/parameters/fnmode file).
- there's no Delete key, and Home/End/Page Up/Page Down are function+arrow key.
- the power button is a regular key immediately above backspace. It's easy to press accidentally.
the cooling fans don't speed up until the CPU temperature is near 100 C.
no built-in Ethernet. Seriously, we've reinvented how many mini and micro HDMI and DisplayPort form factors, but we can't come up with a way to rearrange eight copper wires to fit an Ethernet port into the laptop?

Worst Thing: Insufficient cooling

The worst thing about the MacBook is the insufficient cooling. Even forcing the two fans to their maximum frequencies isn't enough to prevent the CPUs from thermal throttling in less than a minute of full load. Most worrying is that my CPU's core #1 seems to run significantly hotter under load that the other cores. It's always the first, and routinely the only, core to reach 100 C, causing the whole CPU package to be throttled until it cools slightly. The temperature gradient across a chip only 177 square millimeters is also troubling: frequently core #1 is 15 C hotter than core #3 under load. The only plausible conclusion I've come to is that the thermal paste isn't applied evenly across the CPU die. And since Apple uses tamper resistant screws I couldn't reapply the thermal paste without special tools (and probably voiding the warranty).

Best Thing: Retina display

I didn't realize how much the Retina display would improve the experience. Having multiple windows (that would have been close to full screen at 1080p) open at once is really nice. Being able to have driver code open on the left half of the screen, and the PDF documentation open on the right makes patch review quicker and more efficient. I've attached other laptops I've used to larger monitors, but I've never even felt like trying with the 15" MBP.

13" MacBook Pro

I consider the 13" MacBook Pro to be strictly inferior (okay, lighter and smaller is nice, but...) to the 15". Other than the obvious differences in the hardware, the most disappointing thing I've discovered about it is that the 13" screen isn't really big enough to be comfortable for development. The coworker that owns it plugs it into his physically larger 1080p monitor when he gets to the office. For a screen that's supposed to be probably the biggest selling point of the laptop, it's not getting a lot of use.

As I mentioned, I'm perfectly satisfied with the 15" screen for everyday development.

System76 Galago Pro

I used the Galago Pro for about three weeks before switching to the 15" MacBook. In total it's a really compelling system, except for the serious lack of attention to detail.

Quirks and Annoyances

although it has built-in Ethernet (yay!), the latch mechanism will drive you nuts. Two hands are necessary to unplug an Ethernet cable from it, and three are really recommended.
the single hinge attaching the screen feels like a failure point, and the screen itself flexes way too much when you open or close the laptop.
all three USB ports at on the right side, which can be annoying if you want to use a mouse, which you will, because...
the touchpad doesn't behave very well. In fairness, this is probably mostly the fault of the synaptics driver or the default configuration.

Worst Thing: Keyboard

The keyboard is probably the worst part. The first time I booted the system, typing k while holding the shift key wouldn't register a key press. Lower case k typed fine, but with shift held – nothing. After about 25 presses, it began working without any indication as to what changed.

The key stroke is very short, you get almost no feedback, and if you press the keys at an angle slightly off center they may not register. Typing on it can be a rather frustrating experience. Beyond it being a generally unpleasant keyboard, the function key placement confirms that the keyboard is a complete afterthought: Suspend is between Mute and Volume Down. Whoops!

Best Thing: Cooling

The Galago Pro has an excellent cooling system. Its fans are capable of moving a surprising amount of air and don't make too much noise doing it. Under full load, the CPU's temperature never passed 84 C – 16 C cooler than the 15" MBP (and the MBP doesn't break 100 C only because it starts throttling!). On top of not scorching your lap during compiles, the cooler temperatures mean the CPU and GPU are going to be able to stay in turbo mode longer and give better performance.

Final thoughts

Concerns about the keyboard and general build quality of the Galago Pro turned out to be true. I think it's possible to get used to the keyboard, and if you do I feel confident that the system is really nice to use (well, I guess you have to get used to the other input device too).

I'm overall quite happy with the MacBook Pro. The Retina display is awesome, and the PCIe SSD is incredibly fast. I was most worried about the 15" MacBook overheating and triggering thermal throttling. Unfortunately this was well founded. Other than the quirks, which are par for the course, the overheating issue is the one significant downside to this machine.

19 March 2014 – Tags: freedesktop intel linux mesa

Difficulty in Finding a Good Development Laptop

When I started at working at Intel last year on the open source 3D driver I was given a spare Lenovo T420s (Sandybridge) as my development machine. Almost everyone on my team had upgraded to Ivy Bridge by February, but I planned just to hold out a few months until Haswell was released. I then spent all summer wondering where the Haswell laptops were, and only now, five months later has Lenovo released Thinkpads with Haswell. It's time for a new development machine, and after months of research the only conclusion I've come to is that it's really hard to find a good laptop for my (admittedly strange) case.

I use my development laptop for undemanding tasks like text editing, reading documentation, email, patch review, but also things that benefit greatly by fast multicore hardware: compiling Mesa, running piglit, and compiling large sets of real world GLSL shaders. All of these are parallel tasks that see linear speed ups given additional CPU cores. Spending less time waiting for a compile or a run of the test suite to finish means I can test changes more quickly and do my work more efficiently.

Given these uses, my requirements are a quad-core Haswell laptop with a large resolution screen (greater than 1080p), GT3e graphics, and at least 8 GiB of RAM, on a budget of $2000 (less than this is obviously easier to justify to the people with the checkbook). I also have no use for dedicated graphics, and do not want it if it will cause any problems in using the Haswell GT.

I'm looking for a high end laptop with fast graphics, but without a discrete card. Sort of unsurprisingly, this is hard to find.

13" – 14" laptops

An early favorite was System76's rebranded Galago Pro. It looks amazing on paper. I can configure a 14" quad-core GT3e system with 16 GiB and two disks for around $1500. Unfortunately many owners have said the keyboard was the awful. The "keyboard is literally the worst keyboard I've ever used in my life", said one reviewer, with other descriptions ranging from "attrocious", to "junk", to "a bit rubbish"; and that it was so bad that they had to return the system. I consider such a bad keyboard a deal breaker. To add to the misery, the touchpad is apparently equally terrible. Who needs to use input devices anyway?!

The Haswell successor of my T420s, the T440s, starts in price from $1419 to $1870. It contains a lower-end i5 4200U by default with the option to upgrade to a 4600U (for an additional $270). Those prices also get you only 4 GiB of RAM. Adding an extra 4 GiB SODIMM costs an additional $80; adding an 8 GiB SODIMM costs an additional $210! 16 GiB of RAM isn't even an option.

An interesting option is Lenovo's Yoga 2 Pro. It's top selling point is its awesome 3200x1800 13.3" screen. But other than that it's not super impressive. Having only a 1.8 GHz CPU leaves me wondering how much extra time I'll spend over the life of the laptop waiting for piglit test runs to finish than if I'd gotten a faster CPU.

Apple products aren't usually compelling to the Linux user in me, but the 13" MacBook Pro is a strong option based on its specifications. It starts in price from $1299 to $1799 (depending on the size of the SSD) and has a great 2560x1600 (16:10 aspect ratio!) Retina display. It offers a 2.8 GHz 4558U CPU with GT3 (no e) graphics and a PCIe-based SSD which according to reviewers has read and write speeds of 700 MB/sec! I've read that MacBooks often have cooling problems, but according to Notebookcheck.net's review the CPU didn't throttle after an hour at maximum load.

ASUS's unreleased Zenbook Infinity UX301LA seems compelling as well. Notebookcheck.net's review says that its price will be higher than anything else I've spec'd so far, and outside of my stated budget at $2450. It does have some really nice features to attempt to justify the price: 2.8 GHz 4558U CPU and GT3 graphics, 2560x1440 (16:9) screen, and strangely two 128 GB SSDs in RAID 0. The SSDs use some exotic connector I've never heard of, which worries me, and in general the potential for data loss in RAID 0 does too. According to the review, this laptop is designed to compete with the MacBook Air rather than the Pro, but offers higher performance than the Air. I'd be worried about throttling with this system (as I would with the Air) and the review confirms this – in testing the CPU throttled down to only 1.2 GHz after a few minutes under load.

13" – 14" dual-core laptops
	CPU	RAM	Graphics	Screen	Storage	Battery	Price
MacBook Pro 13"	2.8 GHz 4558U	8 GiB	GT3 - 1200 MHz	13.3" 2560x1600	512 GiB PCIe	71.8 Wh	$1999
T440s	2.1 GHz 4600U	8 GiB	GT2 - 1100 MHz	14.0" 1920x1080	256 GiB SATA	23.2 Wh + 72 Wh	$2074
Yoga 2 Pro	1.8 GHz 4500U	8 GiB	GT2 - 1100 MHz	13.3" 3200x1800	512 GiB SATA	54? Wh	$1599
Zenbook UX301LA	2.8 GHz 4558U	8 GiB	GT3 - 1200 MHz	13.3" 2560x1440	256 GiB (RAID 0)	50 Wh	$2450

In comparison with the MacBook Pro I can immediately remove the T440s and Zenbook from consideration based on price (and for the latter, that it's not yet available). I think the MacBook Pro is a better choice over the Yoga 2 because I believe that the difference in price is more than worth the improvement of a 1.8 to a 2.8 GHz CPU, 2x the graphics execution units, and that the PCIe SSD is amazingly fast.

I still haven't met my goals of a quad-core CPU or GT3e graphics (all GT3e are with quad-core CPUs). With the exception of the Galago Pro (why oh why don't you just have a decent keyboard?) these features seem to only be available on 15" laptops.

15" laptops

Apple's 15" MacBook Pro offers quad-core CPUs and GT3e graphics and has a 2880x1800 Retina display, satisfying all of my criteria. The price leaves no room for upgrades though, since it starts at $1999. But for that price you get a 2.0 GHz quad-core 4750HQ CPU, GT3e, 8 GiB of RAM, and a 256 GiB PCIe SSD. Compared with the 13", you're trading a higher frequency dual-core CPU for a lower frequency (but still higher than the Yoga 2) quad-core CPU, gaining 128 MiB of graphics eDRAM, and the increasing the dimensions and weight (by a pound). Parallel tasks like compiling code and piglit test runs will take less time on the quad-core CPU.

One potential concern is heat dissipation and the potential for thermal throttling. It's hard to find objective reviews of the MacBooks (Notebookcheck.net's reviews are really good, but there's no review of the 15" MBP), much less reviews that consider thermal throttling. Most reviews seem to be of the model with Nvidia graphics anyway. The battery size is also really impressive at 95 Wh.

An option suggested to me is the Toshiba S55-A5358. It's a compelling option based on price; Newegg sells it for only $850. It has a quad-core 4700MQ CPU, GT2 graphics, a spinning 1 TiB disk, and a probably overly large 15.6" 1080p screen. Even though it lacks GT3(e) graphics, it still has the same thermal output as the 15" MacBook Pro with GT3e. Effectively, this means that the time it will take to perform graphic workloads (e.g., piglit runs) will be longer while still producing the same amount of heat per time, so the threat of thermal throttling is actually much worse. It's also a half-pound heavier than the 15" MacBook Pro, but perhaps more worrying is that its battery is less than half that of the MacBook, only 43 Wh, while its CPU and GPU still consume the same amount of power as those in the MacBook.

As far as I know, there aren't any other laptops with GT3e graphics which is largely the reason for me to consider 15" laptops.

15" quad-core laptops
	CPU	RAM	Graphics	Screen	Storage	Battery	Price
15" MacBook Pro	2.0 GHz 4750HQ	8 GiB	GT3e - 1200 MHz	15.4" 2880x1800	256 GiB PCIe	95 Wh	$1999
15" MacBook Pro (faster CPU)	2.6 GHz 4960HQ	8 GiB	GT3e - 1300 MHz	15.4" 2880x1800	256 GiB PCIe	95 Wh	$2299
Toshiba S55-A5358	2.4 GHz 4700MQ	8 GiB	GT2 - 1150 MHz	15.6" 1920x1080	1 TiB spinning	43 Wh	$850 + SSD
Galago Pro	2.0 GHz 4750HQ	16 GiB	GT3e - 1200 MHz	14.1" 1920x1080	many options	52 Wh	~ $1500

The Galago Pro is included in the table just to make me sick.

I need to think hard about whether portability or speed are more important to me, but regardless of the decision both paths lead to a MacBook Pro.

17 November 2013 – Tags: freedesktop intel linux mesa

My time optimizing graphics performance on the OLPC XO 1.75 laptop

Last summer after a year of graduate school, I was looking for an interesting project to work on. After asking around, Chris Ball found me in the #xorg-devel IRC channel and set me up working with One Laptop per Child. I started working with Chris and Jon Nettleton on improving the graphics performance of the ARM-based XO 1.75 laptop. The graphics drivers were in a state of flux, and in a number of cases the Sugar interface felt noticeably slower than on the VIA-powered XO 1.5. We wanted to know why it was slower and how to quantitatively measure graphics performance of real-world applications.

I suggested that we use cairo's trace tool to benchmark our hardware and to find performance bottlenecks. Using it to create traces of your own applications is very easy, so I captured a trace of me playing Sugar's Implode activity. The Implode activity's graphics consists only of moving solid-colored squares around the window but it still lagged during normal play.

Replaying the trace of Implode under a profiler revealed which compositing function I needed to focus on – over_n_8_8888. I made other traces too, although they weren't always useful. Five minutes of contorting my wrists to fit the tiny keyboard in order to complete the touch-typing lessons in the Typing Turtle activity created a trace that could be executed on the unoptimized graphics stack in 0.4 seconds. At least there was no performance issue there.

The Marvell CPU in the XO 1.75 is the successor to Intel's XScale line of ARM CPUs and as such has the iwMMXt SIMD instruction set. The neat thing about iwMMXt is that since it was designed by Intel to have the same features as x86's MMX, compilers can implement the same set of intrinsic functions and software can be written to take advantage of x86/MMX and ARM/iwMMXt with a single piece of code. pixman already had a set of MMX-optimized compositing functions written using the intrinsics, so the basic port of this code to ARM was relatively straightforward and consisted mostly of fixing unaligned accesses.

Unfortunately, the last time gcc's iwMMXt support was tested was the last time someone cared about XScale CPUs (i.e., a long time ago) and as a result gcc would crash with an internal compiler error when trying to compile some of the intrinsic functions (gcc PR35294). I submitted a small patch to fix the problem, but my school (NC State) took eight months to acknowledge that they don't own my work, and by that time Marvell had contributed a five-patch series to significantly improve iwMMXt scheduling and support. Marvell hadn't been successful in finding a gcc maintainer to commit their code, so Jon and I tested it, benchmarked pixman built with it, and resubmitted to the gcc mailing list. Nick Clifton at Red Hat took it upon himself to regression test it, fix some formatting issues, and finally commit it to gcc. Improved iwMMXt code generation support (and an addition to the test suite to ensure iwMMXt support doesn't bitrot again) will be available in gcc-4.8.

A year after beginning to work on the graphics stack of the XO 1.75 laptop I've now graduated and concluded my work, so I think now is a good time to show the results.

The image columns show the time in seconds to complete a cairo-perf-trace workload when using 32 bits per pixel and likewise image16 for 16 bits per pixel. The first column in both image and image16 groupings is the time to complete the workload without using the iwMMXt code. The second column is time to complete the workload when using the iwMMXt code.

cairo-trace	Before	After	Change	Before	After	Change
	image			image16
implode-sugarless	50.019	34.557	44.7% faster	51.871	25.874	100.5% faster
evolution	33.492	29.590	14.7% faster	30.334	24.751	22.6% faster
firefox-planet-gnome	191.465	173.835	10.1% faster	211.297	187.570	12.6% faster
gnome-system-monitor	51.956	44.549	16.6% faster	52.272	40.525	29.0% faster
gnome-terminal-vim	53.625	54.554	no change	47.593	47.341	no change
gvim	35.321	50.018	29.4% slower	35.441	35.539	no change
midori-zoomed	38.033	28.500	33.4% faster	38.576	26.937	43.2% faster
poppler	41.096	31.949	28.6% faster	41.230	31.749	29.9% faster
swfdec-giant-steps	20.062	16.912	18.6% faster	28.294	17.286	63.7% faster
swfdec-youtube	42.281	37.335	13.2% faster	52.848	47.053	12.3% faster
xfce4-terminal-a1	64.311	51.011	26.1% faster	62.592	51.191	22.3% faster

Generally the iwMMXt code improves performance rather significantly. 32-bpp gvim is a bit of a mystery and requires some more investigation. The Implode activity, which initially started this adventure has seen awesome improvements, namely a doubling of performance in 16-bpp mode.

More recently, I began working on bilinear scaling compositing functions. I implemented three of the most important ones (the same ones that the SSE2 code implements). Bilinear scaling is used a lot by web browsers, so I benchmarked a couple of Firefox traces.

cairo-trace	Before	After	Change due to bilinear	Total change
	image
firefox-fishtank	2042.723	1363.913	49.7% faster	don't want to know
firefox-planet-gnome	173.835	144.939	19.9% faster	32.1% faster

The firefox-fishtank (a trace of Firefox running an HTML5 demo) spends an enormous percentage of its time in the over_8888_8_8888 compositing function, so it came as little surprise that implementing a bilinear scaling version of it would yield large performance improvements. I just didn't expect it to cut more than 11 minutes out of the run time. The firefox-planet-gnome trace sees an additional 19.9% improvement and in total more than a 30% improvement overall.

In looking through my old emails to write this, I came across some benchmarks I did last year before a lot of other awesome performance work was done at OLPC, like switching to a hard-float build. They show how much performance has improved in general and not due to the work I've done on pixman.

cairo-trace	Before	After	After iwMMXt	Change	Total change
	image
implode-sugarless	56.178	50.019	34.557	12.3% faster	62.6% faster
firefox-planet-gnome	230.332	191.465	144.939	20.3% faster	58.9% faster
gnome-system-monitor	83.245	51.956	44.549	60.2% faster	86.9% faster

I have had a wonderful time working on pixman and working with the great group of people at OLPC. Special thanks goes to

Chris Ball, who got me started with OLPC and continued to help and support me throughout my time with them.
Jon Nettleton, for helping with graphics hacking and benchmarking, and for helping me figure out how to use yum way too many times.
Martin Langhoff, for bringing me onto the project and giving me lots of freedom and flexibility to tackle graphics performance.

06 July 2012 – Tags: arm freedesktop linux olpc pixman xorg