mattst88's blog - My time optimizing graphics performance on the OLPC XO 1.75 laptop

My time optimizing graphics performance on the OLPC XO 1.75 laptop

Last summer after a year of graduate school, I was looking for an interesting project to work on. After asking around, Chris Ball found me in the #xorg-devel IRC channel and set me up working with One Laptop per Child. I started working with Chris and Jon Nettleton on improving the graphics performance of the ARM-based XO 1.75 laptop. The graphics drivers were in a state of flux, and in a number of cases the Sugar interface felt noticeably slower than on the VIA-powered XO 1.5. We wanted to know why it was slower and how to quantitatively measure graphics performance of real-world applications.

I suggested that we use cairo's trace tool to benchmark our hardware and to find performance bottlenecks. Using it to create traces of your own applications is very easy, so I captured a trace of me playing Sugar's Implode activity. The Implode activity's graphics consists only of moving solid-colored squares around the window but it still lagged during normal play.

Replaying the trace of Implode under a profiler revealed which compositing function I needed to focus on – over_n_8_8888. I made other traces too, although they weren't always useful. Five minutes of contorting my wrists to fit the tiny keyboard in order to complete the touch-typing lessons in the Typing Turtle activity created a trace that could be executed on the unoptimized graphics stack in 0.4 seconds. At least there was no performance issue there.

The Marvell CPU in the XO 1.75 is the successor to Intel's XScale line of ARM CPUs and as such has the iwMMXt SIMD instruction set. The neat thing about iwMMXt is that since it was designed by Intel to have the same features as x86's MMX, compilers can implement the same set of intrinsic functions and software can be written to take advantage of x86/MMX and ARM/iwMMXt with a single piece of code. pixman already had a set of MMX-optimized compositing functions written using the intrinsics, so the basic port of this code to ARM was relatively straightforward and consisted mostly of fixing unaligned accesses.

Unfortunately, the last time gcc's iwMMXt support was tested was the last time someone cared about XScale CPUs (i.e., a long time ago) and as a result gcc would crash with an internal compiler error when trying to compile some of the intrinsic functions (gcc PR35294). I submitted a small patch to fix the problem, but my school (NC State) took eight months to acknowledge that they don't own my work, and by that time Marvell had contributed a five-patch series to significantly improve iwMMXt scheduling and support. Marvell hadn't been successful in finding a gcc maintainer to commit their code, so Jon and I tested it, benchmarked pixman built with it, and resubmitted to the gcc mailing list. Nick Clifton at Red Hat took it upon himself to regression test it, fix some formatting issues, and finally commit it to gcc. Improved iwMMXt code generation support (and an addition to the test suite to ensure iwMMXt support doesn't bitrot again) will be available in gcc-4.8.

A year after beginning to work on the graphics stack of the XO 1.75 laptop I've now graduated and concluded my work, so I think now is a good time to show the results.

The image columns show the time in seconds to complete a cairo-perf-trace workload when using 32 bits per pixel and likewise image16 for 16 bits per pixel. The first column in both image and image16 groupings is the time to complete the workload without using the iwMMXt code. The second column is time to complete the workload when using the iwMMXt code.

cairo-trace	Before	After	Change	Before	After	Change
	image			image16
implode-sugarless	50.019	34.557	44.7% faster	51.871	25.874	100.5% faster
evolution	33.492	29.590	14.7% faster	30.334	24.751	22.6% faster
firefox-planet-gnome	191.465	173.835	10.1% faster	211.297	187.570	12.6% faster
gnome-system-monitor	51.956	44.549	16.6% faster	52.272	40.525	29.0% faster
gnome-terminal-vim	53.625	54.554	no change	47.593	47.341	no change
gvim	35.321	50.018	29.4% slower	35.441	35.539	no change
midori-zoomed	38.033	28.500	33.4% faster	38.576	26.937	43.2% faster
poppler	41.096	31.949	28.6% faster	41.230	31.749	29.9% faster
swfdec-giant-steps	20.062	16.912	18.6% faster	28.294	17.286	63.7% faster
swfdec-youtube	42.281	37.335	13.2% faster	52.848	47.053	12.3% faster
xfce4-terminal-a1	64.311	51.011	26.1% faster	62.592	51.191	22.3% faster

Generally the iwMMXt code improves performance rather significantly. 32-bpp gvim is a bit of a mystery and requires some more investigation. The Implode activity, which initially started this adventure has seen awesome improvements, namely a doubling of performance in 16-bpp mode.

More recently, I began working on bilinear scaling compositing functions. I implemented three of the most important ones (the same ones that the SSE2 code implements). Bilinear scaling is used a lot by web browsers, so I benchmarked a couple of Firefox traces.

cairo-trace	Before	After	Change due to bilinear	Total change
	image
firefox-fishtank	2042.723	1363.913	49.7% faster	don't want to know
firefox-planet-gnome	173.835	144.939	19.9% faster	32.1% faster

The firefox-fishtank (a trace of Firefox running an HTML5 demo) spends an enormous percentage of its time in the over_8888_8_8888 compositing function, so it came as little surprise that implementing a bilinear scaling version of it would yield large performance improvements. I just didn't expect it to cut more than 11 minutes out of the run time. The firefox-planet-gnome trace sees an additional 19.9% improvement and in total more than a 30% improvement overall.

In looking through my old emails to write this, I came across some benchmarks I did last year before a lot of other awesome performance work was done at OLPC, like switching to a hard-float build. They show how much performance has improved in general and not due to the work I've done on pixman.

cairo-trace	Before	After	After iwMMXt	Change	Total change
	image
implode-sugarless	56.178	50.019	34.557	12.3% faster	62.6% faster
firefox-planet-gnome	230.332	191.465	144.939	20.3% faster	58.9% faster
gnome-system-monitor	83.245	51.956	44.549	60.2% faster	86.9% faster

I have had a wonderful time working on pixman and working with the great group of people at OLPC. Special thanks goes to

Chris Ball, who got me started with OLPC and continued to help and support me throughout my time with them.
Jon Nettleton, for helping with graphics hacking and benchmarking, and for helping me figure out how to use yum way too many times.
Martin Langhoff, for bringing me onto the project and giving me lots of freedom and flexibility to tackle graphics performance.

06 July 2012 – Tags: arm freedesktop linux olpc pixman xorg