mattst88's blog - Optimizing pixman for Loongson: Process and Results
My time optimizing graphics performance on the OLPC XO 1.75 laptop
Last summer after a year of graduate school, I was looking for an interesting project to work on. After asking around, Chris Ball found me in the #xorg-devel IRC channel and set me up working with One Laptop per Child. I started working with Chris and Jon Nettleton on improving the graphics performance of the ARM-based XO 1.75 laptop. The graphics drivers were in a state of flux, and in a number of cases the Sugar interface felt noticeably slower than on the VIA-powered XO 1.5. We wanted to know why it was slower and how to quantitatively measure graphics performance of real-world applications.
I suggested that we use cairo's trace tool to benchmark our hardware and to find performance bottlenecks. Using it to create traces of your own applications is very easy, so I captured a trace of me playing Sugar's Implode activity. The Implode activity's graphics consists only of moving solid-colored squares around the window but it still lagged during normal play.
Replaying the trace of Implode under a profiler revealed which compositing function I needed to focus on – over_n_8_8888. I made other traces too, although they weren't always useful. Five minutes of contorting my wrists to fit the tiny keyboard in order to complete the touch-typing lessons in the Typing Turtle activity created a trace that could be executed on the unoptimized graphics stack in 0.4 seconds. At least there was no performance issue there.
The Marvell CPU in the XO 1.75 is the successor to Intel's XScale line of ARM CPUs and as such has the iwMMXt SIMD instruction set. The neat thing about iwMMXt is that since it was designed by Intel to have the same features as x86's MMX, compilers can implement the same set of intrinsic functions and software can be written to take advantage of x86/MMX and ARM/iwMMXt with a single piece of code. pixman already had a set of MMX-optimized compositing functions written using the intrinsics, so the basic port of this code to ARM was relatively straightforward and consisted mostly of fixing unaligned accesses.
Unfortunately, the last time gcc's iwMMXt support was tested was the last time someone cared about XScale CPUs (i.e., a long time ago) and as a result gcc would crash with an internal compiler error when trying to compile some of the intrinsic functions (gcc PR35294). I submitted a small patch to fix the problem, but my school (NC State) took eight months to acknowledge that they don't own my work, and by that time Marvell had contributed a five-patch series to significantly improve iwMMXt scheduling and support. Marvell hadn't been successful in finding a gcc maintainer to commit their code, so Jon and I tested it, benchmarked pixman built with it, and resubmitted to the gcc mailing list. Nick Clifton at Red Hat took it upon himself to regression test it, fix some formatting issues, and finally commit it to gcc. Improved iwMMXt code generation support (and an addition to the test suite to ensure iwMMXt support doesn't bitrot again) will be available in gcc-4.8.
A year after beginning to work on the graphics stack of the XO 1.75 laptop I've now graduated and concluded my work, so I think now is a good time to show the results.
The image columns show the time in seconds to complete a cairo-perf-trace workload when using 32 bits per pixel and likewise image16 for 16 bits per pixel. The first column in both image and image16 groupings is the time to complete the workload without using the iwMMXt code. The second column is time to complete the workload when using the iwMMXt code.
image
image16
cairo-trace
Before
After
Change
Before
After
Change
implode-sugarless
50.019
34.557
44.7% faster
51.871
25.874
100.5% faster
evolution
33.492
29.590
14.7% faster
30.334
24.751
22.6% faster
firefox-planet-gnome
191.465
173.835
10.1% faster
211.297
187.570
12.6% faster
gnome-system-monitor
51.956
44.549
16.6% faster
52.272
40.525
29.0% faster
gnome-terminal-vim
53.625
54.554
no change
47.593
47.341
no change
gvim
35.321
50.018
29.4% slower
35.441
35.539
no change
midori-zoomed
38.033
28.500
33.4% faster
38.576
26.937
43.2% faster
poppler
41.096
31.949
28.6% faster
41.230
31.749
29.9% faster
swfdec-giant-steps
20.062
16.912
18.6% faster
28.294
17.286
63.7% faster
swfdec-youtube
42.281
37.335
13.2% faster
52.848
47.053
12.3% faster
xfce4-terminal-a1
64.311
51.011
26.1% faster
62.592
51.191
22.3% faster
Generally the iwMMXt code improves performance rather significantly. 32-bpp gvim is a bit of a mystery and requires some more investigation. The Implode activity, which initially started this adventure has seen awesome improvements, namely a doubling of performance in 16-bpp mode.
More recently, I began working on bilinear scaling compositing functions. I implemented three of the most important ones (the same ones that the SSE2 code implements). Bilinear scaling is used a lot by web browsers, so I benchmarked a couple of Firefox traces.
image
cairo-trace
Before
After
Change due to bilinear
Total change
firefox-fishtank
2042.723
1363.913
49.7% faster
don't want to know
firefox-planet-gnome
173.835
144.939
19.9% faster
32.1% faster
The firefox-fishtank (a trace of Firefox running an HTML5 demo) spends an enormous percentage of its time in the over_8888_8_8888 compositing function, so it came as little surprise that implementing a bilinear scaling version of it would yield large performance improvements. I just didn't expect it to cut more than 11 minutes out of the run time. The firefox-planet-gnome trace sees an additional 19.9% improvement and in total more than a 30% improvement overall.
In looking through my old emails to write this, I came across some benchmarks I did last year before a lot of other awesome performance work was done at OLPC, like switching to a hard-float build. They show how much performance has improved in general and not due to the work I've done on pixman.
image
cairo-trace
Before
After
After iwMMXt
Change
Total change
implode-sugarless
56.178
50.019
34.557
12.3% faster
62.6% faster
firefox-planet-gnome
230.332
191.465
144.939
20.3% faster
58.9% faster
gnome-system-monitor
83.245
51.956
44.549
60.2% faster
86.9% faster
I have had a wonderful time working on pixman and working with the great group of people at OLPC. Special thanks goes to
Chris Ball, who got me started with OLPC and continued to help and support me throughout my time with them.
Jon Nettleton, for helping with graphics hacking and benchmarking, and for helping me figure out how to use yum way too many times.
Martin Langhoff, for bringing me onto the project and giving me lots of freedom and flexibility to tackle graphics performance.
Optimizing pixman for Loongson: Process and Results
The Lemote Yeeloong is a small notebook that is often the computer of choice for Free Software advocates, including Richard Stallman. It's powered by an 800 MHz STMicroelectronics Loongson 2F processor and has an antiquated Silicon Motion 712 graphics chip. The SM712's acceleration features are pretty subpar for today's standards, and performance of the old XFree86 Acceleration Architecture (XAA) that supports the SM712 has slowly decayed as developers move to support newer hardware and newer acceleration architectures. In short, graphics performance of the SM712 isn't very good with new X servers, so how can we improve it?
If you don't care about how pixman was optimized and just want to see the results, you can skip ahead.
pixman, the pixel-manipulation library used by cairo and X has MMX-accelerated compositing functions, written using MMX via C-level intrinsic functions, which allow the programmer to write C but still have fine-grained control over performance sensitive MMX code.
Last summer I began optimizing graphics performance of the OLPC XO-1.75 laptop. The Marvell processor it uses supports iwMMXt2, a 64-bit SIMD instruction set designed by Intel for their XScale ARM CPUs. The instruction set is predictably very similar to Intel's original MMX instruction set. By design, Intel's MMX intrinsics also support generating iwMMXt instructions, so that the same optimized C code will be easily portable to processors supporting iwMMXt. With a relatively small amount of work (as compared to writing compositing functions in ARM/iwMMXt assembly) I had pixman's MMX optimized code working on the XO-1.75 for some nice performance gains.
The Loongson 2F processor also includes a 64-bit SIMD instruction set, very similar to Intel's MMX. Its SIMD instructions use the 32 floating-point registers, and like iwMMXt it provides some useful instructions not found in x86 processors until AMD's Enhanced 3DNow! or Intel's SSE instruction sets.
So just like I did with the XO-1.75, I planned to use pixman's existing MMX code to optimize performance on my Yeeloong.
While Intel's MMX intrinsic functions are well designed, well tested, well supported, and widely used, the Loongson intrinsics are none of these. In fact, they're incomplete, badly designed, and used no where I can find (indeed, all of the instances of Loongson-optimized SIMD code I have found are written in inline assembly, which is no surprise given the state of the intrinsics). Of course, the gcc manual doesn't tell me this, so I learned it only after trying to use them with pixman.
Using the Loongson vector intrinsics, pixman passed the test suite, and objdump verified that gcc was successfully generating vector instructions, but the performance was terrible. gcc apparently was not privy to the knowledge that the integer data types returned by the intrinsics were actually stored in floating point registers, so in between any two vector instructions you might find three or four instructions that simply copied the same data back and forth between integer and floating-point registers.
This path lead no where, so I decided to take the hint from previous programmers and forget that the Loongson intrinsics exist. I still wanted to use pixman's MMX code, so I implemented Intel's MMX intrinsics myself using Loongson inline assembly. Object code size was significantly smaller and performance was better, in fact much better in some select functions, but overall was still a net loss. There must have been optimization opportunities that I was missing.
On the XO-1.75 the MMX code is faster than the generic code, so I didn't recognize inefficiencies in the MMX code the first time I worked with it, but with the Loongson it was necessary that I find and fix them. The great thing is that optimizations to this code benefit x86/MMX, ARM/iwMMXt, and Loongson.
I took a look at the book Dirty Pixels at the suggestion of pixman's maintainer, Søren Sandmann. In it, I discovered that the original MMX instruction set lacked an unsigned packed-multiply high instruction which would be useful for the over compositing operation. To work around the lack of this instruction, an extra two shifts and an add had to be used. AMD recognized this inefficiency and added the instruction in Enhanced 3DNow! and later Intel did the same with SSE. I modified the pix_multiply function to use the new instruction, and the resulting object code size shrunk by 5%.
I realized that the expand_alpha, expand_alpha_rev, and invert_colors functions that mix and duplicate pixel components could be reduced from a combined total of around 30 instruction to a single instruction each. This change further reduced object code size by another 9%.
After that, I focused on eliminating unnecessary copies to and from the vector registers. Consider this code:
__m64 vsrc = load8888 (*src);
The code loads *src into an integer register, and then load8888 loads and expands the value into a vector register. Instead, it's simpler and faster to load from memory into a vector register directly. By counting the number of dmfc1 (doubleword move from floating-point) and dmtc1 (doubleword move to floating-point) instructions I could determine which functions had room for improvement.
After reducing the number of unnecessary copies and adding a number of other optimizations (list available here) I was ready to see if the Yeeloong was more usable.
Results gathered from cairo's perf-trace tool confirm the real-world performance improvements given by the pixman optimizations. The image columns show the time in seconds to complete a cairo-perf-trace workload when using 32 bits per pixel and likewise image16 for 16 bits per pixel. The first column in both image and image16 groupings is the time to complete the workload without using Loongson MMI code. The second column is time to complete the workload after pixman commit c78e986085, the commit that turns on the Loongson MMI code. The third column is the time to complete the workload with pixman-0.25.6 which has many more optimizations.
image
image16
evolution
32.985
29.667
28.752
14.7% faster
27.314
23.870
22.960
19.0% faster
firefox-planet-gnome
197.982
180.437
169.532
16.8% faster
220.986
205.057
199.077
11.0% faster
gnome-terminal-vim
60.799
50.528
50.792
19.7% faster
51.655
44.131
43.561
18.6% faster
gvim
38.646
32.552
33.570
15.1% faster
38.126
34.453
35.457
7.5% faster
ocitysmap
23.065
18.057
17.516
31.7% faster
23.046
18.055
17.543
31.4% faster
poppler
43.676
36.077
35.498
23.0% faster
43.065
36.090
35.534
21.2% faster
swfdec-giant-steps
20.166
20.365
20.469
no change
22.354
16.578
14.473
54.4% faster
swfdec-youtube
31.502
28.118
24.168
30.3% faster
44.052
41.771
38.577
14.2% faster
xfce4-terminal-a1
69.517
51.288
50.838
36.7% faster
62.225
53.309
44.297
40.5% faster
May 29th edit: the % faster numbers were previously calculated as a percent difference between the initial workload times and the final workload times. I realized that this calculation's result is not strictly a metric of how much faster the code is. To calculate that, the new formula is (1/initial - 1/final) / (1/initial)) which calculates the percent difference in terms of operations/second. This number is % faster. The table has been updated accordingly.
As the results show, real-world performance is improved by the Loongson MMI code. I can tell a difference when using GNOME 3 (in fallback mode) on my Yeeloong.
So far this has been very successful. I've optimized pixman on an interesting platform, learned a new instruction set, and in the process found many opportunities to optimize the MMX code on x86 and ARM. I still see a bunch of things to work on with just these compositing operations alone. Beyond that, there are many other things to do like bilinear and nearest scaling functions (which are extremely important for Firefox performance). And beyond that, I've improved my understanding of pixman's code and have a few ideas for improvements in general.
Thanks to
Danny Clark, who runs freedomincluded.com, for providing me with a Lemote Yeeloong laptop for my work on Gentoo's MIPS port.
Søren Sandmann and Siarhei Siamashka for reviewing and helping me improve my code.