17 May 2012 - Optimizing pixman for Loongson: Process and Results

The Lemote Yeeloong is a small notebook that is often the computer of choice for Free Software advocates, including Richard Stallman. It's powered by an 800 MHz STMicroelectronics Loongson 2F processor and has an antiquated Silicon Motion 712 graphics chip. The SM712's acceleration features are pretty subpar for today's standards, and performance of the old XFree86 Acceleration Architecture (XAA) that supports the SM712 has slowly decayed as developers move to support newer hardware and newer acceleration architectures. In short, graphics performance of the SM712 isn't very good with new X servers, so how can we improve it?

If you don't care about how pixman was optimized and just want to see the results, you can skip ahead.

pixman, the pixel-manipulation library used by cairo and X has MMX-accelerated compositing functions, written using MMX via C-level intrinsic functions, which allow the programmer to write C but still have fine-grained control over performance sensitive MMX code.

Last summer I began optimizing graphics performance of the OLPC XO-1.75 laptop. The Marvell processor it uses supports iwMMXt2, a 64-bit SIMD instruction set designed by Intel for their XScale ARM CPUs. The instruction set is predictably very similar to Intel's original MMX instruction set. By design, Intel's MMX intrinsics also support generating iwMMXt instructions, so that the same optimized C code will be easily portable to processors supporting iwMMXt. With a relatively small amount of work (as compared to writing compositing functions in ARM/iwMMXt assembly) I had pixman's MMX optimized code working on the XO-1.75 for some nice performance gains.

The Loongson 2F processor also includes a 64-bit SIMD instruction set, very similar to Intel's MMX. Its SIMD instructions use the 32 floating-point registers, and like iwMMXt it provides some useful instructions not found in x86 processors until AMD's Enhanced 3DNow! or Intel's SSE instruction sets.

So just like I did with the XO-1.75, I planned to use pixman's existing MMX code to optimize performance on my Yeeloong.

While Intel's MMX intrinsic functions are well designed, well tested, well supported, and widely used, the Loongson intrinsics are none of these. In fact, they're incomplete, badly designed, and used no where I can find (indeed, all of the instances of Loongson-optimized SIMD code I have found are written in inline assembly, which is no surprise given the state of the intrinsics). Of course, the gcc manual doesn't tell me this, so I learned it only after trying to use them with pixman.

Aside: let me pretend that I'm designing and implementing Loongson's vector intrinsics, covering an instruction set very similar to MMX, which already has an excellent set of intrinsic functions. Why would I create my own incompatible set, instead of implementing the same interface that lots of software already use?!

Using the Loongson vector intrinsics, pixman passed the test suite, and objdump verified that gcc was successfully generating vector instructions, but the performance was terrible. gcc apparently was not privy to the knowledge that the integer data types returned by the intrinsics were actually stored in floating point registers, so in between any two vector instructions you might find three or four instructions that simply copied the same data back and forth between integer and floating-point registers.

punpcklwd	$f9,$f9,$f5
    dmtc1	v0,$f8
punpcklwd	$f19,$f19,$f5
    dmfc1	t9,$f9
    dmtc1	v0,$f9
    dmtc1	t9,$f20
    dmfc1	s0,$f19
punpcklbh	$f20,$f20,$f2

This path lead no where, so I decided to take the hint from previous programmers and forget that the Loongson intrinsics exist. I still wanted to use pixman's MMX code, so I implemented Intel's MMX intrinsics myself using Loongson inline assembly. Object code size was significantly smaller and performance was better, in fact much better in some select functions, but overall was still a net loss. There must have been optimization opportunities that I was missing.

On the XO-1.75 the MMX code is faster than the generic code, so I didn't recognize inefficiencies in the MMX code the first time I worked with it, but with the Loongson it was necessary that I find and fix them. The great thing is that optimizations to this code benefit x86/MMX, ARM/iwMMXt, and Loongson.

I took a look at the book Dirty Pixels at the suggestion of pixman's maintainer, Søren Sandmann. In it, I discovered that the original MMX instruction set lacked an unsigned packed-multiply high instruction which would be useful for the over compositing operation. To work around the lack of this instruction, an extra two shifts and an add had to be used. AMD recognized this inefficiency and added the instruction in Enhanced 3DNow! and later Intel did the same with SSE. I modified the pix_multiply function to use the new instruction, and the resulting object code size shrunk by 5%.

I realized that the expand_alpha, expand_alpha_rev, and invert_colors functions that mix and duplicate pixel components could be reduced from a combined total of around 30 instruction to a single instruction each. This change further reduced object code size by another 9%.

After that, I focused on eliminating unnecessary copies to and from the vector registers. Consider this code:

__m64 vsrc = load8888 (*src);

The code loads *src into an integer register, and then load8888 loads and expands the value into a vector register. Instead, it's simpler and faster to load from memory into a vector register directly. By counting the number of dmfc1 (doubleword move from floating-point) and dmtc1 (doubleword move to floating-point) instructions I could determine which functions had room for improvement.

After reducing the number of unnecessary copies and adding a number of other optimizations (list available here) I was ready to see if the Yeeloong was more usable.

Results gathered from cairo's perf-trace tool confirm the real-world performance improvements given by the pixman optimizations. The image columns show the time in seconds to complete a cairo-perf-trace workload when using 32 bits per pixel and likewise image16 for 16 bits per pixel. The first column in both image and image16 groupings is the time to complete the workload without using Loongson MMI code. The second column is time to complete the workload after pixman commit c78e986085, the commit that turns on the Loongson MMI code. The third column is the time to complete the workload with pixman-0.25.6 which has many more optimizations.

 imageimage16
evolution32.98529.66728.75214.7% faster27.31423.87022.96019.0% faster
firefox-planet-gnome197.982180.437169.53216.8% faster220.986205.057199.07711.0% faster
gnome-terminal-vim60.79950.52850.79219.7% faster51.65544.13143.56118.6% faster
gvim38.64632.55233.57015.1% faster38.12634.45335.4577.5% faster
ocitysmap23.06518.05717.51631.7% faster23.04618.05517.54331.4% faster
poppler43.67636.07735.49823.0% faster43.06536.09035.53421.2% faster
swfdec-giant-steps20.16620.36520.469no change22.35416.57814.47354.4% faster
swfdec-youtube31.50228.11824.16830.3% faster44.05241.77138.57714.2% faster
xfce4-terminal-a169.51751.28850.83836.7% faster62.22553.30944.29740.5% faster

May 29th edit: the % faster numbers were previously calculated as a percent difference between the initial workload times and the final workload times. I realized that this calculation's result is not strictly a metric of how much faster the code is. To calculate that, the new formula is (1/initial - 1/final) / (1/initial)) which calculates the percent difference in terms of operations/second. This number is % faster. The table has been updated accordingly.

As the results show, real-world performance is improved by the Loongson MMI code. I can tell a difference when using GNOME 3 (in fallback mode) on my Yeeloong.

So far this has been very successful. I've optimized pixman on an interesting platform, learned a new instruction set, and in the process found many opportunities to optimize the MMX code on x86 and ARM. I still see a bunch of things to work on with just these compositing operations alone. Beyond that, there are many other things to do like bilinear and nearest scaling functions (which are extremely important for Firefox performance). And beyond that, I've improved my understanding of pixman's code and have a few ideas for improvements in general.

Thanks to

Tags: freedesktop gentoo linux loongson mips pixman xorg yeeloong