mattst88's blog

Optimizing pixman for Loongson: Process and Results

The Lemote Yeeloong is a small notebook that is often the computer of choice for Free Software advocates, including Richard Stallman. It's powered by an 800 MHz STMicroelectronics Loongson 2F processor and has an antiquated Silicon Motion 712 graphics chip. The SM712's acceleration features are pretty subpar for today's standards, and performance of the old XFree86 Acceleration Architecture (XAA) that supports the SM712 has slowly decayed as developers move to support newer hardware and newer acceleration architectures. In short, graphics performance of the SM712 isn't very good with new X servers, so how can we improve it?

If you don't care about how pixman was optimized and just want to see the results, you can skip ahead.

pixman, the pixel-manipulation library used by cairo and X has MMX-accelerated compositing functions, written using MMX via C-level intrinsic functions, which allow the programmer to write C but still have fine-grained control over performance sensitive MMX code.

Last summer I began optimizing graphics performance of the OLPC XO-1.75 laptop. The Marvell processor it uses supports iwMMXt2, a 64-bit SIMD instruction set designed by Intel for their XScale ARM CPUs. The instruction set is predictably very similar to Intel's original MMX instruction set. By design, Intel's MMX intrinsics also support generating iwMMXt instructions, so that the same optimized C code will be easily portable to processors supporting iwMMXt. With a relatively small amount of work (as compared to writing compositing functions in ARM/iwMMXt assembly) I had pixman's MMX optimized code working on the XO-1.75 for some nice performance gains.

The Loongson 2F processor also includes a 64-bit SIMD instruction set, very similar to Intel's MMX. Its SIMD instructions use the 32 floating-point registers, and like iwMMXt it provides some useful instructions not found in x86 processors until AMD's Enhanced 3DNow! or Intel's SSE instruction sets.

So just like I did with the XO-1.75, I planned to use pixman's existing MMX code to optimize performance on my Yeeloong.

While Intel's MMX intrinsic functions are well designed, well tested, well supported, and widely used, the Loongson intrinsics are none of these. In fact, they're incomplete, badly designed, and used no where I can find (indeed, all of the instances of Loongson-optimized SIMD code I have found are written in inline assembly, which is no surprise given the state of the intrinsics). Of course, the gcc manual doesn't tell me this, so I learned it only after trying to use them with pixman.

Using the Loongson vector intrinsics, pixman passed the test suite, and objdump verified that gcc was successfully generating vector instructions, but the performance was terrible. gcc apparently was not privy to the knowledge that the integer data types returned by the intrinsics were actually stored in floating point registers, so in between any two vector instructions you might find three or four instructions that simply copied the same data back and forth between integer and floating-point registers.

punpcklwd	$f9,$f9,$f5
    dmtc1	v0,$f8
punpcklwd	$f19,$f19,$f5
    dmfc1	t9,$f9
    dmtc1	v0,$f9
    dmtc1	t9,$f20
    dmfc1	s0,$f19
punpcklbh	$f20,$f20,$f2

This path lead no where, so I decided to take the hint from previous programmers and forget that the Loongson intrinsics exist. I still wanted to use pixman's MMX code, so I implemented Intel's MMX intrinsics myself using Loongson inline assembly. Object code size was significantly smaller and performance was better, in fact much better in some select functions, but overall was still a net loss. There must have been optimization opportunities that I was missing.

On the XO-1.75 the MMX code is faster than the generic code, so I didn't recognize inefficiencies in the MMX code the first time I worked with it, but with the Loongson it was necessary that I find and fix them. The great thing is that optimizations to this code benefit x86/MMX, ARM/iwMMXt, and Loongson.

I took a look at the book Dirty Pixels at the suggestion of pixman's maintainer, Søren Sandmann. In it, I discovered that the original MMX instruction set lacked an unsigned packed-multiply high instruction which would be useful for the over compositing operation. To work around the lack of this instruction, an extra two shifts and an add had to be used. AMD recognized this inefficiency and added the instruction in Enhanced 3DNow! and later Intel did the same with SSE. I modified the pix_multiply function to use the new instruction, and the resulting object code size shrunk by 5%.

I realized that the expand_alpha, expand_alpha_rev, and invert_colors functions that mix and duplicate pixel components could be reduced from a combined total of around 30 instruction to a single instruction each. This change further reduced object code size by another 9%.

After that, I focused on eliminating unnecessary copies to and from the vector registers. Consider this code:

__m64 vsrc = load8888 (*src);

The code loads *src into an integer register, and then load8888 loads and expands the value into a vector register. Instead, it's simpler and faster to load from memory into a vector register directly. By counting the number of dmfc1 (doubleword move from floating-point) and dmtc1 (doubleword move to floating-point) instructions I could determine which functions had room for improvement.

After reducing the number of unnecessary copies and adding a number of other optimizations (list available here) I was ready to see if the Yeeloong was more usable.

Results gathered from cairo's perf-trace tool confirm the real-world performance improvements given by the pixman optimizations. The image columns show the time in seconds to complete a cairo-perf-trace workload when using 32 bits per pixel and likewise image16 for 16 bits per pixel. The first column in both image and image16 groupings is the time to complete the workload without using Loongson MMI code. The second column is time to complete the workload after pixman commit c78e986085, the commit that turns on the Loongson MMI code. The third column is the time to complete the workload with pixman-0.25.6 which has many more optimizations.

	image				image16
evolution	32.985	29.667	28.752	14.7% faster	27.314	23.870	22.960	19.0% faster
firefox-planet-gnome	197.982	180.437	169.532	16.8% faster	220.986	205.057	199.077	11.0% faster
gnome-terminal-vim	60.799	50.528	50.792	19.7% faster	51.655	44.131	43.561	18.6% faster
gvim	38.646	32.552	33.570	15.1% faster	38.126	34.453	35.457	7.5% faster
ocitysmap	23.065	18.057	17.516	31.7% faster	23.046	18.055	17.543	31.4% faster
poppler	43.676	36.077	35.498	23.0% faster	43.065	36.090	35.534	21.2% faster
swfdec-giant-steps	20.166	20.365	20.469	no change	22.354	16.578	14.473	54.4% faster
swfdec-youtube	31.502	28.118	24.168	30.3% faster	44.052	41.771	38.577	14.2% faster
xfce4-terminal-a1	69.517	51.288	50.838	36.7% faster	62.225	53.309	44.297	40.5% faster

May 29th edit: the % faster numbers were previously calculated as a percent difference between the initial workload times and the final workload times. I realized that this calculation's result is not strictly a metric of how much faster the code is. To calculate that, the new formula is (1/initial - 1/final) / (1/initial)) which calculates the percent difference in terms of operations/second. This number is % faster. The table has been updated accordingly.

As the results show, real-world performance is improved by the Loongson MMI code. I can tell a difference when using GNOME 3 (in fallback mode) on my Yeeloong.

So far this has been very successful. I've optimized pixman on an interesting platform, learned a new instruction set, and in the process found many opportunities to optimize the MMX code on x86 and ARM. I still see a bunch of things to work on with just these compositing operations alone. Beyond that, there are many other things to do like bilinear and nearest scaling functions (which are extremely important for Firefox performance). And beyond that, I've improved my understanding of pixman's code and have a few ideas for improvements in general.

Thanks to

Danny Clark, who runs freedomincluded.com, for providing me with a Lemote Yeeloong laptop for my work on Gentoo's MIPS port.
Søren Sandmann and Siarhei Siamashka for reviewing and helping me improve my code.

17 May 2012 – Tags: freedesktop gentoo linux loongson mips pixman xorg yeeloong

New multilib N32 Gentoo MIPS Stages

Gentoo/MIPS has been in, well, not great shape for quite some time. When I was going through Gentoo recruitment, there were no stages (used for installing Gentoo) newer than 2008, so this was one of the main things I wanted to improve, specifically by creating new N32 ABI stages. Even though the N32 (meaning New 32-bit) ABI was introduced in IRIX in 1996 to replace SGI's o32 (Old 32-bit) ABI, Linux support for N32 has lagged behind until the last few years. Now, I'm pleased to unofficially announce new multilib N32 stages and that we'll be supporting as the preferred ABI.

MIPS has three main ABIs: o32 (32-bit integer and pointer), N32 (64-bit integer, 32-bit pointer), N64 (64-bit integer and pointer). Compared with N32 and N64, o32 is very restrictive. Very few function arguments are passed in registers; only half the number of floating point registers are usable; no native 64-bit integer datatype; no long double type. (see SGI's MIPSpro N32 ABI Handbook for details). Offering N32 as the default ABI means better performance, sometimes 30% more, just by removing the unnecessary restrictions a 32-bit ABI imposes on 64-bit CPUs. Providing multilib stages (ie, stages with glibc and gcc built for all three ABIs) gives the user flexibility to switch to another ABI relatively easily if desired, while also allowing him to reduce build times by switching to an N32-only profile.

The process of creating N32 (and especially multilib) stages wasn't straight forward. Our profiles were long unmaintained and in many cases totally broken. There were lots of keywording bugs open for mips, many where the MIPS was the last team to complete the request by years. There were actually some real bugs discovered too, like 354877 and 358149, usually caused by the incorrect assumption that the lib directory is always a symlink to lib32. All in all, I've reduced the number of open bugs for MIPS down to ~20.

Work needed to be done to catalyst, Gentoo's release building tool. Since the end of June, I've made 15 commits cleaning, fixing, and adding to the mips support code in catalyst. Other developers like Sebastian Pipping have also resumed work on a project that had otherwise been minimally maintained since the beginning of the year.

The last major component in reviving Gentoo's MIPS support is to create installation media, preferably in an automated manner. I've acquired two Broadcom BCM91250A MIPS development boards (and should be receiving a third soon), but they need disks, controllers, RAM, and cases. For that, I wrote a funding Proposal to build three MIPS development computers (pdf) and had it approved by the Gentoo Foundation. Things seem to be going well in acquisitions (track progress) so I hope to have the project completed in the next few months with the systems automatically building stages for a wide variety of MIPS systems.

Initially, I used a big-endian 2006.1 N32 stage and had to bootstrap my system with a series of at least 20 hacks (not a fun experience) until it was usable enough that I was able to build a clean N32 stage. From there, using crossdev I built a multilib toolchain, and with a few more hacks I was able to build a multilib stage.

With that in the past, I've been building stages that can be used to seed the automated stage creation system to come. At this point, my TODO list looks like this:

Big Endian
- multilib
  - (done) mips3 -mfix-r4000 -mfix-r4400 (for SGI Indy and Indigo2)
  - (done) mips4 (for SGI Indy and O2)
  - (done) r10k (for SGI Indigo2, Octane)
  - (done) mips64 (for Broadcom Sibyte systems)
- o32
  - mips32 (for embedded mips systems)
  - mips1 (for everything else)
Little Endian
- multilib
  - mips3 -Wa,-mfix-loongson2f-nop (for Loongson 2 systems)
  - mips4 (for Cobalt systems)
  - mips64 (for Loongson 3, Broadcom Sibyte systems)
- o32
  - mips32 (for embedded mips systems)
  - mips1 (for everything else)

The final touches will be to create bootable media like CD, USB, and netboot images.

All stages are available in the experimental/mips/stages/ directory (as soon as the files propagate) of a Gentoo Mirror.

Hopefully by the time I'm able to convince Lemote (or, who?) to send me a Loongson 3A laptop, installing and using Gentoo/MIPS will be a fun and pleasant experience.

02 August 2011 – Tags: gentoo linux mips

The Loongson 3A (Godson 3A) looks nice!

The Loongson 3A (or Godson 3A) is the successor to the Loongson 2F used in systems like the Lemote Yeelong and Gdium Liberty 1000. According to the Chinese review site EXPreview, the first production of Godson 3A CPUs has completed. (Specs and motherboard pictures below.)

The Loongson 3A/Godson 3A is a quad-core processor built on a 65 nm process. Running at 1 GHz, it has a power consumption of only 15 Watts. Each has 64kB of L1 instruction cache, 64kB of L1 data cache, and a 4MB shared L2 cache. The Godson 3A implements the MIPS64 architecture, which is a nice improvement over the previous Loongson 2F which only implemented MIPS III with extensions. Also, it has 200 new instructions for speeding up x86 binary execution.

The server motherboard appears to be a two-way (so, 8 total cores!) board with 8 DIMMs (update: DDR2), 2 PCI, and 3 PCI-Express x4 (correction: two x8, one x4), 6 Serial ATA II, along with what looks to be standard PS/2, Parallel, VGA, USB, and Ethernet ports on the back.

The desktop motherboard has 4 DIMMs (update: DDR3) to go with the Godson 3A's two memory controllers, 2 PCI, 2 PCI-Express x1, 1 PCI-Express x16, 6 Serial ATA II, and on the back: sound, USB, Ethernet, VGA and DVI, and PS/2.

Both motherboards have an AMD 780E chipset and a SB710 southbridge. With an RS780E chipset, the integrated graphics would be a RadeonHD 3200, though I have read elsewhere that the Loongson 3A will be using the AMD 690E chipset and a Radeon X1250. The RadeonHD 3200 has an RV610 core running at 500 MHz with 40 unified shaders. The Radeon X1250 has an older and less impressive RS690 (a mobile R400-series) core at 400 MHz with 2 vertex shaders and 4 pixel shaders. Using AMD's John Bridgman's conversion of 4 unified shaders per vertex shader and 5 unified shaders per pixel shader, the X1250 has the equivalent of 28 unified shaders.

In comparison with current Lemote and Gdium offerings, these boards look fantastic. No more ancient Silicon Motion SM502/SM712 graphics is especially nice.

The EXPreview article continues by mentioning that a Godson 2G (or Dragon Heart 2G) processor is also in successful production. The 2G is a Godson 3A but with a reduced number of cores meant for laptops.

12 November 2010 – Tags: hardware mips

How to enable SSL-IRC-access to Freenode and OFTC with XChat

The two main IRC networks I use are Freenode and OFTC. I've always liked the idea (though I don't currently make use of it) of things like HTTPS Everywhere, and I remember being disappointed the last time I checked if major IRC networks supported SSL. For some reason, I checked tonight and found out they do! Here's how to set it up with XChat on Linux.

In XChat, edit the connection information for Freenode and OFTC in the Network List. Check Use SSL for all servers on this network.
Modify the Servers lines to be
```
irc.freenode.net/7000
```
```
irc.oftc.net/9999
```
as SSL-enabled connections use different ports than non-SSL IRC.

When you reconnect to the network, you'll see something like this in your connection log. Enjoy!

* Looking up irc.freenode.net
* Connecting to chat.freenode.net (128.237.157.136) port 7000...
* * Certification info:
*   Subject:
*     OU=Domain Control Validated
*     OU=Gandi Standard Wildcard SSL
*     CN=*.freenode.net
*   Issuer:
*     C=FR
*     O=GANDI SAS
*     CN=Gandi Standard SSL CA
*   Public key algorithm: rsaEncryption (2048 bits)
*   Sign algorithm sha1WithRSAEncryption
*   Valid since Jan 13 00:00:00 2010 GMT to Jan 13 23:59:59 2011 GMT
* * Cipher info:
*   Version: TLSv1/SSLv3, cipher DHE-RSA-AES256-SHA (256 bits)
* Connected. Now logging in...
* *** Looking up your hostname...
* *** Checking Ident
* *** No Ident response
* *** Found your hostname
* Welcome to the freenode Internet Relay Chat Network mattst88

27 October 2010 – Tags: howto irc

Fix for Creative ZEN V Plus won't connect: Code-10

I just spent the better part of my afternoon figuring out what the hell was wrong with my girlfriend's Creative ZEN V Plus MP3 player. It froze yesterday and hasn't been able to connect to our computers since. Everytime she turned it on, it said Rebuilding Library, but it normally only does that during the boot immediately after it froze and has to be manually reset. It wouldn't connect in Linux, and in Windows the obnoxious balloon tips would report "MTP device found" again and again. In the Device Manager, it reported "Code 10: Device cannot start." Well, we finally found a fix.

Searching for zen v plus code 10 turns up 287,000 results including countless forum threads with no helpful replies. The most common suggestion is to uninstall the device in Device Manager, but trying this numerous times did nothing to help.

Many threads from Microsoft's social.answers.microsoft.com show the exact same problem reported over and over again, but no helpful responses.

After wasting hours searching the internet, we discovered the ZEN's Recovery Mode. You enter it by turning on the ZEN while holding down the Play button.

Selecting "Clean up" performs some magic which I cannot explain. Then select Reboot. When your ZEN comes back to life, it'll happily connect to your computer.

What an obnoxious problem. You'd think Creative would have written this article a few years ago, but they'd probably rather you junk your ZEN V Plus and hand over another hundred dollars for whatever their newest model is.

26 October 2010 – Tags: hardware howto