mattst88's blog

Reverse Engineering the PROM for the SGI O2

Since the early 2000s, the potential for upgrading the CPU in the Silicon Graphics O2 with a 900 MHz RM7900 has been blocked by the inability to modify the PROM firmware. To that end, I built ip32prom-decompiler, a program that decompiles the PROM into sources that can be reassembled into a bit-identical image. The decompiler goes to great lengths to produce assembly that is understandable and modifiable by replacing known constants, recognizing and replacing memory addresses with labels, inserting comments and function descriptions, marking function bounds, and many other niceties. In this article I'll describe the process of reverse engineering the structure and contents of the PROM so that I could build the decompiler.

Background

The Silicon Graphics O2 is a Unix workstation with a MIPS CPU.

There are two families of CPUs available for the O2:

in-order R5000 / RM7000 CPUs, from 180-350 MHz
out-of-order R10000 / R12000 CPUs, from 150-400 MHz

In the early 2000s, members of the Nekochan community replaced the 300 MHz RM5200 and 350 MHz RM7000A CPUs with a faster 600 MHz RM7000C model. The 600 MHz CPU, though in-order, is faster than the out-of-order 400 MHz R12000 CPU in most cases.

This modification is documented by SGI Depot in the article Upgrading an O2 to 600MHz (and beyond!). While replacing a BGA-mounted CPU takes significant tooling and expertise, the modification does not require any firmware or software changes.

The Problem

As the title ("Upgrading an O2 to 600MHz (and beyond!)") of the article might suggest, there were hopes of further upgrades. The article notes

Meanwhile, Joe unfortunately did not have any success with the PMC 866Mhz CPU - apparently it is not quite as compatible with R5200 as PMC thought. Meanwhile, any ideas about a 900 are somewhat hampered by the need to have a distinctly modified IP32 PROM image, which would need some assistance from SGI. Who knows if they would be willing to help; one can but ask!

Watch this space!!

The 900 MHz CPU referred to is the RM7900 from PMC-Sierra. The RM7900 uses a newer E9000 CPU core but in a 304-pin BGA package compatible with earlier RM7000 CPUs. It is not clear to me what the 866 MHz CPU is — I can find no evidence of an 866 MHz MIPS CPU, RM7000 or otherwise.

Presumably any attempts to use an RM7900 failed without support in the O2's PROM firmware.

At the time, Silicon Graphics still existed and there remained some faint hope for access to the source code of the PROM — the boot firmware — but today Silicon Graphics is long gone and with it the source code for the PROM. (as well as any concerns about legal issues from reverse engineering!)

The (partial) Solution

I reverse engineered the PROM firmware and wrote a program to decompile it into modifiable assembly (.S) files. The assembly files can be reassembled into a bit-identical PROM image, thus verifying that the decompilation was accurate.

With the PROM firmware now decompiled into modifiable assembly, the "distinctly modified IP32 PROM image" needed for RM7900 support is possible — no assistance from SGI required.

External Annotations

The assembly files are made more comprehensible with various annotations and other improvements to readability.

Filename	Purpose
labels.json	Named addresses for branch targets and data
comments.json	Per-instruction documentation
functions.json	Function boundaries and descriptions
operands.json	Instruction operand replacements
relocations.json	Code that executes at different addresses than stored
bss.json	Named BSS (uninitialized data) symbols

The resulting assembly:

Without improvements With improvements

Without improvements	With improvements
L_0xbfc019b0: lui $t1, 0xbfc0 lui $t0, 0xa000 addiu $t1, $t1, 0x19c8 or $t0, $t0, $t1 jr $t0 nop L_0xbfc019c8: mtc0 $zero, 5 mtc0 $zero, 29 addiu $t1, $zero, 0x23 nop mfc0 $t0, $t7 andi $t0, $t0, 0xff00 srl $t0, $t0, 8 beq $t0, $t1, 0xbfc01ae4 nop addiu $t1, $zero, 0x28 beq $t0, $t1, 0xbfc01ae4 nop addiu $t1, $zero, 0x27 bne $t0, $t1, 0xbfc01bbc nop addiu $t0, $zero, 0x2f lui $t1, 0x1000 L_0xbfc01a0c: addiu $at, $zero, 0x1fff not $t2, $at and $t2, $t2, $t1 lui $at, 0x8000 mtc0 $t0, 0 or $t2, $t2, $at mtc0 $t2, 10 srl $at, $t1, 0xc sll $at, $at, 6 ori $at, $at, 0x11 mtc0 $at, 2 addiu $t2, $at, 0x40 mtc0 $at, 3 addi $t0, $t0, -1 addiu $t1, $t1, -0x2000 bgtz $t0, 0xbfc01a0c tlbwi mfc0 $t0, $s0 addiu $at, $zero, -0x1001 and $t0, $t0, $at addiu $at, $zero, -9 and $t0, $t0, $at mtc0 $t0, 16 mfc0 $t0, $s0 srl $t0, $t0, 9 addiu $t1, $zero, 0x1000 andi $t0, $t0, 7 sllv $t0, $t1, $t0 addi $t0, $t0, -0x20 lui $t1, 0x8000 addu $t2, $t0, $t1 L_0xbfc01a88: cache 0, ($t2) addi $t0, $t0, -0x20 bgez $t0, 0xbfc01a88 addu $t2, $t0, $t1 mfc0 $t0, $s0 srl $t0, $t0, 6 addiu $t1, $zero, 0x1000 andi $t0, $t0, 7 sllv $t0, $t1, $t0 addi $t0, $t0, -0x20 lui $t1, 0x8000 lui $at, 0x1000 L_0xbfc01ab8: addu $at, $at, $t0 srl $at, $at, 0xc sll $at, $at, 8 mtc0 $at, 29 addu $t2, $t0, $t1 addi $t0, $t0, -0x20 cache 9, ($t2) bgez $t0, 0xbfc01ab8 lui $at, 0x1000 jr $ra nop [...]	/* Function tlb_init_uncached_trampoline [0xbfc019b0 - 0xbfc019c8) * * Jump to tlb_init through uncached KSEG1 / tlb_init_uncached_trampoline: / 0xbfc019b0 / lui $t1, %hi(tlb_init) lui $t0, HI(KSEG1) addiu $t1, $t1, %lo(tlb_init) or $t0, $t0, $t1 jr $t0 # Jump to (KSEG1 \| tlb_init) nop / Function tlb_init [0xbfc019c8 - 0xbfc01d98) / tlb_init: / 0xbfc019c8 / mtc0 $zero, $CP0_PAGEMASK mtc0 $zero, $CP0_TAGHI li $t1, PRID_IMP_R5000 nop mfc0 $t0, $CP0_PRID andi $t0, $t0, PRID_IMP_MASK srl $t0, $t0, PRID_IMP_SHIFT beq $t0, $t1, tlb_r5k_init nop li $t1, PRID_IMP_NEVADA beq $t0, $t1, tlb_r5k_init nop li $t1, PRID_IMP_RM7000 bne $t0, $t1, tlb_r10k_init nop li $t0, RM7000_NUM_TLB_ENTRIES-1 lui $t1, HI(0x0fffe000) tlb_rm7k_write_tlb_loop: / 0xbfc01a0c / li $at, PAGE_OFFSET_MASK not $t2, $at and $t2, $t2, $t1 lui $at, HI(KSEG0) mtc0 $t0, $CP0_INDEX or $t2, $t2, $at mtc0 $t2, $CP0_ENTRYHI srl $at, $t1, PAGE_SHIFT sll $at, $at, ENTRYLO_PFN_SHIFT ori $at, $at, (ENTRYLO_G\|ENTRYLO_C_UNCACHED) mtc0 $at, $CP0_ENTRYLO0 addiu $t2, $at, 1 << ENTRYLO_PFN_SHIFT mtc0 $at, $CP0_ENTRYLO1 addi $t0, $t0, -1 addiu $t1, $t1, LO(0x0fffe000) bgtz $t0, tlb_rm7k_write_tlb_loop tlbwi mfc0 $t0, $CP0_CONFIG li $at, ~RM7K_CONF_TE and $t0, $t0, $at li $at, ~CONF_CU and $t0, $t0, $at mtc0 $t0, $CP0_CONFIG mfc0 $t0, $CP0_CONFIG srl $t0, $t0, CONF_IC_SHIFT li $t1, 0x1000 andi $t0, $t0, CONF_CACHE_SIZE_MASK sllv $t0, $t1, $t0 addi $t0, $t0, -CACHE_LINE_SIZE lui $t1, HI(KSEG0) addu $t2, $t0, $t1 tlb_rm7k_inv_l1i_loop: / 0xbfc01a88 / cache (CACHE_TYPE_L1I\|INDEX_WRITEBACK_INV), 0($t2) addi $t0, $t0, -CACHE_LINE_SIZE bgez $t0, tlb_rm7k_inv_l1i_loop addu $t2, $t0, $t1 mfc0 $t0, $CP0_CONFIG srl $t0, $t0, CONF_DC_SHIFT li $t1, 0x1000 andi $t0, $t0, CONF_CACHE_SIZE_MASK sllv $t0, $t1, $t0 addi $t0, $t0, -CACHE_LINE_SIZE lui $t1, HI(KSEG0) lui $at, 0x1000 tlb_rm7k_inv_l1d_loop: / 0xbfc01ab8 */ addu $at, $at, $t0 srl $at, $at, PAGE_SHIFT sll $at, $at, RM7K_TAGHI_PTAG_SHIFT mtc0 $at, $CP0_TAGHI addu $t2, $t0, $t1 addi $t0, $t0, -CACHE_LINE_SIZE cache (CACHE_TYPE_L1D\|INDEX_STORE_TAG), 0($t2) bgez $t0, tlb_rm7k_inv_l1d_loop lui $at, 0x1000 jr $ra nop [...]





L_0xbfc019b0:
    lui     $t1, 0xbfc0
    lui     $t0, 0xa000
    addiu   $t1, $t1, 0x19c8
    or      $t0, $t0, $t1
    jr      $t0
    nop



L_0xbfc019c8:
    mtc0    $zero, 5
    mtc0    $zero, 29
    addiu   $t1, $zero, 0x23
    nop
    mfc0    $t0, $t7
    andi    $t0, $t0, 0xff00
    srl     $t0, $t0, 8
    beq     $t0, $t1, 0xbfc01ae4
    nop
    addiu   $t1, $zero, 0x28
    beq     $t0, $t1, 0xbfc01ae4
    nop
    addiu   $t1, $zero, 0x27
    bne     $t0, $t1, 0xbfc01bbc
    nop
    addiu   $t0, $zero, 0x2f
    lui     $t1, 0x1000
L_0xbfc01a0c:
    addiu   $at, $zero, 0x1fff
    not     $t2, $at
    and     $t2, $t2, $t1
    lui     $at, 0x8000
    mtc0    $t0, 0
    or      $t2, $t2, $at
    mtc0    $t2, 10
    srl     $at, $t1, 0xc
    sll     $at, $at, 6
    ori     $at, $at, 0x11
    mtc0    $at, 2
    addiu   $t2, $at, 0x40
    mtc0    $at, 3
    addi    $t0, $t0, -1
    addiu   $t1, $t1, -0x2000
    bgtz    $t0, 0xbfc01a0c
    tlbwi
    mfc0    $t0, $s0
    addiu   $at, $zero, -0x1001
    and     $t0, $t0, $at
    addiu   $at, $zero, -9
    and     $t0, $t0, $at
    mtc0    $t0, 16
    mfc0    $t0, $s0
    srl     $t0, $t0, 9
    addiu   $t1, $zero, 0x1000
    andi    $t0, $t0, 7
    sllv    $t0, $t1, $t0
    addi    $t0, $t0, -0x20
    lui     $t1, 0x8000
    addu    $t2, $t0, $t1
L_0xbfc01a88:
    cache   0, ($t2)
    addi    $t0, $t0, -0x20
    bgez    $t0, 0xbfc01a88
    addu    $t2, $t0, $t1
    mfc0    $t0, $s0
    srl     $t0, $t0, 6
    addiu   $t1, $zero, 0x1000
    andi    $t0, $t0, 7
    sllv    $t0, $t1, $t0
    addi    $t0, $t0, -0x20
    lui     $t1, 0x8000
    lui     $at, 0x1000
L_0xbfc01ab8:
    addu    $at, $at, $t0
    srl     $at, $at, 0xc
    sll     $at, $at, 8
    mtc0    $at, 29
    addu    $t2, $t0, $t1
    addi    $t0, $t0, -0x20
    cache   9, ($t2)
    bgez    $t0, 0xbfc01ab8
    lui     $at, 0x1000
    jr      $ra
    nop
[...]

/* Function tlb_init_uncached_trampoline [0xbfc019b0 - 0xbfc019c8)
 *
 * Jump to tlb_init through uncached KSEG1
 */
tlb_init_uncached_trampoline: /* 0xbfc019b0 */
    lui     $t1, %hi(tlb_init)
    lui     $t0, HI(KSEG1)
    addiu   $t1, $t1, %lo(tlb_init)
    or      $t0, $t0, $t1
    jr      $t0     # Jump to (KSEG1 | tlb_init)
     nop

/* Function tlb_init [0xbfc019c8 - 0xbfc01d98)
 */
tlb_init: /* 0xbfc019c8 */
    mtc0    $zero, $CP0_PAGEMASK
    mtc0    $zero, $CP0_TAGHI
    li      $t1, PRID_IMP_R5000
    nop
    mfc0    $t0, $CP0_PRID
    andi    $t0, $t0, PRID_IMP_MASK
    srl     $t0, $t0, PRID_IMP_SHIFT
    beq     $t0, $t1, tlb_r5k_init
     nop
    li      $t1, PRID_IMP_NEVADA
    beq     $t0, $t1, tlb_r5k_init
     nop
    li      $t1, PRID_IMP_RM7000
    bne     $t0, $t1, tlb_r10k_init
     nop
    li      $t0, RM7000_NUM_TLB_ENTRIES-1
    lui     $t1, HI(0x0fffe000)
tlb_rm7k_write_tlb_loop: /* 0xbfc01a0c */
    li      $at, PAGE_OFFSET_MASK
    not     $t2, $at
    and     $t2, $t2, $t1
    lui     $at, HI(KSEG0)
    mtc0    $t0, $CP0_INDEX
    or      $t2, $t2, $at
    mtc0    $t2, $CP0_ENTRYHI
    srl     $at, $t1, PAGE_SHIFT
    sll     $at, $at, ENTRYLO_PFN_SHIFT
    ori     $at, $at, (ENTRYLO_G|ENTRYLO_C_UNCACHED)
    mtc0    $at, $CP0_ENTRYLO0
    addiu   $t2, $at, 1 << ENTRYLO_PFN_SHIFT
    mtc0    $at, $CP0_ENTRYLO1
    addi    $t0, $t0, -1
    addiu   $t1, $t1, LO(0x0fffe000)
    bgtz    $t0, tlb_rm7k_write_tlb_loop
     tlbwi
    mfc0    $t0, $CP0_CONFIG
    li      $at, ~RM7K_CONF_TE
    and     $t0, $t0, $at
    li      $at, ~CONF_CU
    and     $t0, $t0, $at
    mtc0    $t0, $CP0_CONFIG
    mfc0    $t0, $CP0_CONFIG
    srl     $t0, $t0, CONF_IC_SHIFT
    li      $t1, 0x1000
    andi    $t0, $t0, CONF_CACHE_SIZE_MASK
    sllv    $t0, $t1, $t0
    addi    $t0, $t0, -CACHE_LINE_SIZE
    lui     $t1, HI(KSEG0)
    addu    $t2, $t0, $t1
tlb_rm7k_inv_l1i_loop: /* 0xbfc01a88 */
    cache   (CACHE_TYPE_L1I|INDEX_WRITEBACK_INV), 0($t2)
    addi    $t0, $t0, -CACHE_LINE_SIZE
    bgez    $t0, tlb_rm7k_inv_l1i_loop
     addu   $t2, $t0, $t1
    mfc0    $t0, $CP0_CONFIG
    srl     $t0, $t0, CONF_DC_SHIFT
    li      $t1, 0x1000
    andi    $t0, $t0, CONF_CACHE_SIZE_MASK
    sllv    $t0, $t1, $t0
    addi    $t0, $t0, -CACHE_LINE_SIZE
    lui     $t1, HI(KSEG0)
    lui     $at, 0x1000
tlb_rm7k_inv_l1d_loop: /* 0xbfc01ab8 */
    addu    $at, $at, $t0
    srl     $at, $at, PAGE_SHIFT
    sll     $at, $at, RM7K_TAGHI_PTAG_SHIFT
    mtc0    $at, $CP0_TAGHI
    addu    $t2, $t0, $t1
    addi    $t0, $t0, -CACHE_LINE_SIZE
    cache   (CACHE_TYPE_L1D|INDEX_STORE_TAG), 0($t2)
    bgez    $t0, tlb_rm7k_inv_l1d_loop
     lui    $at, 0x1000
    jr      $ra
     nop
[...]

Reverse engineering the IP32 PROM

When I began this process, I knew only a tiny bit about firmware or the MIPS instruction set. I knew even less about the initialization process for MIPS CPUs.

First steps

A mailing list post in 2004 about the topic dissuaded others from reverse engineering the IP32 PROM due to the difficulty.

Modifying the binary is most assuredly way more difficult than gaining access to ip32PROM source and modifying it directly (and solving license issues). The level of change to the binary needed to make the ip32PROM detect a new CPU would require extremely detailed knowledge of the binary format the ip32PROM is in, SGI O2 systems, and how the PROM even functions. I'd wager a guess that a super-skilled SGI engineer might possibly pull this off, given enough caffeine.

I read this and wondered, how difficult could it actually be? It didn't seem like firmware from 1996 would be terribly complex.

I found a 512 KiB binary dump of the last version of the O2's PROM:

$ md5sum ip32prom.rev4.18.bin
c9725e036052cf1f3e6258eb9bc687fa  ip32prom.rev4.18.bin

And disassembled it:

$ mips64-unknown-linux-gnu-objdump -D -b binary -m mips -EB ip32prom.rev4.18.bin | head

ip32prom.rev4.18.bin:     file format binary


Disassembly of section .data:

00000000 <.data>:
       0:       10000011        b       0x48
       4:       00000000        nop
       8:       53484452        beql    k0,t0,0x11154

The first two instructions looked legitimate, but the third looked unlikely to be a real instruction.

Further inspection of the disassembly indicated that there were real functions:

[...]
    152c:       03e00008        jr      ra
    1530:       00000000        nop
    1534:       90820000        lbu     v0,0(a0)
    1538:       00001825        move    v1,zero
    153c:       24840001        addiu   a0,a0,1
    1540:       10400006        beqz    v0,0x155c
    1544:       00000000        nop
    1548:       90820000        lbu     v0,0(a0)
    154c:       24840001        addiu   a0,a0,1
    1550:       24630001        addiu   v1,v1,1
    1554:       5440fffd        bnezl   v0,0x154c
    1558:       90820000        lbu     v0,0(a0)
    155c:       03e00008        jr      ra
    1560:       00601025        move    v0,v1
[...]

The jr and nop at 152c and 1530 end a function, and the lbu at 1534 starts a new function by loading from the a0 (argument 0) register. The jr and move at 155c and 1560 return from the function and copy a value into v0 which holds the return value. (This function is strlen).

strings showed meaningful data as well:

$ strings ip32prom.rev4.18.bin | head -n2
SHDR
sloader

I recognized that the first string ("SHDR") matched the odd looking instruction from the initial disassembly:

       8:       53484452        beql    k0,t0,0x11154

0x53484452 is "SHDR".

SHDR

Could this stand for section/segment header? What info was contained in the header?

       0:   10000011    b   0x48
       4:   00000000    nop
       8:   53484452    # "SHDR"
       c:   00004000    # unknown data
      10:   07030100    # unknown data
      14:   736c6f61    # unknown data
      18:   64657200    # unknown data
      1c:   00000000    # unknown data
      20:   00000000    # unknown data
      24:   00000000    # unknown data
      28:   00000000    # unknown data
      2c:   00000000    # unknown data
      30:   00000000    # unknown data
      34:   312e3000    # unknown data
      38:   00000000    # unknown data
      3c:   8cb4693c    # unknown data
      40:   00000000    # unknown data
      44:   00000000    # unknown data
      48:   100000d7    b   0x3a8
      4c:   00000000    nop
[...]
     3a8:   401a6000    mfc0    k0,$12
     3ac:   001ad402    srl     k0,k0,0x10
     3b0:   335a0018    andi    k0,k0,0x18
     3b4:   235affe8    addi    k0,k0,-24
[...]

SHDR size

The header appeared to be bounded by a branch+delay slot in [0x00, 0x08) and [0x48, 0x50). These two chained branches lead to valid-looking code at 0x3a8.

That meant that the SHDR size was 72 bytes (including 8 bytes for the branch and delay slot before the "SHDR" magic number).

Strings

Interpreting the unknown data as ASCII found some additional strings:

736c6f61646572 is "sloader". It's followed by 25 zero bytes (null terminator included), so this could be the name of the section in a 32-byte field.
312e30 is "1.0". It's followed by 5 zero bytes (null terminator included), so this could be the version of the section in an 8-byte field.

That left bytes [0x0c, 0x14), [0x3c, 0x48) unknown.

The four bytes in [0x0c, 0x10) looked like they might be a single element, but I didn't know what 0x00004000 (16384) was.

The four bytes in [0x10, 0x14) were 7310. The length of the string "sloader" is 7, and the length of the string "1.0" is 3. These were probably the string lengths of the name and version fields. I didn't know what the 1 or 0 bytes meant.

I had no idea what the data in [0x3c, 0x48) was.

I found that there were 5 instances of "SHDR" in the binary dump. The table contains the name and version in each header and their lengths (which matched the actual strings!).

Offset	Name	Name Len	Version	Version Len
`0x00000000`	sloader	7	1.0	3
`0x00004000`	env	3	1.0	3
`0x00004400`	post1	5	1.0	3
`0x00009200`	firmware	8	4.18	4
`0x00069200`	version	7	4.18	4

Section length

I recognized that the env section started at 0x00004000 — the same as the unknown [0x0c, 0x10) bytes in the sloader header. Was it the offset of the next SHDR? Or maybe the length of the current section?

Adding the unknown value to the offset of the current SHDR:

Offset	Section	[0x0c, 0x10)	Offset + unknown
`0x00000000`	sloader	`0x00004000`	`0x00004000`
`0x00004000`	env	`0x00000400`	`0x00004400`
`0x00004400`	post1	`0x00004d44`	`0x00009144`
`0x00009200`	firmware	`0x0005fffc`	`0x000691fc`
`0x00069200`	version	`0x00000388`	`0x00069588`

These unknown values looked to be the length of the current section, but maybe needed to be rounded up to the next 0x100?

Checksum

During this part of the investigation, I noticed that at the end of each section there was a bogus instruction, often preceded by a lot of zeros that looked to be padding.

[start of "sloader" section]
       0:   10000011    b   0x48
       4:   00000000    nop
       8:   53484452    # "SHDR" for sloader
[...]
[a lot of zeros — padding]
    3ffc:   15d0fa4f    bne t6,s0,0x293c        # Bogus instruction
[end of "sloader" section]

[start of "env" section]
    4000:   00000000    nop
    4004:   00000000    nop
    4008:   53484452    # "SHDR" for env
[...]
    43fc:   eba16bb0    swc2    $1,27568(sp)    # Bogus instruction
[end of "env" section]

[start of "post1" section]
    4400:   10000011    b   0x4448
    4404:   00000000    nop
    4408:   53484452    # "SHDR" for post1
[...]
    9140:   6c91c641    ldr s1,-14783(a0)       # Bogus instruction
[end of "post1" section]
    9144:   00000000    nop
[a lot of zeros — padding]

[start of "firmware" section]
    9200:   10000011    b   0x9248
    9204:   00000000    nop
    9208:   53484452    # "SHDR" for firmware
[...]
   691f8:   d1c38847    lld v1,-30649(t6)       # Bogus instruction
[end of "firmware" section]
   691fc:   00000000    nop

[start of "version" section]
   69200:   7f454c46    .word   0x7f454c46 # WTF?
   69204:   01020100    .word   0x1020100  # WTF?
   69208:   53484452    # "SHDR" for version
[...]
[a lot of zeros — padding]
   69584:   108fedea    beq a0,t7,0x64d30       # Bogus instruction
[end of "version" section]
   69588:   00000000    nop
[a lot of zeros — padding]
   69600:
[a lot of ones — padding]

Those bogus instructions each end at the offset calculated by the SHDR start + the section length. They end the section. Could they be checksums for the section? If they're checksums, how are they calculated?

Section	Checksum
sloader	`0x15d0fa4f`
env	`0xeba16bb0`
post1	`0x6c91c641`
firmware	`0xd1c38847`
version	`0x108fedea`

SHDR checksum

The SHDRs had some weird looking numbers towards the ends as well. Might they be checksums as well?

Section	SHDR checksum
sloader	`0x8cb4693c`
env	`0x131811ae`
post1	`0xc516c9e5`
firmware	`0x82b4a297`
version	`0x012d56b7`

Section type

The remaining unknown bytes in the SHDRs were [0x12, 0x14), [0x40, 0x48). Their values for each SHDR are:

Section	0x12	0x13	[0x40, 0x44)	[0x44, 0x48)
sloader	1	0	`0x00000000`	`0x00000000`
env	0	0	`0x4175746f`	`0x4c6f6164`
post1	1	0	`0x00000000`	`0x00000000`
firmware	3	0	`0x81000000`	`0x00048e70`
version	0	8	`0x00000000`	`0x00000000`

From the names and small sizes of env and version I guessed that they did not contain code. sloader, post1, and firmware definitely did include code, and their SHDRs' initial instructions branched over their SHDRs to more code.

Section	Entry instructions
sloader	branch over SHDR
env	nop
post1	branch over SHDR
firmware	branch over SHDR
version	unknown¹

strings confirmed that env and version were almost entirely ASCII data. In fact, 0x4175746f / 0x4c6f6164 are ASCII for Auto / Load.

I suspected that the value in byte 0x12 was the section type, with the lowest bit indicating whether the section was code (1) or data (0).

The sloader, post1, and firmware sections began with branch instructions that jump over the SHDR. The env section began with two nop instructions. The version section began with data that I only came to understand much later.

The byte at 0x13 is 0 in all SHDRs other than version. This is padding to a 4-byte boundary.

Trailing 8 bytes

I didn't figure out what the trailing 8 bytes were until much later in the process, but here's what I did know at this point.

env didn't seem to have these bytes — as stated before the bytes immediately following env's SHDR are actual data that fit with the rest of the data in the section.
version contained zeros for these bytes, but the next 12 bytes were as well so it wasn't certain whether they were metadata or actual data.
sloader and post1 contained zeros in these bytes, but their initial branch instructions jumped just past these fields. It seemed pretty clear that they were some metadata.
firmware was the only one that seemed to clearly contain some meaningful metadata here (0x81000000 / 0x00048e70). Like sloader and post1, the initial branch instruction jumped just past these fields.

These findings seemed to correlate with the values in byte 0x12.

Name	0x12	[0x40, 0x44)	[0x44, 0x48)
sloader	1	`0x00000000`	`0x00000000`
env	0	N/A	N/A
post1	1	`0x00000000`	`0x00000000`
firmware	3	`0x81000000`	`0x00048e70`
version	0	N/A	N/A

It seemed that if the lowest bit in byte 0x12 was set that the 8 bytes would be present, and the second bit indicated something about the metadata?

Summary

Bytes	Field	sloader	env	post1	firmware	version
[0x00-0x08)	Entry instructions	branch over SHDR	nop	branch over SHDR	branch over SHDR	unknown²
[0x08-0x0c)	Magic number	"SHDR"	"SHDR"	"SHDR"	"SHDR"	"SHDR"
[0x0c-0x10)	Section Length	16384	1024	19780	393212	904
[0x10-0x11)	Name Length	7	3	5	8	7
[0x11-0x12)	Version Length	3	3	3	4	4
[0x12-0x13)	Section Type	1	0	1	3	0
[0x13-0x14)	Padding	0	0	0	0	8³
[0x14-0x34)	Name String	"sloader"	"env"	"post1"	"firmware"	"version"
[0x34-0x3c)	Version String	"1.0"	"1.0"	"1.0"	"4.18"	"4.18"
[0x3c-0x40)	SHDR Checksum	`0x8cb4693c`	`0x131811ae`	`0xc516c9e5`	`0x82b4a297`	`0x012d56b7`
[0x40-0x44)	Metadata #1	`0x00000000`	N/A	`0x00000000`	`0x81000000`	N/A
[0x44-0x48)	Metadata #2	`0x00000000`	N/A	`0x00000000`	`0x00048e70`	N/A
[end]	Section Checksum	`0x15d0fa4f`	`0xeba16bb0`	`0x6c91c641`	`0xd1c38847`	`0x108fedea`

Identifying Code

How

With the SHDRs mostly understood, I moved on to trying to understand the code.

It was evident that the code sections also included strings and other data. How could I programmatically identify what was code and what was data?

I turned to the Capstone disassembler — a small library with a simple interface capable of disassembling a large number of architectures' instruction sets, including MIPS.

In short, the decompiler part of the project began here with a program that essentially performed a breadth-first search of the code. It processed instructions, beginning with the first branch instruction in a code section, discovering more code in the process. If an address was reachable by a branch then it must be code and the program could search it for further branch targets.

The results were promising but unimpressive. Only around 10% of the binary was identified as code.

Relative jumps versus (nearly) absolute jumps

I discovered that there were a number of reasons for this, with the most salient being that I didn't understand jump instructions. While branch instructions are relative, jump instructions provide a (nearly) absolute jump target.

For example, the unconditional branch instruction (b) here jumps 0x48 / 72 bytes, regardless of its location in memory.

       0:       10000011        b       0x48

The jump-and-link (jal) instruction (used for function calls) provides the low 28-bits of the jump target (26-bits encoded, shifted left by 2) with the high 4-bits coming from its own address in memory.

     6e0:       0ff0023c        jal     0xfc008f0

This meant that the 0xfc008f0 target was missing the high 4-bits, and without those I couldn't find the function it was calling.

I realized I didn't actually know where execution began.

I picked up a copy of See MIPS Run and found it to be an invaluable resource in this process. In it I found:

The CPU responds to reset by starting to fetch instructions from 0xBFC0.0000. This is physical address 0x1FC0.0000 in the uncached kseg1 region.

I'd answered an important question (and discovered that I didn't have any idea about MIPS' different memory regions — another thing I'd need to learn).

With this knowledge in hand, I disassembled the binary again but this time with the --adjust-vma=0xbfc00000 flag. The two instructions from earlier now disassembled as:

bfc00000:       10000011        b       0xbfc00048

bfc006e0:       0ff0023c        jal     0xbfc008f0

A small change to the decompiler to tell Capstone the starting address resulted in it finding a lot more code.

Visualizing binary structure

Around this time, I recognized visualizing the binary structure could be useful.

I added support for emitting images in the simplest format I could find — XPM.

Here's what the post1 structure looked like:

Each row contained 128 pixels, with each representing a 4-byte chunk of the binary image. 4-byte chunks work well because MIPS instructions are 4 bytes and are always naturally aligned.

Red is code. Blue is header and checksum. Black is 0x00000000. White is 0xffffffff. Gray is unknown.

What was in the unknown areas?

Identifying Strings

strings indicated there was plenty of ASCII data in the binary, so I wrote some code to find it. It wasn't hard, but there were lots of corner cases to discover one by one.

The string data in sloader, post1, and firmware is always aligned to a 4-byte boundary. This was very convenient for finding the starting points and fit well with the existing visualization support.

Green is ASCII data.

Statically-unreachable functions

I could see valid instructions in the remaining unknown data. The functions hadn't been found for three reasons:

called via a jump table
called via a constructed address
actually dead code

I added the annotation system for providing external information about the firmware and added the functions' addresses to functions.json. These labels would prepopulate the code discovery queue.

Virtual Subsection

The remaining large chunk of unknown data was code. But when I added annotations for the functions' addresses, code discovery failed because the functions contained jumps to addresses that were outside of the ROM. For example:

$ mips64-unknown-linux-gnu-objdump -b binary -m mips -EB --adjust-vma=0xbfc00000 -D -d ip32prom.rev4.18.bin \
    | grep 'jal.*0xb000' \
    | head
bfc072a0:   0c0010e1    jal 0xb0004384
bfc072c0:   0c0011f4    jal 0xb00047d0
bfc072d8:   0c001125    jal 0xb0004494
bfc072f0:   0c0011f4    jal 0xb00047d0
bfc07300:   0c001125    jal 0xb0004494
bfc0731c:   0c0011f4    jal 0xb00047d0
bfc07330:   0c0011f4    jal 0xb00047d0
bfc07554:   0c0011f4    jal 0xb00047d0
bfc075fc:   0c0011f4    jal 0xb00047d0
bfc0763c:   0c0011f4    jal 0xb00047d0

It turns out (after a lot of assembly reading), the post1 section contains a blob of code that is copied to RAM and executed at a different address (0xa0004000). Adding support for dealing with this was a lot of work.

Unreachable Code

There were a few stray pixels in the middle of the code sections. The nop instruction on MIPS is 0x00000000, so I knew that the black pixels were nop instructions — typically padding between functions. The .int 0x00000000 in this snippet is an unreachable padding nop.

F_0xbfc05098: /* 0xbfc05098 */
    cache   (CACHE_TYPE_L1I|INDEX_WRITEBACK_INV), 0($a0)
    nop
    jr      $ra
     nop
    .int    0x00000000

But there were also bits of unknown data in the middle of code. Here's an example from post1.S:

    beql    $t6, $t8, L_0xbfc05824
     addiu  $v1, $s1, 2
    b       L_0xbfc05894
     ori    $v0, $v1, 0x100
    .int    0x26230002
L_0xbfc05824: /* 0xbfc05824 */
    lbu     $t2, 2($s0)

Whatever the instruction was, it was definitely unreachable since it occurred after an unconditional branch.

I added a pass that inspected unknown data in the middle of code sections and marked them as code with an unreachable comment.

    beql    $t6, $t8, L_0xbfc05824
     addiu  $v1, $s1, 2
    b       L_0xbfc05894
     ori    $v0, $v1, 0x100
    addiu   $v1, $s1, 2     # unreachable
L_0xbfc05824: /* 0xbfc05824 */
    lbu     $t2, 2($s0)

It seems pretty clear that these unreachable instructions were the result of a compiler optimization that filled branch delay slots. The same addiu $v1, $s1, 2 instruction can be seen a few lines above in the delay slot of the beql instruction. Leaving these dead instructions behind looks like a (minor) compiler bug to me.

Accessed memory

I added a pass that marked memory addresses that were accessed by load and store instructions in yellow.

Remaining mysteries

`firmware` section

The firmware section accounts for 91% of the used portion of the PROM image (384 KiB of 422 KiB), and despite the success of decompiling the code in sloader and post1, the firmware section was still looking very sad. Here are the first 8 rows of the structure, with the rest not looking much different.

Looking at the small amount of successfully discovered code showed a jal instruction to an unknown address.

L_0xbfc092a4: /* 0xbfc092a4 */
    move    $a0, $s0
    jal     0xb1000370
     move   $a1, $s1
    b       L_0xbfc092a4
     nop

Absolute jumps like jal compose their target using the top four bits of their own address, which until now I'd assumed was 0xb (from 0xbfc00000). Clearly this must not be the case.

The firmware section is the only section with the 0x2 bit set in the section type field. I'd previously identified that this bit seemed related to the presence of meaningful-looking data in the 8 bytes immediately following the SHDR.

The first four bytes were 0x81000000. Maybe it was the address the code was expected to execute from?

This theory had merit for a few reasons:

the low 28 bits of jal's jump target were 0x1000370, which would work when executing from 0x81000000.
a virtual address of 0x81000000 is within the kseg0 virtual address space. kseg0 is unmapped (virtual addresses are simply translated to physical addresses by dropping the high 3-bits), so it doesn't require initializing the TLB. It's also (configurably) cached, which is probably desirable for the core part of the firmware.
the physical memory location would therefore be 0x01000000, or 16 MiB — within the O2's minimum memory configuration of 32 MiB.

I tried disassembling the firmware section with --adjust-vma=0x81000000. The jal now looked like

810000a8:       0c4000dc        jal     0x81000370

and better yet, at 0x81000370 there appeared to be a function.

[...]
8100035c:       27bd0018        addiu   sp,sp,24
81000360:       03e00008        jr      ra
81000364:       00000000        nop
        ...
81000370:       27bdffd8        addiu   sp,sp,-40
81000374:       afb00018        sw      s0,24(sp)
81000378:       00808025        move    s0,a0
8100037c:       afbf001c        sw      ra,28(sp)
81000380:       afa5002c        sw      a1,44(sp)
81000384:       0c4013a6        jal     0x81004e98
[...]

The second four-byte value was 0x00048e70 / 298608. I didn't recognize that this was the length until I happened to notice something odd:

[...]
81048e70:       81048e70        lb      a0,-29072(t0)
81048e74:       0000b290        .word   0xb290
[...]

The data at location 0x81048e70 was its own address?

Spidey senses tingling, I looked at what was at 0x81048e70 + 0xb290 + 8 (the size of this header) = 0x81054108.

81054108:       81054100        lb      a1,16640(t0)
8105410c:       0000bee0        .word   0xbee0

And again at 0x81054100 + 0xbee0 + 16 (the size of two headers) = 0x8105fff0.

8105fff0:       81000000        lb      zero,0(t0)
8105fff4:       00000000        nop

This time, however, we were at the very end of the section. The remaining 8-bytes of the section were the checksum (0xd1c38847) and four bytes of zeros to pad to a 256-byte boundary.

8105fff8:       d1c38847        lld     v1,-30649(t6)
8105fffc:       00000000        nop

So these pairs appeared to be an address and length with the last pair as a sentinel value with a length of zero.

Inspecting the contents of each of these subsections showed clear differences. The first subsection was code. The second was primarily strings with what looked to be jump tables (sequences of pointers into the code's virtual memory area). The third was more difficult. It still had some strings. It still had some pointers to the code. But whereas all the memory accesses to the second subsection were loads, there were loads and stores to the third.

It became apparent that these were the .text, .rodata, and (read-write) .data sections.

Subsection	Load Address	Length	Content
`.text`	`0x81000000`	`0x00048e70`	Executable code
`.rodata`	`0x81048e70`	`0x0000b290`	Read-only data (strings, tables)
`.data`	`0x81054100`	`0x0000bee0`	Read-write initialized data
sentinel	`0x81000000`	`0x00000000`	Zero length terminates parsing

Presumably the firmware section was written in C, compiled to a static ELF binary, and then had its sections extracted and repacked into a simple but custom format.

Checksum

If the ultimate goal of the project was to make modifying the PROM possible, I'd need to be able to recalculate the checksums.

Fortunately it wasn't too hard to find the function that verified the checksum in sloader.

is_section_checksum_valid: /* 0xbfc01874 */
    lw      $t6, SHDR_OFFSET_SECTION_LEN($a0)   # $t6 = Load the length of the section
    [...]

    addiu   $v1, $a0, SHDR_SIZE                 # $v1 = address of end of SHDR
    addu    $a1, $a0, $t6                       # $a1 = address of end of section

    [...]

    move    $v0, $v1                            # $v0 = address of data to be checksummed

    [...]

     move   $a2, $zero                          # $a2 = checksum

    [...]

checksum_main_loop: /* 0xbfc018c4 */
    lw      $t9, 0($v0)                         # $t9 = word[0]
    lw      $t0, 4($v0)                         # $t0 = word[1]
    lw      $t1, 8($v0)                         # $t1 = word[2]
    addu    $a2, $a2, $t9                       # checksum += word[0]
    lw      $t2, 0xc($v0)                       # $t2 = word[3]
    addu    $a2, $a2, $t0                       # checksum += word[1]
    addiu   $v0, $v0, 0x10                      # word += 16
    addu    $a2, $a2, $t1                       # checksum += word[2]
    bne     $v0, $a1, checksum_main_loop        # branch while not at end
     addu   $a2, $a2, $t2                       # checksum += word[3]
checksum_done: /* 0xbfc018ec */
    jr      $ra
     sltiu  $v0, $a2, 1                         # return checksum == 0

A plain old two's complement checksum — add all the 32-bit words and negate, such that when the stored checksum is added the result is zero.

I verified that the SHDR checksum is calculated the same way. A funny implication is that the section checksum calculation doesn't need to consider the contents of the SHDR, because a valid checksum for the SHDR necessarily means that its contribution would be 0. We see this taken advantage of in is_section_checksum_valid by skipping the SHDR.

`version` SHDR

The version section's SHDR had three oddities compared with the others.

the initial bytes looked like garbage
the padding byte contained 8
there was data after the "version" string in the 32-byte name field

Initial bytes

The section didn't seem important for my purposes, so it wasn't until I was implementing support for recognizing addresses constructed by li + addiu/ori pairs that I discovered what the initial bytes were.

Some values constructed weren't addresses but other useful values:

133333000 — a clock frequency
31536000 — the number of seconds in 365 days
0x53484452 — the "SHDR" magic value

This made me wonder if the initial bytes (0x7f454c46, 0x01020100) could be magic numbers.

A quick search revealed that 0x7f454c46 was the magic number for ELF binaries ("\x7fELF"). file on the extracted version section confirmed, and I felt a bit silly for not realizing this sooner.

$ file version.bin
version.bin: ELF 32-bit MSB MIPS, MIPS-II (SYSV)

I looked up the structure of the ELF header, and found that the initial 16 bytes were the e_ident field.

#define EI_NIDENT (16)

typedef struct
{
  unsigned char e_ident[EI_NIDENT]; /* Magic number and other info */
  [...]
} Elf32_Ehdr;

It contained the ELF magic number and the 0x01020100 value, which I decoded as:

Ehdr->e_ident[EI_CLASS]   = ELFCLASS32;
Ehdr->e_ident[EI_DATA]    = ELFDATA2MSB;
Ehdr->e_ident[EI_VERSION] = EV_CURRENT;
Ehdr->e_ident[EI_OSABI]   = ELFOSABI_NONE;

The remaining bytes in e_ident are ABI version (byte 8) and padding (9..15). These bytes contained the "SHDR" magic number and the section length.

Value in padding byte

With the recognition that the SHDR and ELF header were overlaid, I checked what was in the ELF header at this address.

	SHDR		ELF
Bytes	Field	Interpretation	Field	Interpretation
0x12	Section Type	`0`, `SECTION_TYPE_DATA`	`e_machine`	(`0x08`, `EM_MIPS`)
0x13	Padding	`8`	`e_machine`	(`0x08`, `EM_MIPS`)

A perfect fit.

Data after `"version"` name string

Decoding the stray data in the name string was trivial at this point.

Ehdr->e_phoff     = 0x00000000;
Ehdr->e_shoff     = 0x00000244;
Ehdr->e_flags     = EF_MIPS_ARCH_2 | EF_MIPS_NOREORDER | EF_MIPS_PIC;
Ehdr->e_ehsize    = 52;
Ehdr->e_phentsize = 0;
Ehdr->e_phnum     = 0;
Ehdr->e_shentsize = 40;
Ehdr->e_shnum     = 8;
Ehdr->e_shstrndx  = 7;

Conclusion

Reverse engineering the IP32 PROM turned out to be more tractable than the author of the mailing list post thought.

The firmware's structure — SHDRs, subsection headers, checksums — was relatively straightforward (in hindsight, at least), but it took small incremental steps over a long period of time to fully unmask.

Visualization was particularly helpful, not just for understanding but also providing motivation and a progress bar of sorts.

For a 512 KiB firmware image from 1996, the main challenge wasn't complexity but instead the sheer number of small details to get right.

Next steps

With the structure of the PROM fully understood, work turned to improving the decompiler's output.

The decompiler now produces assembly source files that reassemble into a bit-identical copy of the original ROM image — a strong confirmation that the PROM has been correctly understood. Today, with BSS variable names, function labels, and comments annotating the output, the firmware is sufficiently readable to understand its hardware initialization and boot process.

My hope is that this work is an important step towards a new CPU upgrade in the Silicon Graphics O2.

Current PROM structure

Here's the full structure of the PROM image, as generated by the decompiler at the time of this writing.

Red is code. Blue is header and checksum. Green is ASCII data. Yellow is accessed memory. Black is 0x00000000. White is 0xffffffff. Gray is unknown.

Footnotes

Later determined to be the magic number for ELF binaries ("\x7fELF")↩︎
Later determined to be the magic number for ELF binaries ("\x7fELF")↩︎
Later determined to be the EM_MIPS value for Elf32_Ehdr::e_machine↩︎

08 February 2026 – Tags: mips reverse-engineering sgi

GNOME 40 available in Gentoo

GNOME 40 was released at the end of March, and yesterday I added the last bits of it to Gentoo. You may not think that's fast, and you'd be right, but it's a lot faster than any GNOME release has been added to Gentoo that I can recall. I wasn't looking to become Gentoo's GNOME maintainer when I joined the team 18 months ago. I only wanted to use a GNOME release that was a little less stale. So how did I get here?

I asked about the GNOME 3.26 status when 3.28 and 3.30 were already out. Repeat that story until I got tired of waiting and added myself to the Gentoo/GNOME team and started updating glib... then I started updating mutter and gnome-shell... then I started updating everything...

— Matt Turner (@mattst88) May 1, 2021

GNOME has two major releases per year (in March and September), so to be more than two major releases behind is significant. At least two of my coworkers on the Mesa team at Intel switched to Gentoo for one reason or another, but ultimately switched back to their old distribution because Gentoo's GNOME packages were so out of date. That was pretty disappointing to hear, but I sympathized with them.

I maintain the X11/Wayland stack in Gentoo, and I think I do a good job of keeping on top of it. I make upstream releases of X packages and contribute to Mesa professionally so I'm often able to make the upstream and downstream changes at the same time.

But for GNOME I was just a user who happened to be a Gentoo Developer, so I started by just poking and asking if there was anything I could do to help. Unfortunately the answer was "no" nearly every time.

So I just watched and occasionally asked how things were going. And occasionally GNOME updates happened, but the gap between Gentoo and upstream never really closed. GNOME 3.26 was added to Gentoo, and before significant progress was made on adding 3.28 or 3.30 a new major version 3.32 was released upstream. It looked like we were just treading water.

What's worse, there were multiple unofficial overlays often providing newer versions of GNOME than what the ::gentoo repository contained. For reasons that were never clear to me, it seemed that none of the external overlay contributors (one of whom was a full Gentoo Developer!) were willing or able to collaborate with the Gentoo GNOME team.

I started small by adding new versions of GNOME packages and making pull request on GitHub for more experienced GNOME team members to review. Unfortunately by this time, the GNOME team had only one active member.

I joined the GNOME team in October 2019 and worked around the edges, doing small version bumps of non-critical packages.

Since most of the GNOME packages were behind, I began adding the next major GNOME's glib to the tree to get extra testing. I figured if that additional testing caught issues before they could block the rest of GNOME from being updated that I could save us some time.

That worked out pretty well, and I felt a little more confident so I added the next major GNOME's mutter and gnome-shell. Kind of scary.

But that worked out well too. Users tested, filed bugs, and I fixed them. And since the most critical GNOME packages entered the ::gentoo repo long before the ancillary applications we didn't have any big surprises when it was time to ask for stabilization.

Initially I had no idea which packages were related or if there were particular problems to look out for. This knowledge existed only in the head of one Gentoo Developer, so as I squeezed it out of him (as I made mistakes and he let me know!) I began documenting it on the Wiki.

As I updated packages, I encountered various build system bugs. Gentoo naturally uncovers problems binary distributions don't notice. Whenever possible, I made a merge request upstream so that the next time we added a new version we wouldn't have to carry a patch. So far I've had 13 merge requests accepted!

Starting on March 20 I added the first bits of GNOME 40 to the tree (glib and some other packages are often released before the official release date). I added glib first, and then I figured I couldn't break anything too badly if I just bumped the GNOME games. I added gnome-shell (behind package.mask), and then sort of forgot that's where I normally stopped. Less than 8 weeks later, all of GNOME is entirely up to date in Gentoo!

The bookends of adding GNOME 40 are commits 71e9245b05e6 and b93e3e581161. In that time I made 610 commits. The vast majority are GNOME-related (511 of them by my count). Categorized, they are:

2 reverted commits (both mine)
229 commits adding new package versions
152 commits dropping old package versions
3 commits adding new packages
7 commits adding support for Python 3.9
118 miscellaneous commits fixing, cleaning, masking, unmasking

Those commits closed 120 bugs (and referenced 21 more), which made a nice dent in the Gentoo GNOME team's bug backlog. At the time of this writing, there are 514 bugs assigned to the GNOME team or with the GNOME team in the Cc list. By default, Bugzilla only shows 500 bugs on a single page, so the GNOME bug list doesn't even fit. That was a bit of a psychological hurdle for me to get started. It'll be a nice moment when we get to the other side of 500.

I hope that with the gap to upstream now closed that some other Gentoo Developers and users will be more willing to help contribute. GNOME fell behind in Gentoo because it was too much work for a single person to maintain sustainably. I've remedied the most glaring symptom of the situation but not the underlying problem. Reach out to me if you'd like to help!

Because it's fun to look at, here's the output of our gnome-bumpchecker.py tool, showing that we're indeed up-to-date on everything.

13 May 2021 – Tags: gentoo gnome linux

Combining constants in i965 fragment shaders

On Intel's Gen graphics, three source instructions like MAD and LRP cannot have constants as arguments. When support for MAD instructions was introduced with Sandybridge, we assumed the choice between a MOV+MAD and a MUL+ADD sequence was inconsequential, so we chose to perform the multiply and add operations separately. Revisiting that assumption has uncovered some interesting things about the hardware and has lead us to some pretty nice performance improvements.

On Gen 7 hardware (Ivybridge, Haswell, Baytrail), multiplies and adds without immediate value arguments can be co-issued, meaning that multiple instructions can be issued from the same execution unit in the same cycle. MADs, never having immediates as sources, can always be co-issued. Considering that, we should prefer MADs, but a typical vec4 * vec4 + vec4(constant) pattern would lead to three duplicate (four total) MOV imm instructions.

mov(8)  g10<1>F    1.0F
mov(8)  g11<1>F    1.0F
mov(8)  g12<1>F    1.0F
mov(8)  g13<1>F    1.0F
mad(8)  g40<1>F    g10<8,8,1>F   g20<8,8,1>F   g30<8,8,1>F
mad(8)  g41<1>F    g11<8,8,1>F   g21<8,8,1>F   g31<8,8,1>F
mad(8)  g42<1>F    g12<8,8,1>F   g22<8,8,1>F   g32<8,8,1>F
mad(8)  g43<1>F    g13<8,8,1>F   g23<8,8,1>F   g33<8,8,1>F

Should be easy to clean up, right? We should simply combine those 1.0F MOVs and modify the MAD instructions to access the same register. Well, conceptually yes, but in practice not quite.

Since the i965 driver's fragment shader backend doesn't use static single assignment form (it's on our TODO list), our common subexpression elimination pass has to emit a MOV instruction when combining instructions. As a result, performing common subexpression elimination on immediate MOVs would undo constant propagation and the compiler's optimizer would go into an infinite loop. Not what you wanted.

Instead, I wrote a pass that scans the instruction list after the main optimization loop and creates a list of immediate values that are used. If an immediate value is used by a 3-source instruction (a MAD or a LRP) or at least four times by an instruction that can co-issue (ADD, MUL, CMP, MOV) then it's put into a register and sourced from there.

But there's still room for improvement. Each general register can store 8 floats, and instead of storing 8 separate constants in each, we're storing a single constant 8 times (and on SIMD16, 16 times!). Fixing that wasn't hard, and it significantly reduces register usage – we now only use one register for each 8 immediate values. Using a special vector-float immediate type we can even load four floating-point values in a single instruction.

With that in place, we can now always emit MAD instructions.

I'm pretty pleased with the results. Without using the New Intermediate Representation (NIR), the shader-db results are:

total instructions in shared programs: 5895414 -> 5747578 (-2.51%)
instructions in affected programs: 3618111 -> 3470275 (-4.09%)

And with NIR (that already unconditionally emits MAD instructions):

total instructions in shared programs: 7992936 -> 7772474 (-2.76%)
instructions in affected programs: 3738730 -> 3518268 (-5.90%)

Effects on a WebGL microbenchmark

In December, I checked what effect my constant combining pass would have on a WebGL procedural noise demo. The demo generates an effect ("noise") that looks like a ball of fire. Its fragment shader contains a ton of instructions but no texturing operations. We're currently able to compile the program in SIMD8 without spilling any registers, but at a cost of scheduling the instructions very badly.

The effects the constant combining pass has on this demo are really interesting, and it actually gives me evidence that some of the ideas I had for the pass are valid, namely that co-issuing instructions is worth a little extra register pressure.

1.00x FPS of baseline – 3123 instructions – baseline
1.09x FPS of baseline – 2841 instructions – after promoting constants only if used by more than 2 MADs

Going from no-constant-combining to restricted-constant-combining gives us a 9% increase in frames per second for a 9% instruction count reduction. We're totally limited by fragment shader performance.

1.46x FPS of baseline – 2841 instructions – after promote any constant used by a MAD

Going from step 2 to 3 though is interesting. The instruction count doesn't change, but we reduced register pressure sufficiently that we can now schedule instructions better without spilling (SCHEDULE_PRE, instead of SCHEDULE_PRE_NON_LIFO) – a 33% speed up just by rearranging instructions.

1.62x FPS of baseline – 2852 instructions – after promoting constants used by at least 4 co-issueable instructions

I was worried that we weren't going to be able to measure any performance difference from pulling constants out of co-issueable instructions, but we can definitely get a nice improvement here, of about 10% increase in frames per second.

As an aside, I did an experiment to see what would happen if we used SCHEDULE_PRE and spilled registers anyway (I added a couple of extra instructions to increase register pressure over the threshold). I changed the window size to 2048x2048 and rendered a fixed number of frames.

SCHEDULE_PRE with no spills: 17.5 seconds
SCHEDULE_PRE with 4 spills (8 send instructions): 17.5 seconds
SCHEDULE_PRE_NON_LIFO with no spills: 28 seconds

So there's some good evidence that the cure is worse than the disease. Of course this demo doesn't do any texturing, so memory bandwidth is not at a premium.

1.76x FPS of baseline – 2609 instructions – ???

I ran the demo to see if we'd made any changes in the last two months and was pleasantly surprised to find that we'd cut another 9% of instructions. I have no idea what caused it, but I'll take it! Combined with everything else, we're up to a 76% performance improvement.

Where's the code

The Mesa patches that implement the constant combining pass were committed (commit bb33a31c) and will be in the next major release (presumably version 10.6).

If any of this sounds interesting enough that you'd like to do it for a living, feel free to contact me. My team at Intel is responsible for the open source 3D driver in Mesa and is looking for new talent.

07 April 2015 – Tags: freedesktop intel linux mesa xorg

Laptop choices and aftermath

In November I was lamenting the lack of selection in credible Haswell-powered laptops for Mesa development. I chose the 15" MacBook Pro, while coworkers picked the 13" MBP and the System76 Galago Pro. After using the three laptops for a few months, I review our choices and whether they panned out like we expected.

	CPU	RAM	Graphics	Screen	Storage	Battery
13" MacBook Pro	2.8 GHz 4558U	16 GiB	GT3 - 1200 MHz	13.3" 2560x1600	512 GiB PCIe	71.8 Wh
15" MacBook Pro	2.0 GHz 4750HQ	16 GiB	GT3e - 1200 MHz	15.4" 2880x1800	256 GiB PCIe	95 Wh
Galago Pro	2.0 GHz 4750HQ	16 GiB	GT3e - 1200 MHz	14.1" 1920x1080	many options	52 Wh

15" MacBook Pro

The installation procedure on the MacBook was very simple. I shrunk the HFS partition from OS X and installed rEFInd, before following the usual Gentoo installation.

Quirks and Annoyances

Running Linux on the MacBook is a good experience overall, with some quirks:

the Broadcom BCM4360 wireless chip is supported by a proprietary driver (net-wireless/broadcom-sta in Gentoo)
the high DPI Retina display often necessitates 150~200% zoom (or lots of squinting)
the keyboard causes some annoyances:
- the function keys operate only as F* keys when the function key is held, making common key combinations awkward (behavior can be changed with the /sys/module/hid_apple/parameters/fnmode file).
- there's no Delete key, and Home/End/Page Up/Page Down are function+arrow key.
- the power button is a regular key immediately above backspace. It's easy to press accidentally.
the cooling fans don't speed up until the CPU temperature is near 100 C.
no built-in Ethernet. Seriously, we've reinvented how many mini and micro HDMI and DisplayPort form factors, but we can't come up with a way to rearrange eight copper wires to fit an Ethernet port into the laptop?

Worst Thing: Insufficient cooling

The worst thing about the MacBook is the insufficient cooling. Even forcing the two fans to their maximum frequencies isn't enough to prevent the CPUs from thermal throttling in less than a minute of full load. Most worrying is that my CPU's core #1 seems to run significantly hotter under load that the other cores. It's always the first, and routinely the only, core to reach 100 C, causing the whole CPU package to be throttled until it cools slightly. The temperature gradient across a chip only 177 square millimeters is also troubling: frequently core #1 is 15 C hotter than core #3 under load. The only plausible conclusion I've come to is that the thermal paste isn't applied evenly across the CPU die. And since Apple uses tamper resistant screws I couldn't reapply the thermal paste without special tools (and probably voiding the warranty).

Best Thing: Retina display

I didn't realize how much the Retina display would improve the experience. Having multiple windows (that would have been close to full screen at 1080p) open at once is really nice. Being able to have driver code open on the left half of the screen, and the PDF documentation open on the right makes patch review quicker and more efficient. I've attached other laptops I've used to larger monitors, but I've never even felt like trying with the 15" MBP.

13" MacBook Pro

I consider the 13" MacBook Pro to be strictly inferior (okay, lighter and smaller is nice, but...) to the 15". Other than the obvious differences in the hardware, the most disappointing thing I've discovered about it is that the 13" screen isn't really big enough to be comfortable for development. The coworker that owns it plugs it into his physically larger 1080p monitor when he gets to the office. For a screen that's supposed to be probably the biggest selling point of the laptop, it's not getting a lot of use.

As I mentioned, I'm perfectly satisfied with the 15" screen for everyday development.

System76 Galago Pro

I used the Galago Pro for about three weeks before switching to the 15" MacBook. In total it's a really compelling system, except for the serious lack of attention to detail.

Quirks and Annoyances

although it has built-in Ethernet (yay!), the latch mechanism will drive you nuts. Two hands are necessary to unplug an Ethernet cable from it, and three are really recommended.
the single hinge attaching the screen feels like a failure point, and the screen itself flexes way too much when you open or close the laptop.
all three USB ports at on the right side, which can be annoying if you want to use a mouse, which you will, because...
the touchpad doesn't behave very well. In fairness, this is probably mostly the fault of the synaptics driver or the default configuration.

Worst Thing: Keyboard

The keyboard is probably the worst part. The first time I booted the system, typing k while holding the shift key wouldn't register a key press. Lower case k typed fine, but with shift held – nothing. After about 25 presses, it began working without any indication as to what changed.

The key stroke is very short, you get almost no feedback, and if you press the keys at an angle slightly off center they may not register. Typing on it can be a rather frustrating experience. Beyond it being a generally unpleasant keyboard, the function key placement confirms that the keyboard is a complete afterthought: Suspend is between Mute and Volume Down. Whoops!

Best Thing: Cooling

The Galago Pro has an excellent cooling system. Its fans are capable of moving a surprising amount of air and don't make too much noise doing it. Under full load, the CPU's temperature never passed 84 C – 16 C cooler than the 15" MBP (and the MBP doesn't break 100 C only because it starts throttling!). On top of not scorching your lap during compiles, the cooler temperatures mean the CPU and GPU are going to be able to stay in turbo mode longer and give better performance.

Final thoughts

Concerns about the keyboard and general build quality of the Galago Pro turned out to be true. I think it's possible to get used to the keyboard, and if you do I feel confident that the system is really nice to use (well, I guess you have to get used to the other input device too).

I'm overall quite happy with the MacBook Pro. The Retina display is awesome, and the PCIe SSD is incredibly fast. I was most worried about the 15" MacBook overheating and triggering thermal throttling. Unfortunately this was well founded. Other than the quirks, which are par for the course, the overheating issue is the one significant downside to this machine.

19 March 2014 – Tags: freedesktop intel linux mesa

Difficulty in Finding a Good Development Laptop

When I started at working at Intel last year on the open source 3D driver I was given a spare Lenovo T420s (Sandybridge) as my development machine. Almost everyone on my team had upgraded to Ivy Bridge by February, but I planned just to hold out a few months until Haswell was released. I then spent all summer wondering where the Haswell laptops were, and only now, five months later has Lenovo released Thinkpads with Haswell. It's time for a new development machine, and after months of research the only conclusion I've come to is that it's really hard to find a good laptop for my (admittedly strange) case.

I use my development laptop for undemanding tasks like text editing, reading documentation, email, patch review, but also things that benefit greatly by fast multicore hardware: compiling Mesa, running piglit, and compiling large sets of real world GLSL shaders. All of these are parallel tasks that see linear speed ups given additional CPU cores. Spending less time waiting for a compile or a run of the test suite to finish means I can test changes more quickly and do my work more efficiently.

Given these uses, my requirements are a quad-core Haswell laptop with a large resolution screen (greater than 1080p), GT3e graphics, and at least 8 GiB of RAM, on a budget of $2000 (less than this is obviously easier to justify to the people with the checkbook). I also have no use for dedicated graphics, and do not want it if it will cause any problems in using the Haswell GT.

I'm looking for a high end laptop with fast graphics, but without a discrete card. Sort of unsurprisingly, this is hard to find.

13" – 14" laptops

An early favorite was System76's rebranded Galago Pro. It looks amazing on paper. I can configure a 14" quad-core GT3e system with 16 GiB and two disks for around $1500. Unfortunately many owners have said the keyboard was the awful. The "keyboard is literally the worst keyboard I've ever used in my life", said one reviewer, with other descriptions ranging from "attrocious", to "junk", to "a bit rubbish"; and that it was so bad that they had to return the system. I consider such a bad keyboard a deal breaker. To add to the misery, the touchpad is apparently equally terrible. Who needs to use input devices anyway?!

The Haswell successor of my T420s, the T440s, starts in price from $1419 to $1870. It contains a lower-end i5 4200U by default with the option to upgrade to a 4600U (for an additional $270). Those prices also get you only 4 GiB of RAM. Adding an extra 4 GiB SODIMM costs an additional $80; adding an 8 GiB SODIMM costs an additional $210! 16 GiB of RAM isn't even an option.

An interesting option is Lenovo's Yoga 2 Pro. It's top selling point is its awesome 3200x1800 13.3" screen. But other than that it's not super impressive. Having only a 1.8 GHz CPU leaves me wondering how much extra time I'll spend over the life of the laptop waiting for piglit test runs to finish than if I'd gotten a faster CPU.

Apple products aren't usually compelling to the Linux user in me, but the 13" MacBook Pro is a strong option based on its specifications. It starts in price from $1299 to $1799 (depending on the size of the SSD) and has a great 2560x1600 (16:10 aspect ratio!) Retina display. It offers a 2.8 GHz 4558U CPU with GT3 (no e) graphics and a PCIe-based SSD which according to reviewers has read and write speeds of 700 MB/sec! I've read that MacBooks often have cooling problems, but according to Notebookcheck.net's review the CPU didn't throttle after an hour at maximum load.

ASUS's unreleased Zenbook Infinity UX301LA seems compelling as well. Notebookcheck.net's review says that its price will be higher than anything else I've spec'd so far, and outside of my stated budget at $2450. It does have some really nice features to attempt to justify the price: 2.8 GHz 4558U CPU and GT3 graphics, 2560x1440 (16:9) screen, and strangely two 128 GB SSDs in RAID 0. The SSDs use some exotic connector I've never heard of, which worries me, and in general the potential for data loss in RAID 0 does too. According to the review, this laptop is designed to compete with the MacBook Air rather than the Pro, but offers higher performance than the Air. I'd be worried about throttling with this system (as I would with the Air) and the review confirms this – in testing the CPU throttled down to only 1.2 GHz after a few minutes under load.

13" – 14" dual-core laptops
	CPU	RAM	Graphics	Screen	Storage	Battery	Price
MacBook Pro 13"	2.8 GHz 4558U	8 GiB	GT3 - 1200 MHz	13.3" 2560x1600	512 GiB PCIe	71.8 Wh	$1999
T440s	2.1 GHz 4600U	8 GiB	GT2 - 1100 MHz	14.0" 1920x1080	256 GiB SATA	23.2 Wh + 72 Wh	$2074
Yoga 2 Pro	1.8 GHz 4500U	8 GiB	GT2 - 1100 MHz	13.3" 3200x1800	512 GiB SATA	54? Wh	$1599
Zenbook UX301LA	2.8 GHz 4558U	8 GiB	GT3 - 1200 MHz	13.3" 2560x1440	256 GiB (RAID 0)	50 Wh	$2450

In comparison with the MacBook Pro I can immediately remove the T440s and Zenbook from consideration based on price (and for the latter, that it's not yet available). I think the MacBook Pro is a better choice over the Yoga 2 because I believe that the difference in price is more than worth the improvement of a 1.8 to a 2.8 GHz CPU, 2x the graphics execution units, and that the PCIe SSD is amazingly fast.

I still haven't met my goals of a quad-core CPU or GT3e graphics (all GT3e are with quad-core CPUs). With the exception of the Galago Pro (why oh why don't you just have a decent keyboard?) these features seem to only be available on 15" laptops.

15" laptops

Apple's 15" MacBook Pro offers quad-core CPUs and GT3e graphics and has a 2880x1800 Retina display, satisfying all of my criteria. The price leaves no room for upgrades though, since it starts at $1999. But for that price you get a 2.0 GHz quad-core 4750HQ CPU, GT3e, 8 GiB of RAM, and a 256 GiB PCIe SSD. Compared with the 13", you're trading a higher frequency dual-core CPU for a lower frequency (but still higher than the Yoga 2) quad-core CPU, gaining 128 MiB of graphics eDRAM, and the increasing the dimensions and weight (by a pound). Parallel tasks like compiling code and piglit test runs will take less time on the quad-core CPU.

One potential concern is heat dissipation and the potential for thermal throttling. It's hard to find objective reviews of the MacBooks (Notebookcheck.net's reviews are really good, but there's no review of the 15" MBP), much less reviews that consider thermal throttling. Most reviews seem to be of the model with Nvidia graphics anyway. The battery size is also really impressive at 95 Wh.

An option suggested to me is the Toshiba S55-A5358. It's a compelling option based on price; Newegg sells it for only $850. It has a quad-core 4700MQ CPU, GT2 graphics, a spinning 1 TiB disk, and a probably overly large 15.6" 1080p screen. Even though it lacks GT3(e) graphics, it still has the same thermal output as the 15" MacBook Pro with GT3e. Effectively, this means that the time it will take to perform graphic workloads (e.g., piglit runs) will be longer while still producing the same amount of heat per time, so the threat of thermal throttling is actually much worse. It's also a half-pound heavier than the 15" MacBook Pro, but perhaps more worrying is that its battery is less than half that of the MacBook, only 43 Wh, while its CPU and GPU still consume the same amount of power as those in the MacBook.

As far as I know, there aren't any other laptops with GT3e graphics which is largely the reason for me to consider 15" laptops.

15" quad-core laptops
	CPU	RAM	Graphics	Screen	Storage	Battery	Price
15" MacBook Pro	2.0 GHz 4750HQ	8 GiB	GT3e - 1200 MHz	15.4" 2880x1800	256 GiB PCIe	95 Wh	$1999
15" MacBook Pro (faster CPU)	2.6 GHz 4960HQ	8 GiB	GT3e - 1300 MHz	15.4" 2880x1800	256 GiB PCIe	95 Wh	$2299
Toshiba S55-A5358	2.4 GHz 4700MQ	8 GiB	GT2 - 1150 MHz	15.6" 1920x1080	1 TiB spinning	43 Wh	$850 + SSD
Galago Pro	2.0 GHz 4750HQ	16 GiB	GT3e - 1200 MHz	14.1" 1920x1080	many options	52 Wh	~ $1500

The Galago Pro is included in the table just to make me sick.

I need to think hard about whether portability or speed are more important to me, but regardless of the decision both paths lead to a MacBook Pro.

17 November 2013 – Tags: freedesktop intel linux mesa

Reverse Engineering the PROM for the SGI O2

Background

The Problem

The (partial) Solution

External Annotations

Reverse engineering the IP32 PROM

First steps

SHDR

SHDR size

Strings

Section length

Checksum

SHDR checksum

Section type

Trailing 8 bytes

Summary

Identifying Code

How

Relative jumps versus (nearly) absolute jumps

Visualizing binary structure

Identifying Strings

Statically-unreachable functions

Virtual Subsection

Unreachable Code

Accessed memory

Remaining mysteries

firmware section

Checksum

version SHDR

Initial bytes

Value in padding byte

Data after "version" name string

Conclusion

Next steps

Current PROM structure

Footnotes

GNOME 40 available in Gentoo

Combining constants in i965 fragment shaders

Effects on a WebGL microbenchmark

Where's the code

Laptop choices and aftermath

15" MacBook Pro

Quirks and Annoyances

Worst Thing: Insufficient cooling

Best Thing: Retina display

13" MacBook Pro

System76 Galago Pro

Quirks and Annoyances

Worst Thing: Keyboard

Best Thing: Cooling

Final thoughts

Difficulty in Finding a Good Development Laptop

13" – 14" laptops

15" laptops

`firmware` section

`version` SHDR

Data after `"version"` name string