Combining constants in i965 fragment shaders
On Intel's Gen graphics, three source instructions like MAD and LRP cannot have constants as arguments. When support for MAD instructions was introduced with Sandybridge, we assumed the choice between a MOV+MAD and a MUL+ADD sequence was inconsequential, so we chose to perform the multiply and add operations separately. Revisiting that assumption has uncovered some interesting things about the hardware and has lead us to some pretty nice performance improvements.
On Gen 7 hardware (Ivybridge, Haswell, Baytrail), multiplies and adds without immediate value arguments can be co-issued, meaning that multiple instructions can be issued from the same execution unit in the same cycle. MADs, never having immediates as sources, can always be co-issued. Considering that, we should prefer MADs, but a typical vec4 * vec4 + vec4(constant) pattern would lead to three duplicate (four total) MOV imm instructions.
mov(8) g10<1>F 1.0F mov(8) g11<1>F 1.0F mov(8) g12<1>F 1.0F mov(8) g13<1>F 1.0F mad(8) g40<1>F g10<8,8,1>F g20<8,8,1>F g30<8,8,1>F mad(8) g41<1>F g11<8,8,1>F g21<8,8,1>F g31<8,8,1>F mad(8) g42<1>F g12<8,8,1>F g22<8,8,1>F g32<8,8,1>F mad(8) g43<1>F g13<8,8,1>F g23<8,8,1>F g33<8,8,1>F
Should be easy to clean up, right? We should simply combine those 1.0F MOVs and modify the MAD instructions to access the same register. Well, conceptually yes, but in practice not quite.
Since the i965 driver's fragment shader backend doesn't use static single assignment form (it's on our TODO list), our common subexpression elimination pass has to emit a MOV instruction when combining instructions. As a result, performing common subexpression elimination on immediate MOVs would undo constant propagation and the compiler's optimizer would go into an infinite loop. Not what you wanted.
Instead, I wrote a pass that scans the instruction list after the main optimization loop and creates a list of immediate values that are used. If an immediate value is used by a 3-source instruction (a MAD or a LRP) or at least four times by an instruction that can co-issue (ADD, MUL, CMP, MOV) then it's put into a register and sourced from there.
But there's still room for improvement. Each general register can store 8 floats, and instead of storing 8 separate constants in each, we're storing a single constant 8 times (and on SIMD16, 16 times!). Fixing that wasn't hard, and it significantly reduces register usage – we now only use one register for each 8 immediate values. Using a special vector-float immediate type we can even load four floating-point values in a single instruction.
With that in place, we can now always emit MAD instructions.
I'm pretty pleased with the results. Without using the New Intermediate Representation (NIR), the shader-db results are:
total instructions in shared programs: 5895414 -> 5747578 (-2.51%)
instructions in affected programs: 3618111 -> 3470275 (-4.09%)
And with NIR (that already unconditionally emits MAD instructions):
total instructions in shared programs: 7992936 -> 7772474 (-2.76%)
instructions in affected programs: 3738730 -> 3518268 (-5.90%)
Effects on a WebGL microbenchmark
In December, I checked what effect my constant combining pass would have on a WebGL procedural noise demo. The demo generates an effect ("noise") that looks like a ball of fire. Its fragment shader contains a ton of instructions but no texturing operations. We're currently able to compile the program in SIMD8 without spilling any registers, but at a cost of scheduling the instructions very badly.
The effects the constant combining pass has on this demo are really interesting, and it actually gives me evidence that some of the ideas I had for the pass are valid, namely that co-issuing instructions is worth a little extra register pressure.
- 1.00x FPS of baseline – 3123 instructions – baseline
- 1.09x FPS of baseline – 2841 instructions – after promoting constants only if used by more than 2 MADs
Going from no-constant-combining to restricted-constant-combining gives us a 9% increase in frames per second for a 9% instruction count reduction. We're totally limited by fragment shader performance.
- 1.46x FPS of baseline – 2841 instructions – after promote any constant used by a MAD
Going from step 2 to 3 though is interesting. The instruction count doesn't change, but we reduced register pressure sufficiently that we can now schedule instructions better without spilling (SCHEDULE_PRE
, instead of SCHEDULE_PRE_NON_LIFO
) – a 33% speed up just by rearranging instructions.
- 1.62x FPS of baseline – 2852 instructions – after promoting constants used by at least 4 co-issueable instructions
I was worried that we weren't going to be able to measure any performance difference from pulling constants out of co-issueable instructions, but we can definitely get a nice improvement here, of about 10% increase in frames per second.
SCHEDULE_PRE
with no spills: 17.5 secondsSCHEDULE_PRE
with 4 spills (8 send instructions): 17.5 secondsSCHEDULE_PRE_NON_LIFO
with no spills: 28 seconds
So there's some good evidence that the cure is worse than the disease. Of course this demo doesn't do any texturing, so memory bandwidth is not at a premium.
- 1.76x FPS of baseline – 2609 instructions – ???
I ran the demo to see if we'd made any changes in the last two months and was pleasantly surprised to find that we'd cut another 9% of instructions. I have no idea what caused it, but I'll take it! Combined with everything else, we're up to a 76% performance improvement.
Where's the code
The Mesa patches that implement the constant combining pass were committed (commit bb33a31c) and will be in the next major release (presumably version 10.6).
If any of this sounds interesting enough that you'd like to do it for a living, feel free to contact me. My team at Intel is responsible for the open source 3D driver in Mesa and is looking for new talent.
– Tags: freedesktop intel linux mesa xorg