As promised, here are the assembly listings of the mat4 x mat4 testcase testvect_intrinsic.cpp. Apologies for the wonky syntax highlight - apparently that is not among pastebin's strengths.
ppc750cl assembly - the timed innermost loop starts at L103
bobcat assembly - the timed innermost loop starts at L98
Thanks.
4x4 matrix multiply involves basically 32 loads, 16 stores, 16 MULs, and 48 MADDs. This is the instruction breakdown..
PPC750:
15 lfs (load single fp32)
1 lfsx (load single fp32 using reg index - compiler uses this w/constant 0 register instead of using an imm offset of 0)
16 ps_merge00 (used to create a paired single from two separate singles; compiler is using this like a 2x broadcast instruction)
8 ps_mul (2x multiply)
24 ps_madd (2x madd)
8 psq_lx (loads a paired single)
8 psq_stx (stores a paired single)
4 add (pointer arithmetic)
2 slwi (left shifts, pointer arithmetic)
1 bdnz (flow control, probably free)
x86 version is:
1 add (pointer arithmetic)
1 mov (pointer arithmetic)
1 lea (pointer arithmetic)
2 sal (left shift, pointer arithmetic)
1 sub (flow control)
1 jne (flow control)
12 addps (add 4x FP32)
16 movaps (move 4x FP32 - 4 of these are loads and 4 are stores, the other 8 are reg/reg)
16 movss (move 1x FP32 - these are all used as loads)
16 mulps (MUL 4x FP32)
16 shufps (re-organize lanes of 4x FP32)
I haven't made any attempt at timing analysis for the PC750 version so I don't yet know how many dependency stalls it's hitting (is probably not terribly high). The FLOP count for the x86 version is the same, but the x86 version is doing some extra work moving stuff around and due to not having FMADDs. And it doesn't attempt folding loads with the adds or muls. Both are using a bunch of scalar loads + broadcasts to get the scalar multiplication, instead of using vector loads. But Bobcat gets artificially penalized because its shufps instructions take 2 cycles like the addps/mulps do, and alarmingly, so do the movss loads, so it's paying a lot more for its broadcasts.
Moving to Jaguar's improved ISA support could improve things. That gives you three-address arithmetic with AVX128, as well as broadcasts. FMA would improve things more.
A more heavily optimized version would probably consider some friendlier storage formats if at all possible. Bet a lowly Cortex-A8 would fare nicely here with some hand-written ASM, since it has vector * scalar FMADD.. that is, if you could hide the huge latency..
So now our expectation is on 170 gflops? That's actually pretty intresting to me. Is it really possible that GPU architectures have advanced so much in the time from Xenos's conception that when optimizing, you can extract comparable results from a GPU with only a fraction of the raw processing power? With the same number of ROPs to boot and a slower main bandwidth, along with weaker CPU.
I guess it bodes well for 720 and Orbis's architecture efficiencies over their predecessors.
You can find some good examples of AMD vs nVidia GPUs where the difference in peak FLOP count painted a very different picture from the difference in typical game performance. That should answer your question, I hope.
Last edited by a moderator: