C++ implementation of Larrabee new instructions

This is quite interesting because that's essentially what they did in AVX. Originally AVX was supposed to have a 4-operand FMA, but later they revised it to a 3-operand FMA, with three variants, just like this. I don't know the rationale behind this decision, but I think it may be related to x86 instruction encoding or some hardware issues.

My guess is that they have 3 variants because only the last argument in the instruction encoding can come from memory and/or have format conversion applied. The trio of instructions gives you flexibility as to which of the arguments that applies to.
 
So it appears that LOG2, EXP2, RECIP and RSQRT are intrinsic transcendentals that operate on all 16 lanes of a vector. That appears to be quite a serious investment in die area! Though we can't tell what latency these instructions will have.

Jawed
 
That appears to be quite a serious investment in die area!
Not necessarily. Just like G80 they could use less execution units than vector components and process them partially sequentially. So it's a die area versus latency versus throughput tradeoff. Four execution units seems like a reasonable choice. Note that Intel has done this before with SSE; before Core 2 they only had 64-bit wide execution units and every 128-bit wide operation took an extra cycle to process both the lower and upper half.
 
So it appears that LOG2, EXP2, RECIP and RSQRT are intrinsic transcendentals that operate on all 16 lanes of a vector. That appears to be quite a serious investment in die area! Though we can't tell what latency these instructions will have.

Jawed

"Math utility functions do not correspond directly to Larrabee new instructions, but are added for programming support."

I think these will likely be executed in a scalar fashion in h/w.
 
"Math utility functions do not correspond directly to Larrabee new instructions, but are added for programming support."

I think these will likely be executed in a scalar fashion in h/w.
The instructions I listed aren't in the set of "math utility operations", they're there right next to ADD, MAD132, etc.

Because the math utility operations section is specifically about instructions that take multiple cycles it seems to me like it's a very strong hint that these four transcendentals are single-cycle intrinsic 16-wide vector operations.

What's bugging me is why the presentations at GDC haven't appeared yet :cry:

Jawed
 
Is the gdc under nda ala apple's wwdc? Doesn't look like that to me. I would have thought that intel would want to popularize the techniques, that's why the vector isa intrinsics were released. Or may be not much ppl care abt rasterizing in software so no point in telling them abt it.
 
From Formal Verification of IA-64 Division Algorithms:

http://www.cl.cam.ac.uk/~jrh13/slides/tphols-15aug00/slides.pdf

slide 9, it appears that frcpa has 5 cycles of latency. (IA64 seems to have two floating point units.) I know essentially nothing about IA64, so I'm unclear whether the floating point units have a latency of 5 cycles as that slide can be interpreted in multiple ways.

There's more background here:

http://download.intel.com/technology/itj/q41999/pdf/ia64fpbf.pdf
http://www.ncsu.edu/wcae/ISCA2003/submissions/cornea.pdf

This:

http://www.pdc.kth.se/~pek/ia64-profiling.txt

says that FPU latency is 4 cycles, for Itanium 2, which was ~1GHz it seems.

So, what's the latency of Larrabee's vector ALUs? It's 8 cycles in ATI and, er, I forget in NVidia, 12? That's purely the math, not fetching operands or storing.

Jawed
 
A shader with 32 live registers at any point is pretty rare... note that on modern NVIDIA hardware 10+ registers is considered a lot (see the D3D tutorial slides from this year's GDC), so I don't think 32 is exactly starving anyone at the moment ;)

I'm also confused as to why people think loop unrolling would increase the live register count at all... if anything, it may actually optimize out the loop iterator, but by no means should it increase the register count of the program since registers used internal to the loop are necessarily only live for one iteration, and thus can be trivially reused on the next.
 
In a worst case scenario loop unrolling might slightly increase register pressure but it's not something I would particularly worry about. As Andrew says, in some cases it might actually decrease register pressure.
 
Is branching completely unimproved over desktop x86 on Larrabee? (ie. no zero overhead looping, no data dependent branch hints.) Also, does it have register renaming?

Writing assembly with branches where you know the direction of the branch dozens of cycles ahead of time and having no alternative but to just let the processor but it's head into a wall of unnecessary branch misses always irks me (even if after a while it learns to avoid the wall occasionally).
 
Last edited by a moderator:
A shader with 32 live registers at any point is pretty rare... note that on modern NVIDIA hardware 10+ registers is considered a lot (see the D3D tutorial slides from this year's GDC), so I don't think 32 is exactly starving anyone at the moment ;)
At the shader level each register would correspond to one scalar value, not a 4-component vector. So with 10 registers you wouldn't even have enough to do a MAC operation on a whole vector, let alone execute a complex shader. So 32 hardware registers really isn't too many if each of them stores 16 scalars from 16 pixels.
I'm also confused as to why people think loop unrolling would increase the live register count at all... if anything, it may actually optimize out the loop iterator, but by no means should it increase the register count of the program since registers used internal to the loop are necessarily only live for one iteration, and thus can be trivially reused on the next.
There's loop unrolling to lower the loop test-and-branch overhead, and there's loop unrolling to hide latency. The latter form rapidly increases register usage.
 
Is branching completely unimproved over desktop x86 on Larrabee? (ie. no zero overhead looping, no data dependent branch hints.) Writing assembly with branches where you know the direction of the branch dozens of cycles ahead of time and having no alternative but to just let the processor but it's head into a wall of unnecessary branch misses always irks me (even if after a while it learns to avoid the wall occasionally).
It has 4-way HyperThreading. By the time the result of the branch is needed, it's ready.
Also, does it have register renaming?
Why would you need that in an in-order processor?
 
It has 4-way HyperThreading. By the time the result of the branch is needed, it's ready.
From what they say in the papers/presentations it sounds like it switches on branch mispredicts (to avoid the bubble) but not when the branch enters the pipeline. It would be nice if they did this, but even then the vertical multithreading wouldn't always be enough (you need on the order of 10-20 cycles, it's not just about instruction latency since branches are done higher up in the pipeline).
 
From what they say in the papers/presentations it sounds like it switches on branch mispredicts (to avoid the bubble) but not when the branch enters the pipeline. It would be nice if they did this, but even then the vertical multithreading wouldn't always be enough (you need on the order of 10-20 cycles, it's not just about instruction latency since branches are done higher up in the pipeline).

Larrabee's design is supposed to closely match Pentium1's short execution pipeline. If I remember correctly, a mispredicted branch used to cost 4 cycles.
Of course Pentium1 had only static prediction. I don't know if Larrabee has anything beefier than that.
 
At the shader level each register would correspond to one scalar value, not a 4-component vector. So with 10 registers you wouldn't even have enough to do a MAC operation on a whole vector, let alone execute a complex shader. So 32 hardware registers really isn't too many if each of them stores 16 scalars from 16 pixels.
The typical shaders I see spit out of FXC are 1-8 "vector" temporary registers (xyzw - although it's not packing those at all, so after scalarizing it's often less than 4 * # registers), which fits fine into 32 registers. I just compiled a very complex shader with significant control flow and it uses 13 FXC "vector" registers, and many of those it appears to be using only one or two components, so I don't think the situation is as bad as you seem to imply. In fact I think 32/HW thread is a pretty reasonable number for most shaders. Time will tell though :)

There's loop unrolling to lower the loop test-and-branch overhead, and there's loop unrolling to hide latency. The latter form rapidly increases register usage.
Ah I see... in this context I'd consider that more directly related to software fibering (i.e. "unrolling" statically or dynamically the implicit outer loop over pixels), rather than unrolling shader loops, which is how I interpretted your original comment. And yes, that can certainly increase register pressure, but it's not exactly hard to spill/restore the register context at a really low cost on typical fiber switches.
 
Because the math utility operations section is specifically about instructions that take multiple cycles it seems to me like it's a very strong hint that these four transcendentals are single-cycle intrinsic 16-wide vector operations.
http://pc.watch.impress.co.jp/docs/2009/0330/kaigai498_p139.jpg said:
Almost all vector instructions take one clock
I think it's pretty clear that some of the ops, most likely being the transendentals, will be multi-cycle ops even if they're logically a single instruction.
 
Interesting thing that just hit me from their slides: none of the vector instructions can take constants. If you want a constant (like a looped (Va+Vb)/2 or something) you have to use an existing X86 instruction to write the immediate value into memory, and then do a broadcast load from memory on the vector instruction that uses it.
 
There's loop unrolling to lower the loop test-and-branch overhead, and there's loop unrolling to hide latency. The latter form rapidly increases register usage.
Why would you unroll the latter loops? Unless they are incredibly short it doesn't make much sense.
 
Back
Top