Larrabee: Samples in Late 08, Products in 2H09/1H10

MfA · Jan 29, 2008

ArchitectureProfessor said:
But, thus far they haven't come up with a better ISA than a standard RISC ISA in terms of really being a "better" ISA.

All else being equal, having branch preparation instructions is better than not having them (and ignoring patent rights they cost bugger all to implement).

PS. it's not just Cell, also SuperH and the ill fated E2k.

Gubbi · Jan 29, 2008

MfA said:
All else being equal, having branch preparation instructions is better than not having them (and ignoring patent rights they cost bugger all to implement).

PS. it's not just Cell, also SuperH and the ill fated E2k.

The problem is that the branch preparation instruction will have to finish (retire) at the very end of the pipeline, but the result of the instruction is needed at the front. In a Core-2 you can end up having to load your branch 14 cycles (pipeline depth) * 4 instructions/cycle (width) = 56 instructions in advance. The wider and deeper your core is, the harder it will be to utilize branch preparation.

Branch prediction has the wonderful ability to break data-dependencies on conditions (which branches are).

I think you're right, that it's cheap to add the capability and it might be useful in some cases, but in a lot of cases the CPU core would end up predicting the branch in question anyway because the preparation instruction hasn't completed yet.

Cheers

crystall · Jan 29, 2008

ArchitectureProfessor said:
Try hiding the branch hint latency while traversing some irregular spatial subdivision structure. Sounds like it would be hard to find enough independent work to hide the branch hint latency.

Ray-tracers have been show to run very well on Cell so it seems that letting the SPEs traverse a kd-tree is a solved problem.

crystall · Jan 29, 2008

MfA said:
PS. it's not just Cell, also SuperH and the ill fated E2k.

I don't know about E2k but SuperH has branches with a delay slot (like MIPS and SPARC used to BTW). This was a popular choice back in the RISC glory days however it turns out as a very poor choice as it tends to expose part of the microarchitecture and becomes a burden when using long pipelines using branch prediction or implementing out-of-order execution.

Gubbi · Jan 29, 2008

crystall said:
Ray-tracers have been show to run very well on Cell so it seems that letting the SPEs traverse a kd-tree is a solved problem.

You would probably want to use bounding volume hiearchies on CELL since they have a higher compute-to-memory access ratio as well as a more shallow tree-structure.

As for ray-tracers running well on CELL, has there been any that deals with secondary rays ? All the papers I've read conclude that using coherent rays/ray packets you get fantastic primary ray performance, so-so shading performance but they all skip on dealing with secondary rays.

Cheers

Jawed · Jan 29, 2008

ArchitectureProfessor said:
The biggest advances in the last few decades have been VLIW-related, and they haven't been very successful (except in the embedded/DSP space).

All the ATI and NVidia GPUs are VLIW.

Jawed

nAo · Jan 29, 2008

Jawed said:
All the ATI and NVidia GPUs are VLIW.

Not really.

ArchitectureProfessor · Jan 29, 2008

Jawed said:
All the ATI and NVidia GPUs are VLIW.

I would have called the GPUs SIMD rather than VLIW. Can you say more about how these GPUs are VLIW (like RISC and CISC, the term is somewhat ambiguous and means different things to different people). As I don't actually know what the ISA looks like, they very well could be VLIW (I just need more info to know for sure).

Jawed · Jan 29, 2008

ArchitectureProfessor said:
I would have called the GPUs SIMD rather than VLIW. Can you say more about how these GPUs are VLIW (like RISC and CISC, the term is somewhat ambiguous and means different things to different people). As I don't actually know what the ISA looks like, they very well could be VLIW (I just need more info to know for sure).

R600 has 5 ALUs that can run independent instructions, 4x MAD + 1x transcendental (this latter also supports MAD - multiply-add). G80 has 2 ALUs, MAD + transcendental (which also has an attribute-interpolation instruction).

In all cases the code is statically compiled to issue across these ALUs.

Jawed

MfA · Jan 29, 2008

Crystall, SH5.

crystall · Jan 29, 2008

MfA said:
Crystall, SH5.

You're right, I had forgotten about that. However it seems to have completely fallen of the roadmaps both at Renesas and STMicro with both deploying internal respins of the SH-4 instead.

3dcgi · Jan 30, 2008

How will branch prediction work with a 16 wide vector machine like Larrabee? With GPUs each vector element is a vertex or pixel that can branch independently so predication is used unless all elements choose the same branch.

I can see the branch predictor working like normal here if it treats the entire vector as a single thread, but it might be impossible to predict a branch correctly with GPU like behavior as both branches could be taken.

Is Larrabee expected to work like a GPU here or will it truly be a wide SSE where all data takes the same branch? If the vector does branch together won't there be many times when part of the vector is empty because the workload doesn't need 16 elements?

armchair_architect · Jan 30, 2008

With predication on GPUs you're running both sides of the branch too, and in those cases part of the SIMD hardware is idle/empty also.

If I understand it right, there won't be a vector branch instruction in Larrabee, so there's no chance of having both a taken and a not-taken result for the same branch. Instead, you'll have vector comparison instruction(s) that compute essentially a 1-bit result for each vector element, and a way to tell whether any of those results are true or any false.

Then a potentially-divergent branch on Larrabee would look like this:

Code:

vcond = <vector comparison>
push current vector predicate mask
if (any bits set in vcond)
    AND vcond into vector predicate mask
    run "true" side of branch
if (any bits set in NOT(vcond))
    AND NOT(vcond) into vector predicate mask
    run "false" side of branch
pop vector predicate mask

The branch predictor would predict the result of the two branches in that code, each of which is either taken or not taken.

I'd expect there to be specialized instructions for making this pattern (and similar patterns for loops) efficient.

3dilettante · Jan 30, 2008

3dcgi said:
How will branch prediction work with a 16 wide vector machine like Larrabee? With GPUs each vector element is a vertex or pixel that can branch independently so predication is used unless all elements choose the same branch.

If you are talking about branching granularity, Larrabee, G80, and R6xx have a granularity that is greater than one.
The question is how we go about comparing them.

Larrabee's vector is 16-wide, though if we were to pack a pixel quadsworth of data into each register, it is only 4 pixels. If the data is arranged differently, we could force the granularity up to 16.
G80 has a granularity of 16 for vertex threads, and 32 for pixels.
R600 has a granularity of 64.

If there is any divergent branch behavior, there is going to be idled hardware, duplicated work, or serialized run-through of both code paths.

Larrabee's best case is less than that of G80, which is less than that of R600.

armchair_architect said:
With predication on GPUs you're running both sides of the branch too, and in those cases part of the SIMD hardware is idle/empty also.

If I understand it right, there won't be a vector branch instruction in Larrabee, so there's no chance of having both a taken and a not-taken result for the same branch. Instead, you'll have vector comparison instruction(s) that compute essentially a 1-bit result for each vector element, and a way to tell whether any of those results are true or any false.

Then a potentially-divergent branch on Larrabee would look like this:

Code:

vcond = <vector comparison> push current vector predicate mask if (any bits set in vcond) AND vcond into vector predicate mask run "true" side of branch else if (any bits set in NOT(vcond)) AND NOT(vcond) into vector predicate mask run "false" side of branch pop vector predicate mask

The branch predictor would predict the result of the two branches in that code, each of which is either taken or not taken.

I'd expect there to be specialized instructions for making this pattern (and similar patterns for loops) efficient.

That sounds more like value prediction that just happens to be used to change program behavior. There may be some theoretical advantage shown in scalar code, but I haven't seen much on wider values.

At 16 elements, a branch predictor's storing a bit mask for each vector branch would be 8 times as large as a 2-bit saturating counter.
At 16 elements, there is a widening space of possible branch paths besides taken-not taken.
Given how much of this behavior is data-driven, the chance that the predictor is wrong in 16 elements is pretty high.

I'm not convinced predicting a vector mask's value is better.

edit:

Never mind, I read too much into what you were saying.
The pseudo code there is a branch based on a compare to zero.
The problem of branching granularity is not helped, as the branch predictor only avoids work if the vcond value is zero. That actually goes so far as ignoring the other coherent branching case where vcond is all 1s.
GPUs at least are capable of handling both the all 0s and all 1s cases pretty well.
The branch predictor in this case adds nothing and potentially makes things worse.

Jawed · Jan 30, 2008

armchair_architect said:
The branch predictor would predict the result of the two branches in that code, each of which is either taken or not taken.

GPUs seem to do one of two things (this is my interpretation):

(ATI) swap the thread out - the pipeline avoids a bubble by being filled with another thread and the branch evaluation is performed separately. When the branch has been evaluated the thread can continue.
(NVidia) thresholded serialisation - if the number of instructions in at least one branch is below a certain threshold then the GPU always executes all the instructions in that branch even if the predicate mask for the instructions ends up as "0". If it's a loop, the branch will eventually be evaluated and can be exited. Above that threshold the GPU swaps the thread out to evaluate the branch, each time it is reached.

So, the NVidia technique minimises the number of threads that are required, but risks wasted ALU cycles on short branches.

Jawed

TimothyFarrar · Jan 30, 2008

Last night I was talking to my brother (who is currently optimizing parallel code for an IBM Blue Gene super computer) about Larrabee, and he brought up some interesting points. Keep in mind, from his massively parallel perspective, cache coherence and shared memory is simply not scalable, managing a local store of memory and minimizing network latency are paramount.

One obvious thing he brought up which I completely forgot about, is that while caches (and branch prediction for that matter) make it easier for some programmers, when trying to extract peak performance per Watt (or per ALU capacity of the machine), you nearly always have to have an intimate knowledge of the cache design and modify the algorithm to suit the cache. Same goes for branch prediction (to a lessor extent), in optimizing by fanning out branch paths to suit the branch predictor. Both techniques require expert level programming skills to take advantage of. At least with a program managed local store (non-cache coherent) you can also easily optimize the memory bandwidth (and memory access patterns) as well as the ALU side of the algorithm, resulting in better performance per Watt.

Texture caches probably are an exception to the comment above, in that GPUs run fragment programs grouped on output locality, so changing texture cache details probably need not require changes to fragment programs for optimal performance.

Somewhat on the topic of branch prediction,

If Larrabee as a GPU is going to run DX10 shaders, seems as if Larrabee is going to need to run multiple instances of 16-wide fragment programs (NVidia style, 16 programs at once in one SIMD group) in the same thread in order to hide texture fetch latency. This will probably be done transparently by the compiler (interleaving). There are a few possible problems with this for Larrabee,

1.) Registers become a scarce resource. If you are interleaving 4 16-wide shader programs to hide texture fetch latency, you are down to 8 regs per shader program (32/4).

2.) You won't be able to use the standard x86 integer or double precision SSE2 regs with this formula (with any hope of good performance). Double precision might not integrate well into standard unified vertex programs on Larrabee. Problem here is that if you add 8-wide DP support in the 16-wide vectors you break the formula for 16 simultaneous programs. One option would be to split DP into register pars (hi 32-bits 16-wide in one even register, and lo 32-bits 16-wide in the next odd register), keeping the same ISA opcode structure. Still wonder what NVidia is going to do here as well... if done this way NVidia's banked register file actually looks like an asset, compared to doing a MAC (mul+add) on double precision requiring an extra register file cycle (stalling access for another ALU op) do to limited number of register file ports on Larrabee (3 vs the needed 6 for split regs).

3.) Same as above, Larrabee will probably need integer support in the 16-wide reg extension to support unified shaders.

4.) Branch granularity will probably have to match the amount of shader program interleaving. So Larrabee's possible 16 wide branch granularity (for fragment programs) probably will effectively be 32, 48, or 64 wide depending on how the compiler interleaves. Branch direction is likely highly uncorrelated to anything other than data and best predicted with a simple program hint indicating which path is most common.

ArchitectureProfessor · Jan 30, 2008

Special vector instructions

When looking at the paper on the XuCPU (The tri-core PowerPC-based XBox 360 CPU chip), I noticed the following two paragraphs about their special vector instructions:

The VMX128 implementation includes an added dot product instruction, common in graphics applications. The dot product implementation adds minimal latency to a multiply-add by simplifying the rounding of intermediate multiply results. The dot product instruction takes far less latency than discrete instructions.

Another addition we made to the VMX128 was direct 3D (D3D) compressed data formats, the same formats supported by the GPU. This allows graphics data to be generated in the CPU and then compressed before being stored in the L2 or memory. Typical use of the compressed formats allows an approximate 50 percent savings in required bandwidth and memory footprint.

Reading this made me think about what Larrabee's vectors might support. Of course, XuCPU works in tandem with a discrete GPU whereas Larrabee is aiming to be the discrete GPU. Yet, one could imagine that Larrabee could add similar instructions for handling compressed/decompression of textures, texture sampling, etc.

One interesting thing about the dot product instructions is that it "cross lanes". That is, uses the whole vector in computing the final result. It doesn't just do the same operation in parallel to the entire vector.

Anyone care to speculate what sort of 512-bit vector operations might make sense in a GPU?

nAo · Jan 30, 2008

While a dot product instruction can be handy in some cases on a 16-way SIMD processor it doesn't make much sense imho, MADs (and SOAs) are your friend in this case

ArchitectureProfessor · Jan 30, 2008

nAo said:
While a dot product instruction can be handy in some cases on a 16-way SIMD processor it doesn't make much sense imho, MADs (and SOAs) are your friend in this case

I'm not exactly sure where dot products are used in games and graphics (or how large the vectors are), but let's just consider doing a dot product of two large vectors. Just using 32-bit operations would require 16 MAD (multiply-and-adds) in consecutive cycles to perform the same computation of a single 16-way dot product using a single hardware vector instruction. Doing this as a single instruction might be faster, because all the multiplies would be in parallel and an adder tree could be built using CSA adders.

Are MADs allowed to be pipelined, or do they have back-to-back latency penalty?

Maybe this highlights the difference of the multiple parallel SIMD lanes of the G80 with the vectors of CPUs? If I understand things correctly, in the G80 context, each thread would do a dot product one element at a time using 32-bit floating point operations. In a vector CPU, it could perform a vector x vector operation and get back a single 32-bit result (something that you wouldn't do that way on a G80). Is that close to right?

Is this making any sense? This is an honest question, as I'm trying to understand the difference between Larrabee's vectors and the G80s multiple SIMD computations.

(and SOAs) are your friend in this case

The patent Microsoft on this talks about why using SOA isn't a perfect solution (apparently, SOA is structs-of-arrays, rather than the more traditional array-of-structs, AOS, layout). There is an article that quotes from it. I'm not sure I understand this all exactly, but it sounds convincing to me.

nAo · Jan 30, 2008

ArchitectureProfessor said:
I'm not exactly sure where dot products are used in games and graphics (or how large the vectors are),

Dot products are used all the time, mostly on 3 or 4 component vectors.
Sometimes even dot products with very wide vectors (10+) are needed, for example to reconstruct functions projected over some vector basis (spherical harmonics..) or to tesselate surfaces via direct evaluation (and much much more..)

but let's just consider doing a dot product of two large vectors. Just using 32-bit operations would require 16 MAD (multiply-and-adds) in consecutive cycles to perform the same computation of a single 16-way dot product using a single hardware vector instruction. Doing this as a single instruction might be faster, because all the multiplies would be in parallel and an adder tree could be built using CSA adders.

It might be faster, but what about latency?

It's not also clear what vectors size should the hw address, while with MADs + SOA you can automatically support any vector size while having an execution time and a latency proportional to the size of your vectors.

Are MADs allowed to be pipelined, or do they have back-to-back latency penalty?

I worked on architectures where MADs were/are fully pipelined with no back-to-back latency penalties. I guess it's a matter of implementation..

Maybe this highlights the difference of the multiple parallel SIMD lanes of the G80 with the vectors of CPUs? If I understand things correctly, in the G80 context, each thread would do a dot product one element at a time using 32-bit floating point operations. In a vector CPU, it could perform a vector x vector operation and get back a single 32-bit result (something that you wouldn't do that way on a G80). Is that close to right?

Yes..on G80 each thread probably work on a single component at any given cycle, a CPU (or other GPUs) might use the approach you proposed.

Is this making any sense? This is an honest question, as I'm trying to understand the difference between Larrabee's vectors and the G80s multiple SIMD computations.

I see many similarities here, wouldn't be surprised if Larrabee doesn't have any dot product instruction

Larrabee: Samples in Late 08, Products in 2H09/1H10

MfA

Gubbi

crystall

crystall

Gubbi

Jawed

nAo

Nutella Nutellae

ArchitectureProfessor

Jawed

MfA

crystall

3dcgi

armchair_architect

3dilettante

Jawed

TimothyFarrar

ArchitectureProfessor

nAo

Nutella Nutellae

ArchitectureProfessor

nAo

Nutella Nutellae

Similar threads