Larrabee: Samples in Late 08, Products in 2H09/1H10

I thought warps has 16 "threads"? The G80's units are 8-wide SIMD cores (as you say), and I thought it double-pumped them (and not quad-pumped them). I guess it isn't a big deal either way.

Actually, you are half right, a half-warp is 16 threads, and the core arch currently seems to be double-pumped, with 16 banks for memory access, but in CUDA warp size is 32 threads, and fragment programs also go 32 pixels at a time. Also according to the CUDA docs, seems as if going to 32 banks for memory access is in the works for future hardware.

Also I agree that it looks like with the type of GP programs Larrabee is designed to run well, there should be no problem keeping ALUs busy.

One thing I'm wondering about with Larrabee is what are they going to do driver side for scheduling tasks on the chip's cores.

With CUDA currently all programs (kernels) are serialized. I would assume that programs do overlap as one finishes and the next starts, but I haven't seen anything that verifies this (just it would seem stupid not to). Not being able to run multiple programs in parallel is a serious limitation for GPGPU stuff, and a serious advantage for Larrabee. Also I think, but could be very wrong, that the graphics drivers don't run CUDA and DX/GL in parallel, and have to switch between those to modes of computation.

For graphics, seems as if the drivers overlap vertex and fragment stages by starting the fragment stage when they can be sure that the fragment stage won't wait on vertex input. However, and perhaps someone who actually knows could correct me here, I'm thinking that, like CUDA, drawing calls are serialized with overlap between different programs only at the end and beginning. And would this only be overlap between different stages of two different programs, or for example, do two different fragment programs ever run in parallel?

Right now GPGPU stuff using DX/GL is limited by having to divide algorithms into very large batch chunks of fully parallel operations. Even if CUDA gets double precision, read/write access to a 2D surface cache, seems like it's going to need to be able to run multiple kernels in parallel to be competitive in terms of GPGPU usefulness compared to Larrabee.

Larrabee's biggest strength in my eyes is being able to easily intermix lots of GPGPU and GPU stuff in parallel. Just the though of being able to spawn new tasks from code running on the GPU is awesome.
 
CELL SPEs support dynamic branch hints, which give to the experienced programmer the opportunity to never 'mispredict' a branch and to not pay any cost of it...

Or why not use a dynamic branch predictor and be right 99% of the time without worrying about it? ;)

...as long as you can have enough free instructions to cover branch hints latency.

What about when you don't? Perhaps in GPU/GPGPU programming you can cover the branch hint latency, but what if once and a while you don't?

A hardware branch predictor does have a hardware area cost (and some power cost), but it can be really accurate. Most ISAs have some sort of branch hints and such, but experience has shown that in most cases, a good branch predictor can do a better job than the branch hints inserted by the programmer.

The best argument I can think of against a branch predictor is if you have some hard real-time constraints (and not just the soft real-time constraints of games). If you really want a deterministic execution time, forgoing a branch predictor and using scratchpad (non-cache) memory is exactly the way to go. This, of course, is exactly what Cell's SPEs do.
 
Or why not use a dynamic branch predictor and be right 99% of the time without worrying about it? ;)
1) it takes more area
2) 99% is wishful thinking, try to traverse some irregular spatial subdivision structure and tell me what prediction rate you get. with an SPU you can get 100%.
It's not like ppl who designed SPUs are idiots or anything, they just made certain trade-offs given the timeframe CELL was developed. That little thing contains 8 + 1 cores..
 
2) 99% is wishful thinking, try to traverse some irregular spatial subdivision structure and tell me what prediction rate you get. with an SPU you can get 100%.

Try hiding the branch hint latency while traversing some irregular spatial subdivision structure. Sounds like it would be hard to find enough independent work to hide the branch hint latency. Finally, even hard to predict programs like a compiler (the SPEC benchmark GCC, for example) that does lots of graph structure manipulation had branch prediction accuracies in the high 90s.

Recall also that Larrabee does have four threads, so (unlike Cell) it can just fill in with the other thread if there is a mispredicted branch once and a while.

It's not like ppl who designed SPUs are idiots or anything, they just made certain trade-offs given the timeframe CELL was developed. That little thing contains 8 + 1 cores..

Ok, I wouldn't exactly call the CELL guys idiots. But I disagree with the choices they made. Even inside of IBM there is disagreement on how to do things. Which would you call a better design: the three-core Xenon processor used in the Xbox 360 (which shipped earlier) or the Cell processor? They were both made by IBM. Given a choice, I know which design (and which design team) I would pick.

FYI: The Xenon does have a branch predictor. ;) From the article on Xenon: "The branch unit includes a 4KB two-way set-associative Branch History Table per thread".

Any XBox 360 and PS3 developers out there care to comment?
 
Try hiding the branch hint latency while traversing some irregular spatial subdivision structure. Sounds like it would be hard to find enough independent work to hide the branch hint latency. Finally, even hard to predict programs like a compiler (the SPEC benchmark GCC, for example) that does lots of graph structure manipulation had branch prediction accuracies in the high 90s.
On what architectures?
I haven't seen Xenon being profiled on SPEC for branch accuracy, and the architectures that are commonly profiled have much more hardware devoted to multiple branch predictors with many entries.
It may be expecting too much to get an accuracy rate that high for a slimmer predictor.

Recall also that Larrabee does have four threads, so (unlike Cell) it can just fill in with the other thread if there is a mispredicted branch once and a while.
I don't see how.
The chip doesn't know whether a predicted branch was predicted correctly until after the branch is resolved. The instructions issued can't be preempted by other threads without negating the point of having a branch predictor.


FYI: The Xenon does have a branch predictor. ;) From the article on Xenon: "The branch unit includes a 4KB two-way set-associative Branch History Table per thread".

If a similar setup is used for Larrabee, that's 32 KB for four threads: in the same league as the L1 data cache.
edit: whoops, misread the entry, that should be 16 KB.
 
Last edited by a moderator:
If a similar setup is used for Larrabee, that's 32 KB for four threads: in the same league as the L1 data cache.

Yeah, I'm thinking branch prediction is a no-go. Besides with x86 you have the 2Eh "branch not taken hint" and 3Eh "branch taken hint" instruction prefixes for Jcc (conditional branch) instructions...
 
Scratch that, I misread the 2-way associative part as a 2x4 KB per thread.

It's 16 KB for 4 threads, or half the data cache, should it be 32 KiB as the Larrabee slides showed.
It's a notable hardware investment, but not as bad as I wrote previously.
 
Or why not use a dynamic branch predictor and be right 99% of the time without worrying about it? ;)
That's wildly optimistic ...

I don't see why not all architectures have some way to allow programmers to get data dependent branches right when they have the data available in time. Something like a software loadeable BTB is an almost zero cost addition.

Speculative execution is nice, but real execution is always better.
 
Yeah, I'm thinking branch prediction is a no-go. Besides with x86 you have the 2Eh "branch not taken hint" and 3Eh "branch taken hint" instruction prefixes for Jcc (conditional branch) instructions...

Branch hints are all nice and dandy when you have the programming resources to go optimize your loops and branches. The whole philosophy behind Larrabee is to have the hardware do the mundane stuff. Why bother programmers with branch prediction when the hardware is pretty good at it at a minimal cost.
Intel have the best branch predictors in the industry, I don't see them not using it in Larrabee.
I think IBM made some great design choices with the SPU but I personally don't agree with their total lack of branch prediction (or even branch snoop as bare minimum!). Come on, who likes to predict unconditional branches, please.
 
Predictors on the big cores can get pretty large.

K8 has something on the order of 16k saturating counters in its branch history table. The counters alone are 4K.

I don't have any numbers on Conroe, but if it is descended from the P4 branch prediction hardware, it would have over four thousand entries in its branch target buffer.
Depending on how things sort out with the P4, the final iteration had in excess of 6 thousand entries in the various tables.

That is to the exclusion of a number of indirect predictors, loop detectors, and return stack entries.

The best Intel can do for branch predictors may be excessive.
 
I guess K8 was a bad example then.
Would it have mattered if AMD had just doubled the link width, if other constraints were ignored?

The K8 has a *really* chatty coherence protocol. Basically, a core sends a "probe" to all other cores. And then each of those cores all respond with a probe response. There is no sort of combining or anything, so every miss generates 2*(N-1) control messages. If you have N processors all making misses, the traffic on the interconnect becomes O(N^2). Not good. This is fine for two cores or even four cores, but it really doesn't work well for eight (as they had originally hoped it would).

From what I've heard, one variant of Intel's new CSI/Quickpath Interconnect does something similar for small systems (but CSI uses a directory for systems with more sockets).

Larrabee's uber flexibility advantage its proponents shout from the rooftops would also seem to run counter to the idea that it must only be task-oriented.

A badly written parallel program will run badly under any architecture. Larrabee supports a wide range of parallel models pretty well, but if a program does something stupid like rely on the OS scheduler to multiplex N software threads over T hardware threads (where N > T), then of course the program is doomed. (Note: I'm talking about parallel here, and not the use of threading to hide disk latency or wide-area network latency).
 
Larrabee's biggest strength in my eyes is being able to easily intermix lots of GPGPU and GPU stuff in parallel. Just the though of being able to spawn new tasks from code running on the GPU is awesome.

Certainly spawning new tasks from an existing task is something that Larrabee's software/runtime system could easily support. It does seem to open up some interesting possibilities.
 
I don't see how.
The chip doesn't know whether a predicted branch was predicted correctly until after the branch is resolved. The instructions issued can't be preempted by other threads without negating the point of having a branch predictor.

A branch misprediction has two costs: (1) the work already performed in the pipeline that needs to thrown away, and (2) the time it takes to re-direct fetch, access the instruction memory, decode the new instructions, etc.

You're totally right about #1. It is cost #2 that multithreading can help. In many in-order pipelines the cost is divided about evenly between these two costs. As such, multithreading can hide perhaps half the latency of a branch misprediction. But you're right in that I implied the multiple threads could hide all the branch misprediction latency, which isn't the case with "switch on event" type threading.

Something that has been proposed in the research community is "confidence prediction". The idea is that the processor can learn over time which branches it has a hard time predicting. If a specific branch is too hard to predict, the processor could just try to switch threads whenever it encounters it. Might not really help, but it might help mitigate a poor branch predictor (at the cost of some bits to make the confidence prediction).

Of course, then you can have a predictor to predict the confidence of your confidence predictor. And then a predictor to predict the confidence of your prediction of your confidence predictor. ;) And so on. I think I'll stop now.
 
The best Intel can do for branch predictors may be excessive.

Yes, Intel, AMD, and others have gone wild with branch predictors (all the major desktop processors have done so). These machines are pretty sensitive to the branch predictor.

I agree that 4KBs per thread is pretty excessive (especially for a four-thread machine). Yet, you can do pretty well with less state as well.

So, why are branches so predictable, at least in CPU codes? First, if you guess randomly, you get 50% accuracy. Once you include unconditional branches, procedure calls, returns, and loop back edges you're maybe at 75%. If you include additional "highly biased" branches.. ones that almost always go one way or the other every time, you get even more. Examples include branches for error handling or based on runtime configurations that disable some program feature. I think Intel also has a loop count predictor, so if a loop always executes the same number of times, you don't have a mis-predict on the last iteration. Furthermore, then you can start applying all the branch history and correlation stuff that really improves the prediction rate. I realize it is hard to believe, but CPUs are surprisingly good at branch prediction.

To be really concrete, the YAGS predictor from several years ago shows 92% prediction rate for a 1KB predictor on the SPECint95 benchmark suite. This paper puts GCC's prediction rate at about 90% for a 1KB predictor, so my comment on "high 90s% for GCC" above was wrong (or at least exaggerated). Since YAGS was published, the predictors have continued to improve, so things should have gotten even a bit better. Of course, I'll let you decide how relevant SPECint95 is to workloads you actually care about.

Now GPU codes might be very different. But as we've been discussing, GPU code and CPU code are perhaps becoming to look more similar all the time... but what would they look like today? I really have no data on which to base a conjecture. Actually, comparing the branch prediction rates of GPU and CPU workloads would make for a nice academic paper to write. Nothing like cold hard data to settle an argument. :)
 
I realize it is hard to believe, but CPUs are surprisingly good at branch prediction.
Better than they need to be ...

Being able to finally get zero overhead looping working (well at least the second iteration through the loop, as long as the history doesn't get overwritten of course) after throwing that much chip real estate at it is not exactly a shining endorsement of the x86 ISA (or all the other decades old ISAs which Moore's law managed to keep whipping forward).
Having to fix a brain dead ISA with transistors is why straight x86 just isn't very nice for this kind of chip ... as long as they are changing the ISA for vector processing they might as well change the branch handling a bit too, so it can make do with branch prediction which is a little less necessarily complex.
 
Come on, who likes to predict unconditional branches, please.

I just think the area used for conditional branch prediction would be better suited towards something else. Besides Larrabee is in-order right and probably not going to have a hugely deep pipeline?

Speaking of "UNconditional branches", still think a hardware return stack predictor would be a good idea (low area requirement right?). Also the PowerPC's dedicated loop counter is not a bad idea either, trivial to predict loops (but not found in x86). Add this to predicated instructions (or not as good, x86's conditional moves), branch hints (or not as good, just the default prediction of backwards is taken and forward is skipped), and you have most of the important performance critical branches covered (or avoided), especially for GPU style programs.
 
...as long as they are changing the ISA for vector processing they might as well change the branch handling a bit too, so it can make do with branch prediction which is a little less necessarily complex.

IPF (aka IA-64, aka Itanium) is one of the most recent major ISAs be designed. It had the manta that smart compilers can do all sorts of things. They design the ISA for general predication. They added lots of hints from the compiler for cache locality. They expected the compiler to do full wide-issue static scheduling.

Even with all this advanced support, Itanium still has branch prediction.

It also really isn't clear if much of Itanium's advanced compiler-based speculation and such really help much at all. Sure, x86 is a wart and hard to make go fast. But, thus far they haven't come up with a better ISA than a standard RISC ISA in terms of really being a "better" ISA. With the exception of SIMD vector instructions, conditional move instructions, there hasn't been much real innovation in terms of actual instruction set architectures since the 1980s. In fact, most of the "branch delay slots" and annulled branching in some of the earliest RISC ISAs just look like big warts at this point. The biggest advances in the last few decades have been VLIW-related, and they haven't been very successful (except in the embedded/DSP space).

And believe me, it isn't for lack of trying on the part of the academic (and industrial) research communities.
 
...branch hints (or not as good, just the default prediction of backwards is taken and forward is skipped), and you have most of the important performance critical branches covered (or avoided), especially for GPU style programs.

Branch prediction is really two things: (1) directional prediction and (2) next fetch prediction. With a modern high-frequency design, the processor likely can't even decide the instruction before it needs to decide where to fetch the next instruction. So, to avoid the fetch redirection penalty of a taken branch, you have to know in advance.

So, hints in the instructions themselves might not be enough to hide all the penalty of a taken branch. So, a branch target buffer (BTB) or a next-line predictor can help. Even small-ish predictors (just say 2% of the size of the L1 cache) can make a big difference.

Finally, I have to admit that the "prepare to branch" instructions in the Cell SPE can also address this issue. For branches that you really do know in advance, it does seem that a prepare to branch instruction makes good sense. Yet, if you're wrong (and don't prepare to branch correctly), I recall the penalty on the Cell was shockingly huge (but I don't have the actual number in front of me right now).
 
Back
Top