Larrabee delayed to 2011 ?

I wasn't stating that the store capability on the scalar pipe was uncertain, as that was stated outright by Intel slides, rather that the software could opt to do so or not.

To be clear, are you saying that Intel has confirmed the possibility to co-issue a vector store with a vector alu op?

Actually, that might be enough to achieve zero-overhead software thread switch. :idea: :smile:
 
IIUC, you are saying that game engine people complai about writing shaders that are performance portable across IHVs. Did I understand your point correctly? I was referring to GPU ISAs only in that comment of mine.
Obviously people don't complain about portable shaders, but even ComputeShader doesn't expose a lot of what these chips can do, and people do complain about having to go to IHV-specific - if not even hardware-specific - languages and ISA's to do that. In a world where you had heterogeneous cores on the same die with a shared memory space, they would complain a lot more loudly. The necessity of the abstraction right now is what forgives the current model. FWIW though, yes it's a pain that I have to compile code, query reflection data and bind parameters so manually at runtime even in the fairly streamlined DX10/11 APIs!

AFAICS, you are looking at a future where there will be say 6 sandy bridge cores and 32 lrb cores on a single die, and your os will be able to kick threads around freely. If that is what you are looking at, then I am afraid that is not going to happen for a very long time.
Regardless of the time frame, it is a desirable end-point and starting with an x86-based part is a step in that direction.

Now you see why a thread cannot be kicked from a sandy bridge core to a lrb core (and vice-versa) without risking SIGILL.
Right now that's obviously not the case unless you had some guarantees about the "modes" or common feature sets that a thread used. I see no reason why it couldn't be the case in the future.

Even if it goes on die, asking lrb to become binary compatible with an intel cpu is too much.
Why is it too much to ask for some unification of these things in the long run? Do you really want to continue with N different vector ISAs?

cuda has 2 restrictions on pure C,

1) no function pointers
2) no recursion
Uhh and AFAIK a ton of rules to do with statically resolvable shared vs. global memory aliasing, unless something has changed recently.

But this is exactly my point: saying it "supports C" implies you can take a C program and port it easily to CUDA. This is not the case for all but the most trivial C programs, due to the lack of a standard library, a completely different threading and synchronization model, explicitly-managed cache hierarchies, etc. The fact that it can compile C code that doesn't use anything but the basic keywords is, and let me emphasize, completely uninteresting. No programmer worth their salt gives a damn about the syntax of the language... it's the stuff that's completely different between typical C and CUDA that actually matters.

So yeah it'll be awesome when I can use function pointers and "exception-like things" on Fermi, but lets not pretend that this has any relationship to the flexibility of a typical CPU. You still can't express the most basic producer/consumer or message passing thread models in CUDA, and while this might come with Fermi's "task parallelism" stuff and CUDA 3.0, that remains to be seen.

Don't get me wrong, Fermi looks impressive on paper and a lot of the features that they are adding are desperately needed in the GPU computing space, but I've yet to see any indication that Fermi could - say - run a full OS on it or anything, which is typically associated with the flexibility of a CPU. Also I'm not saying that the typical CPU programming models are ideal going forward (inheriting the "all aliasing" pointer model of C/C++ would be insane going forward - so lets not see these languages as the ideal!), but there are still fundamental differences in the functionality of programming a multi-core CPU vs. a GPU at the moment.
 
Last edited by a moderator:
Anyone intending to write any significant portions of code for Larrabee using the intrinsics would be insane.

I found porting my rendering code to use the Larrabee intrinsics easier and more fun than trying to work around the global memory latency in CUDA.

Larrabee is best programmed using OpenCL or HLSL.

I don't see how OpenCL can always give you the best possible performance.
 
To be clear, are you saying that Intel has confirmed the possibility to co-issue a vector store with a vector alu op?

Actually, that might be enough to achieve zero-overhead software thread switch. :idea: :smile:

Larrabee is multithreaded, if that's what you mean about thread switch.


Tom Forsythe had a presentation for GDC 09.

This is what indicated to me that the x86 side of Larrabee is single-issue, and where it was confirmed about the store being able to issue through the scalar pipe.

Here is the text from slide 50:

Two pipelines
One x86 scalar pipe, one LNI vector
Every clock, you can run an
instruction on each
Similar to Pentium U/V pairing rules
Mask operations count as scalar ops

Vector stores are special
They can run down the scalar pipe
Can co-issue with a vector math op
Since vector math instructions are all
load-op, and vector stores co-issue,
memory access is very cheap in this
architecture
 
Don't get me wrong, Fermi looks impressive on paper and a lot of the features that they are adding are desperately needed in the GPU computing space, but I've yet to see any indication that Fermi could - say - run a full OS on it or anything, which is typically associated with the flexibility of a CPU.
why would you need to see that? as general-purpose as today's GPGPUs are, they are still designed to tackle a certain class of computational taks. all the new GP features they've been gaining are just low-hanging fruits lying on the periphery of that domain. but running an OS does not belong there. and since nobody has retired the CPU yet, i don't see why it should. actually, any vendor who tries to sell to me a vector cruncher on its merit of being OS-capable will ultimately get a cold shoulder from me.

do you really want to dub LRB as 'the GPU that never ran Word'?
 
why would you need to see that?
I'm not arguing that we need to see that as an end goal necessary... I'm responding to the comment that Fermi is almost as flexible as a CPU/LRB-like model. It clearly isn't. Whether or not you value the places where it isn't is a separate question and unrelated to the fact that a CPU can still do a whole lot more than Fermi.

The previous discussion about about greatly restricting a scheduler if your cores do not have the capability of running various operations is relevant too down to the road. Load balancing and non-hierarchical global schedulers are becoming an increasingly large problem and cores that are flexible enough to run any workload as needed - whether it be Word or graphics - are part of the solution here.
 
Programmability != x86.

Backward compatibility = x86.

Fermi is almost as programmable as Lrb, and it does not have x86.

While it is a variant of the backward compatibility argument, x86 has a vastly huge software infrastructure built around it. I know it won't help you get the vector bits optimized on LRB, but it will allow you to use familiar tools to get started, and port code a lot more easily.

The only analogy I can think of is, umm, not a good one based on the end result, but the 68K in the Atari Jaguar. Once you understood the machine, you almost didn't use it, but it was very handy to run stuff on while you wrapped your head around the other things.

I see x86 in LRB as a way to get a rough bit of code running, and then optimize to the LRB-NI side as you are able with the tools you know and like. On top of that, it can also be seen as a generic low latency x86 controller on die, use as many as you want for that purpose.

Having one thread per 'quad' running some code to keep things optimized/prefetched/whatever starts to look appealing when you start getting into the plus of graphics plus. For generalized graphics, it is far less useful.

All this said, the whole point of x86 in LRB is undoubtedly to prep for the future where it is on chip, then core, then just another instruction that is taken for granted. It was slated for integration on die with Haswell, but that has been pushed out, likely to Haswell +2, but it could be farther out. :(

-Charlie
 
Why is it too much to ask for some unification of these things in the long run? Do you really want to continue with N different vector ISAs?

I want to write kernels in ocl that will be performance portable across vector ISAs so that IHVs can actually innovate on hw without having to worry about binary compatibility.

So yeah it'll be awesome when I can use function pointers and "exception-like things" on Fermi, but lets not pretend that this has any relationship to the flexibility of a typical CPU. You still can't express the most basic producer/consumer or message passing thread models in CUDA, and while this might come with Fermi's "task parallelism" stuff and CUDA 3.0, that remains to be seen.

Don't get me wrong, Fermi looks impressive on paper and a lot of the features that they are adding are desperately needed in the GPU computing space, but I've yet to see any indication that Fermi could - say - run a full OS on it or anything, which is typically associated with the flexibility of a CPU. Also I'm not saying that the typical CPU programming models are ideal going forward (inheriting the "all aliasing" pointer model of C/C++ would be insane going forward - so lets not see these languages as the ideal!), but there are still fundamental differences in the functionality of programming a multi-core CPU vs. a GPU at the moment.

Alrighty, let's just agree to call fermi a lot closer to lrb than gt200 in terms of programmability.
 
Load balancing and non-hierarchical global schedulers are becoming an increasingly large problem and cores that are flexible enough to run any workload as needed - whether it be Word or graphics - are part of the solution here.
cores that are flexible-enough to run everything are abund on the market - those are usually referred to as CPUs : ) by stating that you want to have a homogeneous config of such cores (so you would not need to discriminate in your scheduler) you basically negate all the advantages an 'inherent knowledge of the task domain (aka specialization for a given task domain)' would give you - which is all (GP)GPUs have been doing for the past 15-20 years.

the thing is, i don't want to send the indiscriminantly-same workload to my VPU tansistors, compared to the one i send to my CPU transistors, compared to the one i send to my god-knows-what-doman-accelerator-we'll-come-up-with-next. or to put it as simple as possible, i want my vector-crunchers to be best at vector-crunching. massive and massive loads of that. now, whether there would be domain-specific schedulers in there - sure, why not. actually, there'd better be such. but not across the domains - there i don't need it. at all. i don't want it. and i won't pay for it ; )
 
Load balancing and non-hierarchical global schedulers are becoming an increasingly large problem and cores that are flexible enough to run any workload as needed - whether it be Word or graphics - are part of the solution here.

This is I think, the key point of disagreement amongst us.

Sure, you can run a thread of Word 97 (even 2007 if it is written in a JITed/interpreted language) on a lrb core. But lrb is not meant to do that. Running Word 2007 on a throughput optimized core is a cure worse than the disease. In order, single issue x86 core with tiny cache (accessing L2 of a remote core has higher latency) makes for laughable performance if you are the competition. It makes for really long coffee breaks if you have to use it yourselves.

We already have cores that can run both Word and graphics. They are called CPUs. :p Then why do you think the world+dog is migrating their throughput sensitive workloads to gpu's? The point of GPU's is to give up on the former and specialize for the latter. It is with this specialization that gpu's get an order of magnitude better perf/W/mm on graphics.

Because any core that is optimized to run Word will suck at graphics and vice versa, no 2 ways about it. Word and graphics are different workloads, and different kind of hw architectures are needed to run them fast. This is a fundamental tradeoff and programmers don't have a vote in this. Familiarity with the old and gold doesn't count for squat, anymore than familiarity with the existing serial code bases helps when you have to run them efficiently on many core servers. They'll have to adapt. The code will have to grow layers so that latency sensitive and throughput sensitive bits can live together.

Just because I can run a thread of mysql on my Intel GPU, doesn't mean that it makes any sense whatsoever for me to run it there. It is a luxury, which definitely has an impact on perf/mm/W, without any obvious benefits. There may be unanticipated benefits down the line, but at the moment I can see only marketing BS as the reason for putting x86 into LRB.
 
I found porting my rendering code to use the Larrabee intrinsics easier and more fun than trying to work around the global memory latency in CUDA.
That's a memory architecture related problem. Not ISA specific.

I don't see how OpenCL can always give you the best possible performance.
I think Captain Obvious would agree with that. ;)
Nobody is arguing that there will be ways to extract better performance than OpenCL. It's a question of !/$.
 
Running Word 2007 on a throughput optimized core is a cure worse than the disease. In order, single issue x86 core with tiny cache (accessing L2 of a remote core has higher latency) makes for laughable performance if you are the competition.
I've been wondering about that myself: how would a single threaded program running on a single LRB core running only x86 instructions compare against, say, a single core ATOM?

Is it unreasonable to assume that the ATOM will easily be 2x faster?
 
While it is a variant of the backward compatibility argument, x86 has a vastly huge software infrastructure built around it. I know it won't help you get the vector bits optimized on LRB, but it will allow you to use familiar tools to get started, and port code a lot more easily.

>99% of the non intel devs will use *tools* like dx/ogl/ocl to program lrb. And they don't need x86 for that. I think this tools+environment+ISA fetish comes from a time when down-to-isa compilers were in the vogue. Those times are gone and they are not coming back. Vendor supplied JIT compilers, (helped in no small measure by constrained/implicitly functional languages and -ffast-math :smile:) are the tools of today. And they don't need x86.

You are pretty good with coming up insider info Charlie;). I hope you will dig up numbers on the percentage of non-Intel devs using non-device-agnostic tools to write code, say 3 years after lrb launches. You might be surprised.

I see x86 in LRB as a way to get a rough bit of code running, and then optimize to the LRB-NI side as you are able with the tools you know and like. On top of that, it can also be seen as a generic low latency x86 controller on die, use as many as you want for that purpose.

Again, almost no-one outside Intel is going to use lrb-ni for a long time.
 
I've been wondering about that myself: how would a single threaded program running on a single LRB core running only x86 instructions compare against, say, a single core ATOM?

Is it unreasonable to assume that the ATOM will easily be 2x faster?

Dunno about the speedup, but due to it's larger cache and dual issue :smile: it is definitely going to be faster.
 
That's a memory architecture related problem. Not ISA specific.

I know. I was just putting the difficulty of programming with intrinsics in context with the harder issues of gpu programming. I've spent far longer trying to tease out better performance because of the memory architecture than I did writing intrinsics code.

I think Captain Obvious would agree with that. ;)
Nobody is arguing that there will be ways to extract better performance than OpenCL. It's a question of !/$.

I had an issue with it being described as 'best' without qualification. For some people performance is everything.
 
Back
Top