Larrabee delayed to 2011 ?

Well, the 10% figure is right there in front of you.

But that's likely inaccurate. And even if it were true, I highly doubt that would have been enough to vastly increase lrb's performance.
 
How about an alternative approach. They run their apps on existing cpu's. Profile which bits need perf, and rewrite them in OpenCL to target GPUs/whichever-LRB-is-available?

Do you really need x86 in your gpu in this approach? Is this approach in any way worse off than your suggestion? If anything, existing cpu's will blow LRB out when it comes to the serial bits. And they are important in overall perf.

There is a world of difference between optimizing code and adding support for sub-data DMA functions to and from an accelerator board with different programming models/requirements working on a different ISA, working outside the OS via some driver interface, with a completely different set of development tools, debug tools, libraries, etc.
 
I've been asking that question, and the answers I've gotten indicate that Larrabee needs to issue 1 traditional x86 instruction and one VPU instruction per cycle (the exception being that vector stores can run in place of an x86 instruction).
In pure x86 code, it is apparently half the width of Atom.
Intel slides have also been released that seem to support this.

I believe various people have been quoted as saying the LRB1 core effectively follows the UV issue rules of the Pentium with the V pipe being fairly restricted.
 
Well, the 10% figure is right there in front of you. Who knows how much else pointless stuff is there for which we don't have the numbers (cache coherency overhead, anyone?). If you look at the overall bogo-flops/mm2 and bogo-flops/W, LRB1 was hardly any good.

There you go assuming cache coherency is a bad thing. I wonder how long it will take everyone to figure out that memory bandwidth scaling is slowing down in a big BIG way. The BW per pin is getting closer and closer to maximums. coherent caches enable massive bandwidth reduction if used along with algorithms that partition execution to take advantage of them.

I won't even get into the whole issue of integration, which is well on its way, into mobile/desktop cores that have a small fraction of the bandwidth of discrete GPUs.
 
I believe various people have been quoted as saying the LRB1 core effectively follows the UV issue rules of the Pentium with the V pipe being fairly restricted.

It's restricted to VPU instructions, going by the GDC slide.
If asked to run non-VPU code, Atom would have an issue-width advantage.
 
There you go assuming cache coherency is a bad thing. I wonder how long it will take everyone to figure out that memory bandwidth scaling is slowing down in a big BIG way. The BW per pin is getting closer and closer to maximums. coherent caches enable massive bandwidth reduction if used along with algorithms that partition execution to take advantage of them.

I won't even get into the whole issue of integration, which is well on its way, into mobile/desktop cores that have a small fraction of the bandwidth of discrete GPUs.

You don't need LRB's full cache coherency in order to keep things on die. Fermi is a good example: the L1 caches are not coherent at all, but you can commit data to L2 with a fence and then use atomics to signal that the data is ready.

Full cache coherency has a host of scalability issues of its own - both in hardware, since LRB style coherency is very expensive, as well as in software, since it facilitates the creation of "parallel" programs which execute serially, since they just end up serialized in the cache coherency protocol, instead of the Program Counter.

So I agree with the earlier poster: LRB's cache coherence is expensive and unjustified. Additionally, I've heard several rumors that LRB3, if it ever sees the light of day, will abandon cache coherency in favor of something closer to Fermi's cache architecture.
 
There you go assuming cache coherency is a bad thing. I wonder how long it will take everyone to figure out that memory bandwidth scaling is slowing down in a big BIG way. The BW per pin is getting closer and closer to maximums. coherent caches enable massive bandwidth reduction if used along with algorithms that partition execution to take advantage of them.

I won't even get into the whole issue of integration, which is well on its way, into mobile/desktop cores that have a small fraction of the bandwidth of discrete GPUs.

Caches are nice. I am not sure about the coherency bit. It is ironic that people are trying to put stuff in hw (full coherency) to support stuff that people are trying to get of rid of in sw (with partly or fully functional languages).

Once you partition execution, how are you any different from, say Fermi.

Bottomline, I'll buy full-coherency when I see O(1000) hw threads sharing data at commensurate scale in apps that actually scale. For consumer apps at least, apps that need parallelism, have got plenty of parallelism without having to share data at massive s scale.
 
There is a world of difference between optimizing code and adding support for sub-data DMA functions to and from an accelerator board with different programming models/requirements working on a different ISA, working outside the OS via some driver interface, with a completely different set of development tools, debug tools, libraries, etc.

I'll grant you that. But if you are already messaging your app, jump to OCL over intrinsics/pthread/TBB isn't too much of a big deal.

Especially when you consider that you can be 10x more productive writing shader like code over intrinsics.
 
Fermi is a good example: the L1 caches are not coherent at all, but you can commit data to L2 with a fence and then use atomics to signal that the data is ready.
Sure, but it remains to be seen whether this is a particularly usable programming model. I knew very few people (myself *not* included!) who fully know the actual coherency guarantees that are made by DirectCompute for instance, which is similar model expressed in the SPMD style.

Full cache coherency has a host of scalability issues of its own - both in hardware, since LRB style coherency is very expensive, as well as in software, since it facilitates the creation of "parallel" programs which execute serially, since they just end up serialized in the cache coherency protocol, instead of the Program Counter.
Atomics have the same problems and do the same things (indeed they are even slower!), but they are still vitally important to expose. Cache coherence is the same thing - obviously you don't want to be abusing it to fire data all over the chip and serialize with locks and such but even CPU programmers know that... The ability to tank performance does not imply that coherent caches are not useful or desirable. There are already a million less-obvious ways to destroy performance on GPU architectures :)

I've learned to accept that what we have now are extremely low level programming models suitable for experts. Thus all of the useful tools and hardware features have to be exposed to write efficient code.

So I agree with the earlier poster: LRB's cache coherence is expensive and unjustified.
Uhh what are you basing that on? I'd hope to make a statement like that you've either had personal experience with the hardware.

rpg.314 said:
Nitpick, but I don't think this is precisely what you mean :) To address your point, I don't see why data sharing via coherent caches would be any less efficient than doing it explicitly, and it has the same advantage as atomics: it can be applied to data-dependent sharing with low-to-moderate collisions. I was the first to whine about people abusing atomics to write bad code (and they *are* already doing that... in this case it's largely a programming model problem), but there's no doubt that they are a useful feature when used properly.
 
I've learned to accept that what we have now are extremely low level programming models suitable for experts.
When it comes to performance, per se, who else is going to achieve it other than experts?...

I don't think that's being elitist or glib. One only has to look at the thread title to discern how hard veterans find implementing this kind of performance.
 
When it comes to performance, per se, who else is going to achieve it other than experts?...
Right, precisely. My point is you have to expect people to be using coherent caches and other features well. "You can screw yourself with them" is not an argument against it, in the same way that it wasn't an argument against atomics (which can screw you ever harder).
 
Caches are nice. I am not sure about the coherency bit. It is ironic that people are trying to put stuff in hw (full coherency) to support stuff that people are trying to get of rid of in sw (with partly or fully functional languages).

See people still don't understand, once you have coherence, its easy to be incoherent. The opposite isn't true.

Bottomline, I'll buy full-coherency when I see O(1000) hw threads sharing data at commensurate scale in apps that actually scale.

by the same token, I'll buy non-coherence when that happens too ;)

For consumer apps at least, apps that need parallelism, have got plenty of parallelism without having to share data at massive s scale.

Unfortunately, they are quickly running out of parallelism.
 
Atomics have the same problems and do the same things (indeed they are even slower!), but they are still vitally important to expose. Cache coherence is the same thing - obviously you don't want to be abusing it to fire data all over the chip and serialize with locks and such but even CPU programmers know that... The ability to tank performance does not imply that coherent caches are not useful or desirable. There are already a million less-obvious ways to destroy performance on GPU architectures :)
The difference between atomics and cache coherence is that atomics are explicit and visible in the code you write, whereas cache coherence introduces invisible performance bugs. We need architectures and programming models that push people in the direction of scalable parallel code. Cache coherence is wrong because it facilitates bad code.

Too many programmers parallelize their code and forget to parallelize their data structures because of cache coherence. That's why I consider it a bug, not a feature.
 
See people still don't understand, once you have coherence, its easy to be incoherent. The opposite isn't true.
From a software developer's point of view, the opposite is certainly true: once you have partitioned your data structures for parallel execution on a non-coherent processor, it's easy to get parallel scalability on a processor with coherent caches. You've already done the essential work to parallelize your application. On the other hand, once you've parallelized your execution on a coherent processor, you do not necessarily have scalable code, and you have to do more work to partition your data structures.

by the same token, I'll buy non-coherence when that happens too ;)
Non-coherence has already happened (Fermi). My experience programming with non-coherent caches is that they are practical, useful, and perform pretty well. You should give it a try sometime, you might be surprised.
 
Back
Top