Larrabee delayed to 2011 ?

rpg.314 · May 11, 2010

3dilettante said:
Forsyth's GDC 09 Larrabee presentation, slide 50 describes Larrabee as this:

I guess they didn't want to bother with dual issuing x86 at the same time. Must have simplified the x86-specific core.

willardjuice · May 11, 2010

Well, the 10% figure is right there in front of you.

But that's likely inaccurate. And even if it were true, I highly doubt that would have been enough to vastly increase lrb's performance.

rpg.314 · May 11, 2010

willardjuice said:
But that's likely inaccurate.

Why?

aaronspink · May 11, 2010

rpg.314 said:
How about an alternative approach. They run their apps on existing cpu's. Profile which bits need perf, and rewrite them in OpenCL to target GPUs/whichever-LRB-is-available?

Do you really need x86 in your gpu in this approach? Is this approach in any way worse off than your suggestion? If anything, existing cpu's will blow LRB out when it comes to the serial bits. And they are important in overall perf.

There is a world of difference between optimizing code and adding support for sub-data DMA functions to and from an accelerator board with different programming models/requirements working on a different ISA, working outside the OS via some driver interface, with a completely different set of development tools, debug tools, libraries, etc.

aaronspink · May 11, 2010

rpg.314 said:
In a GPU, ALL of x86 is legacy. Who needs mmap() and ioctl() and cousins in a shader? Existing apps run fine on existing CPUs.

Neither of those are the x86 instruction set.

aaronspink · May 11, 2010

3dilettante said:
I've been asking that question, and the answers I've gotten indicate that Larrabee needs to issue 1 traditional x86 instruction and one VPU instruction per cycle (the exception being that vector stores can run in place of an x86 instruction).
In pure x86 code, it is apparently half the width of Atom.
Intel slides have also been released that seem to support this.

I believe various people have been quoted as saying the LRB1 core effectively follows the UV issue rules of the Pentium with the V pipe being fairly restricted.

aaronspink · May 11, 2010

rpg.314 said:
Well, the 10% figure is right there in front of you. Who knows how much else pointless stuff is there for which we don't have the numbers (cache coherency overhead, anyone?). If you look at the overall bogo-flops/mm2 and bogo-flops/W, LRB1 was hardly any good.

There you go assuming cache coherency is a bad thing. I wonder how long it will take everyone to figure out that memory bandwidth scaling is slowing down in a big BIG way. The BW per pin is getting closer and closer to maximums. coherent caches enable massive bandwidth reduction if used along with algorithms that partition execution to take advantage of them.

I won't even get into the whole issue of integration, which is well on its way, into mobile/desktop cores that have a small fraction of the bandwidth of discrete GPUs.

3dilettante · May 11, 2010

aaronspink said:
I believe various people have been quoted as saying the LRB1 core effectively follows the UV issue rules of the Pentium with the V pipe being fairly restricted.

It's restricted to VPU instructions, going by the GDC slide.
If asked to run non-VPU code, Atom would have an issue-width advantage.

Jawed · May 11, 2010

aaronspink said:
I wonder how long it will take everyone to figure out that memory bandwidth scaling is slowing down in a big BIG way. The BW per pin is getting closer and closer to maximums.

3D integration for the win

Er, soon, real soon now...

RecessionCone · May 11, 2010

aaronspink said:
There you go assuming cache coherency is a bad thing. I wonder how long it will take everyone to figure out that memory bandwidth scaling is slowing down in a big BIG way. The BW per pin is getting closer and closer to maximums. coherent caches enable massive bandwidth reduction if used along with algorithms that partition execution to take advantage of them.

I won't even get into the whole issue of integration, which is well on its way, into mobile/desktop cores that have a small fraction of the bandwidth of discrete GPUs.

You don't need LRB's full cache coherency in order to keep things on die. Fermi is a good example: the L1 caches are not coherent at all, but you can commit data to L2 with a fence and then use atomics to signal that the data is ready.

Full cache coherency has a host of scalability issues of its own - both in hardware, since LRB style coherency is very expensive, as well as in software, since it facilitates the creation of "parallel" programs which execute serially, since they just end up serialized in the cache coherency protocol, instead of the Program Counter.

So I agree with the earlier poster: LRB's cache coherence is expensive and unjustified. Additionally, I've heard several rumors that LRB3, if it ever sees the light of day, will abandon cache coherency in favor of something closer to Fermi's cache architecture.

rpg.314 · May 11, 2010

aaronspink said:
There you go assuming cache coherency is a bad thing. I wonder how long it will take everyone to figure out that memory bandwidth scaling is slowing down in a big BIG way. The BW per pin is getting closer and closer to maximums. coherent caches enable massive bandwidth reduction if used along with algorithms that partition execution to take advantage of them.

I won't even get into the whole issue of integration, which is well on its way, into mobile/desktop cores that have a small fraction of the bandwidth of discrete GPUs.

Caches are nice. I am not sure about the coherency bit. It is ironic that people are trying to put stuff in hw (full coherency) to support stuff that people are trying to get of rid of in sw (with partly or fully functional languages).

Once you partition execution, how are you any different from, say Fermi.

Bottomline, I'll buy full-coherency when I see O(1000) hw threads sharing data at commensurate scale in apps that actually scale. For consumer apps at least, apps that need parallelism, have got plenty of parallelism without having to share data at massive s scale.

rpg.314 · May 11, 2010

aaronspink said:
There is a world of difference between optimizing code and adding support for sub-data DMA functions to and from an accelerator board with different programming models/requirements working on a different ISA, working outside the OS via some driver interface, with a completely different set of development tools, debug tools, libraries, etc.

I'll grant you that. But if you are already messaging your app, jump to OCL over intrinsics/pthread/TBB isn't too much of a big deal.

Especially when you consider that you can be 10x more productive writing shader like code over intrinsics.

rpg.314 · May 11, 2010

aaronspink said:
Neither of those are the x86 instruction set.

That is just dodging the argument.

aaronspink · May 11, 2010

Jawed said:
3D integration for the win

Er, soon, real soon now...

Sure, lots of bandwidth or usable memory capacity, choose 1

Andrew Lauritzen · May 11, 2010

RecessionCone said:
Fermi is a good example: the L1 caches are not coherent at all, but you can commit data to L2 with a fence and then use atomics to signal that the data is ready.

Sure, but it remains to be seen whether this is a particularly usable programming model. I knew very few people (myself *not* included!) who fully know the actual coherency guarantees that are made by DirectCompute for instance, which is similar model expressed in the SPMD style.

RecessionCone said:
Full cache coherency has a host of scalability issues of its own - both in hardware, since LRB style coherency is very expensive, as well as in software, since it facilitates the creation of "parallel" programs which execute serially, since they just end up serialized in the cache coherency protocol, instead of the Program Counter.

Atomics have the same problems and do the same things (indeed they are even slower!), but they are still vitally important to expose. Cache coherence is the same thing - obviously you don't want to be abusing it to fire data all over the chip and serialize with locks and such but even CPU programmers know that... The ability to tank performance does not imply that coherent caches are not useful or desirable. There are already a million less-obvious ways to destroy performance on GPU architectures

I've learned to accept that what we have now are extremely low level programming models suitable for experts. Thus all of the useful tools and hardware features have to be exposed to write efficient code.

RecessionCone said:
So I agree with the earlier poster: LRB's cache coherence is expensive and unjustified.

Uhh what are you basing that on? I'd hope to make a statement like that you've either had personal experience with the hardware.

rpg.314 said:
O(1000)

Nitpick, but I don't think this is precisely what you mean

To address your point, I don't see why data sharing via coherent caches would be any less efficient than doing it explicitly, and it has the same advantage as atomics: it can be applied to data-dependent sharing with low-to-moderate collisions. I was the first to whine about people abusing atomics to write bad code (and they *are* already doing that... in this case it's largely a programming model problem), but there's no doubt that they are a useful feature when used properly.

Jawed · May 11, 2010

Andrew Lauritzen said:
I've learned to accept that what we have now are extremely low level programming models suitable for experts.

When it comes to performance, per se, who else is going to achieve it other than experts?...

I don't think that's being elitist or glib. One only has to look at the thread title to discern how hard veterans find implementing this kind of performance.

Andrew Lauritzen · May 11, 2010

Jawed said:
When it comes to performance, per se, who else is going to achieve it other than experts?...

Right, precisely. My point is you have to expect people to be using coherent caches and other features well. "You can screw yourself with them" is not an argument against it, in the same way that it wasn't an argument against atomics (which can screw you ever harder).

aaronspink · May 11, 2010

rpg.314 said:
Caches are nice. I am not sure about the coherency bit. It is ironic that people are trying to put stuff in hw (full coherency) to support stuff that people are trying to get of rid of in sw (with partly or fully functional languages).

See people still don't understand, once you have coherence, its easy to be incoherent. The opposite isn't true.

Bottomline, I'll buy full-coherency when I see O(1000) hw threads sharing data at commensurate scale in apps that actually scale.

by the same token, I'll buy non-coherence when that happens too

For consumer apps at least, apps that need parallelism, have got plenty of parallelism without having to share data at massive s scale.

Unfortunately, they are quickly running out of parallelism.

RecessionCone · May 12, 2010

Andrew Lauritzen said:
Atomics have the same problems and do the same things (indeed they are even slower!), but they are still vitally important to expose. Cache coherence is the same thing - obviously you don't want to be abusing it to fire data all over the chip and serialize with locks and such but even CPU programmers know that... The ability to tank performance does not imply that coherent caches are not useful or desirable. There are already a million less-obvious ways to destroy performance on GPU architectures

The difference between atomics and cache coherence is that atomics are explicit and visible in the code you write, whereas cache coherence introduces invisible performance bugs. We need architectures and programming models that push people in the direction of scalable parallel code. Cache coherence is wrong because it facilitates bad code.

Too many programmers parallelize their code and forget to parallelize their data structures because of cache coherence. That's why I consider it a bug, not a feature.

RecessionCone · May 12, 2010

aaronspink said:
See people still don't understand, once you have coherence, its easy to be incoherent. The opposite isn't true.

From a software developer's point of view, the opposite is certainly true: once you have partitioned your data structures for parallel execution on a non-coherent processor, it's easy to get parallel scalability on a processor with coherent caches. You've already done the essential work to parallelize your application. On the other hand, once you've parallelized your execution on a coherent processor, you do not necessarily have scalable code, and you have to do more work to partition your data structures.

aaronspink said:
by the same token, I'll buy non-coherence when that happens too

Non-coherence has already happened (Fermi). My experience programming with non-coherent caches is that they are practical, useful, and perform pretty well. You should give it a try sometime, you might be surprised.

Larrabee delayed to 2011 ?

rpg.314

willardjuice

super willyjuice

rpg.314

aaronspink

aaronspink

aaronspink

aaronspink

3dilettante

Jawed

RecessionCone

rpg.314

rpg.314

rpg.314

aaronspink

Andrew Lauritzen

Moderator

Jawed

Andrew Lauritzen

Moderator

aaronspink

RecessionCone

RecessionCone