I guess they didn't want to bother with dual issuing x86 at the same time. Must have simplified the x86-specific core.Forsyth's GDC 09 Larrabee presentation, slide 50 describes Larrabee as this:
I guess they didn't want to bother with dual issuing x86 at the same time. Must have simplified the x86-specific core.Forsyth's GDC 09 Larrabee presentation, slide 50 describes Larrabee as this:
Well, the 10% figure is right there in front of you.
But that's likely inaccurate.
How about an alternative approach. They run their apps on existing cpu's. Profile which bits need perf, and rewrite them in OpenCL to target GPUs/whichever-LRB-is-available?
Do you really need x86 in your gpu in this approach? Is this approach in any way worse off than your suggestion? If anything, existing cpu's will blow LRB out when it comes to the serial bits. And they are important in overall perf.
In a GPU, ALL of x86 is legacy. Who needs mmap() and ioctl() and cousins in a shader? Existing apps run fine on existing CPUs.
I've been asking that question, and the answers I've gotten indicate that Larrabee needs to issue 1 traditional x86 instruction and one VPU instruction per cycle (the exception being that vector stores can run in place of an x86 instruction).
In pure x86 code, it is apparently half the width of Atom.
Intel slides have also been released that seem to support this.
Well, the 10% figure is right there in front of you. Who knows how much else pointless stuff is there for which we don't have the numbers (cache coherency overhead, anyone?). If you look at the overall bogo-flops/mm2 and bogo-flops/W, LRB1 was hardly any good.
I believe various people have been quoted as saying the LRB1 core effectively follows the UV issue rules of the Pentium with the V pipe being fairly restricted.
3D integration for the winI wonder how long it will take everyone to figure out that memory bandwidth scaling is slowing down in a big BIG way. The BW per pin is getting closer and closer to maximums.
There you go assuming cache coherency is a bad thing. I wonder how long it will take everyone to figure out that memory bandwidth scaling is slowing down in a big BIG way. The BW per pin is getting closer and closer to maximums. coherent caches enable massive bandwidth reduction if used along with algorithms that partition execution to take advantage of them.
I won't even get into the whole issue of integration, which is well on its way, into mobile/desktop cores that have a small fraction of the bandwidth of discrete GPUs.
There you go assuming cache coherency is a bad thing. I wonder how long it will take everyone to figure out that memory bandwidth scaling is slowing down in a big BIG way. The BW per pin is getting closer and closer to maximums. coherent caches enable massive bandwidth reduction if used along with algorithms that partition execution to take advantage of them.
I won't even get into the whole issue of integration, which is well on its way, into mobile/desktop cores that have a small fraction of the bandwidth of discrete GPUs.
There is a world of difference between optimizing code and adding support for sub-data DMA functions to and from an accelerator board with different programming models/requirements working on a different ISA, working outside the OS via some driver interface, with a completely different set of development tools, debug tools, libraries, etc.
Neither of those are the x86 instruction set.
3D integration for the win
Er, soon, real soon now...
Sure, but it remains to be seen whether this is a particularly usable programming model. I knew very few people (myself *not* included!) who fully know the actual coherency guarantees that are made by DirectCompute for instance, which is similar model expressed in the SPMD style.Fermi is a good example: the L1 caches are not coherent at all, but you can commit data to L2 with a fence and then use atomics to signal that the data is ready.
Atomics have the same problems and do the same things (indeed they are even slower!), but they are still vitally important to expose. Cache coherence is the same thing - obviously you don't want to be abusing it to fire data all over the chip and serialize with locks and such but even CPU programmers know that... The ability to tank performance does not imply that coherent caches are not useful or desirable. There are already a million less-obvious ways to destroy performance on GPU architecturesFull cache coherency has a host of scalability issues of its own - both in hardware, since LRB style coherency is very expensive, as well as in software, since it facilitates the creation of "parallel" programs which execute serially, since they just end up serialized in the cache coherency protocol, instead of the Program Counter.
Uhh what are you basing that on? I'd hope to make a statement like that you've either had personal experience with the hardware.So I agree with the earlier poster: LRB's cache coherence is expensive and unjustified.
Nitpick, but I don't think this is precisely what you mean To address your point, I don't see why data sharing via coherent caches would be any less efficient than doing it explicitly, and it has the same advantage as atomics: it can be applied to data-dependent sharing with low-to-moderate collisions. I was the first to whine about people abusing atomics to write bad code (and they *are* already doing that... in this case it's largely a programming model problem), but there's no doubt that they are a useful feature when used properly.rpg.314 said:O(1000)
When it comes to performance, per se, who else is going to achieve it other than experts?...I've learned to accept that what we have now are extremely low level programming models suitable for experts.
Right, precisely. My point is you have to expect people to be using coherent caches and other features well. "You can screw yourself with them" is not an argument against it, in the same way that it wasn't an argument against atomics (which can screw you ever harder).When it comes to performance, per se, who else is going to achieve it other than experts?...
Caches are nice. I am not sure about the coherency bit. It is ironic that people are trying to put stuff in hw (full coherency) to support stuff that people are trying to get of rid of in sw (with partly or fully functional languages).
Bottomline, I'll buy full-coherency when I see O(1000) hw threads sharing data at commensurate scale in apps that actually scale.
For consumer apps at least, apps that need parallelism, have got plenty of parallelism without having to share data at massive s scale.
The difference between atomics and cache coherence is that atomics are explicit and visible in the code you write, whereas cache coherence introduces invisible performance bugs. We need architectures and programming models that push people in the direction of scalable parallel code. Cache coherence is wrong because it facilitates bad code.Atomics have the same problems and do the same things (indeed they are even slower!), but they are still vitally important to expose. Cache coherence is the same thing - obviously you don't want to be abusing it to fire data all over the chip and serialize with locks and such but even CPU programmers know that... The ability to tank performance does not imply that coherent caches are not useful or desirable. There are already a million less-obvious ways to destroy performance on GPU architectures
From a software developer's point of view, the opposite is certainly true: once you have partitioned your data structures for parallel execution on a non-coherent processor, it's easy to get parallel scalability on a processor with coherent caches. You've already done the essential work to parallelize your application. On the other hand, once you've parallelized your execution on a coherent processor, you do not necessarily have scalable code, and you have to do more work to partition your data structures.See people still don't understand, once you have coherence, its easy to be incoherent. The opposite isn't true.
Non-coherence has already happened (Fermi). My experience programming with non-coherent caches is that they are practical, useful, and perform pretty well. You should give it a try sometime, you might be surprised.by the same token, I'll buy non-coherence when that happens too