22 nm Larrabee

For matrix operations there isn't if you compare it with Fermi, as I said Loewe-CSC is claiming 70% efficiency for that HPL benchmark on their website.

Edit:
I may have stumbled over at least one source of the discrepancy. In the top500 list, the theoretical peak is for the full machine, which probably includes also the 40 nodes with quad CPU/48 cores and 24 dual CPU/24 cores nodes, both types without a GPU (reserved for jobs with no use for GPUs). The actual benchmark run and the efficiency numbers I cited were only done on the part with GPUs.
As I've said: I am aware that system build contributes a good deal of the efficiency. And of course does the top500 calculate all the available computing ressources. If there are a certain amount of nodes only for synchronization, that's exactly what this is all about.

You have to have them in order for your system-wide level of performance you want to achieve, even if they only organize the data stream or provide user-interfaces. You have to pay for them, you have to provide power for them (thus paying again). And if your system is particularly hard to optimize for, that's also a problem you'll have to deal with in it's lifetime over and over again, because this optimization has to be done for each kind of workload, probably for each and every algorithm.

LOEWE-CSC said:
Um die GPU-Leistung voll auszunutzen, wurde eine spezielle Methode implementiert, die Transferzeiten fast vollständig zu verstecken. Ein paralleler Linpack Lauf auf mehreren hundert GPU-Knoten erreicht bei weitem, die oft nur 50% erreichen.
So they developed a method in order to hide DMA transfer time almost completely and ran HPL on a portion of LOEWE-CSC with several hundred nodes with their specifically optimized code achieving 70% of theoretical peak. They don't say though, if this is purely with the GPUs or with the full computing ressources of the nodes. LOEWE-CSCs compute power consists of roughly 39% Opteron-CPUs, only 61% is provided by Cypress GPUs.

So, if those 70% efficiency are system-wide, the GPU-portion of it must be significantly lower, because the CPUs are providing roughly 39% of the ressources and those are able to work a mid-to-high eighties at least. If it's GPU-only, they still are a far cry away from homogeneous HPCs with over 90% - better than fermi yes, but still I'd consider the lower efficiency a common architectural trait.
 
Then name the issue as what it is: 2way, 3way or whatever simultaneous issue!

I think we agree there's no standard terminology in use today to describe n-way SIMD issue. At least I haven't seen it. Being overly pedantic about it isn't any more useful than using a slightly incorrect term. What matters is the message not the words.

Simultaneous issue is transparent, the vector nature of the architecture is not (even when the programming model hides it a bit).

The data parallel nature of the programming model is explicitly exposed, not the hardware configuration or execution width. Code for G80 runs just fine on Fermi even though SIMD widths and memory transaction sizes differ. You could run CUDA code just fine on a true scalar architecture.
 
Ok so in the 5 years since SSE3 we're going to get 4x CPU FP throughput. Is that supposed to be impressive when you consider how much faster GPUs have gotten and how much more demanding graphics workloads have become in the same time frame?
It's an eightfold increase since Prescott, and a fourfold increase since Nehalem, per core. The number of cores is going up as well, and by ditching the IGP there would be room for even more cores.

Keep in mind that high-end GPU dies became a lot bigger, and a lot hotter. This means the performance/area and performance/Watt didn't increase as spectacularly as one might assume. Also, the earliest floating-point GPUs only spent a tiny fraction of their die space on ALUs. Nowadays they can't cram any larger ratio of shading units onto a die. So while GPUs did rapidly increase their computing power, those days are now over (unless of course they get rid of the remaining fixed-function hardware).
Yes, and it would be deficient in memory bandwidth, texture filtering, texture decompression, rasterization and a whole slew of things that would chew into those 435 GFLOPS that GPUs currently get for free. CPU flops != GPU flops.
As I've said before, the average texture sampling rate is far lower than the peak texture rate GPUs waste their transistors on. And this is also true for all other fixed-function hardware. Furthermore, GPUs can get bottlenecked by numerous things, stalling the ALUs and thus wasting FLOPS. CPUs indeed have to spend some cycles 'emulating' fixed-function hardware, but it's less than you might think and they're not bottlenecked by any of it.

Last but not least, brains can win from muscle. A GPU doesn't bother to skip processing invisible geometry. It just uses raw computing power to process massive batches as fast as possible, wasting part of it on things that never end up on the screen. With a CPU you can make more fine-grained decisions, using the available FLOPS more wisely.
Ah, so Intel is going to storm the mainstream graphics market by increasing the thermal and power requirements of even the highest end CPU configurations today? Does that really make sense to you? The main issue with your claims is that a 200w CPU with 4x the throughput will still be slow at graphics. Current CPUs are just that slow.
A 200 Watt CPU with AVX-1024 would have way more than 4x the throughput.
...unlike GPUs, CPUs will not get much wider each generation.
Why not? If OpenCL has any future at all, then multi-core homogeneous CPUs have an even brighter future since they can handle a wider range of applications. Only a year ago people still claimed quad-core was overkill, but nowadays they've gone mainstream and applications actually start to benefit from it. Even NVIDIA's mobile Kal-El chip will feature four CPU cores.

You also have to realize that going multi-core is pretty much a one-time investment for software developers. It can be a heavy investment, but once you have an software architecture that scales up to four cores, it usually automatically takes advantage of more cores as well, or requires just minimal effort. Also, we're starting to see ever more tools to assist in writing multi-threaded software. So it's only a matter of time before you'll see serious benefit from having a multi-core CPU in a wide variety of software.

Heck, even for a heterogeneous architecture the CPU has to continue to scale to keep supporting the otherwise helpless GPU.
When Haswell arrives in 2013 it will be facing Maxwell and 2nd generation GCN. How much faster do you think they will be than today's Fermi and Cayman parts? I would wager at least 4x (nVidia claims 8x).
Maxwell won't appear on the shelves in 2013, and only 1st generation GCN is slated for 2013.
Do we even know how fast gather will be or if Intel's schedulers and memory subsystem can actually feed 16 FMAs/clk? How efficiently will they emulate fixed function hardware?
There is no confirmation of how fast the gather implementation will be, but since Knight's Corner is rumored to support gathering any number of elements from one cache line every cycle, there's no reason to expect anything less from an architecture with less wide vector units and already two load units per core. Most people also expect them to double the width of the load and store units. That actually creates an interesting opportunity for the gather implementation: The first load unit has minimal latency and handles most regular loads, while the second load unit supports gather instructions at a slightly higher latency, and also pitches in when there's a high amount of regular loads.
 
Unless your entire working set fits in cache, you will be streaming lots of stuff from RAM as well.
You don't have to fit the "entire" working set in cache memory to get substantial benefit from having a large amount of on-die storage per thread.

RAM bandwidth is increasing more slowly than ALU throughput, so heavy caching becomes a necessity for GPUs. But to have sufficient cache space per thread, the number of threads has to be lowered. Eventually only out-of-order execution can guarantee that there's enough ILP every cycle with a minimal number of threads.
 
You also have to realize that going multi-core is pretty much a one-time investment for software developers. It can be a heavy investment, but once you have an software architecture that scales up to four cores, it usually automatically takes advantage of more cores as well, or requires just minimal effort. Also, we're starting to see ever more tools to assist in writing multi-threaded software. So it's only a matter of time before you'll see serious benefit from having a multi-core CPU in a wide variety of software.
Where are these apps that scle with cpu cores but not with gpu cores?

Maxwell won't appear on the shelves in 2013, and only 1st generation GCN is slated for 2013.
GCN is set for late 2011.
 
Eventually only out-of-order execution can guarantee that there's enough ILP every cycle with a minimal number of threads.

All the benefits of OoO are already there. What is needed is a more flexible memory hierarchy, not more ILP.
 
Where are these apps that scle with cpu cores but not with gpu cores?
RelativePerf_i7vsGTX280.JPG


In each of these, the CPU reaches high utilization (except when limited by the lack of gather/scatter). Now imagine what that graph looks like with twice the cores (to match die size and power), four times the computing power per core, and gather/scatter.
GCN is set for late 2011.
"Because of this need to inform developers of the hardware well in advance, while we’ve had a chance to see the fundamentals of GCN products using it are still some time off. At no point has AMD specified when a GPU will appear using GCN will appear, so it’s very much a guessing game. What we know for a fact is that Trinity – the 2012 Bulldozer APU – will not use GCN, it will be based on Cayman’s VLIW4 architecture. Because Trinity will be VLIW4, it’s likely-to-certain that AMD will have midrange and low-end video cards using VLIW4 because of the importance they place on being able to Crossfire with the APU. Does this mean AMD will do another split launch, with high-end parts using one architecture while everything else is a generation behind? It’s possible, but we wouldn’t make at bets at this point in time. Certainly it looks like it will be 2013 before GCN has a chance to become a top-to-bottom architecture, so the question is what the top discrete GPU will be for AMD by the start of 2012." - AnandTech
 
All the benefits of OoO are already there.
Really? So we can use a 1 MB call stack now and have the GPU do stuff like compiling code?

That's what a CPU can do (with each core), and soon enough they'll match the GPU's effective throughput. So I'm very curious what sort of technology GPUs will utilize to be faster at anything other than rasterization graphics.
What is needed is a more flexible memory hierarchy, not more ILP.
What's your idea of a more flexible memory hierarchy, and what do you expect will be the cost / gain?
 
In each of these, the CPU reaches high utilization (except when limited by the lack of gather/scatter). Now imagine what that graph looks like with twice the cores (to match die size and power), four times the computing power per core, and gather/scatter.
Any chance we could see the same things run on 1 vs N cores on a CPU to get a rough idea how it scales there? Without knowing that it could simply boil down to CPU running at much higher clock speed.
 
RelativePerf_i7vsGTX280.JPG


In each of these, the CPU reaches high utilization (except when limited by the lack of gather/scatter). Now imagine what that graph looks like with twice the cores (to match die size and power), four times the computing power per core, and gather/scatter.

a) Didn't that paper include PCIe transfer times as well?

b) There's really no point in comparing if you are going to indulge is fantasizing about only one side, while ignoring the improvement of the other side.

"Because of this need to inform developers of the hardware well in advance, while we’ve had a chance to see the fundamentals of GCN products using it are still some time off. At no point has AMD specified when a GPU will appear using GCN will appear, so it’s very much a guessing game. What we know for a fact is that Trinity – the 2012 Bulldozer APU – will not use GCN, it will be based on Cayman’s VLIW4 architecture. Because Trinity will be VLIW4, it’s likely-to-certain that AMD will have midrange and low-end video cards using VLIW4 because of the importance they place on being able to Crossfire with the APU. Does this mean AMD will do another split launch, with high-end parts using one architecture while everything else is a generation behind? It’s possible, but we wouldn’t make at bets at this point in time. Certainly it looks like it will be 2013 before GCN has a chance to become a top-to-bottom architecture, so the question is what the top discrete GPU will be for AMD by the start of 2012." - AnandTech

http://semiaccurate.com/2011/06/29/amd-southern-islands-possible-for-september/
 
Nick clearly can defend him self... himself but as he's clearly been speaking APU vs multi-cores I believe he's right for time frame, GCN is not likely to get integrated to AMD APUs before late 2012 / early 2013.
I wonder about CPU memory interface I find that X86 architecture are lagging in this regard vs say IBM high end part (power 7 and powerA2 use four memory channels), do you guys expect INtel to make advancement in this regard with Haswell?
 
["Because of this need to inform developers of the hardware well in advance, while we’ve had a chance to see the fundamentals of GCN products using it are still some time off. At no point has AMD specified when a GPU will appear using GCN will appear, so it’s very much a guessing game. What we know for a fact is that Trinity – the 2012 Bulldozer APU – will not use GCN, it will be based on Cayman’s VLIW4 architecture. Because Trinity will be VLIW4, it’s likely-to-certain that AMD will have midrange and low-end video cards using VLIW4 because of the importance they place on being able to Crossfire with the APU. Does this mean AMD will do another split launch, with high-end parts using one architecture while everything else is a generation behind? It’s possible, but we wouldn’t make at bets at this point in time. Certainly it looks like it will be 2013 before GCN has a chance to become a top-to-bottom architecture, so the question is what the top discrete GPU will be for AMD by the start of 2012." - AnandTech

AMD themselves during there dev conf have stated that GCN is the next GPU and will be out before the end of the year.
 
I wonder about CPU memory interface I find that X86 architecture are lagging in this regard vs say IBM high end part (power 7 and powerA2 use four memory channels), do you guys expect INtel to make advancement in this regard with Haswell?

IBM uses 4 channels because their market can afford it. I think on package DRAM is more likely for intel then 4 off socket channels.
 
I wonder about CPU memory interface I find that X86 architecture are lagging in this regard vs say IBM high end part (power 7 and powerA2 use four memory channels), do you guys expect INtel to make advancement in this regard with Haswell?
Intel is using 4 memory channels in some of their Xeons, e.g this one
 
There's really no point in comparing if you are going to indulge is fantasizing about only one side, while ignoring the improvement of the other side.
I'm not ignoring GPU evolution. I'm observing that they need more registers, larger caches, and/or more dynamic scheduling. And although efficiency will increase, each of these things cost computing density. So the convergence is happening from both sides.

If there's anything that might widen the gap again, which CPUs can't implement, I haven't heard about it yet and I'm open to learning all about it.
Fair enough, but Haswell still won't be up against a 4x faster GCN.
 
I wonder about CPU memory interface I find that X86 architecture are lagging in this regard vs say IBM high end part (power 7 and powerA2 use four memory channels), do you guys expect INtel to make advancement in this regard with Haswell?
Sandy Bridge-E will feature a quad-channel memory controller. And DDR4 guarantees bandwidth to scale for years to come.
 
Really? So we can use a 1 MB call stack now and have the GPU do stuff like compiling code?
Call stack size != OoO.

What's your idea of a more flexible memory hierarchy, and what do you expect will be the cost / gain?
Something like LRB1 ie allocating registers, registers and L1 from the same pool.
 
I'm not ignoring GPU evolution. I'm observing that they need more registers, larger caches, and/or more dynamic scheduling. And although efficiency will increase, each of these things cost computing density. So the convergence is happening from both sides.
GPUs don't need any of that. A hypothetical GCN with LRB1 like per core storage will do just fine without needing substantially more hw.

Fair enough, but Haswell still won't be up against a 4x faster GCN.
It will face a 2x faster GCN.
 
Back
Top