22 nm Larrabee

Everything else being equal, more ILP means more warps in flight to hide the same memory latency.
Wrong. Higher ILP doesn't increase throughput by itself. You just switch warps less frequently. In the worst case you need the same number of warps to hide the same memory latency.

But there isn't just memory latency. Fermi has an 18-cycle arithmetic pipeline. With a modest number of memory accesses and a good cache hit ratio, arithmetic latency dominates. And that's when higher ILP allows to use fewer warps, which in turn reduces the number of cache accesses for spilled registers, and reduces storage contention, both further lowering the memory access latency that has to be hidden.

So higher ILP never makes things worse.
 
Higher ILP doesn't increase throughput by itself. You just switch warps less frequently. In the worst case you need the same number of warps to hide the same memory latency.
Higher ILP means an ALU clause lasts for shorter period of time, needing more warps to hide the same latency.
 
I beg to differ, at least with regard to the Top500 - I'm aware of some specific MM-Test which show pretty good utilization rates on radeon hardware.

In the Top 500 however, Radeon-based Computers are between 59% efficiency (LOEWE-CSC) and 47% (the first Iteration of Tianhe). The current top-Fermi-cluster reaches 55% and thus does not differ fundamentally. But it sure is a far cry to the efficiency of the top supercomputer which is at 93%.
You have to see that the large scale matrix operations are also limited by the interconnect and communication within the cluster and of course the used software libraries used in that specific benchmark.
Even with a machine, where a single node reaches 90%+ efficiency, the cluster as a whole can score just an average 50% or so. Coming back to the question Fermi vs. Cypress/Cayman in matrix operations, there is simply no way, that a single Fermi-GPU reaches significantly more than 65% efficiency or so (there is a paper of Volkov on that I think), that means even with the best interconnect in your cluster, you cannot reach more than that. I don't know exactly why the AMD GPU-Clusters aren't significantly better, could be either a non optimal interconnect, lacking optimization (if matrix operations on the full machine isn't the sole purpose of the cluster [it almost never is], often not the last ounce of effort is put in to that) or just using not the best performing available library for matrix operations for the used GPU.

Hmm, just looked at the website of the LOEWE-CSC (it's in Frankfurt/Main) and they claim higher efficiency numbers there, maybe they did some further optimizations after the submission of the benchmark data but just didn't run a second benchmark on the full machine?

LOEWE-CSC-Website said:
Zur Demonstration, dass es durchaus möglich ist, die Leistung der heterogenen Architektur auszuschöpfen, wurde ein DGEMM (Verallgemeinerte Matrix-Multiplikation) programmiert, der sowohl GPU als auch CPU voll auslasten kann. Der GPU Kernel selbst erreicht über 90% der theoretischen Spitzenperformance. Im System stehen weit über 80% der theoretischen akkumulierten Leistung von GPU und CPU zur Verfügung; den DMA transfer miteingerechnet. (CALDGEMM Quellcode und Dokumentation)

Auf Basis dieses DGEMMs wurde der HPL Benchmark angepasst. HPL ist eine implementierung von Linpack, welcher der Standardbenchmark für Hochleistungscomputer ist. Um die GPU-Leistung voll auszunutzen, wurde eine spezielle Methode implementiert, die Transferzeiten fast vollständig zu verstecken. Ein paralleler Linpack Lauf auf mehreren hundert GPU-Knoten erreicht ca. 70% der theoretischen Maximalleistung. Dies übertrifft herkömliche heterogene Hochleistungsrechner bei weitem, die oft nur 50% erreichen.
 
But there isn't just memory latency. Fermi has an 18-cycle arithmetic pipeline. With a modest number of memory accesses and a good cache hit ratio, arithmetic latency dominates. And that's when higher ILP allows to use fewer warps, which in turn reduces the number of cache accesses for spilled registers, and reduces storage contention, both further lowering the memory access latency that has to be hidden.

ALU latency is 18 cycles. Mem latency is >600 cycles and slowly increasing. Which should you be focussing on?
 
But there isn't just memory latency. Fermi has an 18-cycle arithmetic pipeline. With a modest number of memory accesses and a good cache hit ratio, arithmetic latency dominates. And that's when higher ILP allows to use fewer warps, which in turn reduces the number of cache accesses for spilled registers, and reduces storage contention, both further lowering the memory access latency that has to be hidden.

So higher ILP never makes things worse.
It still doesn't work that way. You confuse that Fermi gets away using fewer warps when the code contains more ILP with the fact, that when you use that ILP by issuing independent operations of a warp/thread within the same cycle (as GF104 can do) you actually need more warps than if you would'nt do that (and use the ILP to sequentially issue single independent instructions per warp). The reason is simple, the issue rate rises, you need more instructions in flight to cover the arithmetic latencies, which eventually have to come from more warps. So GF104 needs more warps than GF100 to hide the arithmetic latencies. It really is that way. So ILP in the code doesn't make things worse, using it for parallel issue of operations does ;)
 
You have to see that the large scale matrix operations are also limited by the interconnect and communication within the cluster and of course the used software libraries used in that specific benchmark.
[Sorry to have cut out the rest of your posting]
I am aware of the system and cluster level dependencies. But since we are in the SC-space with this part of the discussion, you otoh have to be aware of the actual real-life results you can get out of the architectures in place. And that involves their respective host-system interconnects, their ability to generate results in combination with their environments.

In my testing there seems to be a tendency that the further you step away from purely synthetical benchmarks to more specialized ones which do actual work to really general applications, the less a pure GigaFLOPS rating means. And in this environment, the Radeon cards seem to have implementation details that prevent the HPC people from extracting more performance from them.

After all, if you take part in the top500 benchmarking exercise, you will try your best to extract as much performance as you can - because that'll also help your optimization strategies on other applications as long as cluster-wide dependencies or synchronisation are an issue.
 
In my testing there seems to be a tendency that the further you step away from purely synthetical benchmarks to more specialized ones which do actual work to really general applications, the less a pure GigaFLOPS rating means. And in this environment, the Radeon cards seem to have implementation details that prevent the HPC people from extracting more performance from them.
For matrix operations there isn't if you compare it with Fermi, as I said Loewe-CSC is claiming 70% efficiency for that HPL benchmark on their website.

Edit:
I may have stumbled over at least one source of the discrepancy. In the top500 list, the theoretical peak is for the full machine, which probably includes also the 40 nodes with quad CPU/48 cores and 24 dual CPU/24 cores nodes, both types without a GPU (reserved for jobs with no use for GPUs). The actual benchmark run and the efficiency numbers I cited were only done on the part with GPUs.

After all, if you take part in the top500 benchmarking exercise, you will try your best to extract as much performance as you can - because that'll also help your optimization strategies on other applications as long as cluster-wide dependencies or synchronisation are an issue.
Later optimizations are one of the reasons why some systems are rebenchmarked half a year later to submit a higher result. But if it is a productive system, you may not have the time to optimize for a case which is basically never relevant and kick users out for some time to optimize and run a stupid benchmark. So some systems (which would be in the top500) never get benchmarked at all, because nobody there see any sense to put effort in such endeavour (systems installed at some company). And even systems which got benchmarked in the course of installation, rarely get benchmarked again after software optimizations just for increasing it's position in the list a few numbers for bragging rights.

And such cluster systems at universities or like the HLRN and HLRS (to name some German examples) are basically never used as a full system beside those benchmark runs. Normally they run hundreds of concurrent jobs each using between 16 to 256 CPUs, depending on what the user applied for. That's the reality for such systems, not running HPL on the full system.
 
Higher ILP means an ALU clause lasts for shorter period of time, needing more warps to hide the same latency.
Instruction latency remains the same. You just issue more instructions from the same warp instead of from different warps.

Let me take a simplified example. Assume you have two ALUs with a latency of 1 cycle, and two threads. Now assume one thread encounters a memory instruction which takes multiple cycles. Without superscalar issue, arithmetic throughput would only be half. With superscalar issue, the second thread can continue to use all ALUs.

Of course at some point the second thread also encounters a memory operation, but if the density and latency of memory operations is low enough then the first thread can take over. In any case, it's never worse than without superscalar issue!
 
Let me take a simplified example. Assume you have two ALUs with a latency of 1 cycle, and two threads. Now assume one thread encounters a memory instruction which takes multiple cycles. Without superscalar issue, arithmetic throughput would only be half. With superscalar issue, the second thread can continue to use all ALUs.

You are assuming a CPU model, where an ALU can accept instructions from any thread that is not stalled.

In GPUs, workitem to ALU mapping is static. And with good reason.
 
ALU latency is 18 cycles. Mem latency is >600 cycles and slowly increasing. Which should you be focussing on?
That's RAM latency. Most memory accesses should hit a cache and have a far lower latency.

This also reminds me just how stupid a GPU is at handling misses. CPUs with out-of-order execution can overlap multiple misses. This allows them to use far fewer threads and have very little cache contention, ensuring a high hit rate and lowering the RAM bandwidth requirements. AVX-1024 should ensure close to 100% utilization even for very complex tasks.

In particular the dependency on RAM bandwidth also means APUs won't scale well.
 
The reason is simple, the issue rate rises, you need more instructions in flight to cover the arithmetic latencies, which eventually have to come from more warps.
The issue rate doesn't increase. You just issue multiple instructions from the same warp whenever you can, instead of from different warps.
 
Throughput will continue to increase, even with a fully homogeneous architecture. What you're claiming is that won't suffice, and we'll get workloads which will require heterogeneous dedicated hardware again. Could you give me an example of a task which requires much more throughput than graphics but less programmability, and would be worth the dedicated silicon?

No, I can't as I'm no graphics researcher but given historical trends and the limitations of today's real-time techniques it's not exactly a leap of faith to expect new accelerators in the future.

Compared to SSE, AVX2 increases the throughput fourfold, adds non-destructive instructions, and features gather. That's way more than "just a few more flops", and then some.

Ok so in the 5 years since SSE3 we're going to get 4x CPU FP throughput. Is that supposed to be impressive when you consider how much faster GPUs have gotten and how much more demanding graphics workloads have become in the same time frame?

That's sentiment, not fact. GPUs are losing this advantage too. If Sandy Bridge had FMA support and no IGP, it would be less than 200 mm² and deliver 435 GFLOPS. GF116 is 238 mm² and delivers 691 GFLOPS.

Yes, and it would be deficient in memory bandwidth, texture filtering, texture decompression, rasterization and a whole slew of things that would chew into those 435 GFLOPS that GPUs currently get for free. CPU flops != GPU flops.

That's merely a 33% advantage in computing density. But let's not forget that a CPU is still a CPU. The GPU is worthless on its own. And the convergence continues...

Yes, and in graphics workloads the CPU is equally worthless without a GPU....

What's not to understand? You said CPUs are constrained by the thermal and power limits of the CPU socket. So I'm asking, what would keep them from increasing these thermal and power limits?

Ah, so Intel is going to storm the mainstream graphics market by increasing the thermal and power requirements of even the highest end CPU configurations today? Does that really make sense to you? The main issue with your claims is that a 200w CPU with 4x the throughput will still be slow at graphics. Current CPUs are just that slow.

Ivy Bridge doesn't have AVX2 nor AVX-1024. IGPs will stay around as long as these haven't been implemented. Software rendering being slow is not a cosmic constant. They're currently limited to using 100 GFLOPS and emulating gather takes 3 uops per element. With four times higher throughput per core, non-destructive instructions, hardware gather, and more cores, software rendering is about to take a quantum leap.

A 4x improvement in theoretical throughput in 5 years is not a quantum leap and, unlike GPUs, CPUs will not get much wider each generation. When Haswell arrives in 2013 it will be facing Maxwell and 2nd generation GCN. How much faster do you think they will be than today's Fermi and Cayman parts? I would wager at least 4x (nVidia claims 8x).

Do we even know how fast gather will be or if Intel's schedulers and memory subsystem can actually feed 16 FMAs/clk? How efficiently will they emulate fixed function hardware?
 
The issue rate doesn't increase. You just issue multiple instructions from the same warp whenever you can, instead of from different warps.

I think you guys may be talking about two different things. If you're talking about GF104 with superscalar issue then more warps are required. If you just want to issue independent instructions sequentially from the same warp to the execution units then you can probably reduce the number of warps required and there's no net increase in instruction issue rate. They're already doing that though as GF100 can support up to 4 instructions in flight from the same warp.
 
In GPUs, workitem to ALU mapping is static. And with good reason.
In Fermi, it isn't true anymore. Just look at the GF104 type SMs. Nevertheless, GF104 still need more warps in flight than GF100 (which doesn't have the dual issue capability per thread) to cover the latencies. The amount of needed warps basically scale with the sustainable instructions/clock, more throughput means more needed warps.

Btw., anybody should forget the word "superscalar" in respect to GPUs. It simply doesn't apply, as one doesn't have scalar, but vector ALUs. If one doesn't want to name it supervector issue, just name it dual issue. GCN has a real scalar issue port, all current GPUs don't.
 
This also reminds me just how stupid a GPU is at handling misses. CPUs with out-of-order execution can overlap multiple misses. This allows them to use far fewer threads and have very little cache contention, ensuring a high hit rate and lowering the RAM bandwidth requirements. AVX-1024 should ensure close to 100% utilization even for very complex tasks.
Unless your entire working set fits in cache, you will be streaming lots of stuff from RAM as well.
 
Btw., anybody should forget the word "superscalar" in respect to GPUs. It simply doesn't apply, as one doesn't have scalar, but vector ALUs. If one doesn't want to name it supervector issue, just name it dual issue. GCN has a real scalar issue port, all current GPUs don't.

What do you call it when a CPU issues multiple SIMD instructions in parallel?
 
What do you call it when a CPU issues multiple SIMD instructions in parallel?
The term superscalar was meant to describe the simultaneous issue to several scalar ALUs/pipelines. So while the SIMD extensions of CPUs blur the line a bit of course, a normal CPU is a superscalar CPU anyway (look at the Integer-core!), just with some shallow vector unit(s) attached to it, which may (intel) or may not (AMD) use the same scheduler. So they effectively unify both concepts, scalar and vector pipelines both allowing the simultaneous issue of instructions.

The fundamental thing is that "scalar" describes a value (and not an instruction!) representable by a single number, a vector is represented by several numbers. That's where the fundamental distinction between scalar processors and vector processors originated: does an instruction operate on scalars (single values as operands) or on vectors (operands are vectors)? The evolution to superscalar processors didn't change that fundamental difference, adding the simultaneous issue to scalar ALUs. Of course you can add the same to vector processors. But starting to call vector processors now superscalar because of their capability for simultaneous issue of instructions appears a bit ridiculous to me. Scalar vector units, seriously? :LOL:
But that tells me, that nvidia's marketing works after all :rolleyes:
 
You will always have that problem when trying to force fit old terminology to new architectures. There isn't accepted terminology for "dual-issue" of SIMD for CPUs so why bother with CPU terminology at all?

Btw, I think instruction issue is more useful than data widths when describing an architecture. The "super" in superscalar is definitely referring to multiple instruction issue. Why complain about the use of existing terms when there isn't a better, widely accepted alternative?

supervector or supersimd makes sense but good luck getting people to use that! In any case, from a software standpoint GF104 is superscalar. Arguments over whether hardware configuration or software perspective is more important can now ensue :)
 
You will always have that problem when trying to force fit old terminology to new architectures. There isn't accepted terminology for "dual-issue" of SIMD for CPUs so why bother with CPU terminology at all?
It is very simple. Don't you remember the term "3way superscalar CPU" for instance was quite well accepted? It simply means there are 3 parallel pipelines. You can use the same terminolgy for vector processing, too (3way vector issue). And you have also no problems, if you want to apply that scheme to mixed scalar/vector architectures as current CPUs. I don't see the problem. ;)
Btw, I think instruction issue is more useful than data widths when describing an architecture.
Then name the issue as what it is: 2way, 3way or whatever simultaneous issue!
The "super" in superscalar is definitely referring to multiple instruction issue.
That's what I said above. And the "scalar" part applies to the nature of the ALUs. Why would one use a term where only half of it refers to the reality in a correct way and the other half is completely misleading and flat out wrong?
Why complain about the use of existing terms when there isn't a better, widely accepted alternative?
There is as you can just specify the instruction issue as x-way simultaneous. And it is even general and works for scalar and vector processors.
supervector or supersimd makes sense but good luck getting people to use that!
That's why it would be my preferred term. I think it was clear.
In any case, from a software standpoint GF104 is superscalar.
Not at all. Simultaneous issue is transparent, the vector nature of the architecture is not (even when the programming model hides it a bit).
 
Back
Top