22 nm Larrabee

Call stack size != OoO.
I know, but to enable running complex code you need a substantial stack size, which means the thread count has to be minimal, which in turn can only be achieved with out-of-order execution.
Something like LRB1 ie allocating registers, registers and L1 from the same pool.
Since LRB1 didn't achieve the peak performance and power efficiency to compete with GPUs, what makes you confident it can be implemented cheaply and efficiently? Still sounds like more convergence to me.
 
which in turn can only be achieved with out-of-order execution.
Like I said, all the advantages of OoO are already there.

Since LRB1 didn't achieve the peak performance and power efficiency to compete with GPUs, what makes you confident it can be implemented cheaply and efficiently? Still sounds like more convergence to me.
There's lots of ff hw still there. The compute density of gpu's is higher than lrb1. The driver teams have much more experience than what lrb1 team had at the time. There's no x86 penalty.
 
It's an eightfold increase since Prescott, and a fourfold increase since Nehalem, per core. The number of cores is going up as well, and by ditching the IGP there would be room for even more cores.

Ok let's entertain the idea of a super-CPU. An 8-core AVX2 chip @ 4Ghz gives you a single precision teraflop. Sounds impressive until you look at perf/flop or perf/$. That CPU on 22nm will be $400+ unless Intel becomes a charity. The equivalent 28nm GPU will be $150, earlier to market and also offer higher performance. Not so impressive after all - a $120 CPU and $150 GPU will be cheaper and faster than your homogeneous setup.

As I've said before, the average texture sampling rate is far lower than the peak texture rate GPUs waste their transistors on.

And as mentioned before, you have to spec for peak, not average. Texturing workloads aren't spread evenly across a frame and no amount of CPU magic will change that. If you spec your CPU for avg texture throughput it will be severely bottlenecked during texture heavy portions of the frame.

GPUs have a lot of FF silicon but much of it is active in parallel and not bottlenecking anything. CPUs are more flexible but also slow - it's a simple tradeoff. You say that CPUs don't need to burn a lot of flops emulating FF hardware so where are the software renderers that prove that out?

A GPU doesn't bother to skip processing invisible geometry....With a CPU you can make more fine-grained decisions, using the available FLOPS more wisely.

Never heard of culling or early-Z eh? There's definitely a brute-force aspect to GPUs but smarter geometry processing like proposed for Larrabee is still just a theory. It's real easy to make big claims on paper.

Heck, even for a heterogeneous architecture the CPU has to continue to scale to keep supporting the otherwise helpless GPU.

Is that why my Q9550 is enough to drive a GTX 580? CPUs need to scale but not by much.

In general you're being a little too optimistic about CPUs and too dismissive of GPU performance. All current evidence points in the opposite direction. Today's fastest and most expensive CPUs can't render 10 year old games yet we are to believe they are going to catch up in a year or two from now with acceptable performance in modern games? Sorry, it's just not going to happen :)
 
I know, but to enable running complex code you need a substantial stack size, which means the thread count has to be minimal, which in turn can only be achieved with out-of-order execution.

A GPU core currently maintains thousands of active thread contexts. A CPU core has at most two. Why would GPUs need to reduce their thread counts to CPU levels in order to support deeper stacks or more registers per thread? They are at a 3 orders of magnitude advantage! Even a small reduction in thread counts would be of huge benefit and still not require the immense burden of ILP extraction that CPUs bear.

For OoO execution, nVidia is already most of the way there. I wouldn't be surprised to learn that arithmetic instructions can issue out of order in a narrow window. The scoreboarding patent didn't put any restrictions on instruction ordering.
 
Ok let's entertain the idea of a super-CPU. An 8-core AVX2 chip @ 4Ghz gives you a single precision teraflop. Sounds impressive until you look at perf/flop or perf/$. That CPU on 22nm will be $400+ unless Intel becomes a charity. The equivalent 28nm GPU will be $150, earlier to market and also offer higher performance. Not so impressive after all - a $120 CPU and $150 GPU will be cheaper and faster than your homogeneous setup.
Hell, even a $200 APU will trash such a monster.
 
For OoO execution, nVidia is already most of the way there. I wouldn't be surprised to learn that arithmetic instructions can issue out of order in a narrow window. The scoreboarding patent didn't put any restrictions on instruction ordering.

You don't need a scoreboard for that. Just a better compiler will do.
 
If there's anything that might widen the gap again, which CPUs can't implement, I haven't heard about it yet and I'm open to learning all about it.
The real issue is that to bridge that gap, CPUs will have to grow the compute density by an OoM BEYOND moore's law in a couple of years, else the window of opportunity will close. And that is just plain impossible when the the first and last design target is serial IPC/W.
 
One question up front: Does that image depict Double-Precision numbers or single-precision ones? edit: Ah, SGEMM gave it away; single precision it is!
RelativePerf_i7vsGTX280.JPG


In each of these, the CPU reaches high utilization (except when limited by the lack of gather/scatter). Now imagine what that graph looks like with twice the cores (to match die size and power), four times the computing power per core, and gather/scatter.

So, my above question notwithstanding, you're comparing a GPU architecture from 1H2008, which was basically the first useful attempt of Nvidia towards GPU-Computing, against a hypothetical CPU architecture for the timeframe of what, 1H2014?

I fail to see the point[strike], especially if those were DP numbers[/strike].
 
Really? Then what you said isn't really true. Within the top 5 supercomputers in your list, 3 of them are using nVidia GPUs.
And except for the japanese one in 1st place, the nVidia-powered supercomputers are more recent too.

And they are mostly useless for mostly everything except linpack!



I meant "demanding" applications for domestic computers, like playing 1080p (later stereo-3D) videos, 3d games, video+image editing software, opening several tabs with "heavy" web pages, eventually complex WebGL games, etc.

So you didn't mean what you said but instead meant something that had no relation to what you said! Got it.

And BTW, most of the things you just listed have nothing to do with GPUs but simple fixed function hardware...
 
But that is "just" a distinct weakness of Fermi in this respect. They don't handle the matrix operations (which are the base for the top500 list) with very high efficiency. AMD GPUs are actually currently better in this and nvidia promised to improve that considerably with Kepler too (i.e. they aim for parity with CPUs). And you always get a better power efficiency when choosing low voltage parts, that is true for GPUs too.

So they don't handle pretty much the simplest possible operation that they will ever see and that operation is in a best case setting?

The more general problem is actually what the scores tell about the code those computers will actually encounter in reality. It's almost nothing. Just solving huge systems of linear equations isn't what most of these systems do as their daily work. :rolleyes:

So far, all the data points to the GPUs having even worst efficiency ratios compared to CPU when doing things that are not linpack. solving huge systems of linear equations is basically best case for GPUs.
 
Btw., anybody should forget the word "superscalar" in respect to GPUs. It simply doesn't apply, as one doesn't have scalar, but vector ALUs. If one doesn't want to name it supervector issue, just name it dual issue. GCN has a real scalar issue port, all current GPUs don't.

No, not really. Superscalar and uniscalar refer to ISSUE not data. By your own argument, superscalar and uniscalar never applied to any CPU that has shipped in the last 20+ years as well...
 
The term superscalar was meant to describe the simultaneous issue to several scalar ALUs/pipelines. So while the SIMD extensions of CPUs blur the line a bit of course, a normal CPU is a superscalar CPU anyway (look at the Integer-core!), just with some shallow vector unit(s) attached to it, which may (intel) or may not (AMD) use the same scheduler. So they effectively unify both concepts, scalar and vector pipelines both allowing the simultaneous issue of instructions.

The fundamental thing is that "scalar" describes a value (and not an instruction!) representable by a single number, a vector is represented by several numbers. That's where the fundamental distinction between scalar processors and vector processors originated: does an instruction operate on scalars (single values as operands) or on vectors (operands are vectors)? The evolution to superscalar processors didn't change that fundamental difference, adding the simultaneous issue to scalar ALUs. Of course you can add the same to vector processors. But starting to call vector processors now superscalar because of their capability for simultaneous issue of instructions appears a bit ridiculous to me. Scalar vector units, seriously? :LOL:
But that tells me, that nvidia's marketing works after all :rolleyes:

Superscalar has nothing to do with data. your whole entire argument is from a perspective of ignorance and invalid. I would suggest some reading on actual computer architecture and terminology.
 
You will always have that problem when trying to force fit old terminology to new architectures. There isn't accepted terminology for "dual-issue" of SIMD for CPUs so why bother with CPU terminology at all?

dual-issue superscalar works just fine actually. That allies to dual issue Vectors, Matrix, SIMD, VLIW, etc. Superscalar is a well defined term within general and computer science/architecture use and has no relation to the actual operations being issued but instead refers to the issuing itself.

supervector or supersimd makes sense but good luck getting people to use that! In any case, from a software standpoint GF104 is superscalar. Arguments over whether hardware configuration or software perspective is more important can now ensue :)

Actually in context, they make NO BLOODY SENSE.

And from a software perspective: GF104 is Scalar ;) See, you are looking at the wrong thing again. ie, you must look at the contract points. Converting from a HLL to a LLL doesn't tell you anything about the software interface of a device.
 
It is very simple. Don't you remember the term "3way superscalar CPU" for instance was quite well accepted? It simply means there are 3 parallel pipelines. You can use the same terminolgy for vector processing, too (3way vector issue).

Really? You don't understand the raw basics do you?

Then name the issue as what it is: 2way, 3way or whatever simultaneous issue!

An understanding of a field and it's terminology is crucial to not making a fool of oneself. I would highly recommend that you get such an understanding before continuing.

And the "scalar" part applies to the nature of the ALUs.

And NO! NOPE! INCORRECT! WRONG!

And it is even general and works for scalar and vector processors.

instructions are not data.
 
Ok let's entertain the idea of a super-CPU. An 8-core AVX2 chip @ 4Ghz gives you a single precision teraflop. Sounds impressive until you look at perf/flop or perf/$. That CPU on 22nm will be $400+ unless Intel becomes a charity. The equivalent 28nm GPU will be $150, earlier to market and also offer higher performance. Not so impressive after all - a $120 CPU and $150 GPU will be cheaper and faster than your homogeneous setup.

And that "cheaper" solution will take 2-40x the programming effort and deliver 1/10th to 1/5th the effective performance if current trends hold.

Or to put it simpler: CPU Flops >> GPU Flops.
 
Like I said, all the advantages of OoO are already there.
No they're not. Lots of workloads look something like this:
Code:
sequential code

repeat N times
{
    independent iterations
}

sequential code

repeat M times
{
    dependent iterations
}

...
The only thing for which a GPU's thread scheduling can potentially approach out-of-order execution, is the loop with independent iterations. But even then the data is accessed 'horizontally' instead of 'vertically', making the caches less effective. When the loop is short, out-of-order execution will start executing multiple iterations simultaneously, but only for as many instructions as necessary to keep the execution units busy. By keeping the data accesses as local as possible, the benefit of the precious small caches is maximized.

So even in the best case scenario for the GPU, with laboriously tuned code, out-of-order execution fares better. For sequential code or loops with dependencies, GPUs slow down to a crawl, and no amount of in-order cores can turn that around. Only out-of-order execution can.
There's lots of ff hw still there. The compute density of gpu's is higher than lrb1. The driver teams have much more experience than what lrb1 team had at the time. There's no x86 penalty.
Larrabee has less fixed-function hardware, so its compute density should have been higher than that of a GPU, unless having a "flexible memory hierarchy" is to blame. I really don't think the x86 overhead is of any significance, since classic Pentium cores are tiny and other ISAs aren't free to implement either. I estimate the overhead is a low single digit percentage.

So if in your opinion GPUs should get a more flexible memory hierarchy like Larrabee, it looks like the compute density will go down.
 
Back
Top