22 nm Larrabee

Gipsel · Jul 9, 2011

aaronspink said:
So far, all the data points to the GPUs having even worst efficiency ratios compared to CPU when doing things that are not linpack. solving huge systems of linear equations is basically best case for GPUs.

Linpack isn't the best case for GPUs, specifically nvidia GPUs. More GPU friendly algorithms definitely exist. And there are even problems, which run with higher efficiency on GPUs than on CPUs.

aaronspink said:
Superscalar has nothing to do with data. your whole entire argument is from a perspective of ignorance and invalid. I would suggest some reading on actual computer architecture and terminology.

If you start that way, I would recommend you to first getting a dictionary to look up the meaning of the word "scalar" and after that have a look at the history how the whole thing of scalar vs. vector vs. superscalar vs. VLIW vs. whatever developed.

Or I will shut up if you explain, your next statement a bit more in detail:

aaronspink said:
No, not really. Superscalar and uniscalar refer to ISSUE not data. By your own argument, superscalar and uniscalar never applied to any CPU that has shipped in the last 20+ years as well...

How didn't the individual instructions operate on scalar data for the CPUs in the last 20+ years besides the SIMD extensions I specifically mentioned (and which appeared later in mainstream CPUs)?

Or to make it clearer, uniscalar (single issue) and superscalar (multi issue) refer of course to the instruction issue, but the scalar part originates from the scalar vs. vector processor question (clearly referring to the data the instructions operate on). As superscalar processors are normally seen as an evolution of (uni)scalar processors, they simply inherited this property. "Super" as a word simply means above or beyond and designates the issue of multiple instructions in an otherwise scalar ISA (opposed to VLIW, where the ISA and the compiler provide the means for issuing multiple parallel operations, superscalar is an µarchitectural implementation detail of a scalar ISA which is transparent). And in complete analogy there are also single issue, multi issue, and also VLIW vector processors as well as there are multi issue CPUs with a mixed scalar/vector instruction set (the majority right now). Those things are basically orthogonal properties of a processor.

aaronspink · Jul 9, 2011

Gipsel said:
How didn't the individual instructions operate on scalar data for the CPUs in the last 20+ years besides the SIMD extensions I specifically mentioned (and which appeared later in mainstream CPUs)?

you do realize that bit vector operations have been pretty standard since pretty much the beginning of time, right? 4004 had them. 8008 had them. 8080 had them. Etc.

And FYI, vector is dead, has been for quite some time.

Gipsel · Jul 10, 2011

aaronspink said:
you do realize that bit vector operations have been pretty standard since pretty much the beginning of time, right? 4004 had them. 8008 had them. 8080 had them. Etc.

And FYI, vector is dead, has been for quite some time.

You really want to consider a bitfield as a vector? You are kidding, right?
I see no coherent illustration of your take on this issue. I delivered one, you didn't.
And by the way, vector is far from dead. Every major ISA has a vector extension of some sort. And do I really have to mention GPUs, which are basically the current stronghold of vector processors (at least the big ones, there are a slew of vector DSPs out there)? So what is your argument?

rpg.314 · Jul 10, 2011

Gipsel said:
You really want to consider a bitfield as a vector? You are kidding, right?
I see no coherent illustration of your take on this issue. I delivered one, you didn't.
And by the way, vector is far from dead. Every major ISA has a vector extension of some sort. And do I really have to mention GPUs, which are basically the current stronghold of vector processors (at least the big ones, there are a slew of vector DSPs out there)? So what is your argument?

Well, I wouldn't call an ISA without scatter/gather and predication a vector ISA. At best, it is glorified VLIW.

Gipsel · Jul 10, 2011

rpg.314 said:
Well, I wouldn't call an ISA without scatter/gather and predication a vector ISA. At best, it is glorified VLIW.

One can probably argue about that, but my point was actually that common CPUs have traditionally a scalar ISA. So thanks for supporting that.

rpg.314 · Jul 10, 2011

And FYI, vector is dead, has been for quite some time.

I'd say vector has had a rebirth in the last couple of years.

Nick · Jul 10, 2011

rpg.314 said:
If Intel can implement a single uop gather, then great. It's just that there are real issues that need to be tackled. Making the execution of uops interruptible seems to add some complexity that is, afaik, not present at the moment.

With 512-bit vector processing, Larrabee demands a high-performance gather implementation. Micro-coding each individual element access is absolutely not an option (neither for graphics nor HPC). We know they've implemented a "clean scatter/gather architecture" and Larrabee supports "context switching & pre-emptive multi-tasking".

So whatever real issues you had in mind, Intel has a solution for them. And I've got a pretty solid idea what Haswell's implementation might look like...

CarstenS · Jul 10, 2011

Nick said:
That's still a lower computing density when corrected for the process shrink.

I've factored that in already:
1581 GFLOPS, 520 sqmm., 40nm ->3,04 GFLOPS/sqmm ->
708 GFLOPS, 470 sqmm, 55nm -> 1,51 GFLOPS/sqmm /[(40*40)/(55*55)] ->2,848 GFLOPS/sqmm.
Or is that calculation wrong? GT200b-die would, ideally shrunk, be 249 sqmm in 40nm.

Nick said:
That would be fine if it did software rendering and ran every other application plus the operating system...

Which is what APUs do, for example, as you are providing them as an example below.

Nick said:
This thread evolved into a discussion about homogeneous versus heterogeneous architectures. That means a CPU without IGP, versus an APU or CPU + discrete GPU.

Nick said:
What's the area for 8 logarithmic right shift units for 64 bytes at byte granularity?

I don't know, please help me out here. I'd wager it is >0 if you want to have specialized circuits for it, no?

--
edit:
Please, don't get me wrong! I really want to understand your line of arguments.

CarstenS · Jul 10, 2011

Nick said:
Nonsense. Even a GMA 950 can render Aero smoothly, and they use lots of optimizations to ensure the graphics load is as low as possible. Heck the windows are pixel aligned most of the time so with a software renderer you can just copy the texture data directly.

I don't understand - sorry: First you started by software rendering on a CPU whose IGP has been ditched (1]). Of course, the graphics load wouldn't be very high, but you would keep at least once core, the memory controller and probably L2/LLC away from deeper sleep states, wouldn't you? So using the CPU for things ff-hardware can do very efficiently, is not a very good option energy wise - which was the starting point.

aaronspink · Jul 10, 2011

Gipsel said:
You really want to consider a bitfield as a vector? You are kidding, right?

fits easily into the definition of vector you've been using.

Every major ISA has a vector extension of some sort.

Name 1 current in use ISA with a vector extension? x86? Nope. Sparc? Nope. Power? Nope. ARM? nope.

In fact the last two proposed designs I know were the Alpha Tarantula and the Cray X1E system.

And do I really have to mention GPUs, which are basically the current stronghold of vector processors (at least the big ones, there are a slew of vector DSPs out there)? So what is your argument?

GPUs are threaded SIMD based processors. To call them real vectors is to stretch the truth to the extreme.

aaronspink · Jul 10, 2011

rpg.314 said:
I'd say vector has had a rebirth in the last couple of years.

SIMD had a second eruption, vector, not so much.

rpg.314 · Jul 10, 2011

aaronspink said:
GPUs are threaded SIMD based processors. To call them real vectors is to stretch the truth to the extreme.

So the GPU's are SIMD but LRB is vector because LRB has scalar instructions as well, but not GPU's. Is that so?

If yes, then imho, that's more hair splitting than a meaningful distinction. And even then, GCN tips the vector scales.

Gipsel · Jul 10, 2011

aaronspink said:
fits easily into the definition of vector you've been using.

No? Or do you want to define an addition of two dwords as a succession of 32 horizontal bitwise carry adds within a vector?

aaronspink said:
Name 1 current in use ISA with a vector extension? x86? Nope. Sparc? Nope. Power? Nope. ARM? nope.

Tell that intel! They even named their later efforts "Advanced Vector eXtension".

The most fundamental property of a vector instruction is that it operates on vectors. It's that simple. If you have instructions for common data types as elements of those vectors, you have a vector extension (yes, SIMD is only a subset, but who cares about that) to an otherwise scalar ISA.

3dilettante · Jul 11, 2011

Nick said:
With 512-bit vector processing, Larrabee demands a high-performance gather implementation. Micro-coding each individual element access is absolutely not an option (neither for graphics nor HPC).

The scheme I was thinking of wouldn't create a separate op for each element, rather it would issue uops per cache access.
A high amount of locality would create a handful of uops, which is much better than dozens of separate instructions in the L1 and dozens in the pipeline.

The updating of the mask and destination registers while the instruction was in progress seemed to be consistent.

We know they've implemented a "clean scatter/gather architecture"

What does it mean to be "clean"? And why only Larrabee 3?

So whatever real issues you had in mind, Intel has a solution for them. And I've got a pretty solid idea what Haswell's implementation might look like...

Interesting. What's your idea?
A uop that can replay itself when the mask is not zero and update the mask and destination register?
It would be a change if the replay can be done after writing to the two registers, instead of rolling things back.

I thought a microcoded solution would be a reasonable compromise that would allow area sharing between the new permute hardware and the gather hardware.

Nick · Jul 11, 2011

CarstenS said:
I've factored that in already:
1581 GFLOPS, 520 sqmm., 40nm ->3,04 GFLOPS/sqmm ->
708 GFLOPS, 470 sqmm, 55nm -> 1,51 GFLOPS/sqmm /[(40*40)/(55*55)] ->2,848 GFLOPS/sqmm.
Or is that calculation wrong? GT200b-die would, ideally shrunk, be 249 sqmm in 40nm.

Seems correct to me now. Not counting in the SFU's MUL made all the difference. Note that the process shrink may also have allowed the clock frequency increase, but I'll give that the benefit of the doubt.

So that's a 7% increase in computing density for the GPU, versus an 80% increase in computing density for the CPU. And with AVX2 the CPU will nearly double its computing density once again. So I sincerely doubt that GPUs can prevent the throughput performance gap from getting a whole lot smaller.

Which is what APUs do, for example, as you are providing them as an example below.

Yes, comparing an APU against a homogeneous CPU with software rendering is valid since they both support all the functionality of a complete system. Comparing a GPU against a CPU is not a valid comparison though since the latter is demanded to run a whole lot more applications than the former. The GPU can not function without the CPU.

So when evaluating heterogeneous versus homogeneous architectures, you need to add some area to the GPU (which decreases the computing density), and subtract the area of the IGP from the CPU (which increases computing density).

I don't know, please help me out here. I'd wager it is >0 if you want to have specialized circuits for it, no?

I don't know exactly either. Since you mentioned that to support gather some more area will be needed, I expected you knew just how significant it would be...

I believe it's negligible though. Caches have existed for several decades, so right shifting a cache line at byte granularity must be really cheap. Many architectures have multiple load/store units too. So having eight shifters, to support gathering up to eight elements from a cache line each cycle, doesn't seem very significant. Also, Larrabee is supposed to be capable of gathering up to 16 elements, and has much smaller cores...

A 64-byte logarithmic right shifter requires 6 stages of 2:1 multiplexers. These can be implemented with essentially 6,144 n-type transistors, and probably a buffer in the middle and at the output. So roughly 80,000 transistors for all eight shifters, most of which small n-type transistors. That won't affect the compute density in any significant way. The logic to test which indices address the same cache line is also tiny: you only need seven 25-bit comparators and seven 7-bit adders. Everything else is already largely in place; a word which straddles two cache lines will queue up two fetches, so it shouldn't be hard to allow a gather uop to queue up all of its cache line fetches over multiple cycles.

Nick · Jul 11, 2011

CarstenS said:
I don't understand - sorry: First you started by software rendering on a CPU whose IGP has been ditched (1]). Of course, the graphics load wouldn't be very high, but you would keep at least once core, the memory controller and probably L2/LLC away from deeper sleep states, wouldn't you? So using the CPU for things ff-hardware can do very efficiently, is not a very good option energy wise - which was the starting point.

Even with an IGP, you still need to keep one CPU core active for running the O.S. and graphics driver, and you need the L3 cache and memory controller as well. So it won't make that much of a difference to perform the IGP's tasks on the CPU core(s) as well, especially when you have AVX-1024 which allows clock gating the out-of-order execution logic 3/4 of the time. FinFET also saves a lot of power at low frequency/voltage.

Nick · Jul 11, 2011

3dilettante said:
The scheme I was thinking of wouldn't create a separate op for each element, rather it would issue uops per cache access.

You can't do that at the decoder stage since it depends on the addresses.

Note that loads can already be decomposed into multiple fetches when they straddle a cache line boundary. This happens in the load unit itself, and it's still one uop.

What does it mean to be "clean"? And why only Larrabee 3?

"This is a simplified representation of what is currently a hardware-assisted multi-instruction sequence, but will become a single instruction in the future."

Interesting. What's your idea?
A uop that can replay itself when the mask is not zero and update the mask and destination register?
It would be a change if the replay can be done after writing to the two registers, instead of rolling things back.

An interrupt can just trigger the load unit to write the partial mask and destination results. Interrupts are serviced after the next retirement boundary, so it's as if the instruction retired, but replay still starts at the gather.

Note that an uninterrupted gather does not require writing two 256-bit registers. Zeroing the mask register merely requires flagging it in the renamer.

I thought a microcoded solution would be a reasonable compromise that would allow area sharing between the new permute hardware and the gather hardware.

That would still require 24 uops and thus offer very little advantage.

rpg.314 · Jul 11, 2011

Nick said:
Seems correct to me now. Not counting in the SFU's MUL made all the difference. Note that the process shrink may also have allowed the clock frequency increase, but I'll give that the benefit of the doubt.

So that's a 7% increase in computing density for the GPU, versus an 80% increase in computing density for the CPU. And with AVX2 the CPU will nearly double its computing density once again. So I sincerely doubt that GPUs can prevent the throughput performance gap from getting a whole lot smaller.

That's a comparison over a very short period of time, just one year for CPUs. If you look at longer term trends, the growth of compute density is pretty matched, even in the best case for CPU's. GT200->Fermi increased a lot of programmability, added lots of ff hw, and still managed to eke out a smaller increase. Besides, there's lots of stuff in CPUs too which will make perfect 2x scaling impossible.

CarstenS · Jul 11, 2011

Nick said:
Seems correct to me now. Not counting in the SFU's MUL made all the difference. Note that the process shrink may also have allowed the clock frequency increase, but I'll give that the benefit of the doubt.

Thanks, but with differently optimized architectures clock and ,in case of GPUs more importantly, power potentials do not seem to be a function of process tech; or you could argue that since 65nm, there has been no improvement on the clock front for Intel and they're still stuck at 3,8 GHz. (Which is not what I'd consider something we will have to debate over).

And yes, the SFU's MUL is and was from the beginning marketing bogus, even though you could detect it's presence to some degree by directed tests. Thanks for not using Wikipedia-repeated nonsense any more.

Nick said:
So that's a 7% increase in computing density for the GPU, versus an 80% increase in computing density for the CPU. And with AVX2 the CPU will nearly double its computing density once again. So I sincerely doubt that GPUs can prevent the throughput performance gap from getting a whole lot smaller.

Just came to my mind: The analogue part of chips have been said to scale really poor by process tech and is a burden each GPU must bear. And on top of that you subtract that crucial part from CPUs...

Maybe some GPU-vendor goes back to separating 2D and Video in an additional chip like Nvidia did in the past? That'd increase compute density a bit I'd say.

Anyway, there are two (or threee) more important things which could benefit GPUs rather than CPUs in the future:
Number one is related to the point above: Removal of ff hardware in order to cram more compute units into a given die.

Number two: contrary to GPUs, CPUs do not automatically profit from wider designs, i.e. higher counts (or width) of their execution units. AVX is available for half a year now and has been documented well for a longer time before. Except for SiSoft Sandra I don't even know another synthetic tool which makes use of them. GPUs can put more execution capabilities to good use immediately in their traditional domain: graphics (given that CPUs until now don't use their Vector/SIMD Extensions for that and it is yet to be seen if they will).

And finally number three: CPUs have to maintain hardware backward compatibility with their ancestors at least for a while in order for their extension to be of any real-world relevance. GPUs have the benefit (or curse) of being able (or forced, depending on your point of view) to mask that behind the driver. This, obviously, does not completely disable any kind of progress in the same line of extensions, but makes it at least more costly and/or inflexible compared to being able to do something completely different in your next generation. At least that's what I'd think from my layman's perspective.

Nick said:
Even with an IGP, you still need to keep one CPU core active for running the O.S. and graphics driver, and you need the L3 cache and memory controller as well. So it won't make that much of a difference to perform the IGP's tasks on the CPU core(s) as well, especially when you have AVX-1024 which allows clock gating the out-of-order execution logic 3/4 of the time. FinFET also saves a lot of power at low frequency/voltage.

While i see your point, i also suspect that there must be a reason, why Intel has decided to go with more ff hardware in SB IGP (most of which related to video playback/transcoding) than in previous generations instead of using the x86 cores.

hoho · Jul 11, 2011

CarstenS said:
or you could argue that since 65nm, there has been no improvement on the clock front for Intel and they're still stuck at 3,8 GHz.

"Downgrading" from 3.8GHz Netburst to 3GHz Core2 isn't really all that bad. At what clock would i7 run if it was allowed for it to have 150W TDP as some of highest end Netbursts did?

CarstenS said:
Number two: contrary to GPUs, CPUs do not automatically profit from wider designs, i.e. higher counts (or width) of their execution units. AVX is available for half a year now and has been documented well for a longer time before. Except for SiSoft Sandra I don't even know another synthetic tool which makes use of them. GPUs can put more execution capabilities to good use immediately in their traditional domain: graphics (given that CPUs until now don't use their Vector/SIMD Extensions for that and it is yet to be seen if they will).

that depends directly on the software you are running. I would imagine a CPU-based OpenCL or something similar would scale automagically with wider configuration

CarstenS said:
And finally number three: CPUs have to maintain hardware backward compatibility with their ancestors

Pretty much the only thing they need is a "virtual machine" in form of x86->internal microcode translator. The underlying hardware has seen rather huge changes over past 30 years while keeping backwards compability without too much of a transistor-burden just fine.

22 nm Larrabee

Gipsel

aaronspink

Gipsel

rpg.314

Gipsel

rpg.314

Nick

CarstenS

Moderator

CarstenS

Moderator

aaronspink

aaronspink

rpg.314

Gipsel

3dilettante

Nick

Nick

Nick

rpg.314

CarstenS

Moderator

hoho

Similar threads