22 nm Larrabee

Why not? You seem to have no issue with underestimating the capabilities of current and future GPUs :)
Because I haven't heard of any technology which would increase the GPU's computing density per transistor again. It has gone down both for Fermi and Cayman. There's no room to increase the number of shading cores without sacrificing vital throughput elsewhere, and either the amount of storage has to go up or the scheduling has to become more dynamic to cope with increasing workload complexity. In other words, they'll only converge closer to the CPU. The CPU on the other hand will add FMA to double throughput at only a relatively minor transistor cost.
 
The problem with storing your data set in registers is that it's not always feasible, or even possible. Registers are good choice for storing scalars, but any thing more complex than that is HARD.
It enables you to store more of the intermediate values you have for reuse without touching the cache, thus freeing it up for more useful fetches.
While ray tracing, are programmers supposed to hold - let's say one or two top levels of acceleration structure - in registers?
RT generally exhibits a strong locality of memory accesses for neighboring pixels/samples. The caches should do quite okay.
 
Isn't that the case only for primary rays? I'm asking, because all the rage that is raytracing seems to concentrate on accurate reflections (and shadowing)...

I know, that's only part of it and probably the most easily sellable advantage, if you want to explain raytracing to dummies like me.

edit:
Ok, you could go with Always-on-4x-Antialiasing.
 
The difference may be not as large as you think as GPUs can hold a lot more data in the registers.

But let's make up some numbers and say a GCN CU runs 12 heavy threads. That are 3 threads per 512bit wide SIMD unit (logical width is 2048 bit in ATIs case). A CPU with SMT also runs already 2 threads for let's say two 256bit SIMD units. Each thread (aka wavefront) on a CU has significantly more vector registers (in this example 85 instead of 16) available to store it's active data before any need to resort to the caches for reuse of data.
With a logical width of 2048-bit for GCN there are only 21 registers from the code's point of view.

With AVX-1024, the CPU would still have 16 logical registers, but quadruple the latency hiding ability.

But registers don't increase the cache hit rate! CPUs can also store a much larger working set in memory and thus handle more complex workloads. Whereas when GCN runs out of addressable registers, performance plummets. 21 registers isn't a whole lot, especially since there isn't a viable alternative. CPUs gracefully spill data to cache memory. So the limited number of 16 registers per thread is backed by a huge amount of additional storage that can be used both for additional variables or the actual data set.

So future CPUs will be capable of handling tasks as complex as compilation (where a 1 MB stack is no luxury), as well as high throughput tasks, and anything in between. Do not for a second underestimate just how powerful this will make every CPU.
 
With a logical width of 2048-bit for GCN there are only 21 registers from the code's point of view.
Nope, I factored that in already.
16 k floats register / 64 vector width / 3 threads = 85 registers per thread (4 times that per physical ALU).
With AVX-1024, the CPU would still have 16 logical registers, but quadruple the latency hiding ability.
Which would only mimic what GPUs do already on top of having more registers. :LOL:
Besides that AVX1024 is quite hypothetical at the moment.
But registers don't increase the cache hit rate!
But they decrease the number of memory accesses in the first place! CPUs have also high hitrates, because they are often storing and reloading data GPUs can simply keep in their registers. 20% misses of x doesn't have to be more than 10% of y.
CPUs can also store a much larger working set in memory and thus handle more complex workloads.
Are you referring to global memory?
Whereas when GCN runs out of addressable registers, performance plummets.
So does on CPUs. Not as sharply, I agree, but you also don't feel it that much because you are already working in that regime all the time. ;)
21 registers isn't a whole lot, especially since there isn't a viable alternative.
As said, it's actually 85 for the example given above.
CPUs gracefully spill data to cache memory.
So does Fermi and GCN. As said, maybe not as gracefully, but still. And it is less often a problem.
So the limited number of 16 registers per thread is backed by a huge amount of additional storage that can be used both for additional variables or the actual data set.
When it comes down to streaming some values from memory and doing only little arithmetics on them, CPUs as well as GPUs are limited by the bandwidth anyway. The "window of opportunity", where the relatively larger caches of a CPU may be decisive, is not that large.
So future CPUs will be capable of handling tasks as complex as compilation (where a 1 MB stack is no luxury)
That has no relevance to throughput computing or wide vector units. They won't help for this task anyway. Also not the CPU.
, as well as high throughput tasks, and anything in between. Do not for a second underestimate just how powerful this will make every CPU.
First, we have to see two 256 bit FMACs in Haswell and the effect on power consumption, needed transistors/space not only for the units itself, but also the L/S units, the caches and so on. All that needs a widening by at least a factor of 2 for the units to work without stalling. Maybe we will talk then about extending that to wider vectors again.

By the way, have you ever thought of Intel providing a basically common instruction set for the 256 bit AVX units closely coupled to the integer cores and more throughput oriented Larrabee-like in-order units with wider vectors, larger register files and so on for more power efficient execution of those workloads? You know, there are a few projections out there that it isn't possible to use all transistors at the same time at 14nm and below for power reasons anyway.
 
How do you count that? A quad core Sandybridge has already more than 900 million transistors for ~200 GFlop/s in single precision. Do you think two 256 bit FMA units per core and all those data paths necessary so they don't just idle are coming for free?
Sandy Bridge's 995 million transistors includes the IGP and System Agent. The actual cores only occupy 1/3 of the die size. And extending the SSE units to AVX didn't make them a whole lot larger: Westmere versus Sandy Bridge.

AVX2 and extra bandwidth obviously won't be free but for all intents and purposes it's still a doubling of the computing density.
 
Sandy Bridge's 995 million transistors includes the IGP and System Agent. The actual cores only occupy 1/3 of the die size.
So what? I already noted that the actual shader units of an RV770 (capable of 1.2 TFlop/s a few years ago) consumed also only 29% of the total die size (41% including TMUs and L1-caches).
Historically speaking the flops/transistor in the evolution of an architecture hardly went up at any point in time (there are very few exceptions, I know), as the "supportive circuitry" to really use the improved performance is nothing you should underestimate.
And extending the SSE units to AVX didn't make them a whole lot larger: Westmere versus Sandy Bridge.
Intel got this because of an optimized layout (the caches got smaller for instance) and mainly because they didn't increase the width of most datapaths. That won't be possible again, they used some kind of inherent redundancy of their prior design Nehalem. And it is also a reason why Sandybridge with AVX is much more bandwidth constrained than Nehalem with SSE, the cache bandwidth is identical, despite the 2x higher peak troughgput, which makes the flops harder to use.
AVX2 and extra bandwidth obviously won't be free but for all intents and purposes it's still a doubling of the computing density.
That sentence contradicts itself. ;)
 
Nope, I factored that in already.
16 k floats register / 64 vector width / 3 threads = 85 registers per thread (4 times that per physical ALU).
Fair enough, but does 3 wavefronts / 12 cycles really suffice to cover the average latency?

Also note that CPUs with more logical registers never succeeded to fight back x86, even back when it basically just had 6 GPRs. So I'm not convinced that 256 addressable registers shared between several wavefronts offers any sizable benefit over 16 registers per thread and a fast cache hierarchy.
Besides that AVX1024 is quite hypothetical at the moment.
That's what people said last month about FMA and gather...

AVX2 illustrates Intel's strong commitment to exceed Moore's law at offering ever higher performance/Watt. They've also been touting AVX's extensibility to 1024-bit ever since its earliest announcement in 2008. It's also perfectly feasible technically, and they even obtained access to related patents.

So it's really just a question of "when" rather than "if". Frankly I don't even think it matters all that much. CPUs and GPUs continue to converge so whether homogeneous architectures become superior in 2015 or 2017 is hardly relevant. What's relevant is that Intel has no reason not to implement AVX-1024 to gain dominance over the throughput computing market.

I'm convinced AVX-1024 will arrive in 2015 though. The AVX2 announcement exceeded everyone's expectations, and nobody can deny that AVX-1024 would complete the evolution to turn the CPU into a highly competitive throughput computing device. In particular Intel has to be well aware of this and with AVX2 fully published they must now be focusing on AVX3.
But they decrease the number of memory accesses in the first place!
Yes, but not those you care about for running complex workloads.
When it comes down to streaming some values from memory and doing only little arithmetics on them, CPUs as well as GPUs are limited by the bandwidth anyway. The "window of opportunity", where the relatively larger caches of a CPU may be decisive, is not that large.
I beg to differ. It's a huge win if you can "stream" data that fits in the L3 cache. It's no coincidence that Itanium 2 processors (which are targeted at the HPC market) feature humongous caches. It matters to have lots of on-chip storage, because most workloads have a significant amount of reuse within working sets of such size.
That has no relevance to throughput computing or wide vector units. They won't help for this task anyway. Also not the CPU.
Wrong. Dynamic code generation is the key to efficiently execute state-dependent kernels (cfr. über-shaders).
First, we have to see two 256 bit FMACs in Haswell and the effect on power consumption, needed transistors/space not only for the units itself, but also the L/S units, the caches and so on. All that needs a widening by at least a factor of 2 for the units to work without stalling.
You could compare Yonah and Conroe to get an idea of what it took to go from dual 64-bit to dual 128-bit vector execution units and double the cache bandwidth, although the latter also added x64 and macro-op fusion...
You know, there are a few projections out there that it isn't possible to use all transistors at the same time at 14nm and below for power reasons anyway.
Which is why executing AVX-1024 on 256-bit execution units makes a lot more sense than extending the execution units to 512-bit. The instruction rate is lowered by a factor four, and large portions of the core can be clock gated during AVX-1024 execution. Scaling throughput would then be done by increasing the core count. I don't expect they'll invest more transistors per core at throughput computing alone.
 
So what? I already noted that the actual shader units of an RV770 (capable of 1.2 TFlop/s a few years ago) consumed also only 29% of the total die size (41% including TMUs and L1-caches).
Historically speaking the flops/transistor in the evolution of an architecture hardly went up at any point in time (there are very few exceptions, I know), as the "supportive circuitry" to really use the improved performance is nothing you should underestimate.
RV770's shader units are helpless without that other 71%. The IGP and System Agent on the other hand are not relevant when comparing the CPU microarchitecture against the GPU.
Intel got this because of an optimized layout (the caches got smaller for instance) and mainly because they didn't increase the width of most datapaths. That won't be possible again, they used some kind of inherent redundancy of their prior design Nehalem. And it is also a reason why Sandybridge with AVX is much more bandwidth constrained than Nehalem with SSE, the cache bandwidth is identical, despite the 2x higher peak troughgput, which makes the flops harder to use.
Instead of having a separate 256-bit floating-point ALU and 128-bit integer ALU, they succeeded to combine them into a single ALU. Widening the integer portion can also benefit from such unification, and FMA won't double the area either. Even if the AVX2 ALUs and increased bandwidth take a massive 10% of additional core area, that's only 5% when including the L3 cache and memory controller.
That sentence contradicts itself. ;)
5% to double throughput is not free, but negligible. No contradiction there.
 
AVX units are not humongous nor power hungry. Haswell should reach 500 GFLOPS for 1 billion transistors. That's roughly the same computing density as NVIDIA's latest GPUs, and the TDP wouldn't be too far off either (even at today's 32 nm process).
Can you clarify this?
Are you saying that if Haswell were 32nm, it could reach 500 GFLOPS within the same TDP limits?

And I will not reiterate my standing argument about focusing on transistor quantity.
 
Fair enough, but does 3 wavefronts / 12 cycles really suffice to cover the average latency?
It really depends on the workload. Actually the allocation of wavefronts to an SIMD works that way, that the number of needed registers determine the amount of simultaneous running wavefronts. So it is basically up to the developer (and some extent the compiler) to find the optimal compromise. But it's not that way that 3 wavefronts are only able to cover 12 cycles of latency. That may be true in the absolute worst case (where a CPU covers nothing), but what happens is that the wavefronts on an SIMD gets basically out of sync, so the latency is often just a one time penalty for one SIMD, not a penalty for each wavefront running on it (because it rarely happens, that all wait for a memory access at the same time). It basically gets the same benefits as SMT, just with more threads.
Also note that CPUs with more logical registers never succeeded to fight back x86, even back when it basically just had 6 GPRs. So I'm not convinced that 256 addressable registers shared between several wavefronts offers any sizable benefit over 16 registers per thread and a fast cache hierarchy.
I thought it was already discussed how power hungry that eventually is.
That's what people said last month about FMA and gather...
I remember the FMA specification first turned up years ago. AMD announced that Bulldozer will support FMA(4) more than two years ago (and before, SSE5 proposed in 2007 also included [differently encoded 3 operand] FMA instructions). The confusion about FMA4 and FMA3 is also already 2 years old. So someone who didn't know that intel will implement FMA(3) must have lived under a stone in a cave somewhere in the middle of nowhere for the last years. :LOL:
AVX2 illustrates Intel's strong commitment to exceed Moore's law at offering ever higher performance/Watt. They've also been touting AVX's extensibility to 1024-bit ever since its earliest announcement in 2008. It's also perfectly feasible technically, and they even obtained access to related patents.

So it's really just a question of "when" rather than "if".
As mentioned, it is also a question of "where". ;)
Yes, but not those you care about for running complex workloads.
I would care about all accesses to the memory hierarchy. :rolleyes:
I beg to differ. It's a huge win if you can "stream" data that fits in the L3 cache. It's no coincidence that Itanium 2 processors (which are targeted at the HPC market) feature humongous caches. It matters to have lots of on-chip storage, because most workloads have a significant amount of reuse within working sets of such size.
For conventional GPUs it does not matter, as you can stream with a higher bandwidth directly from memory. So a huge LLC it is basically wasted. But look on intels iGPU in Sandybridge! They share already the L3. Why do you think it will be any different in future versions of it (or AMD's)?
Wrong. Dynamic code generation is the key to efficiently execute state-dependent kernels (cfr. über-shaders).
And how does a compilation profit from wide vector units in that course?
You could compare Yonah and Conroe to get an idea of what it took to go from dual 64-bit to dual 128-bit vector execution units and double the cache bandwidth, although the latter also added x64 and macro-op fusion...
Yes, in the same 65nm process the core area (i.e. just the core including L1) grew from 19.7mm² to 31.5mm². And the complete die was 57% larger. Doesn't look like a doubling of compute density to me. :rolleyes:
 
Last edited by a moderator:
RV770's shader units are helpless without that other 71%. The IGP and System Agent on the other hand are not relevant when comparing the CPU microarchitecture against the GPU.
That's why I provided the second number of 41%, which includes the texture units (Besides the L1 they have quite some graphics oriented logic in it, but so what, it's not a good comparison anyway). And it's not that ROPs/caches, the "frontend" of a GPU with the work distribution and rasterization among others, the memory interface, PCI-Express interface and such stuff are free. And a CPU also can't work without that system agent and a memory controller.
Widening the integer portion can also benefit from such unification, and FMA won't double the area either. Even if the AVX2 ALUs and increased bandwidth take a massive 10% of additional core area, that's only 5% when including the L3 cache and memory controller.
5% to double throughput is not free, but negligible. No contradiction there.
Let's look at the numbers when Haswell gets out! ;)
 
Are you saying that if Haswell were 32nm, it could reach 500 GFLOPS within the same TDP limits?
Not the same TDP, but close enough. Sandy Bridge can (and will) Turbo Boost the CPU cores as long as it's sufficiently cooled and stays within TDP (not using the IGP reportedly helps). In any case FMA would significantly increase performance/Watt and bring it in the same ballpark as GPUs, on competing processes. GPUs are about to move to 28 nm and (Intel's) CPUs to 22 nm + FinFET so I'm looking forward at how that might affect the comparison.

It's obvious though that AVX-1024 is required to deal the final blow.
 
Sweet, I was looking for a good shot of SB and its AVX block.

Some fun with mspaint shows the AVX block to be about 1/5 of the core area of SB.
Llano's poor CPU performance is heavily indicated by its dimunitive core area, 56% of SB.

Similar mspaint fun with a die shot of Llano shows that a SIMD, TMU, and sequencer take up about 75% of the area of a Llano core.
This makes a SIMD+TMU+SEQ unit 2x the size of an AVX block, meaning roughly 2.5 of them would fit in an SB core's area.
 
Nick said:
5% to double throughput is not free, but negligible. No contradiction there.
So, if widening AVX to 1024 bits will bring such large performance improvements at such low cost, why stop at 1024? Why not 2048 bits? Or even more?
 
Not the same TDP, but close enough. Sandy Bridge can (and will) Turbo Boost the CPU cores as long as it's sufficiently cooled and stays within TDP (not using the IGP reportedly helps).
Gating the IGP does free up TDP room. 4 SB cores are already capable of hitting 95W with their current AVX units, though Intel's Turbo is able to exceed TDP for a partially adjustable time period, like 30 seconds.
Sustained GFLOP rates may constrained by TDP.
 
That's why I provided the second number of 41%, which includes the texture units (Besides the L1 they have quite some graphics oriented logic in it, but so what, it's not a good comparison anyway). And it's not that ROPs/caches, the "frontend" of a GPU with the work distribution and rasterization among others, the memory interface, PCI-Express interface and such stuff are free. And a CPU also can't work without that system agent and a memory controller.
You don't want to go that route. The entire GPU is completely worthless without a CPU...

So let's just look at the commercial relevance. You need to pay for that System Agent anyway, so there's no point in considering it to be part of the CPU microarchitecture. Likewise Sandy Bridge's IGP should be taken out of the equation.

GPUs can choose between focusing on graphics and serving as a generic computing device on the side, which means you do have to take the samplers and ROPs and such into account, or they can opt to become more computing oriented and perform graphics in software.

Anyhow, Haswell will have highly competitive computing capabilities.
Let's look at the numbers when Haswell gets out! ;)
Absolutely. And let's look at the numbers when GCN gets out. Those new features don't come for free either...
 
Because I haven't heard of any technology which would increase the GPU's computing density per transistor again. It has gone down both for Fermi and Cayman. There's no room to increase the number of shading cores without sacrificing vital throughput elsewhere, and either the amount of storage has to go up or the scheduling has to become more dynamic to cope with increasing workload complexity. In other words, they'll only converge closer to the CPU. The CPU on the other hand will add FMA to double throughput at only a relatively minor transistor cost.

Yes, everyone knows they will converge but the argument is whether the resulting architecture will look more like a GPU or a CPU. I don't understand why you're so convinced that today's CPUs are more representative of future many-core achitectures than today's GPUs.

There is additional compute density to be had on GPUs as well. nVidia at least is predicting up to 3Ghz shader clocks in the next few years on GPU parts.

GPUs can choose between focusing on graphics and serving as a generic computing device on the side, which means you do have to take the samplers and ROPs and such into account, or they can opt to become more computing oriented and perform graphics in software.

What are we expecting from Haswell in terms of fixed function hardware?
 
Intel said that FD-SOI (and planar transistors) can bring comparabable advantages as FinFET. Intel dismissed it because they think FinFET has a cost advantage. What makes you sure we are not going to see 20nm FD-SOI processes from the members of the SOI consortium?

FD-SOI has significant cost and control issues to get from PD to FD. PD already has significant costs over bulk, FD will increase that further. I doubt that we see a general purpose high volume FD process out of that side except for boutique uses from IBM.
 
Back
Top