22 nm Larrabee

Two 256-bit FMA pipes per core could be an overkill...
Why would a Hyper-Threaded core with two FMA units be overkill, while a Bulldozer module with two FMA units is not?
...but we still don't know anything about the Haswell's load/store pipeline capabilities. Probably Intel could settle for asymmetric ALU design with FMA + MUL "co-issue" organization, or FMA + ADD.
There's no need for Intel to settle for anything less. 22 nm Tri-Gate gives them plenty of power efficient transistors. What else would they use them for anyway? More cores would mean more threads and thus worse scaling, as well as more of the other power-hungry components. Compared to the alternatives, 2 x 256-bit FMA offers the best performance / Watt even if they have to widen a few things.

An FMA + ADD configuration doesn't make sense since multiplications are more prevalent. And if you got FMA + MUL you may as well make it a second FMA since it's not much bigger and you already got three source operands for other operations.
 
How is that dodging the question? The GF116-400 has 192 FMA ALUs clocked at 1800 Mhz, on 40 nm. So why would it be such a big deal to have a quad-core Haswell with a mere 64 FMA ALUs clocked at 4 Ghz, on 22 nm Tri-Gate? And since Sandy Bridge has two 256-bit ports, we're over half way there already!

It's not about the transistor budget. It makes little sense to add alus that you cannot feed, and on normal loads 256bit AVX is already *heavily* bandwidth limited. Doubling the peak ALU throughput would have costs, and wouldn't have almost any effect on real performance, unless they very seriously boost the memory subsystem too.

They might do it, but frankly, I can still think of a lot of better ways to use that transistor budget.
 
Intel also tends to make $$$, not lose it.
The argument was that AMD is no threat to Intel. But just because Intel makes more money doesn't mean they're immune and can make due with one FMA unit. Sure, it means they can survive a mistake like that, but it doesn't mean they should be doing it. The expectations of investors and clients in various markets are very high. So it would be plain stupid to cripple a winning architecture.

Again, Bulldozer already has two 128-bit FMA units per module. All it takes is to widen them to 256-bit to trump Intel. And widening is cheap; they can easily do it at the shrink to 22 nm.

Haswell will already be at 22 nm, plus Tri-Gate! So if Intel wants to keep making $$$, they should not take any chances and go for 2 x 256-bit FMA. Heck, they've got nothing to lose. Even if power efficiency for legacy workloads is a bit lower than it could have been, it would still be better than anything AMD can deliver. So either way it's a winning scenario, while going with one FMA unit is a huge risk.
 
Last edited by a moderator:
And since Sandy Bridge has two 256-bit ports, we're over half way there already!
SNB has two 128-bit load and one 128-bit store ports to the data cache. Unless something major is changed in this regard for Haswell, extending the two AVX ALUs to full FMA capability will be futile.
 
That does not in any way address the cost of adding the 2nd FP unit and the business advantage/engineering constraints trade-offs involved in a CPU whose primary design goal is low power.
Its design goal is not low power. Its design goal is high performance / Watt. Want low absolute power? Use fewer cores and/or lower frequency. It's that simple. Everyone knows that wide vector operations are the key to better performance / Watt.

With 2 x 256-bit FMA, Haswell can not be beaten on any front by AMD, so why risk doing something else?
A CPU can address DLP, at best, as an afterthought when it is designed with serial integer IPC/W above all else in the first place. It needs to delegate that bit to ... ahem ... specialists.
GPUs are using ever more complicated scheduling to keep their ALUs busy, register file sizes reasonable and improve cache hit rates. CPUs are still inherently better at things like ray-tracing, but they lack the computing power. So ILP is not a bad thing, and DLP is perfectly orthoganal to it.

Last but not least, AVX-1024 would bring us the latency hiding and power efficiency properties of these "specialists" you talk of, but combines them with high ILP into a superior homogeneous architecture.
 
Last edited by a moderator:
Last but not least, AVX-1024 would bring us the latency hiding and power efficiency properties of these "specialists" you talk of, but combines them with high ILP into a superior homogeneous architecture.
Aren't CPUs already good enough at reducing latency with their tons of on-die cache and elaborate data pre-fetching algorithms, not to mention the out-of-order memory read/write op's? But yes, power efficiency is still a valid reason for such wider vector ISA.
 
Its design goal is not low power. Its design goal is high performance / Watt. Want low absolute power? Use fewer cores and/or lower frequency. It's that simple. Everyone knows that wide vector operations are the key to better performance / Watt.

With 2 x 256-bit FMA, Haswell can not be beaten on any front by AMD, so why risk doing something else?
Sacrificing the power efficiency of existing workloads to improve power efficiency of 256-bit FMA workoads that won't be widespread for at least a few years doesn't seem like a good trade-off to me given that Intel refreshes their architecture every 2 years. And it's incredibly unlikely that AMD beats them at FP performance in the Haswell timeframe.

Anyway my opinion is that the optimal design choice is probably FMA+ADD. This is not just for power/cost reasons but also performance: a FMA unit will have very similar latency for MUL but significantly higher latency for ADD than a standalone unit. AMD had to compromise ADD latency on Bulldozer which doesn't seem ideal to me at this point in time.
 
Nick said:
The argument was that AMD is no threat to Intel. But just because Intel makes more money doesn't mean they're immune and can make due with one FMA unit. Sure, it means they can survive a mistake like that, but it doesn't mean they should be doing it. The expectations of investors and clients in various markets are very high. So it would be plain stupid to cripple a winning architecture.
The issue wasn't how much money Intel makes, rather the opposite....
 
Its design goal is not low power. Its design goal is high performance / Watt. Want low absolute power? Use fewer cores and/or lower frequency. It's that simple. Everyone knows that wide vector operations are the key to better performance / Watt.
The primary design target had it's power budget gutted in half. One of their biggest customers threatened to cut them out and their executives acknowledged it as a real wake up call.

GPUs are using ever more complicated scheduling to keep their ALUs busy, register file sizes reasonable and improve cache hit rates
That's a lot of smoke and mirrors in one sentence.

What changes have happened on the scheduling side in the last ~5 years?

The register file size has been the same since (R600?) R770->GCN over 4 years and just one bump over 5 in case of nvidia.

GPU's have had caches for exactly one (zero) generations in case of nv (amd). What "ever more complicated" changes in scheduling, pray tell, have happened to "increase cache hit rates".
 
Last edited by a moderator:
Its design goal is not low power. Its design goal is high performance / Watt. Want low absolute power? Use fewer cores and/or lower frequency. It's that simple. Everyone knows that wide vector operations are the key to better FP performance / Watt.

With 2 x 256-bit FMA, Haswell can not be beaten on any FP front by AMD, so why risk doing something else?

GPUs are using ever more complicated scheduling to keep their ALUs busy, register file sizes reasonable and improve cache hit rates. CPUs are still inherently better at things like ray-tracing, but they lack the FP computing power. So ILP is not a bad thing, and DLP is perfectly orthoganal to it.

Last but not least, AVX-1024 would bring us the latency hiding and power efficiency properties of these "specialists" you talk of, but combines them with high ILP into a superior homogeneous architecture.
IMHO nobody cares about FP performance. It's not decisive for almost any benchmark (who really cares about cinebench?), much less for any real non-niche application. Yet you argument as if FP is what is making billions for intel. Bulldozer could have otherwise 4x times the FP performance of SB and lower IPC than phenom2 and nobody would buy but niche vertical applications.
 
How is that dodging the question?

When you argued about...
Having one execution port that can take a 256-bit MAD operation is nowhere near the same as having two ports for a 256-bit ADD and a 256-bit MUL. The latter can be turned into two 256-bit MAD ports by adding a third source operand, while the former would require an additional execution port, scheduler, bypass network, higher decoding rate, etc. Sandy Bridge already has the two 256-bit execution ports, so it's not a huge step to make it capable of two 256-bit MAD operations.

I mentioned that the cost of associated structures that have to be beefed up as well. Which will take a lot of area and power when you have already lost half of your power budget.
What about the actual area/power of the multipliers and adders? What about the 2x throughput of L1 and L2? The effort involved is more than the encoding space for the third operand.

At this point you launched an entirely pointless exercise of reaching <some arbitrary bogoflops number> by using <arbitrary vector size, clocks, issue width and cores>.

The core argument is that there is no room for luxuries like extra FP units when power budgets are shrinking, and there are no applications out there that could use them in that timeframe while the cost involved is more than the encoding scheme to do all that.

Counterarguing by showing that you know how to multiply is dodging the question at best. At worst, it's trolling.

Everything! All applications using floating-point calculations would suffer.
SB has just 1x256 MAD (effectively, as it can do one add and one mul/clock) unit. I don't see any fp app suffering. Quite the contrary.
 
It's not about the transistor budget. It makes little sense to add alus that you cannot feed, and on normal loads 256bit AVX is already *heavily* bandwidth limited.
Indeed, with its total of three 256-bit execution units, Sandy Bridge is bandwidth limited. But do you honestly think Intel isn't aware of that? When AVX was first presented they mentioned scalability as a main feature. So they've clearly got long-term plans for it, and increasing bandwidth must be part of the roadmap. Haswell features both gather and FMA, and they would be rather pointless without extra bandwidth.
They might do it, but frankly, I can still think of a lot of better ways to use that transistor budget.
Would you mind summing them up?
 
What changes have happened on the scheduling side in the last ~5 years?

The register file size has been the same since (R600?) R770->GCN over 4 years and just one bump over 5 in case of nvidia.

GPU's have had caches for exactly one (zero) generations in case of nv (amd). What "ever more complicated" changes in scheduling, pray tell, have happened to "increase cache hit rates".
I don't know if Nick was referring to anything specific, but GPU scheduling could change every generation and you wouldn't know it. Big changes are coming for AMD and I assume things changed for Nvidia with Fermi.

Also, every GPU has a cache as they've gotten larger in size it's possible hit rates have increased.
 
SNB has two 128-bit load and one 128-bit store ports to the data cache. Unless something major is changed in this regard for Haswell, extending the two AVX ALUs to full FMA capability will be futile.
As I've detailed before there are obviously already some significant changes required to support gather, and gather itself increases bandwidth needs even more. So if they're redesigning it anyway they should also tackle the glaring bandwidth problem simultaneously.

Also, gather most likely adds some latency, which is unacceptable for regular loads. So the most logical configuration would be one ordinary 256-bit load port, and one 256-bit gather port with sightly higher latency, which can also service regular loads.
 
I don't know if Nick was referring to anything specific, but GPU scheduling could change every generation and you wouldn't know it. Big changes are coming for AMD and I assume things changed for Nvidia with Fermi.
It very well could. But why claim it has without evidence?

Also, every GPU has a cache as they've gotten larger in size it's possible hit rates have increased.
I think he was referring to general purpose caches.
 
Aren't CPUs already good enough at reducing latency with their tons of on-die cache and elaborate data pre-fetching algorithms, not to mention the out-of-order memory read/write op's? But yes, power efficiency is still a valid reason for such wider vector ISA.
High throughput computing often deals with working set sizes where no realistic amount of on-die cache helps. And prefetching is something you actually want to avoid. It uses up collateral bandwidth and increases power consumption.

So AVX-1024's ability to hide latency would allow comparatively smaller caches and less aggressive prefetching.
 
Sacrificing the power efficiency of existing workloads to improve power efficiency of 256-bit FMA workoads that won't be widespread for at least a few years doesn't seem like a good trade-off to me given that Intel refreshes their architecture every 2 years.
There's no way around that, ever. Innovation takes time. You have to invest into capabilities for future applications years before it becomes mainstream.

Anyhow, having 2 x 256-bit FMA would not sacrifice power efficiency of existing workloads. It can be heavily clock gated or even power gated. But if you know of anything that would provide higher future gains at a lower "sacrifice", feel free to share.
And it's incredibly unlikely that AMD beats them at FP performance in the Haswell timeframe.
Because... ?
Anyway my opinion is that the optimal design choice is probably FMA+ADD. This is not just for power/cost reasons but also performance: a FMA unit will have very similar latency for MUL but significantly higher latency for ADD than a standalone unit. AMD had to compromise ADD latency on Bulldozer which doesn't seem ideal to me at this point in time.
Floating-Point Fused Multiply-Add: Reduced Latency for Floating-Point Addition.
Latency Sensitive FMA Design.

Note though that AVX-1024 would offer the ultimate solution by allowing compact higher latency power-efficient execution units.
 
The issue wasn't how much money Intel makes, rather the opposite....
Please elaborate.

Correct me if I'm wrong, but AMD seems to already have done the hardest part by equipping Bulldozer's FlexFP unit with two 128-bit units. All that remains to be done to pose a serious threat to Intel is widen things, and this doesn't appear to require a massive budget. They already have an advantage in core count for scalar workloads, and wider SIMD units would give them the throughput advantage.

But that's not even the scenario that was being discussed. The argument was that Intel could make due with just a single 256-bit FMA unit. But no matter how little profit AMD makes (if any at all), that seems like an incredibly risky thing to do. Two 128-bit FMA units is far better than one 256-bit FMA unit for legacy workloads, let along risk having to compete against two 256-bit units.
 
It very well could. But why claim it has without evidence?

I think he was referring to general purpose caches.
Well, Doesn't Fermi claim it has?
Just ignore RV770 and NI. They do not really design an un-core system for general computing.

SNB has two 128-bit load and one 128-bit store ports to the data cache. Unless something major is changed in this regard for Haswell, extending the two AVX ALUs to full FMA capability will be futile.
If they tend to extend the execute they'll extend the memory subsystem won't they
But it do seems cost a lot.

Two 128-bit FMA units is far better than one 256-bit FMA unit for legacy workloads, let along risk having to compete against two 256-bit units.
According to Intel's pdf in IDF, Intel's AVX in Snb can be also seen as a combination of two SSE units.
 
Nick said:
Please elaborate.
Really.... I shouldn't need to, but... you can start by looking at the gross margin of the two companies followed by the net profit margin, and debt/equity ratio. Intel can lower their prices considerably and still make a profit; AMD can not. There is a reason AMD just fired 10% of their workforce.
 
Back
Top