22 nm Larrabee

What about 1 256-bit ADD and 1 256-bit FMA/MUL only?
Yes that would avoid crippling legacy code, but that's all. It doesn't make sense to add FMA support without increasing performance / Watt. Dual FMA units would improve performance / Watt for both legacy workloads and AVX2 workloads. Furthermore, we know Sandy Bridge is slightly bandwidth limited, but if Haswell doubles it that would be overkill unless the arithmetic throughput is increased. Integer throughput is increased by doubling the width of integer SIMD instructions in AVX2, so for floating-point intensive workloads they need dual FMA to balance things out. Gather support also removes bottlenecks which increase the demand for processing power.

Given that GF104 has triple FMA ports I don't see why we should assume Haswell to have any fewer than two. Also keep in mind that implementing transcendental functions can make good use of FMA and gather (for performing table lookups and interpolation).
Bulldozer does something similar on the Vector Integer side.
With all due respect Bulldozer is hardly an example of good design choices...
 
Yes that would avoid crippling legacy code, but that's all. It doesn't make sense to add FMA support without increasing performance / Watt. Dual FMA units would improve performance / Watt for both legacy workloads and AVX2 workloads.
The perf/watt for FMA is highly dependent on the mix when it comes to legacy code. Some loads show a power benefit from FMA, but other legacy apps show a degradation because the peak performance of FMA is lost without recompilation, but the units themselves draw more power and have at least a little longer latency.
Other improvements to the design would need to compensate for this.

Given that GF104 has triple FMA ports I don't see why we should assume Haswell to have any fewer than two.
The design considerations for a relatively low-clocked last-gen (relative to Haswell) in-order ASIC and a high speed OoO CPU are so different that I think it would be safer to just refer back to what Intel has done before.

With all due respect Bulldozer is hardly an example of good design choices...
It's an example of making an OoO core into a throughput machine.
 
The perf/watt for FMA is highly dependent on the mix when it comes to legacy code. Some loads show a power benefit from FMA, but other legacy apps show a degradation because the peak performance of FMA is lost without recompilation, but the units themselves draw more power and have at least a little longer latency.
Other improvements to the design would need to compensate for this.
I agree that for legacy code there would still be a compromise, but let's face it, we're not going to get CPUs with better performance / Watt without a few changes to the software. In this light I believe dual FMA offers maximum long-term result at minimal short-term impact. And I'm not sure other design improvements are needed to further compensate it, when you have 22 nm + Tri-Gate.

ADD + FMA would increase power consumption for legacy workloads and not offer significant gains for AVX2 workloads. So while it's the most reasonable intermediate step towards dual FMA, I see no reason to hold back when AVX2 already demands more profound changes. As I mentioned before, providing the triple 256-bit source operands is already in place for two execution ports with Sandy Bridge. So we're just talking about a few more multipliers. Looking at how many are crammed onto GPUs with inferior process technology, I see no reason why it would be a big deal for Haswell.

Since doing more work with fewer instructions is the key to power efficiency, the switch to all FMA can't happen soon enough. Things like having two execution ports capable of blend instructions in Sandy Bridge, and shift instructions with independent counts for AVX2, also illustrate how focused Intel is on achieving higher performance / Watt out of SIMD.
The design considerations for a relatively low-clocked last-gen (relative to Haswell) in-order ASIC and a high speed OoO CPU are so different that I think it would be safer to just refer back to what Intel has done before.
That sort of reasoning could hardly have predicted full 256-bit, FMA, and gather support for AVX2. So instead we have to look at both the evolution of GPU and CPU architectures to see where things are heading. Despite the fact that there's still a significant gap for things like clock speed and instruction scheduling, they have been converging for many years now. And if you throw architectures like Larrabee into the mix it becomes clear that it's all one connected design space. Looking at what Larrabee must have taught Intel about what works and what doesn't, I really don't think we can rule out dual FMA for Haswell.
It's an example of making an OoO core into a throughput machine.
tumblr_ls6ljkQkzR1r3nqnho1_250.png
 
And to further emphasize how focused Intel seems on SIMD, increasing the likelihood of 2 x 256-bit FMA per core:

Intel: "Floating Point Multiply Accumulate – Our floating-point multiply accumulate significantly increases peak flops and provides improved precision to further improve transcendental mathematics. They are broadly usable in high performance computing, professional quality imaging, and face detection"
 
That sort of reasoning could hardly have predicted full 256-bit, FMA, and gather support for AVX2.
Intel laid out the progression to 256 bit for all vector types well in advance, as well as FMA.
(edit: By the way, Haswell is not the first Intel chip with FMA, and it may be argued this x86 variant is inferior)
I did not expect scatter/gather support being considered until the next core after Haswell, but it seems Intel split the difference by only doing half of that for AVX2.


And if you throw architectures like Larrabee into the mix it becomes clear that it's all one connected design space. Looking at what Larrabee must have taught Intel about what works and what doesn't, I really don't think we can rule out dual FMA for Haswell.
I already noted that it seems possible for dual FMA, since Haswell's desktop chips max out at 4 cores, which hints at very significant increases in per-core hardware in order to fill up the space and TDP that otherwise would be available for more.

It was a bit of a joke.
 
Actually some point some people are argue about is not performance pre watt, is INT performance pre watt. The cluster design of Bulldozer has creat a good environment for some server software that are thirst of simple INT performance as well as IOps. Although Bulldozer can't perform well, maybe because of bad uncore design and the high latency caused by the complex schedule, it is still essential for Intel to pay attention to it as it does provide more INT unit pre square meters. I once was thinking of wether Intel can give different design for high-end-desktop/workstation/hpc and low-end-desktop/servers. Now that Haswell is a uniform and modularized design, Intel should think of the way to balance both of them.
 
Actually some point some people are argue about is not performance pre watt, is INT performance pre watt. The cluster design of Bulldozer has creat a good environment for some server software that are thirst of simple INT performance as well as IOps. Although Bulldozer can't perform well, maybe because of bad uncore design and the high latency caused by the complex schedule, it is still essential for Intel to pay attention to it as it does provide more INT unit pre square meters. I once was thinking of wether Intel can give different design for high-end-desktop/workstation/hpc and low-end-desktop/servers. Now that Haswell is a uniform and modularized design, Intel should think of the way to balance both of them.

I'm really not following. Beyond the mystical SW that supposedly BD was designed for (it seems to have little to no trouble failing everywhere, but let's ignore that), is there any case where it has created a good environment for anything? It can't even beat Mangy-Cours all that decisively, and that wasn't quite the speed-demon itself. What does it mean provide more INT unit per square meter? Because it's debateable to what extent it actually does that versus something like SB, or even K8L. It does almost double up Bobcat, no doubt about that, so that part better look out....
 
BD's integer performance per mm2 is inferior in most cases to Intel chips in the same segment. Let's note that it takes two 315 mm2 chips in Interlagos to come somewhat short of an an Intel EX server processor.

In some of the areas where it is competitive, there seem to have been changes or will be changes in licensing terms that may shift the debate, but I have limited experience with virtualization and large database licensing. Microsoft is changing some of its licensing to per-core for certain apps.

On licensing alone for those apps, BD's central design decision has been an egregious misstep.
 
If so It's good. Most common application can't use both 256bit MUL and 256bit ADD simultaneously
That's nonsense. If you do math (otherwise there isn't much advantage in using SIMD) you very very often have independent multiplies and adds in the code.
 
By the way, Haswell is not the first Intel chip with FMA, and it may be argued this x86 variant is inferior
In what sense is x86's FMA inferior to Itanium's? The IEEE 754-2008 standard puts bounds on what is required, and AVX2 complies with it.
I did not expect scatter/gather support being considered until the next core after Haswell, but it seems Intel split the difference by only doing half of that for AVX2.
Scatter is far less important than gather. I'm not even expecting it for Skylake. Note that it would require blocking the gather port if more than one cache line is addressed, to guarantee consistency. This means there's very little we can gain from a scatter instruction versus individual extract instructions. Also more often than not a scatter can be turned into a gather with relatively minor algorithmic changes.
 
Bulldozer could be saved by doing what it was claimed to do: "reverse" Hyper-Threading.

What I mean is something like one ALU dedicated to each thread (instead of two), and two ALUs shared between them (instead of none). The total number of ALUs doesn't go up and the schedulers get slightly more complex, but what you get in return is fewer bottlenecks. It is also potentially less bottlenecked than Intel's three ALUs shared between two threads, since there are four ALUs in total. It could be considered reverse Hyper-Threading because if you think of each core having two ALUs then one more ALU can be borrowed from the other core.

And obviously AMD has to implement AVX2 with 2 x 256-bit paths sooner rather than later as well. That plus reverse Hyper-Threading would make Bulldozer 2 capable of competing with Haswell.
 
In what sense is x86's FMA inferior to Itanium's? The IEEE 754-2008 standard puts bounds on what is required, and AVX2 complies with it.
Intel's x86 FMA is FMA3, which makes it arguably inferior to 4 operand FMAs used elsewhere.
Also, the FP side of current Itaniums is able to fire off 2 FP loads, 2 FMA, and 2 FP stores per cycle.
This will not be the case in the upcoming ones, however.

At least with scalar FMA, Itanium's FP unit can sustain higher operand throughput to memory and also in terms of register operands.
 
What I mean is something like one ALU dedicated to each thread (instead of two), and two ALUs shared between them (instead of none). The total number of ALUs doesn't go up and the schedulers get slightly more complex, but what you get in return is fewer bottlenecks. It is also potentially less bottlenecked than Intel's three ALUs shared between two threads, since there are four ALUs in total. It could be considered reverse Hyper-Threading because if you think of each core having two ALUs then one more ALU can be borrowed from the other core.

Sharing the ALU's like that makes no sense at all. The issue is length of data paths, and generally shuffling the data around. BD has two separate ALU clusters that are connected to two separate register files. Adding shared ALUs that are capable of sending the results to both register files and forwarding to all other ALU clusters would be a routing nightmare. Also, sharing integer alus in general makes no sense -- they are small enough that if you want more of them, you can just add more. The reason AMD went with only 2 Int ALUs are probably to keep routing paths short to run at higher clocks. It's certainly not to save space.

And obviously AMD has to implement AVX2 with 2 x 256-bit paths sooner rather than later as well. That plus reverse Hyper-Threading would make Bulldozer 2 capable of competing with Haswell.

Why? What would they gain? In real FPU-intensive XOP code BD is already *totally* bandwidth starved. Doubling the width of the units would not add *any* performance.

What BD desperately needs is better caches. Everything else is a rounding error.
 
Last edited by a moderator:
BD's integer performance per mm2 is inferior in most cases to Intel chips in the same segment. Let's note that it takes two 315 mm2 chips in Interlagos to come somewhat short of an an Intel EX server processor.

In some of the areas where it is competitive, there seem to have been changes or will be changes in licensing terms that may shift the debate, but I have limited experience with virtualization and large database licensing. Microsoft is changing some of its licensing to per-core for certain apps.

On licensing alone for those apps, BD's central design decision has been an egregious misstep.
It sounds like you misunderstand what I mean...BD performe bad performance. But the cluster design of BD is still useful. If a CPU with cluster is a product of Intel. The thread in the PC forum of B3D show that Bulldozer has an extremely useless cache system. Some other test show that the bandwith of BD's L3 cache is the same as what it is between MC and memory. That's partly what we should blame to, not really the cluster.

You can see there is less peformacne difference in sever product.
 
Intel's x86 FMA is FMA3, which makes it arguably inferior to 4 operand FMAs used elsewhere.
Ah, yes, I assumed you were referring to something more fundamental instead.

Since every other operation is non-destructive, and there are "132", "213" and "231" variants of each FMA3 instruction, I wonder if it's ever really a limitation. And obviously it avoids having to make the uop format longer for just this one instruction. So overall it's probably a good thing.
 
Sharing the ALU's like that makes no sense at all. The issue is length of data paths, and generally shuffling the data around. BD has two separate ALU clusters that are connected to two separate register files. Adding shared ALUs that are capable of sending the results to both register files and forwarding to all other ALU clusters would be a routing nightmare.
Well obviously the register files should be unified or at least close together for reverse Hyper-Threading to work. Note that the FlexFP unit is fully shared and Intel's Hyper-Threading shares everything so it's clearly feasible.
Also, sharing integer alus in general makes no sense -- they are small enough that if you want more of them, you can just add more. The reason AMD went with only 2 Int ALUs are probably to keep routing paths short to run at higher clocks. It's certainly not to save space.
That seems like a contradiction to me. If they're small then why would the routing paths suddenly be too long when you have 3 instead of 2 of them? Especially considering that AMD has had 3 ALUs for over a decade and process technology is shrinking faster than the clock rates go up, that doesn't seem very problematic to me. Also, fetching operands from the forwarding network should be more power efficient than accessing the register file, no?

It seems to me that AMD has underestimated the importance of IPC, and tried to compensate it with higher clock frequencies and more cores. But the cure seems worse than the disease. People don't accept worse single-threaded performance, and SIMD is a cheaper way to achieve more parallelism. Of course just like with Phenom the new process will mature but I doubt that will ever fully make up for it.
Why? What would they gain? In real FPU-intensive XOP code BD is already *totally* bandwidth starved. Doubling the width of the units would not add *any* performance.
They'd have to increase the bandwidth as well, just like Haswell.
What BD desperately needs is better caches. Everything else is a rounding error.
There were some rumors about the use of T-RAM technology. Perhaps that could save Bulldozer's cache hierarchy. Anyway I don't think having just two ALUs per core is merely a "rounding error". They've had three ALUs since the first Athlon. Even if the average use of the third ALU is low, it's of value during "burst" execution.

That's why I propose to make one ALU from each core shared between the two. It would be something in the middle between having fully separate execution cores like Bulldozer, and fully shared ALUs like with Hyper-Threading.
 
Nick, have you ever wondered why haven't Intel or AMD added more integer ALUs to their CPUs to increase single-threaded performance if you think it actually works that way? It would be FAR easier and more effective than trying to glue together two separate cores to do the same thing.
 
Well obviously the register files should be unified or at least close together for reverse Hyper-Threading to work. Note that the FlexFP unit is fully shared and Intel's Hyper-Threading shares everything so it's clearly feasible.
The schedulers can ill afford to step on each other like that. The FP unit has its own scheduler along with the unified FP reg file.
If the integer register files were unified, it would almost follow that the scheduling and issue logic would be unified into a single scheduler.
Reverse-hyperthreading would be better characterized as reversing AMD's design decision and making it a one-core module.

That seems like a contradiction to me. If they're small then why would the routing paths suddenly be too long when you have 3 instead of 2 of them? Especially considering that AMD has had 3 ALUs for over a decade and process technology is shrinking faster than the clock rates go up, that doesn't seem very problematic to me.
The design targets higher clocks, but its designers also highlight that the design emphasizes streamlining and reducing the amount of logic and complexity per stage (over the assumed streamlining needed for clocks). The delay in adding additional forwarding paths and the additional burden on the high-speed scheduler may have been beyond AMD's ability to manage. One more ALU would mean 4 paths from the new ALU and the other ALUs and AGUs in the core, and 4 new paths, one from each to the new unit.
Sharing 2 ALUs between cores would mean the shared ALUs would have more than double the burden than if we only considered adding one ALU to one core.

As far as saying AMD shouldn't have problems since it had 3 ALUs 10 years ago, let's note that it failed to consistently improve over a design it made 10 years ago and has run ragged ever since.

It seems to me that AMD has underestimated the importance of IPC, and tried to compensate it with higher clock frequencies and more cores. But the cure seems worse than the disease.
They're smarter than that. The problem is that high-performance designs are hard, competitive ones with Intel are harder, and doing so when they knew they wouldn't have the resources or a competitive manufacturing process even more so.
BD's design makes sense from the viewpoint that the designers wanted to get as much performance as they could knowing that they could not optimize it like Intel and they could not count on a good process from GF.


There were some rumors about the use of T-RAM technology. Perhaps that could save Bulldozer's cache hierarchy.
I'm not sure it would. The L3 is a waste for non-server loads, and the access speeds for T-RAM make too slow for the L1 and L2.
AMD's cache hierarchy and interconnect just isn't all that much better than what preceded it, which has for years not been all that good.
 
Nick, have you ever wondered why haven't Intel or AMD added more integer ALUs to their CPUs to increase single-threaded performance if you think it actually works that way? It would be FAR easier and more effective than trying to glue together two separate cores to do the same thing.
I know it's not as simple as just adding more ALUs. There's additional complexity for the schedulers, the forwarding networks, the register files, cache bandwidth, instruction decoding rate, etc.

However, as far as I can tell my suggestion is somewhere in between already existing designs. It could bring single-threaded performance back on par, without the cost of two full-blown cores. So I don't see why it should be dismissed that easily.
 
Back
Top