Bulldozers are faster in reverse

Nick

Veteran
It appears that the biggest concern about unifying the CPU and (integrated) GPU is that sequential scalar workloads demand designing for ~4 GHz operation while having wide SIMD units for graphics and compute workloads operate at such frequency is not power efficient. So I've been thinking about an architecture that has its SIMD units running at half the base frequency, while still being homogeneous and offering plenty of throughput...

I think a key part of the solution is to do the reverse of what Bulldozer does: Have one scalar execution cluster shared between two threads, and two vector execution clusters each dedicated to one of the threads. The vector clusters would run at half the frequency of the scalar cluster, and as a better alternative to Hyper-Threading they could support AVX-1024 instructions which are executed on 512-bit SIMD units to help hide latency.

One possible implementation for the vector cluster is to have two identical FMA-capable SIMD units that can start an operation on odd/even cycles, and one SIMD unit for simple logic operations which runs at full frequency. This still corresponds relatively closely to Intel's current three SIMD units and thus minimizes the impact on legacy vector workloads.

Just four of these modules would deliver 1 TFLOPS of power efficient homogeneous throughput computing bliss. You can master it using any programming language you desire, without any quirky abstraction layers or unexpected overhead. This architecture would also fully retain legacy scalar performance.

Thoughts?
 
From what I understand from others, it's difficult to write code for or is unlikely that you have an operation that can make use of that kind of width.
 
If what you want to make is a vector computer CPU, then of course :) Unfortunately, vector CPU fell out of the spotlight even in the HPC market quite some time ago.
 
From what I understand from others, it's difficult to write code for or is unlikely that you have an operation that can make use of that kind of width.

If you have the software written with a data parallel model (like with OpenCL Kernels), the compiler can easily generate the code to keep the vector unit busy.
 
It's the same width as NVIDIA's warp size.
That's not all too confidence inspiring. Programing for GPUs is notoriously unforgiving.
If you have the software written with a data parallel model (like with OpenCL Kernels), the compiler can easily generate the code to keep the vector unit busy.
Is that code any more efficient, though?
 
The problem with this kind of design is that the utilization of the FPUs would necessarily be very low.

In reality, vector loads are bursty. You have full throughput for a while, and then nothing at all for a time, repeat. Integer loads are much more consistent. The idea in BD is to take advantage of this by sharing the vector and using dedicated resources for integer. In your design, the vector units would simply be idle for most of the time. Why spend the silicon?

I honestly think that the computational side of BD is a pretty good design. The problems of the chip are much more centered on the lackluster cache subsystem.
 
It appears that the biggest concern about unifying the CPU and (integrated) GPU is that sequential scalar workloads demand designing for ~4 GHz operation while having wide SIMD units for graphics and compute workloads operate at such frequency is not power efficient. So I've been thinking about an architecture that has its SIMD units running at half the base frequency, while still being homogeneous and offering plenty of throughput...

I think a key part of the solution is to do the reverse of what Bulldozer does: Have one scalar execution cluster shared between two threads, and two vector execution clusters each dedicated to one of the threads. The vector clusters would run at half the frequency of the scalar cluster, and as a better alternative to Hyper-Threading they could support AVX-1024 instructions which are executed on 512-bit SIMD units to help hide latency.

Here is your first fault. You are afraid about multi-threading, Multi-threading is a very good thing in throughput computing. It's very cheap and increases utilization well.

One possible implementation for the vector cluster is to have two identical FMA-capable SIMD units that can start an operation on odd/even cycles, and one SIMD unit for simple logic operations which runs at full frequency. This still corresponds relatively closely to Intel's current three SIMD units and thus minimizes the impact on legacy vector workloads.

You totally forget the hard part: How to feed your SIMD units.

And you get lots of extra latency for those legacy operations for going over different clock domains.

So considerably slower performance for "legacy code".

Just four of these modules would deliver 1 TFLOPS of power efficient homogeneous throughput computing bliss. You can master it using any programming language you desire, without any quirky abstraction layers or unexpected overhead. This architecture would also fully retain legacy scalar performance.

Thoughts?

No, you get 1 MARKETING TFLOP which is not available to any real-world code.

Your utilization will be worse than GPU's because of the long latencies of FP operations , you need lots of independent operations to get those units utilized. And there is lots of code that does not have enough ILP so you need to either put multiple work items into single lane (which opens many other problems) or do multi-threading. OOE does not help here, it only allows _using most of the available ILP_ but it cannot "increase ILP" in code that does not have it.


GPU's have multi-threading or 4 times longer vectors than units which make all latencies look like 1 cycle to the software to get good utilization.


And when the thread is doing something else than calculating with those SIMD units, your SIMD units are idle. With multi-threading those would be in use. On GPU's there are other threads that can use them.


So you are really trying to put lots of those big and expensive FP ALU's on the chip and trying to relly make sure those get very bad utilization on real world code with your illogical fear of multi-threading.



And yes, you keep "legacy scalar integer performance" but with a core that is like 5 times bigger. And you lose some scalar FP performance.

Separate latency-optimized scalar core and throughput cores will consume less power and get much better throughput performance.



The first thing where you went wrong when you started thinking about your idead was that you thought you have to unify microarchitecture in order to unify ISA, but you do not. It makes sense to unify ISA, but the way how you try to unify your microarchitecture is bad for both.





Bulldozer's shared FPU is the best point in bulldozer design, which is not very good design because of some other things. I consider the topic of this thread an insult towards AMD architects.


And, if you still want to have a microarchitecture which fits both throughput and latency workloads,
The university of texas guys have much better idea on how to do it:
And they've actually simulated it and ran many benchmarks on their simulator:

http://hps.ece.utexas.edu/people/khubaib/pub/morphcore_micro2012.pdf
 
Last edited by a moderator:
In reality, vector loads are bursty. You have full throughput for a while, and then nothing at all for a time, repeat. Integer loads are much more consistent. The idea in BD is to take advantage of this by sharing the vector and using dedicated resources for integer.
Yes, and what happens then when you have two threads that want to use the vector unit at the same time...? Silicon isn't THAT expensive that we need to shoot ourselves in the foot in that manner, performance-wise, and with clock and/or power gating, it's not a huge drain on resources to have a vector unit for each core.

Bulldog is a half-assed, bad idea. We hit the "more than enough" ceiling whereas integer performance is concerned quite some time ago, but vector performance - where arguably, the future of computing lies - the sky's the limit as far as performance is concerned. AMD's bizarre refusal to implement hyperthreading in their desktop CPUs seem to have played a role leading them down the path towards bulldog, it's really weird.
 
Yes, and what happens then when you have two threads that want to use the vector unit at the same time...?

Another uses it one cycle later. Good utlization of the unit, good throughput.

Silicon isn't THAT expensive that we need to shoot ourselves in the foot in that manner, performance-wise, and with clock and/or power gating, it's not a huge drain on resources to have a vector unit for each core.

There is nothing about "shooting one in the foot" in arbitrating units between two threads.

And those big floating point vector units are big, expensive and power hungry.

When you have that unit, why would you want to limit yourself to eecuting instructions in it in only one thread, when you can execute instructions from multiple threads and get much better throughput?

Bulldog is a half-assed, bad idea

The bad ideas in bulldozer are elsewhere, like in the very small WT L1D cache combined with slow L2 cache, and messing up the implementation so that it does not clock to so high frequencis as it should

Sharing the FPU between the two threads is the BEST part of bulldozer desing, though the FPU is getting too narrow to compete with haswell - they should widen it to 256 bits, but they are not doing it in steamroller.
. We hit the "more than enough" ceiling whereas integer performance is concerned quite some time ago, but vector performance - where arguably, the future of computing lies - the sky's the limit as far as performance is concerned. AMD's bizarre refusal to implement hyperthreading in their desktop CPUs seem to have played a role leading them down the path towards bulldog, it's really weird.

They wanted to have separate L1D caches for the integer cores so that one thread could not disturb the other thread by trashing the cache. And in order to have two caches, you need two sets of LSUs. And the LSUs should be close to integer datapaths, so they had to also put two integer datapaths there.

They were optimizing for integer throughput.

The hardest thing to improve performance is single-thread, non-parallelizable control- and pointer chasing-intensive integer code. And that's the weak point of bulldozer; When something is hard, you have to attack it extra hard but AMD decided to "give it up" and concentrate on other things.
 
AMD's bizarre refusal to implement hyperthreading in their desktop CPUs seem to have played a role leading them down the path towards bulldog, it's really weird.

Umm, are you familiar with what it takes to validate SMT? Pretty sure AMD's refusal was based on that, and it was a reasonable bet at the time, given their resources.
 
Another uses it one cycle later. Good utlization of the unit, good throughput.
...And half the performance of an equivalent CPU with a dedicated vector unit per core. ...Which is partly why intel is stomping all over bulldog, performance-wise.

There is nothing about "shooting one in the foot" in arbitrating units between two threads.
It is, if you take performance in mind.

And those big floating point vector units are big, expensive and power hungry.
Doesn't bother intel. Their core chips have - as you know - a (big, expensive, power hungry) vector unit per core and the chip still draws less than bulldog. Amazing, I know. :)

When you have that unit, why would you want to limit yourself to eecuting instructions in it in only one thread, when you can execute instructions from multiple threads and get much better throughput?
No offense, but throughput means fuckall for me as an end user. Any end user really. Maybe throughput makes silicon engineers jizz their pants I really don't know, but it has no real bearing for virtually anyone else. Actual performance matters for me, and dedicated units win there.

...And yes, intel chips are more expensive. But not because of the dedicated vector units, but rather because intel is intel. It's a dysfunction of the free market that lets them set whatever prices they like basically. This dysfunction is also part of the reason why AMD is doing so poorly, incidentally, but maybe the topic of another thread than this one...

They were optimizing for integer throughput.
...And Intel has them beaten there as well. Fat lot of good their optimization it did them eh. Anyway, what software are you running that is integer-limited? Anything at all? For me, I can't think of a single thing really, I had to drop the overclock from 3.4GHz on my CPU down to 2.66 stock, since the system was getting too unstable now that the system is over 4 years olds and I'm barely noticing it. World of Warcraft, a notoriously poorly optimized application, is running a bit more sluggish in cluttered areas and that's mostly it.
 
And those big floating point vector units are big, expensive and power hungry.
The front end is usually the most power-hogging part of any x86 architecture, where there's a lot of data movement back and forth at all times. It's why Intel persisted on re-introducing the u'op cache back in SNB, while prior to that in Nehalem the loop-buffer was moved after the decoding stage, for the same power/performance reason.
 
...And half the performance of an equivalent CPU with a dedicated vector unit per core. ...Which is partly why intel is stomping all over bulldog, performance-wise.
If execution throughput is such an issue why is AMD's largest focus for steamroller on instruction throughput.

a question i have wondered is what would be better, 4x 128FMA or 2x256FMA
 
It's the same width as NVIDIA's warp size.
That's not all too confidence inspiring. Programing for GPUs is notoriously unforgiving.
This is the unification of a CPU and GPU. So having the same logical SIMD width as a GPU is not a bad thing. Most importantly, the unification makes it much easier to program for than a heterogeneous architecture. Compilers can vectorize loops with independent iterations without having to worry about data migration overhead or synchronization overhead.
 
The problem with this kind of design is that the utilization of the FPUs would necessarily be very low.

In reality, vector loads are bursty. You have full throughput for a while, and then nothing at all for a time, repeat. Integer loads are much more consistent. The idea in BD is to take advantage of this by sharing the vector and using dedicated resources for integer. In your design, the vector units would simply be idle for most of the time. Why spend the silicon?
This is the unification of a CPU and integrated GPU. The silicon that used to be spent on the GPU is distributed between the modules. So there's no extra cost. Also, the utilization of the SIMD units would be no less than those of the legacy integrated GPU.

Also keep in mind that GPGPU is kind of a failure in the consumer market, despite plenty of unexploited DLP in most applications. That's because heterogeneous programming is too hard and too unpredictable. By fully unifying the CPU and GPU the developers can rely on tools to auto-vectorize their code without bad surprises. So the utilization of the SIMD units will go up for uses beyond 3D.
 
I think a key part of the solution is to do the reverse of what Bulldozer does: Have one scalar execution cluster shared between two threads, and two vector execution clusters each dedicated to one of the threads. The vector clusters would run at half the frequency of the scalar cluster, and as a better alternative to Hyper-Threading they could support AVX-1024 instructions which are executed on 512-bit SIMD units to help hide latency.
Can you clarify some elements of this design?
Does this design, like Bulldozer, keep the retire buffer, exception handling, and memory pipeline in the integer domain?
Is this a unified or non-unified scheduler, since the separate large vector domains at a clock divisor seems more applicable to AMD's coprocessor model?
Are there some assumed changes to the integer side, since Bulldozer's current integer cluster format would be rather constraining?
Are there some other assumed changes, such as cache port width, number of accesses per clock(which clock?) cache arrangement/size, and so on?
Is this assuming 512-bit or 1024-bit registers?

One possible implementation for the vector cluster is to have two identical FMA-capable SIMD units that can start an operation on odd/even cycles, and one SIMD unit for simple logic operations which runs at full frequency.
This logic SIMD unit runs at integer core frequency? Is it still in the same domain as the half-clock units?
 
It appears that the biggest concern about unifying the CPU and (integrated) GPU is that sequential scalar workloads demand designing for ~4 GHz operation while having wide SIMD units for graphics and compute workloads operate at such frequency is not power efficient. So I've been thinking about an architecture that has its SIMD units running at half the base frequency, while still being homogeneous and offering plenty of throughput...

I think a key part of the solution is to do the reverse of what Bulldozer does: Have one scalar execution cluster shared between two threads, and two vector execution clusters each dedicated to one of the threads. The vector clusters would run at half the frequency of the scalar cluster, and as a better alternative to Hyper-Threading they could support AVX-1024 instructions which are executed on 512-bit SIMD units to help hide latency.
Here is your first fault. You are afraid about multi-threading, Multi-threading is a very good thing in throughput computing. It's very cheap and increases utilization well.
I am not "afraid" of multi-threading at all. I use it in my renderer and I love the 30% speedup from Hyper-Threading.

The problem is that I had to look for an alternative to Hyper-Threading because running the SIMD units at half the frequency also cuts performance in half, unless you have two SIMD clusters each dedicated to one thread. Executing AVX-1024 instructions in two cycles achieves the same goal of increasing utilization, and is even cheaper.
You totally forget the hard part: How to feed your SIMD units.
Please pinpoint what you believe to be the issues.
And you get lots of extra latency for those legacy operations for going over different clock domains. So considerably slower performance for "legacy code".
There wouldn't really be different clock domains to cross. One SIMD unit starts a new operation on an odd cycle, while the other starts on an even cycle. To the rest of the architecture they appear as one unit operating at full frequency. A third unit for simple logic operations can run at true full frequency without consuming much power. The impact on legacy code should be minimal because either the use of SIMD instructions is sparse so the lower throughput doesn't matter much, or they're used intensely but these applications are definitely multi-threaded so they take advantage of the two clusters.
No, you get 1 MARKETING TFLOP which is not available to any real-world code.
Which is perfectly fine. I was merely pointing out that it would have a sufficiently high peak throughput for this low number of modules. GPUs also advertise their theoretical throughput but get nowhere near it in real-world code: Performance Upperbound Analysis on Kepler GPUs. So let's try to use the same weights and measures if you want to compare things.
Your utilization will be worse than GPU's because of the long latencies of FP operations , you need lots of independent operations to get those units utilized.
Again, that's where AVX-1024 comes in. It's like having two independent AVX-512 instructions, except that it would only occupy one uop. It increases the utilization similarly to Hyper-Threading, but without the overhead of duplicate code, less cache contention, and no extra thread synchronization.
And there is lots of code that does not have enough ILP so you need to either put multiple work items into single lane (which opens many other problems) or do multi-threading. OOE does not help here, it only allows _using most of the available ILP_ but it cannot "increase ILP" in code that does not have it.
There is plenty of ILP that can be exploited with out-of-order execution. Software renderers and compute kernels running on the CPU achieve very good utilization. Hyper-Threading helps by 30% and although this isn't negligible it proves that out-of-order execution doesn't leave much underutilized either. AVX-1024 effectively doubles the available ILP if you think of it as two AVX-512 instructions. And by occupying only one uop for two cycles of work the scheduling window is effectively twice as large.
GPU's have multi-threading or 4 times longer vectors than units which make all latencies look like 1 cycle to the software to get good utilization.
But they don't have out-of-order execution. My unified architecture would combine SIMD instructions which are twice as wide as the units, with out-of-order execution to cover for the rest of the latency. This reduces cache contention and improves locality of reference, which saves power consumption. The GPU is very wasteful with bandwidth since it can't achieve high cache hit ratios due to storing many thread contexts.
 
Back
Top