22 nm Larrabee

I know it's not as simple as just adding more ALUs. There's additional complexity for the schedulers, the forwarding networks, the register files, cache bandwidth, instruction decoding rate, etc.

However, as far as I can tell my suggestion is somewhere in between already existing designs. It could bring single-threaded performance back on par, without the cost of two full-blown cores. So I don't see why it should be dismissed that easily.

In theory, it would have the benefit of an additional ALU. In practice, it would have a cost. Not in die area, but in clock speed.

The amount of ALUs is not dictated by cost or area. The size of a simple integer ALU is really ridiculously small. We are talking about a few tens or hundreds of thousands of transistors in a chip made of billions of them. If they wanted more, they wouldn't share. They'd just add more.

The forwarding network is responsible for getting the result of one alu to another in "between cycles", so that one alu can start an operation on cycle n+1 using a result produced in another in cycle n. As the time it takes for a signal to propagate through the forwarding network is added to every clock, making it take longer directly hurts the clock speed of the chip. Going from "4 places to forward results from/to" to "5 places to forward results from/to" means a minimum of one additional gate with a high fanout on the critical path -- that alone is a 6% hit in clockspeed without even looking at the routing. Then comes the problem that the signal needs to travel from every unit that is part of the routing network to every other in that time "between clocks". This means that the physical distance between the units would be very strictly limited. If the fifth unit was sandwiched between the two clusters, this would make the additional length needed for routing to be, minimally, as long as the width of a single unit. That's another few percent of speed gone again. Then, as the units would have to be right next to each other, it would essentially burn more than a quarter of the space available on the die within very short distance of the register file. This space is at a very high premium -- as everything that needs to have fast access to the reg file needs to be there, reorganizing the clusters together would force something else that is near the reg file to be a little further away. That's probably a few percent again.

You are thinking in the terms that the units are the primary design detail and constraint, and that scheduler/reg file/forwarding are just some minor details that someone else can think about. The exact opposite is true. The speed and design of modern high-performance CPUs are mainly determined by getting data and ops to where they are needed. The design, including the amount of units, is dictated by how complex reg file and forwarding they can get away with. Of the actual execution core, the simple units are so small a portion that, if they can get them ops and data, they can just slap on as many as they feel like. That's why SIMD exists.

Well obviously the register files should be unified or at least close together for reverse Hyper-Threading to work. Note that the FlexFP unit is fully shared and Intel's Hyper-Threading shares everything so it's clearly feasible.

The FlexFP is 4-wide, like the individual integer clusters. Putting the reg files close together I already mentioned.

If the reg files are unified, there would be no sense to have separate clusters at all. This gets you the Intel design -- one beefy reg file feeding one beefy execution cluster with 6 units (of which one is memory data write). The only problem is, AMD isn't Intel. AMD doesn't have the process tech or the design resources Intel has -- if AMD made a direct copy of Intel design, it would run much slower. Intel has a huge manufacturing lead, even on the same process, and it spends most of it on having that wider execution cluster.


I'm not sure it would. The L3 is a waste for non-server loads, and the access speeds for T-RAM make too slow for the L1 and L2.
AMD's cache hierarchy and interconnect just isn't all that much better than what preceded it, which has for years not been all that good.

The problem with present AMD designs isn't the L3. It's the L2/L1. On write-heavy loads the BD is just sad.

That's one thing I wish they would try to copy Intel on. What they need is small and fast L1 and L2, backed by a large, partitioned L3 with a lot of bandwidth.

But can they actually build that?
 
You are thinking in the terms that the units are the primary design detail and constraint, and that scheduler/reg file/forwarding are just some minor details that someone else can think about. The exact opposite is true.
Please don't tell me what I think. I'm well aware there's far more to it than just adding more ALUs. But it's blatantly obvious that Bulldozer lacks single-threaded issue width. The solution is sharing ALUs between threads, not because the ALUs themselves are expensive, but because everything else is and you'd be sharing that too. Yes it's certainly a huge challenge, but it's not as if making up for the issue width with clock speed is a better solution.
If the reg files are unified, there would be no sense to have separate clusters at all. This gets you the Intel design -- one beefy reg file feeding one beefy execution cluster with 6 units (of which one is memory data write). The only problem is, AMD isn't Intel. AMD doesn't have the process tech or the design resources Intel has -- if AMD made a direct copy of Intel design, it would run much slower. Intel has a huge manufacturing lead, even on the same process, and it spends most of it on having that wider execution cluster.
Perhaps if they wasted less time on Fusion they would have had the resources to make Bulldozer less horrible.

And they'd better also be working on hardware transactional memory support.
 
The solution is sharing ALUs between threads, not because the ALUs themselves are expensive, but because everything else is and you'd be sharing that too. Yes it's certainly a huge challenge, but it's not as if making up for the issue width with clock speed is a better solution.
Adding more ALUs to same execution unit is MASSIVELY easier than making ALUs from different execution units be shared between the two.
 
Adding more ALUs to same execution unit is MASSIVELY easier than making ALUs from different execution units be shared between the two.
Sure, but that doesn't increase issue width. Also, this is a very competitive industry and shying away from something just because it's harder will make you lose billions.
 
Sure, but that doesn't increase issue width.
How would that be any different in BD where the instruction decoding-issuing part shared within a module?
Also, this is a very competitive industry and shying away from something just because it's harder will make you lose billions.
True but from the little I know it's nearly impossible to make the ALU sharing fast. At best the latency will be about as good as between registers and L2 cache.
 
I'm well aware there's far more to it than just adding more ALUs. But it's blatantly obvious that Bulldozer lacks single-threaded issue width. The solution is sharing ALUs between threads, not because the ALUs themselves are expensive, but because everything else is and you'd be sharing that too.
It's the sharing part that would be expensive. BD's partitioned design has amortized the cost of the front end and FPU, but the physical separation has made the cost of forwarding and register reads to shared units far more prohibitive.
The distance between neighboring ALUs in a core has helped lead to the clocks and power consumption present. The distance between cores is much greater, and sharing ALUs between cores would make the shared units disproportionately more costly. The pipeline would have to be longer and the core even hotter.

The costs of sharing can be noted in the delays in sharing the front end and FPU. There is an indeterminate amount of buffering between decoders and core with instruction pick, bundle generation, and buffering steps for the front, and a totally separate FP scheduler and reg file that interacts with the integer side more like it is a load/store unit than a fellow ALU.
Adding that burden to basic integer math and address generation would have butchered AMD's performance further.

In essence, you are stating that AMD should have made an SMT core. They couldn't, which is why we have BD.

Perhaps if they wasted less time on Fusion they would have had the resources to make Bulldozer less horrible.
Possibly somewhat better, but there seem to be fundamental problems with the CPU side that it might have been throwing more money down a pit.

And they'd better also be working on hardware transactional memory support.
The advanced synchronization facility was mentioned in that article. I'd looked at it earlier in context with the BD release, and I've been wondering if a good chunk of AMD's woes are due to it getting too cute with the write pipeline. I haven't had the time to really analyze if it in depth.
BD has some massive glass jaws when it comes to streamout, and the WCC can be a bottleneck. The idea behind some of AMD's work on hardware transactions was using current write-combining buffers to shoulder the load, and AMD has elaborated significantly on that portion.
Then it flubbed something.

Perhaps there is no fire to that smoke, but it may have been the case that early on some design features had been implemented with leaving a path open to transaction support, or had possible hooks in place. It may have lead to the suboptimal write capabilities we see now.
 
Last edited by a moderator:
That's how 256-bit AVX works isn't it? And if there's only 1 thread on the core it can co-issue to the two 128-bit SIMDs, can't it?
 
Why?

The SIMDs are SMT aren't they?

It goes back to what is expected of an x86 CPU, and what is expected of a subunit of a gaming graphics chip.

Notably, the FP unit is SMT in BD. It is a separate scheduler, and some of the notably thorny issues like the load/store pipeline and instruction commit are offloaded to the separate integer cores.
 
That's how 256-bit AVX works isn't it?
From what I've understood 256bit AVX will use the whole FPU at once. What I don't know is what happens if you try running a ton of 128bit instructions on it from a single thread. Will only half the FPU get used or both 128bit parts?
 
It should be able to issue 4 uops to the FPU. Whatever units they map to can be used.
Certain things, like the data path between the FPU and integer cores is sized to only sustain a single thread's issue capabilities.
 
That's how 256-bit AVX works isn't it? And if there's only 1 thread on the core it can co-issue to the two 128-bit SIMDs, can't it?
Both halves of a 256bit instruction can be dispatched together, but they don't have to. The 256bit AVX instructions are fastpath doubles (most arithmetic instructions at least), i.e. two individual internal µOps. They may issue together (because they are operating independently on the two halves of the 256bit register) or sequentially depending on the state of the other instructions in the scheduler (coming from both threads). This is in no way different than two independent 128bit AVX instructions (fast path singles), which can also propagate down both of the FMA pipes.
 
Being fastpath double,two AVX-256 instructions take up 4 op slots, which is the peak the FPU can take.
This may lead to some underutilization if there are other operations that could have been done on the other pipes in the FPU.
If it were 128-bit ops, up to four separate operations could be performed, which could matter more if they don't need the 0 or 1 pipes.

This and other factors may lead to the small performance degradation that BD experiences with AVX.
 
Please don't tell me what I think. I'm well aware there's far more to it than just adding more ALUs.

Every sentence you write screams you still don't get it.

But it's blatantly obvious that Bulldozer lacks single-threaded issue width.

I'd be willing to contest that. Every test I've done says that the problem is not in executing instructions, it's in getting instructions and data to the core and out of it. A 2ALU + 2AGU design should run faster than this. In single-threaded mixed sse/int code, BD should have an advantage over SNB, if execution width was what counted. It (really) doesn't.

The solution is sharing ALUs between threads, not because the ALUs themselves are expensive, but because everything else is and you'd be sharing that too.

No. That "everything else" gets geometrically more expensive the more things it needs to talk with. Sharing it doesn't make it cheaper, it makes it more expensive.

I'll put it one more time: 2x(3ALU+2AGU) would be cheaper to make than the 2x(1ALU+2AGU+1shared ALU) you proposed. If they wanted more executing width, they wouldn't share, they'd just add more units. It's that simple. But adding units isn't free. It costs you clock speed. They determined that the few % of time when the third alu would be used is not worth the small loss that adding it would cost. Why would it be worth for the much bigger loss that sharing it would?

And they'd better also be working on hardware transactional memory support.

AMD has an (ancient) proposal for HTM out there somewhere -- my google-fu was too bad to find it.

Haskell brings two important new technologies -- HTM and Gather. Both are very hard to bolt on a memory unit after the fact, like AVX was bolted on to BD. Very likely, AMD will have to suffer for quite a while before they can match Intel with them.
 
You mean the FP core? Can one thread actually use both of the 128bit FPUs there simultaneously or is it limited to at most one of them?
Both
But actually I'm wondering wether it is more like CMT/FMT

"The FPU can receive up to four ops per cycle. These ops can only be from one thread, but the thread may change every cycle" By the Optimization gudie of BD
 
How would that be any different in BD where the instruction decoding-issuing part shared within a module?
Instruction issue is not shared in a Bulldozer module.
True but from the little I know it's nearly impossible to make the ALU sharing fast. At best the latency will be about as good as between registers and L2 cache.
Intel has been sharing ALUs between two threads for almost ten years now, in the form of Hyper-Threading. As long as the ALUs are close together I don't see why the latency would be horrible. There's a 1 or 2 cycle penalty for data crossing domains but that has little impact.
 
In essence, you are stating that AMD should have made an SMT core. They couldn't, which is why we have BD.
I sincerely doubt they couldn't. I think they simply expected software to easily scale to many threads by now and not suffer from Amdahl's Law. In such a world Bulldozer would have made a lot of sense, if it was well executed.

Unfortunately they miscalculated how hard it is to exploit TLP. SIMD is far more effective, which is why Intel created AVX. For everything less DLP oriented, high ILP is still critical.
The advanced synchronization facility was mentioned in that article. I'd looked at it earlier in context with the BD release, and I've been wondering if a good chunk of AMD's woes are due to it getting too cute with the write pipeline. I haven't had the time to really analyze if it in depth.
BD has some massive glass jaws when it comes to streamout, and the WCC can be a bottleneck. The idea behind some of AMD's work on hardware transactions was using current write-combining buffers to shoulder the load, and AMD has elaborated significantly on that portion.
Then it flubbed something.

Perhaps there is no fire to that smoke, but it may have been the case that early on some design features had been implemented with leaving a path open to transaction support, or had possible hooks in place. It may have lead to the suboptimal write capabilities we see now.
That's one interesting theory. I have little doubt that somewhere along the development of Bulldozer they realized they were moving in the wrong direction, resulting in resources to be cut and valuable time lost figuring out what to do.
 
Back
Top