AMD Bulldozer Core Patent Diagrams

I wish they would implement hardware multihreading in it.

I still struggle to understand why they haven't gone down this road, seriously. In my own opinion, they had far more room for success in SMT with all IMC architectures versus everything and anything Intel had before the i7.

They've had excellent main memory bandwidth, excellent inter-core communications, and plenty of execution resources on the die. All the basics are there, why haven't they gone that last few more steps and really opened this up?
 
I still struggle to understand why they haven't gone down this road, seriously. In my own opinion, they had far more room for success in SMT with all IMC architectures versus everything and anything Intel had before the i7.

They would have to increase caches, ROBs and register files. All time critical structures. And all for having higher utilization of the execution units.

AMD's cores are relatively small, each core in a quad core Phenom 2 is around 10% of the entire die. The reasoning behind not developing SMT is probably that they might as well just double the number of cores and get double the performance in multithreaded scenarios. Unfortunately they do not enjoy the excessive fab capacity Intel does.

AMD cores are descendants of the original K7, they have had 2 (or 3?) internal new architecture projects cancelled since the K7 came out, I'm sure one of those contemplated SMT.

Cheers
 
They would have to increase caches, ROBs and register files. All time critical structures. And all for having higher utilization of the execution units.
Register files, yes. The others are a strong maybe. Intel is doing hyperthreading with L1/L2 caches that are smaller than quite a few current AMD chips. And while I understand what you're saying about "cores are only ~10% of the total die space", I also understand that a 2x increased register file would be considerably smaller. The net effect of which would be, I dunno, half the performance of an additional core for one quarter (or less) increase in die space?

Again, they're (my opinion) in a far better position than Intel in terms of SMT's ability to make a difference - they've had far better IPC (inter-processor communication) and memory subsystem technology for a very long time, at least up until the i7 finally hit the street. I think this should've given them far more opportunity to deliver a highly successful SMT implementation.

AMD cores are descendants of the original K7, they have had 2 (or 3?) internal new architecture projects cancelled since the K7 came out, I'm sure one of those contemplated SMT.

This is probably the worst part, and I don't disagree. I think this SMT thing is just another smaller and less obvious example of how AMD's processor innovation really seems to have stagnated over the last multiple years. Which makes me :(
 
One of the pitfalls to SMT on a complex OoO processor is that it does expand the engineering resources needed to properly design and verify.

If the diagram and patent applications are indicative of Bulldozer--and here we need to be cautious, as patents do not always make it to implementation--we see that AMD has made the decision to simplify a number of things at the core unit level to make way for more complexity in speculation.

The aggressive speculation is also an argument against SMT, since slots consumed by speculation cannot be doled out to other threads.
 
Really, my post only serves as a note of sadness and despair for AMD's current processor lineup. They had so many opportunities and so much time to make something awesome, and yet here we are with K7 part 22. :(

Let's do something new AMD, seriously. Let's get back into the game; let's do something with those R&D resources that have seemingly been idle for the last half decade. Come on guys, bring the pain or something!
 
AMD did try to bring the pain, repeatedly.

It had at least one false start with K8, it had several false starts (something like six months delay each) prior to Barcelona, which itself was a faceplant.

Complex designs need significant resources and time, and we can see the disparity between AMD and Intel the amount of resources they have in reserve for such efforts.

The multiyear gap before the long-delayed Bulldozer (whichever design they've settled after how many they've scrapped) points to a significant limitation of means.

I'm curious about how much the layoffs have hit the engineering and design groups, and it's also not clear to me that the engineering executives whose tenure most matches the abortive attempts at a K8 successor have been culled, or whether like the current AMD CEO, just got promoted.
 
Hm, that clustered approach is teasing to me. Looks like each cluster is scaled down integer block, found in K10. The L1D cache probably will stay the same dual-ported bank-differential array with high throughput for arbitrarily access, but some details could be touched, like the size (probably halved, per cluster) and doubled/quadrupled associativity, for compensation.
All the matters point to heavy "modularization" to the lowest arch level, here.
 
Register files, yes. The others are a strong maybe. Intel is doing hyperthreading with L1/L2 caches that are smaller than quite a few current AMD chips. And while I understand what you're saying about "cores are only ~10% of the total die space", I also understand that a 2x increased register file would be considerably smaller. The net effect of which would be, I dunno, half the performance of an additional core for one quarter (or less) increase in die space?

My point is that SMT is not just something you bolt on the side of your processor. We saw with P4 what that would do.

Northwoods 8 KB 2-way D$, the halving of the per process ROB entries and the thrashing of the trace cache made SMT an almost sure loss for most workloads.

Prescott improved the D$ to 16KB 4-way and double the trace caches. It had better SMT performance as a result.

Core i7 has 32KB 8-way D$, similar instruction caches. Ci7 increase the ROB entries to 128 from 96 for Core 2 architecture, when in SMT mode each context has 64 entries (that is why you see lower performance for single thread workloads in SMT mode). Ci7 also has a per core L2 cache that functions as a victim cache for the D$ and I$. So while you have the same first level caches as C2, the per core cache system is greatly improved.

The active register file is doubled, the architected register file is doubled.

All the critical structures are made with SMT in mind, and it shows performance-wise in multithreaded workloads.

Cheers
 
Last edited by a moderator:
Northwoods 8 KB 2-way D$,...
The L1D array in 180 and 130nm P4's was 4-way associative.
Prescott improved the D$ to 16KB 4-way and double the trace caches.
Similar here -- Prescott and later models doubled associativity to 8-way, for the L1D, and the trace-cache size remained unchanged throughout the entire NetBurst family.
 
Last edited by a moderator:
The L1D array in 180 and 130nm P4's was 4-way associative.

Similar here -- Prescott and later models doubled associativity to 8-way, for the L1D, and the trace-cache size remained unchanged throughout the entire NetBurst family.

Right, My bad.

I keep making that mistake and assumes that associativity is size of cache divided by page size (4KB) like in most virtually adresses, physically tagged caches. Other notable exception is K7/8's 2 way associative 64KB caches.

Cheers
 
So I've been trying to work out what is so special about this setup: Why put a 2nd int pipeline into a core when we're in the multi-core period?
Why not just make another core???

Finally I think I get it.
Its about making much better use of the FPU/SIMD unit.

Currently (& presumably for the foreseeable future according to AMD?) CPU instruction ratio int:FP/SIMD must be typically below 2:1?
ie at least half the time that big FPU unit on a modern x86 CPU is sitting idle.

So to make better use of that silicon, you share one FPU between two int pipelines in a 'cluster'.

You get 2 full speed int threads.
The scheduler can reorder FPU ops to prevent conflicts where both threads are trying to use the FPU at once.
(Could you even schedule some FPU ops to do work on both threads at once? ie a 64bit op from thread A + one from thread B? or 2 * 32bit from thread B?!)
Scheduling should be easier than Intels macro fusion ops etc.

When per thread int:FPU is 2:1, the FPU will be sitting at 100% utilisation & both threads running basically the same speed as if they were on 2 separate cores.

A 3ghz quad core of that would be pretty impressive I think.
Shame its not coming later this year but 2011 :-/
 
I still struggle to understand why they haven't gone down this road, seriously. In my own opinion, they had far more room for success in SMT with all IMC architectures versus everything and anything Intel had before the i7.

Makes you wonder. Quite a few people were convinced that HT only worked because Pentium 4 was so inefficient to begin with. A more efficient architecture (eg K7/K8) would not have enough spare resources for a second thread to take advantage of (many people thought that Intel was crazy when they first heard of HT making a comeback on Nehalem).
Perhaps even AMD's engineers believed this?

Or perhaps an efficient SMT-implementation is so hard to do that AMD simply didn't have the resources.
All I know is that they will need it if they want any chance of competing at all in the future. Currently Intel's new Nehalem dual-socket servers are a threat to AMD's quad-socket systems, and SMT plays an important role in that (especially when it comes to capacity for virtual machines).
 
The description of the FP unit shows support for FMAC and 64-128 bit maximum operand width.
From a silcon point of view, we have a total of 4 INT units per core, up 25% from the 3 in a current Opteron core.
The FP unit is going to support operations that could force its size up at least by that much. An FMAC would require at least 50% more operand bandwidth, and the bit width could be enough to bloat the FP unit up as well.
The proportion of idling silicon isn't massively changed or it could be even more slanted in favor of the FP unit.

I think it could be that the design isn't sharing a deemphasized FP unit, but instead it is balanced around several critical resources, some of which might be more related to a much more powerful FP unit than they are for integer execution.

Clustering points to a certain amount of deemphasis of peak integer execution.
Highest peak would be a big expensive 4-way scheduler and a big expensive crossbar servicing all 4 integer lanes.
AMD has cut these into two half-sized entries. This is actually a net savings, as a lot of common circuits for superscalar issue scale quadratically with peak width.

The front end has been increased signficantly. It's 4-wide, but if AMD uses the same symmetric decoder, it is significantly more expensive to implement than the complex-simple-simple-simple scheme used by Intel.
The rename stage works in terms of 4 instructions, which is also expensive.
It then feeds, however, integer clusters that are physicaly incapable of that kind of throughput.

As a result a very expensive front end is amortized over more threads.
The slimmer integer clusters with private schedulers can also do more speculation, since they do not speculate over as wide an integer pipeline.
Other patents hint at attempts to reduce the complexity of the integer register file.

The cache bandwidth is also much higher. 4 data cache loads in total doubles what Opteron can do.
However, each integer cluster has access to only one L1 capable of two loads.
The FPU, however can hit both, which is something that data-hungry FP really needs.
The FPU, being separate, can also go with less speculation that doesn't benefit is as much, and may also have more register ports to support FMAC.

Peak single-threaded integer performance would be increased, if clocks and other things oblidge, but the FP unit looks like it might be the big winner.
 

Looks like AMD is adopting Intel's AVX instructions. Seems like a reasonable move, and it sounds like it's not too big a deal to share floating point resources between non-orthogonal to x86 registers (except they'll have to double the width to 256 bit in this case). We've seen this since 3dnow! Pro which also handled SSE on the Athlon XP.
 
Back
Top