AMD Bulldozer Core Patent Diagrams

rpg.314 · May 3, 2009

It is interesting that it has not been given any particular name like SSEx or AVX. Or are they just introducing new instructions now and will implement them when they see fit?

Raqia · May 4, 2009

rpg.314 said:
It is interesting that it has not been given any particular name like SSEx or AVX. Or are they just introducing new instructions now and will implement them when they see fit?

I'm pretty sure that documentation is just describing their implementation of AVX but a superset of it with a few additional instructions from SSE5 like FMAC that don't yet have an Intel equivalent.

rpg.314 · May 4, 2009

What was pretty interesting was a a) lot of integer mac and madd instructions were implemented. Integer multiplication rarely gets any love from SSEx but here they seem to be almost half of all instructions.

b) Also the cast to half float from float and vice versa. There seems to be no equivalent in AVX.

c) The extraction of fractional part of a floating point number. That will help in a lot of transcendental functions required for an Opencl implementation. And since opencl code is jit compiled at runtime, it's not a big problem if it is not adopted by intel (like with 3dnow)

dkanter · May 6, 2009

SMT is extremely difficult to verify. It was not 'bolted on' to the P4, it was designed in from day one, and turned off for over a year and a half. And remember Intel has more engineering resources than AMD.

The upside is big though as average IPC is typically 0.5-1 for most CPUs wo/SMT.

Another issue is that AMD's architecture would need some pretty serious modifications. Their L1D cache associativity is really way too low, and that's with a single thread running. With 2 threads it would be practically intolerable. And that L1D is definitely on the critical path, so once you try and increase associativity, latency probably increases, which means you need to resize the TLBs, buffering and pipeline depth, etc. etc.

SMT is a huge win though...it's a pretty obvious sign when everyone else (Sun, IBM, Intel, NV, ATI) has already gotten on the bandwagon. Hell, even some embedded CPUs like RMI have multi-threading.

That being said, it's pretty clear to me that it would be a huge undertaking for AMD to verify SMT. Definitely the kind of thing that would have made sense for Barcelona, but it's tantamount to doing an entirely new uarch.

DK

fellix · May 6, 2009

Yup, it's like a balanced trade off -- poor associativity for larger size, low access latency and sheer R/W multi-banked throughput (the K10's L1D impl is the fastest solution in this regard up to this day). It has been a main philosophy for AMD's architectures since the very first K7.

Scali · May 6, 2009

fellix said:
Yup, it's like a balanced trade off -- poor associativity for larger size, low access latency and sheer R/W multi-banked throughput (the K10's L1D impl is the fastest solution in this regard up to this day). It has been a main philosophy for AMD's architectures since the very first K7.

Uhh, you mean AMD has high throughput and low latency?
Intel generally has far higher throughput than AMD, and has a much better balanced trade-off between latency and hit-ratio than AMD does.
There doesn't really seem to be a philosophy as such with AMD, rather an inability to match Intel in caching technology.

3dilettante · May 6, 2009

fellix said:
Yup, it's like a balanced trade off -- poor associativity for larger size, low access latency and sheer R/W multi-banked throughput (the K10's L1D impl is the fastest solution in this regard up to this day). It has been a main philosophy for AMD's architectures since the very first K7.

By what metric do you mean fastest?
Itanium's L1 latency is 1 cycle.
Opteron's is 3.

rpg.314 · May 6, 2009

That being said, it's pretty clear to me that it would be a huge undertaking for AMD to verify SMT. Definitely the kind of thing that would have made sense for Barcelona, but it's tantamount to doing an entirely new uarch.

Well, bulldozer is supposed to be a brand new uarch.

fellix · May 7, 2009

3dilettante said:
By what metric do you mean fastest?
Itanium's L1 latency is 1 cycle.
Opteron's is 3.

My comment was intended more about the throughput and not so much emphasizing on the access latency. And, yes - I know there's lower latency cache designs for a few named architectures (Willamette and Northwood L1D was 2 cycle, but a tiny size). Again - I'm more about the trade off balance. AMD was keen enough to keep the L1D size, access time, bank organization & etc. pretty much intact throughout the entire K7~K10 architecture refreshes and manufacturing technology updates, while just now with K10 the had to double the throughput from 2*64-bit R/W up to 2*128-bit to accommodate it for the extended FP/SIMD paths.
By looking the larger picture, it looks like AMD kept developing their last 3 uarch generations around the very same general cache design (quite unlike Intel, though) -- [relatively] large L1 arrays being low associative but high throughput and latency tolerant, compensated with a slow but high (16-way) accusative and exclusive L2 array. In fact, only the very first Slot-A K7 had an inclusive L2 relation, being an off-chip impl of course. It's all a kind of heritage from the old days of scarcity of silicon die real estate, where inclusive relation with such a large L1 (2*64KB) would be a waste with 256K L2 and even larger sizes for the time.
It is still arguable of why AMD initially jumped with the decision, to implement an architecture (K7) with such large L1's -- some say it was to compensate for the rather poor FP register file organization (I'm not sure), but it is a fact, that they milked this "trade off" model for too long.

dkanter · May 9, 2009

fellix said:
My comment was intended more about the throughput and not so much emphasizing on the access latency. And, yes - I know there's lower latency cache designs for a few named architectures (Willamette and Northwood L1D was 2 cycle, but a tiny size). Again - I'm more about the trade off balance. AMD was keen enough to keep the L1D size, access time, bank organization & etc. pretty much intact throughout the entire K7~K10 architecture refreshes and manufacturing technology updates, while just now with K10 the had to double the throughput from 2*64-bit R/W up to 2*128-bit to accommodate it for the extended FP/SIMD paths.
By looking the larger picture, it looks like AMD kept developing their last 3 uarch generations around the very same general cache design (quite unlike Intel, though) -- [relatively] large L1 arrays being low associative but high throughput and latency tolerant, compensated with a slow but high (16-way) accusative and exclusive L2 array. In fact, only the very first Slot-A K7 had an inclusive L2 relation, being an off-chip impl of course. It's all a kind of heritage from the old days of scarcity of silicon die real estate, where inclusive relation with such a large L1 (2*64KB) would be a waste with 256K L2 and even larger sizes for the time.
It is still arguable of why AMD initially jumped with the decision, to implement an architecture (K7) with such large L1's -- some say it was to compensate for the rather poor FP register file organization (I'm not sure), but it is a fact, that they milked this "trade off" model for too long.

Maybe you missed it, but the whole point of a cache is latency.

Adding more bandwidth is easy, especially when you go from 64b-->128b accesses. Having more accesses and lower latency is much much harder.

David

hoom · May 12, 2009

INQ has a story about AMD having taken up AVX.
Apparently 'SSE5' is no more but the instructions from SSE5 that aren't part of AVX are '+ XOP'.

Anyway, 3dilettante I think you misinterpreted what I was saying.
I think we are mostly suggesting the same thing.

I'm not arguing that its a de-emphasised FPU, I'm saying that by sharing one between two INT cores, it will be getting more ops coming to the FPU = higher utilisation than having one FPU per INT core.

The current case of having one FPU per INT core is presumably leaving that FPU idle a lot (inefficient).
The FPU is only going to get bigger over time so it needs to be running at high utilisation (efficient).

The 2nd INT thread is not about making massive INT throughput but it will help since there are more INT ALUs.
(though I see now they won't be full speed, they'll each be slower per-clock due to being only 2 wide)

I don't know anything about frontend/decoder stuff but I'd have thought 2 * 2 wide should be easier to get high utilisation out of (and/or simpler) than the Intel single 4 wide/SMT.
That should then allow engineering of it better with the same resources or re-allocation of resources for other areas of the chip.

Raqia · May 13, 2009

hoom said:
INQ has a story about AMD having taken up AVX.
Apparently 'SSE5' is no more but the instructions from SSE5 that aren't part of AVX are '+ XOP'.

Anyway, 3dilettante I think you misinterpreted what I was saying.
I think we are mostly suggesting the same thing.

I'm not arguing that its a de-emphasised FPU, I'm saying that by sharing one between two INT cores, it will be getting more ops coming to the FPU = higher utilisation than having one FPU per INT core.

The current case of having one FPU per INT core is presumably leaving that FPU idle a lot (inefficient).
The FPU is only going to get bigger over time so it needs to be running at high utilisation (efficient).

The 2nd INT thread is not about making massive INT throughput but it will help since there are more INT ALUs.
(though I see now they won't be full speed, they'll each be slower per-clock due to being only 2 wide)

I don't know anything about frontend/decoder stuff but I'd have thought 2 * 2 wide should be easier to get high utilisation out of (and/or simpler) than the Intel single 4 wide/SMT.
That should then allow engineering of it better with the same resources or re-allocation of resources for other areas of the chip.

AMD's implementation of AVX could be a half-assed two pass approach for the sake of compatibility since it was initially built around the 128bit SSE5 instructions. I hope that's not the case, but I doubt it will matter very much during the first generation of AVX.

fellix · Jul 7, 2009

More details on Bulldozer's multi-threading and single thread execution

hoom · Aug 29, 2009

Mr. Dresdenboy has a new version of his diagram up

Damn I wish this was coming next year rather than year after next

I'm nearly definitely going to be finally replacing my 2006 C2D e6600 with an i7 860 & I'm not likely to replace that for probably at least as long again.
I'd have waited to see how Bulldozer turns out if it was coming next year.

Also over at EETimes they have Pat Conway confirming some sort of multi threading in Bulldozer

The new core expands what has been the single-threaded nature of the AMD cores "in a different fashion than Hyperthreading," said Conway

Xbit mis-quoted this as 'SMT confirmed' & got a smackdown from AMD Marketing

Edit: Updated link because he's updated the diagram.

fellix · Aug 30, 2009

SxSMT, a.k.a. Side-by-Side Multi-threading

3dilettante · Aug 31, 2009

I find the 4 microcode blocks in that diagram curious.

It would be much more robust than current cores that have one microcode engine that halts the front end and is capable of issuing multiple macro ops per cycle.

The idea of having 4 engines leaves open a lot of considerations.

1: Why do this? It seems that this architecture is going to lean much more heavily on microcode than previously, going by the patents.
Does this mean a regression of sorts, are the direct-path encoders weaker than they have been in the past, or is it that there are going to be a ton of new microcode entries?
There are patents that are interesting, if vaguely reminiscent of failed CISC architectures that went for extremely programmable microcode engines.

2: How will this be implemented?
A brutish 4-way microcode engine that generates 4 macro ops a clock, or 4 1-way microcode engines, presumably each making 1 op a clock?
The same goes with the microcode ROM.
4-ported, or will we instead see 4 copies of the ROM?
The implications for each are interesting from a timing and space concern.

3: What will this accomplish?
Microcode engines currently are designed to fit the occassional-use model. The scheme here would be expanded beyond what would be generally needed.
This amount of resources dedicated to this would mean that AMD intends something far more intensive, perhaps related to the other patents on microcoded control of new processing units.

mboeller · Sep 1, 2009

from wiki http://en.wikipedia.org/wiki/Microcode a few random snippets about microcode:

A similar approach was used by Digital Equipment Corporation in their VAX family of computers. Initially a 32-bit TTL processor in conjunction with supporting microcode implemented the programmer-visible architecture. Later VAX versions used different microarchitectures, yet the programmer-visible architecture didn't change.

Many RISC and VLIW processors are designed to execute every instruction (as long as it is in the cache) in a single cycle. This is very similar to the way CPUs with microcode execute one microinstruction per cycle. VLIW processors have instructions that behave similarly to very wide horizontal microcode, although typically without such fine-grained control over the hardware as provided by microcode. RISC instructions are sometimes similar to the narrow vertical microcode.

wilde guesses from a layman:

AMD tries to use different backends for different markets (bobcat; Bulldozer) or/and they move to some form of VLIW system with a microcode frontend (something like transmeta but with better hardware support) to be able to use grafikhardware in fusion within normal x86 programs.

Mat3 · Sep 9, 2009

Should this Bulldozer core be considered a good design for a desktop processor that has to score well in game benchmarks? Two 2-unit integer groups sharing one float unit... games tend to use more floating point operations don't they?

ShaidarHaran · Sep 9, 2009

Mat3 said:
Should this Bulldozer core be considered a good design for a desktop processor that has to score well in game benchmarks? Two 2-unit integer groups sharing one float unit... games tend to use more floating point operations don't they?

The FPUs are 256-bits wide

Gubbi · Sep 9, 2009

Mat3 said:
Should this Bulldozer core be considered a good design for a desktop processor that has to score well in game benchmarks? Two 2-unit integer groups sharing one float unit... games tend to use more floating point operations don't they?

There is probably more than one issue port for the FPU.

Cheers.

AMD Bulldozer Core Patent Diagrams

hardware monkey