AMD Bulldozer Core Patent Diagrams

Albuquerque · Aug 14, 2012

Well, it appears that I'm incorrect with regards to power; I went looking earlier and didn't find anything, but I looked again this morning and did indeed find someone who tested. I'm happy to be wrong, mind you

http://www.tomshardware.com/reviews/windows-7-hotfix-bulldozer-performance,3119-7.html

Rootax · Aug 16, 2012

Do we already know if ,on paper, Piledriver will reduce the power consumption under load ?

My old q6600 @2.4ghz is still proud, but it becomes slow when I encode a lot of stuff with handbrake, and the BD are not that bad for that kind of tasks, but energy wise, not so much...

(For the record, my CPU supports 3ghz pretty well, but not my crappy motherboard, who will crash at random times when oc is applied...)

itsmydamnation · Aug 16, 2012

relative to bulldozer
Short answer yes,
Long answer yyyyyyyyyyyyyyyyyeeeeeeeeeeeeeeeeeeeeeeeeeeeesssssssssssssss

it also highly depends on clocks, going from mid 3ghz upto 4ghz has a nasty effect on power consumption. piledriver offers no big improvements on the FPU but int performance is up a fair margin across the board.

power consumption from the limited reviews has been much better with two main reasons given.

1. implementation of resonance clock mesh
2. moving from all soft flip flops to hard flip flops in high power consumption areas.

Rootax · Aug 16, 2012

itsmydamnation said:
relative to bulldozer
Short answer yes,
Long answer yyyyyyyyyyyyyyyyyeeeeeeeeeeeeeeeeeeeeeeeeeeeesssssssssssssss

it also highly depends on clocks, going from mid 3ghz upto 4ghz has a nasty effect on power consumption. piledriver offers no big improvements on the FPU but int performance is up a fair margin across the board.

power consumption from the limited reviews has been much better with two main reasons given.

1. implementation of resonance clock mesh
2. moving from all soft flip flops to hard flip flops in high power consumption areas.

Thx you sir.

"1. implementation of resonance clock mesh
2. moving from all soft flip flops to hard flip flops in high power consumption areas."

Have you some links about that ? (EDIT : nevermind, my google-fu helped me)

hkultala · Aug 17, 2012

itsmydamnation said:
Hopefully they make the modules wider both for int and FP then add SMT to each core, especially when we start talking about really wide vectors this seems like a really good use of power hungry resources. It also seems to me things like there cache design suit this kind of throughput over absolute minimum latency design and multithreaded application design seems to lean that way as well.

the added side effect of this over just having more cores is that light threaded applications benefit from the work as well.

I will likely be disappointed but one can always hope

Adding SMT to each cores would mean it would be quite hard for the OS thread scheduler to use them effectively, and that all current OS thread schedulers would be using them suboptimally, causing slowdowns on many programs.

And the cache design.. This would really kill the L1I cache, which is currently having trouble with just 2 threads.

This might be a thing for the future, but first
1) they would have to coordinate well with microsoft and linux developers and help those to make sure the thread scheduler of windows and linux support it well immediately at the release day.
2) they would need a big redesign into their L1I cache(well, they need it anyway

hkultala · Aug 17, 2012

itsmydamnation said:
piledriver offers no big improvements on the FPU but int performance is up a fair margin across the board.

Actually, in the future, the biggest improvements between piledriver and bulldozer performance might be seen on fp-intensive code;

There is lots of software which is not recompiled to use Bulldozer's FMA4 instructions, but will get recompiled to use haswell's (and piledriver's) FMA3.

So on many programs bulldozer will be using separate fadd and fmul instructions(halving it's theoretical fp performance) but piledriver will use fma instructions.

itsmydamnation · Aug 17, 2012

hkultala said:
Actually, in the future, the biggest improvements between piledriver and bulldozer performance might be seen on fp-intensive code;

There is lots of software which is not recompiled to use Bulldozer's FMA4 instructions, but will get recompiled to use haswell's (and piledriver's) FMA3.

So on many programs bulldozer will be using separate fadd and fmul instructions(halving it's theoretical fp performance) but piledriver will use fma instructions.

First , i dont know how my first post you quoted ended up here ( some one moved it :smile

. To me ( a layman) this shouldn't be that hard of a problem to solve, dont move threads 0 through 3 to a thread <0 && >4 (repeat for N number of modules) . Use odd numbered threads before even, If there is a conflict between data locality and issuing on an not used odd thread fall back to the first rule.

The devil is always in the detail, but how hard could it be /Clarkson

But i dont see any advantage to the module system unless they make it wider, otherwise they are always going to be caught in the spot of having to low single/light threaded performance. Something like a 4 thread 3ALU int per core, 3x 256bit AVX2 module would be a floating point monster, but as i said i like wishful thinking.

edit: this kind of slide gives me hope......lol
http://xtreview.com/images/opteron AMD Excavator architecture 01.gif
steamroller to add SMT, excavator to widen the module.

On the L1i, i just dont get what the logic behind it was, did they run into trouble with a new cache design and had to make a judgement call and took the less risky path? But yes, they need to fix L1i. Also add either separate decoders ( has been hinted at) or more decode width and able to decode to both cores on the same clock. Given the deepish nature of bulldozer a L0i/op cache/trace cache would likely help reduce required decode , power and maybe even help performance.

Second i was more thinking current and legacy workloads, i agree that FMA 4 is going to go nowhere. its kinda funny how AMD didn't even want it in the first place.

tunafish · Aug 29, 2012

itsmydamnation said:
edit: this kind of slide gives me hope......lol
http://xtreview.com/images/opteron AMD Excavator architecture 01.gif
steamroller to add SMT, excavator to widen the module.

On the L1i, i just dont get what the logic behind it was, did they run into trouble with a new cache design and had to make a judgement call and took the less risky path? But yes, they need to fix L1i. Also add either separate decoders ( has been hinted at) or more decode width and able to decode to both cores on the same clock. Given the deepish nature of bulldozer a L0i/op cache/trace cache would likely help reduce required decode , power and maybe even help performance.

No SMT, but everything else seems to be spot on: http://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture

mczak · Aug 29, 2012

There's no word about "fixed" L1I though. Yes it's bigger but doesn't say anything if it increased associativity...

fellix · Aug 29, 2012

Well, there are basically two ways to reduce the cache miss-rate: higher associativity or larger size. While the first option sharply diminishes its effect after 4-ways, the size can be bumped up as long as it fits the die-size constraints and it doesn't impact the target access latency.
The L1d cache issues are no less critical in this case. Bulldozer is hampered with higher miss-rate in the L1d than in K10 and I hope Steamroller alleviates this in some manner.

mczak · Aug 29, 2012

fellix said:
Well, there are basically two ways to reduce the cache miss-rate: higher associativity or larger size. While the first option sharply diminishes its effect after 4-ways, the size can be bumped up as long as it fits the die-size constraints and it doesn't impact the target access latency.

Yes but certain forms of cache aliasing cannot be fixed with increased cache size at all. In fact Bulldozer's problem with linux kernel, ASLR and shared libraries would get (very slightly) worse, not better, with doubled cache size (because now you'd need bits 12-15 to stay the same not just bits 12-14).
So I really wouldn't understand if AMD simply doubled the cache size (well it doesn't say how much the size was increased neither but that's the only sensible number I can think of).

The L1d cache issues are no less critical in this case. Bulldozer is hampered with higher miss-rate in the L1d than in K10 and I hope Steamroller alleviates this in some manner.

L1D improvements are notably absent in that presentation (apart from store-load forwarding if you want to count that in there). But I guess there's always hope...

itsmydamnation · Aug 29, 2012

Also having more decode throughput will increase pressure on the L1D as well. so what would be an ideal target 32kb a core 48/64 ?

i am surpised at no 256bit FP ALU's.

fehu · Aug 29, 2012

this is expected to be the real fusion architecture in wich the cpu and gpu can share computational resources, any news on this?

liolio · Aug 29, 2012

I've one question has AMD split the instruction decoder already, how much of a rework it would be for them to completely split the module (and thus sit on CMT)?

They could use still use most of the inner of BD/PD/SR right?

I wonder because they won't address with Streamroller the issue with the L3, they won't either dual thread the FP/SIMD scheduler., the SIMD native width still lag Intel counter part, Haswell will make things worse. The L2 is still suboptimal and slow.

For AMD the L3 must not be a priority as I guess that they acknowledge already that winning big contracts now with their products in the server realm is unlikely.
I wonder if they could go further, they said already that they no longer to fight Intel head to head (they can't anyway, it's unclear if they have choice but that's not the point).

As it seems that they are no longer in a situation to go for the high end (high performances and servers part) would be be that terrible for them to deliver real mid range CPU (looking at the whole scale from embedded to servers part)? To give an idea like "Ok we fight fight core i3 (dual cores) with quad cores but actually our quadcore is not twice as big as Intel dual core".

So I wonder if they could split the BD module and at the same time redesign the cache hierarchy with four cores in minds. A bit like the jaguar that are supposed to scale up to 4 cores. That should be their "module". Pretty much the cache hierarchy would look like the one in Jaguar, a shared L2 (bigger though, looking at Intel i3 and i5 4MB would do fine). We still don't have data but I would not be surprised if the L2 in Jaguar (it would not be running at half speed) offer overall better characteristic than the one in BD module. Just make a bigger one with a more robust L2 interface, they have already something to build upon (the Jaguar). Starting at 4 cores sounds sane. They may move forward later on (once everything is OK, like when the CPUs are no longer sucking the bandwidth from the mem controllers for example through a straw).

So half a BD module would be pretty tiny (especially once they will have two decoders, there are already two L1 data cache. I think that for a while they should not try to match Intel with regard to the SIMD width. They should do like Jaguar cores made it so they can run instruction on 8 wide vectors at half speed. Lot of code and legacy apps won't use that before quiet a while. (Either way It's unclear if even Excavator will fix that vs Intel offering (not taking Haswel in account).

I feel like with Bulldozer, Piledriver, Streamrollers they will already have made plenty improvements vs their Phenom II.
In a sane set-up (wrt to the cache hierarchy and memory) I believe that those cores would prove that they are better than they look stomping on each other feet within a module.

Exophase · Aug 29, 2012

liolio said:
I've one question has AMD split the instruction decoder already, how much of a rework it would be for them to completely split the module (and thus sit on CMT)?

Why should they? The shared decoder was probably the biggest bottleneck for module sharing code. The post-decode buffer will also alleviate fetch bandwidth contention, although that shouldn't be nearly as big of an issue.

Separating everything else shared would be a ton more work, because they're big and deeply buffered in comparison to the decoder which alternated between cores every cycle. Imagine what would go into duplicating the instruction cache and big fetch buffers, or the FPU with huge execution window and triple-issue with two FMA pipelines.. They'd have to seriously rebalance everything to fit a similar transistor budget.

liolio said:
I wonder because they won't address with Streamroller the issue with the L3, they won't either dual thread the FP/SIMD scheduler., the SIMD native width still lag Intel counter part, Haswell will make things worse. The L2 is still suboptimal and slow.

But duplicating/splitting the stuff that's still shared doesn't change any of that.. well maybe dedicated L2 caches could be faster, I don't know.

What they need is a better L1D cache but astonishingly there's no indication on that! Or have two L1.5D caches sitting between the L1D and the L2? Like at around 64 or 128KB each, with latency in between the L1D and L2? I dunno..

But you can't just say they should take the L2 from Jaguar, or anything else for that matter, because until Jaguar is designed for > 4.2GHz speeds it isn't going to work.

liolio · Aug 29, 2012

Exophase said:
Why should they? The shared decoder was probably the biggest bottleneck for module sharing code. The post-decode buffer will also alleviate fetch bandwidth contention, although that shouldn't be nearly as big of an issue.

Separating everything else shared would be a ton more work, because they're big and deeply buffered in comparison to the decoder which alternated between cores every cycle. Imagine what would go into duplicating the instruction cache and big fetch buffers, or the FPU with huge execution window and triple-issue with two FMA pipelines.. They'd have to seriously rebalance everything to fit a similar transistor budget.

Oops my bad, I was confused. I'm so willing to see that thing come together that I forgot about the fact that's a lot of the front end is sized for the two "cores" (it's also amortized on two cores).
Indeed you would have to scale down almost everything, basically rebuilding from scratch

It's still disheartening because with all the improvement AMD made across the board may have they passed on CMT they would have something pretty sexy from scratch

Going with something more standard they may also had more time to scale up something more akin to Jaguar cache hierarchy. It's sad they are stuck with that and it won't fly anytime soon (I mean it's not like Streamrollers are to ship tomorrow or even after tomorrow, neither I expect them to look that sexy vs Haswel ).

Raqia · Aug 29, 2012

My guess is that Haswell's core will not be a dramatic change pipeline and structure-wise over the SB core, though the new FMAC instructions will give a big FP boost. It seems more focused on the un-core with its new cache structure which is rumored to have four levels and be accessible to its GPU as well. I doubt Intel would risk or want to take the time to significantly overhaul the structure of the core with so many un-core modifications on deck.

Also, Intel alternates between its Oregon and Israeli design teams which seem to have recently been trading off on tweaking un-core and core respectively, and I think their Oregon team is up. Their last CPU was Nehalem and they left the Core2 guts designed by the Israeli team intact, focusing on overhauling the un-core by adding the on-die memory subsystem. The Israeli team then did a major overhaul of the the core w/ SandyBridge, and my guess based on their history is that the Oregon team will largely design around the pipeline flow structure present in SB.

Rumor has it that Haswell will have ~10% better performance at the same clocks over IvyBridge; since AMD claims 10-15% over each of its next few iterations, and I expect Steamroller to be closer to 15%, AMD might make up a small bit of lost ground in the next round. Excavator will give AMD some extra die-space (even at the same process) to play with from the size benefits of automation on their FPU that they claimed, so they might be able to add some goodies like higher associativity caches the round after that.

3dilettante · Aug 30, 2012

Haswell should bring an L1 that can support AVX without the bandwith constraint experienced with SB and IB. That's on top of promoting integer SIMD to 256 bits, adding gather, and adding FMA.
Steamroller promises to make its FPU a little smaller.
For concurrent workloads, the lock elision and transactional memory support provide an opportunity for Haswell to scale its performance in heavily multithreaded integer.
AMD thus far has promised that it is not going to improve its cache or memory architecture much.

The increase in L1 Icache size is a change whose magnitude is not yet given. I'm not sure how many workloads were complaining about the 64KB on Bulldozer, however.
Double the decoders sounds like an increase in the width of the instruction prefetch or the number of them is in order.
The problem with expanding the L1 size is that the aliasing problem would worsen.

Idle thought:
They could increase the associativity and cache block size to chip away at the index bits.
Perhaps split the L1 internally with a pseudo-associative cache? That would be two 32-64KB halves/banks of 4-ways and 128 byte blocks. The associativity and block length would take down 2 bits of aliasing, then a rule that syonyms on the last bit are split between the halves.
It sounds complicated, but I'm not certain a 128KB 64-byte line 2-way cache is going to do much for this architecture.
The decoupled prediction and fetch logic would help buffer the hiccups inherent to having larger lines and the variability of having a pseudo-associative cache.

The plus side to having two decoders, if they are fed, is that Steamroller might be able to brute force through one known soft spot for Sandy Bridge, where its fetch bandwidth becomes constraining if the uop cache misses. The core that's consuming this instruction bandwidth really isn't that big a bruiser, though.

The loop buffers and expanded predictor for Steamroller would take on significance because it sounds like no matter what the front end of the pipeline is going to be longer, and anything that keeps the rest of the core from feeling it is an improvement. Sandy Bridge probably has a decently long pipeline, part of which is mitigated by the uop cache. The uop cache was hard to engineer, so that probably means AMD can't manage it yet. Loop buffers in the tradition of Nehalem would be consistent with AMD's multiyear lag behind the leader.

Exophase · Aug 30, 2012

3dilettante said:
The plus side to having two decoders, if they are fed, is that Steamroller might be able to brute force through one known soft spot for Sandy Bridge, where its fetch bandwidth becomes constraining if the uop cache misses. The core that's consuming this instruction bandwidth really isn't that big a bruiser, though.

Doubling the decoders doesn't reduce fetch stalls, in fact the opposite problem occurs since the demand on fetch is now higher. The post-decode buffer, on the other hand, does decrease demand.

My guess is that the fetch still wouldn't be a bottleneck that often, if it can really sustain 32b/cycle (for some reason, Agner Fog's tests show it as much less, maybe they've fixed this?).. this would give an average 16b/cycle/core, and between core switching plus really deep buffers you'd think it could maintain this. So it could eat quite a few larger instructions so long as it eventually balances out with smaller ones. And most of the bigger instructions would be executed on the shared FlexFP, which would probably be execution limited before fetch limited. Of course there's still a fair bit of waste in fetch bandwidth due to branches entering after the start of and exiting before the end of 32b blocks.

fellix · Aug 30, 2012

The 16B instruction fetch is not that much of an issue for Intel. Since Nehalem, there's an additional 4-entry buffer for storing fetched bytes from the i-cache that apparently is sufficient to sustain busy decoders. On the other hand, the 32B fetch in AMD's K10 was obviously an overkill, but that doesn't mean it shouldn't be carried over ahead for an architecture that will finally benefit from it.

AMD Bulldozer Core Patent Diagrams

Albuquerque

Red-headed step child

Rootax

itsmydamnation

Rootax

hkultala

hkultala

itsmydamnation

tunafish

mczak

fellix

mczak

itsmydamnation

fehu

liolio

Aquoiboniste

Exophase

liolio

Aquoiboniste

Raqia

3dilettante

Exophase

fellix