AMD Bulldozer Core Patent Diagrams

eastmen · May 24, 2012

AlexV said:
Because this is where their bread and butter is for the time being, and they can have a bang (Trinity looks quite good given its neighbourhood!). Vishera or whatever won't look so good...no GPU superiority to talk of, no TDP headroom to boast about. They also probably want to get people considering back to school lappies and whatever to take Trinity into account for their purchase!

Your right trinity would be really nice with windows 8 on ultra books compared to intel's offerings esp when price is factored in.

I have a 8150 and if amd can up performance 15-20% while cutting tdp in half i will ditch this chip in a second.

Blazkowicz · May 25, 2012

there may be progress on the tdp, with improvement both on the process and automatic design tools, but cutting in half?, no way unless you underclock.

does the stock 8150 undervolt?

eastmen · May 25, 2012

Blazkowicz said:
there may be progress on the tdp, with improvement both on the process and automatic design tools, but cutting in half?, no way unless you underclock.

does the stock 8150 undervolt?

TBH i haven't checked. It just runs really hot. Trinity looks to have drasticly cut down the tdp so i'm hoping we see the same with piledriver.

Lightman · May 25, 2012

Blazkowicz said:
there may be progress on the tdp, with improvement both on the process and automatic design tools, but cutting in half?, no way unless you underclock.

does the stock 8150 undervolt?

Yes, it does, sometimes quite well as long as you're running stock or close to it!
When you want 4.4GHz+ then you can't really.

3dilettante · May 25, 2012

I wouldn't expect a similar improvement for Piledriver over Bulldozer as Trinity had over Llano.
Llano had a number of problems that Bulldozer didn't have like manufacturability problems and primitive power management compared to Bulldozer.

Some of the biggest improvements over Llano are in ranges where it was pretty bad, and BD isn't as bad at those points.

function · May 26, 2012

I remember that Trinity was the first AMD chip to use this "resonant clock mesh" thing:

http://techreport.com/discussions.x/22520

Given the huge quantities of power thrown at Bulldozer to make it run at ~4 gHz, couldn't this alone have the potential to help Piledriver up clocks somewhat, while (hopefully) lowering TDP?

Blazkowicz · May 26, 2012

the idea sounds beautiful, you replace a harsh and rigid hierarchy with an intimate fabric that acts as an electric pendulum, only needing a gentle "nudge". this is like transitioning from stalinism to step a closer to actual communism.

to lower power use at very high clocks, you'll have to lower the voltage needed to achieve these clocks (esp. when overclocking). does the new clocking technique help with that or not, I have no idea.

3dilettante · May 26, 2012

The described power savings are about 10%, which considering the typically steeper W/MHz curve at those high speeds might yield an extra speed step or a step and half.
It's a bit better than the traditional tree, and better than a standard mesh that would have increased power consumption.

I wonder if AMD likes the resonant mesh more because it allows a mesh at all for the speedy CPU. It has benefits for reducing engineering effort and increasing resistance to process variation, which have been bigger problems for AMD than the want of an extra 100 MHz.

edit: One thing I'm curious about is if this mesh winds up affecting overclocking or possibly the theoretical clock ceiling of AMD's architecture, since the mesh targets a desired range in order to achieve savings and forcing the clock higher might make things worse.

Blazkowicz · May 26, 2012

my curiosity is the relation between implementing it and the automatic design tools. was the trick easier, or possible at all to implement because the CPU is synthesized? can AMD adopt a few more, lesser known tricks?

but, this allows piledriver to compete better against.. sandy bridge. Intel's ivy bridge is both one generation ahead, and came out monthes earlier.

3dilettante · May 27, 2012

My interpretation of some of the statements in the Cyclos pdf is that one of the advantages of the clock mesh is that it isn't as sensitive to the particulars of what is arranged below it, which means it can be applied to a fully synthesized or less than fully synthesized design.

The description of the reduced engineering effort and reduction in skew may allow for some additional clawback in performance by an automated design flow versus a more manual approach. It probably would remain somewhat inferior from a performance standpoint, but eliminating a fatter guardband on a synthesized design is lower-hanging fruit than squeezing even better timings a tree that is closer to the practical optimum already.

Anandtech mentioned AMD credited the careful replacement of less efficient but skew-resistant soft-edged flip-flops with hard-edged variants where it could find timing margins. Possibly, those timing margins were easier to find with a clock mesh that puts skew reduction as a big bullet point.

Lightman · May 30, 2012

Very interesting article from Johan De Gelas: "The Bulldozer Aftermath: Delving Even Deeper"

3dilettante · May 30, 2012

The section on SAP is interesting in that it relays information profiled by Intel as to why its architecture has improved.
In the context of this article on Bulldozer, it does no comparison and gives no color on how those lessons apply to Bulldozer's showing, so it's a nifty but extraneous portion to the article.
It's too bad, since the data on the Intel side has some very specific data points that could have been analyzed.
Perhaps this is a limitation of their resources, but the end result is that it's almost a non sequitur.

The article also dances a bit around the cache sub-system for BD, such as the odd write path and some last-last gen numbers for some of the writeback capabilities that could have been a factor in some of the performance regressions. Perhaps this didn't turn out to be a problem, but a lack of commentary could equal a failure to check as much as it could be a lack of a penalty.

Gubbi · May 31, 2012

3dilettante said:
The article also dances a bit around the cache sub-system for BD, such as the odd write path and some last-last gen numbers for some of the writeback capabilities that could have been a factor in some of the performance regressions. Perhaps this didn't turn out to be a problem, but a lack of commentary could equal a failure to check as much as it could be a lack of a penalty.

It would have been interesting to see how many stores result in stalls. Tunafish pointed out that workloads with stores with low spatial coherence kills performance because of the store through policy, with the worst case being storing one byte per cacheline, even if the workload is entirely contained in L1.

The only other problem seems to be thrashing of the I-cache, which we already knew.

Cheers

3dilettante · May 31, 2012

The article didn't mention that Intel's Icache having higher associativity and smaller size also avoided the aliasing problem that AMD's instruction cache has, with a fix that involved mucking about with memory allocation and weakening or deactivating address space randomization.
Whether it's a significant demerit or not depends on whether a given arrangement of libraries and code in memory hit the problem frequently, and whether compromised system security is acceptable.

Even though the performance impact is mostly tiny, that a server chip would have a performance fix that has reduced system security as one of its requirements is a black mark. There were some text-based rolleyes by some notable people when that fix was submitted.

sebbbi · Jun 4, 2012

Yeah. A two way set associative instruction cache is just barely enough for a single core (handles linear code and all single branch/jump cases well), but they share it with two cores. With two cores it will only work well (not trash) if both cores are running perfectly linear code with no branches/jumps at all (branches to near code that is already cached of course work well = tight loops for example). The branch penalty is high, but the L1 instruction cache design seems to additionally degrade its performance in branch/jump heavy code.

3dilettante · Jun 4, 2012

There's also a lack of a loop detector, although Sandy Bridge doesn't seem to have one either. Neither do as well on loops that are nested or have branches within them.
SB does have the uop cache and loop buffer, while BD has some redirect capability thanks to the buffering after the decoders.
A uop cache and loop detection would seem to be nice to have for a shared-decoder design since it would allow one core to get out the other's way more frequently and help limit the impact of the extra pipeline steps that the more complex decode, instruction buffering, and pick stages brought in.
Some of the additional alignment and packing restrictions AMD's decoder has introduced might be partly mitigated as well.

Perhaps there were obstacles in Bulldozer's tighter timing requirements thanks to its speed racer design, AMD's more limited engineering resources, or a chicken and egg problem where the predictive capability needed to tell a core that it needs to fetch from a loop buffer or the needed address translation to the uop cache space would be stuck in that decoupled front end predictor.

fellix · Jun 4, 2012

The micro-op cache in SB is de-facto a superset type of replacement for the loop detector in Nehalem. Instead of caching only a narrow type of micro-ops, now it acts like a complete I-cache structure, so the front-end of the pipeline can be throttled more often, resulting in greater power savings and performance gains.

3dilettante · Jun 4, 2012

That's sufficient to cache the uops, but it doesn't make the prediction more accurate. Agner's profiling indicates it doesn't do as well if it encounters nested loops or loops with branches.
Nested loops are handled better with a loop detector, but some tradeoff was made, perhaps in favor of a more general increase in the size of the branch history table.
Also, according to Intel, there are allegedly loop buffers that Agner couldn't tease out with his testing.

fellix · Jun 4, 2012

Well, it's only ~6KBytes in size. Not much to find there. But it's a mile better than the old trace cache for instance, with its huge redundancy overhead and complete lack of immunity to branch miss-predictions. I think Intel's reasoning was/is still much more about the power savings mantra, for the mobile SKUs, mostly. Performance is a hit and miss, anyway.

mczak · Jun 4, 2012

fellix said:
I think Intel's reasoning was/is still much more about the power savings mantra, for the mobile SKUs, mostly.

For the P4 trace cache, I'm pretty sure it really was about performance, not power savings (after all the whole chip was almost designed to do as little work with as many transistors as possible - ok not quite but close...). Though apparently in this form it was a big failure. Anyway, I'm not fully convinced the uop cache is that much better than the trace cache was (though it should have lower overhead indeed), the pentium 4's problem with the trace cache wasn't so much the trace cache itself but simply that it relied on the code to be in the trace cache too much because the decoder was slow as molasses.

And I'm also baffled by the low l1i associativity, it seems so obvious it's not enough. Yet AMD didn't even fix it with piledriver.

AMD Bulldozer Core Patent Diagrams

None functional