AMD Bulldozer Core Patent Diagrams

K.I.L.E.R · Oct 12, 2011

The performance of BD seems nice overall. I think I will upgrade to BD. Can not wait until it is in my local PC store. I was going to get an Intel 6-core, but BD is much better. Those benchmarks are staggering, but the performance of having only 2 fp units per inner-core is a huge mistake. There are only 4 cores within BD and each core having two cores. I want BD with 6-8 large cores, which would make 12-16 total cores, and then compare performance, once the memory bandwidth limit is also lifted. Unfortunately the number of fp units is holding it back from gaming, then again since when has PC gaming been big these last few years?

hoho · Oct 12, 2011

K.I.L.E.R said:
I want BD with 6-8 large cores, which would make 12-16 total cores, and then compare performance, once the memory bandwidth limit is also lifted.

You do realize that it took AMD ~2B transistors for that 4 module/8 core CPU vs Inte'ls ~1.2B transistors for 6 real-cores CPU?

mczak · Oct 12, 2011

hoho said:
You do realize that it took AMD ~2B transistors for that 4 module/8 core CPU vs Inte'ls ~1.2B transistors for 6 real-cores CPU?

Quite some big difference in transistors (and die size too) is due to BD having more cache (and intel still seems to pack it quite a bit more densely). Given the latency differences, the smaller cache of intel's chip is most likely helping performance too overall for basically any workloads...
btw I think I misunderstood the L1I problem slightly what's exactly going on. In any case more associativity would help there.

Rootax · Oct 12, 2011

hoho said:
You do realize that it took AMD ~2B transistors for that 4 module/8 core CPU vs Inte'ls ~1.2B transistors for 6 real-cores CPU?

KILER post was not sarcasm / irony ?

mczak · Oct 12, 2011

Hmm interesting anand mentions future consumer cpus might not have a L3 cache.
I previously thought Piledriver might have either smaller L2 caches and a L3 cache shared with the GPU, or instead no L3 cache - looks like AMD is going for the second option. That is certainly simpler but will probably allow intel to do further catchup on the IGP side.

Gubbi · Oct 12, 2011

mczak said:
Hmm interesting anand mentions future consumer cpus might not have a L3 cache.

The L3 is so slow, that I have a hard time seing it making a difference in single socket scenarios.

It will make a difference in multiple socket systems where a bloom filter can be used to lower effective latency of coherence queries.

BD looks completely unsuited for desktops atm.

Cheers

3dilettante · Oct 12, 2011

I think low-level analysis may help discern whether BD's branch prediction is improved any over previous chips, but not enough to hide the penalties of the long pipeline. I've not read all the articles yet, so I may have missed more analysis. Anand's N-queens benchmark is branch-heavy, but it is possible other factors could be impinging on the performance.

The multithreaded case is the strong suit for BD, usually.
Single-threaded is significantly worse than I expected. I was expecting at least parity there with the previous gen per-clock.

The improvement is pretty spotty. I expected something below SB, but not some of the below K10 results.
For a lot of users, Thuban is a better purchase, and for all but few workloads, it at best keeps up with a Sandy Bridge with one hand tied behind its back.

It's a good thing Intel hasn't put out an extra speed grade or two for SB, and the SB-E line is a little late in coming.
I think the server benchmarks should be more interesting, since so much of this design is meant for that market.

The power numbers and clocks listed do not reflect well on the GF process, and it's going to be a long time until 22nm, if that works any better.

Alexko · Oct 12, 2011

trinibwoy said:
I genuinely feel sorry for AMD and the guys who worked on Bulldozer. These reviews are painful to read. So many engineers worked for so many years on this.....they must feel terrible.

Yeah, that must be a very unpleasant feeling. But Piledriver is already pretty much done, and Steamroller must be well underway. Hopefully, results from those two are cheering them up.

At least I fraking hope so.

fellix · Oct 12, 2011

http://forums.anandtech.com/showpost.php?p=32393225&postcount=77

based on these observations I'll say that Bulldozer supports AVX-256 just for compatibility sake but it is probably better (TBC) to not enable AVX-256 for Bulldozer targets. It gives a refreshing new perspective on the issue of the Intel compiler enabling SSEx optimization only on Intel CPUs, since in this case it may well be a *legit optimization to disable AVX-256 for Bulldozer*, i.e. not only rely on the features flag but to look at the manufacturer string ("Genuine Intel", "Authentic AMD")

Apparently, BD will require different compile target for ordinary AVX code to perform well.

Ninjaprime · Oct 12, 2011

How does this even happen? Thuban is 907 million transistors, BD is 2 billion, and is in some cases slower? You'd think they could have just macro'd 2 more cores on Thuban with the improvements from the A-series core and hit around 1.2 billion transistors and a ~240mm2 ish die size with higher clocks than Thuban and it would blow BD away most of the time.

fellix · Oct 12, 2011

Considering BD is stuffed with mostly useless slow cache, a good chunk of those 2B transistors are just heating the air for the winter season.

3dilettante · Oct 12, 2011

fellix said:
http://forums.anandtech.com/showpost.php?p=32393225&postcount=77

Apparently, BD will require different compile target for ordinary AVX code to perform well.

This was indicated by the GCC patches that were found earlier in the year, and BD's mediocre AVX support came up in this thread as well.

digitalwanderer · Oct 12, 2011

fellix said:
Considering BD is stuffed with mostly useless slow cache, a good chunk of those 2B transistors are just heating the air for the winter season.

Well hey, winter is coming...

mczak · Oct 12, 2011

fellix said:
http://forums.anandtech.com/showpost.php?p=32393225&postcount=77

Apparently, BD will require different compile target for ordinary AVX code to perform well.

I think that's not too surprising - after all the data paths are just 128bit. So, if your code does not quite scale to 8-wide execution perfectly (say it needs 0.55 times the instructions it would with AVX-128) then this will be nearly twice as fast on SNB, but definitely slower on BD. Even if it does scale perfectly, it could be slower - while it saves half the decode bandwidth and the decoded operations will take up only half the slots in the scheduler queue, BD needs to execute AVX-256 instructions on both FMAC pipes simultaneously (*), whereas if you had AVX-128 instructions it has more leeway to reshuffle instructions to hide latencies.

(*) Not sure it operates like that. Another possibility is that AVX-256 instructions are broken down to 128bit wide instructions - in which case it wouldn't save any slots in the scheduler queue but it should not be slower neither for "perfect parallel scaling" code vs. AVX-128 code.
Of course, AVX-128 itself isn't really much different to just using SSE2 - you get much nicer instruction encoding (which means you can avoid some movs) but overall the throughput shouldn't vary that much.

3dilettante · Oct 12, 2011

It's likely that the op is split into two 128-bit uops.
One negative is that the registers are not 256 bits, so the rename capacity and latency hiding of the OoO engine is reduced because it can only rename half as many registers.

There can be more contention for issue ports as well, since shuffle instructions have to fight with arithmetic ops for one of the FMAC ports. Since two ports are occupied by AVX-256, either the arithmetic op takes longer to process. Various instructions that shuffle or insert values within and between YMM registers may take longer, reducing throughput further.

mczak · Oct 13, 2011

3dilettante said:
It's likely that the op is split into two 128-bit uops.

Would be a more traditional approach (just the same as SSE2 was handled pre-Barcelona A64). dkanter's article actually suggested both methods might be a possibility.

One negative is that the registers are not 256 bits, so the rename capacity and latency hiding of the OoO engine is reduced because it can only rename half as many registers.

Oh yes I forgot about that. Though if you had "naturally 8-wide" code and just executed it as AVX-128 you'd end up with the same number of registers really as you'd just need more XMM regs instead.

There can be more contention for issue ports as well, since shuffle instructions have to fight with arithmetic ops for one of the FMAC ports. Since two ports are occupied by AVX-256, either the arithmetic op takes longer to process. Various instructions that shuffle or insert values within and between YMM registers may take longer, reducing throughput further.

So all in all AVX-128 probably will always be a (slight) win over AVX-256 on this chip. The increased decoder bandwidth this needs probably is of little consequence.

btw one area where BD excels (at least!) is the SIMD INT scores. SNB (and AVX) cheaped out on these so only 128bits, and BD has improved on the simd int units. Despite having to do with only 4 FP units it still manages to turn in a score 3 times higher than 6-core Phenom II (well in SiSoft Sandra that is). I actually do not quite understand why, since Barcelona (and up) should have been able to execute 2 simd int instructions per clock just as well. I guess being able to do two muls or adds simultaneously helps, but the increase over Phenom II is much larger even taking this into account. I don't think the test was actually using the IMAC (as the big increase was both with and without AVX code), though that could definitely lead to a big increase (I was looking at techreport numbers, http://techreport.com/articles.x/21813/17). Though maybe it was using imac in which case the score isn't all that impressive. Interestingly, AIDA64, despite mentioning using XOP specifically, does not show any such gain at all over Phenom II, though the score is good regardless.

edit: actually I think I've got that wrong, it's confusing in the BD optimization manual (that there are obvious copy/paste bugs around the important areas doesn't help matters neither). The chip has a mysterious "MMA" execution unit (mapped to pipe 0) which is completely missing from the overview (mislabeled as another FMA I think though in reality it's probably the same unit anyway) and seems to handle all simd int mul related operations (muls and macs). If I interpret that right this means you could actually issue 3 simd int ops per cycle, 2 adds (or ands, ors, masks, other simple stuff) and 1 mul/mac. Not 2 simd int muls per clock, however, but in any case it would be quite an improvement with the right mix of instructions (no way to issue a 3rd int add to that MMA unit though despite it obviously having an adder it seems). Still does not explain the Sandra Multimedia Int score though.

3dilettante · Oct 13, 2011

mczak said:
actually I think I've got that wrong, it's confusing in the BD optimization manual (that there are obvious copy/paste bugs around the important areas doesn't help matters neither). The chip has a mysterious "MMA" execution unit (mapped to pipe 0) which is completely missing from the overview (mislabeled as another FMA I think though in reality it's probably the same unit anyway) and seems to handle all simd int mul related operations (muls and macs).

Parts of the guide do indicate that FPU pipe 0 has a 128-bit IMAC. It makes sense that AMD built the FMAC in that pipe more robustly to function as an IMAC when needed.
The copy/paste errors in AMD's documentation is pretty galling. While it is true that such optimization guides are not the rule, it seems like such a low-cost measure to pay someone who knows the architecture for proofreading, especially for an architecture that needs more handholding than others.

I look forward to some in-depth analysis of this design, and why it does seem to underachieve despite some numerical advantages. I had thought it would do better against Westmere in single-threaded than it did, but it doesn't seem to press any of the advantages it has. Why?

The granularity of data could be much better.
One thing not commented in the reviews I've seen is the improved event monitoring for this architecture. Some of the discussion about the (discarded) x264 XOP branch indicated that the performance monitoring was very helpful.

Hopefully this does lead to better tools. Realworldtech did run some performance monitoring tools in a CPU comparison, and at the time AMD's tools and performance counters were considered to be less reliable.

fellix · Oct 13, 2011

Source: Xbit Labs

A useful ALU performance comparison on equal clock and thread base. Integer SIMD throughput in BD is far from stellar in this case, even against a single SNB core.

mczak · Oct 13, 2011

fellix said:
Source: Xbit Labs

A useful ALU performance comparison on equal clock and thread base. Integer SIMD throughput in BD is far from stellar in this case, even against a single SNB core.

The simd int results do not correspond at all with the results from techreport (for all cpus except Thuban, and of course taking core count and frequency into account), hence I can only assume it's due to disabling everything beyond sse3 (techreport only disabled AVX). So maybe the techreport results are using imac after all. Even then the difference is gigantic and cannot be solely due to the imac (even SNB gets roughly twice the performance in techreport results).

swaaye · Oct 13, 2011

I want to see a per clock comparison against Kentsfield.

AMD Bulldozer Core Patent Diagrams

K.I.L.E.R

Retarded moron

hoho

mczak

Rootax

mczak

Gubbi

3dilettante

Alexko

fellix

Ninjaprime

fellix

3dilettante

digitalwanderer

mczak

3dilettante

mczak

3dilettante

fellix

mczak

swaaye

Entirely Suboptimal