AMD Bulldozer Core Patent Diagrams

to me this whole article sounds like - they didnt invite us to Dresden, lets write something bad abut them.

All the pictures and quotes are taken from youtube video published two days ago by AMD.
People there were shown some perf. numbers.
 
Well they're discounting any hint of credibility left with respect to CPU reviews. More power to them if they want to be treated as a tabloid!


Where did the 3-sources rule go? Lazy sources, lazy writing, rehashed info that sounds more akin to a forum post than credible analysis...
 
Ugh, why is everybody selectively missing that it's speculative in nature (the piece)? And I'm having serious trouble seeing where all the oozing AMD CPU goodness that Xbit apparently ignores is hiding. They missed their scheduling (no, the lame excuse about prioritizing Llano does not hold). They'll still be in the position of competing for the cheap sockets Intel doesn't really care about all that much. Interlagos perf targets just got downgraded. Having Ruby in a comic doesn't really inspire that much confidence by itself, does it?
 
Years ago when Bulldozer was first announced I was sceptical how AMD is going to succeed with an architecture that runs just a single thread per core (the shared floating point coprocessor isn't really SMT). All other high performance CPU manufacturers have been running two HW threads per core for some time. IBM since Power5, Intel since Pentium 4, Sun since SPARC T1/Niagara. Now Power 8 runs four threads per core and SPARC T3 runs eight. AMD has the only server/HPC processor that doesn't to SMT. It's hard to keep the pipeline effiency high without any kind of SMT.

After reading the 300+ page AMD Bulldozer optimization guide, I feel that they have made many good choices in their architecture. Several sources have criticized the shared floating point coprocessor design, stating things like a "module is just 1.5 cores" and "it will not perform good in games and multimedia applications". The missing incredient really is: It's really hard to keep the FPU/SIMD pipeline full all the time. Floating point and especially SIMD instructions tend to have higher latencies compared to basic (integer and logic) instructions. So there will be longer dependencies and more pipeline slots unfilled, even in code that is doing "all maths". Also, no program will have 100% of their instructions FPU/SIMD instructions, since you have to calculate addresses (array indices, pointers, etc), do logic operations, data load/store and branching. Sharing the floating point coprocessor (that only executes a subset of the instructions) between two threads will improve the floating point pipeline efficiency drastically (there's more empty slots to fill compared to full SMT systems).

According to some of our profiler captures, the thread that executes most FPU/SIMD in our game is the visibility culler thread. It has around 15% FPU instructions and around 20% SIMD instructions. So 35% of the instructions go to the FPU/SIMD pipeline. We have three threads with similar FPU/SIMD usage (particle processing, physics, graphics rendering), while the remaining two threads have only few percent of FPU/SIMD instructions. Assuming other games have similar FPU/SIMD usage patterns, it would be safe to say that sharing the floating point coprocessor between two cores shouldn't reduce the Bulldozer gaming performance at all compared to having a separate floating point unit in each core...

Actually it might be the opposite. Bulldozer FPU/SIMD coprocessor has improved a lot compared to AMDs last designs, and those weren't that bad either. The optimization guideline states that Bulldozer FPU/SIMD coprocessor has four times the throughput compared to previous AMD architecture. So instead of having a smaller FPU/SIMD unit on each core that is badly utilized (dependency stalls, cache stalls, LHS stalls) they chose to have a single larger unit and feed it with the FP/SIMD instructions of two cores. This yields a much better utilization rate, and should also yield a better single core FP/SIMD performance (the shared FPU/SIMD coprocessor has higher performance than single core dedicated units).

Many other optimizations in the Bulldozer architecture seem to be targeted towards reducing the integer/logic pipeline stalls. The processor has improved LHS stall reduction systems, improved memory/cache systems (to reduce cache stalls), wider out of order windows (to reduce dependency stalls), etc. AMD has clearly chosed to reduce the pipeline stalls instead of populating the pipeline with instructions of two threads. The more efficient arhitecture, the less gains SMT gives you. Simple out of order CPUs (Atom and console CPUs) gain most from SMT, since those CPUs have more dependency stalls (no runtime instruction reordering) and more LHS stalls (less sophisticated data/register management), more cache stalls (no sophisticated data prefetchers). Sandy Bridge doesn't gain that much from SMT, since it's designed to keep the pipeline full for high IPC (lots of transistors for management). AMD chose to go with simpler integer/logic cores, have more of them, clock them higher and share beefy FPU/SIMD between pairs of cost efficient cores. I was hoping they would go as high as Power6 in clocks (5 GHz), since one of the original goals of Bulldozer architecture was to allow higher clock frequencies.
 
Last edited by a moderator:
Honestly I'm hoping it's got Nehalem level performance at least. If it doesn't... Man that's just sad. From everything I've read, though, it seems to hit that level.
 
Thanks for the insights sebbbi, that actually makes Bulldozer sound quite exciting. I really hope it's playing in the same league as Sandy/IvyBridge when it launches. Given the clock speed headroom on those chips it probably needs to be 20% faster than todays fastest Sandybridge just to compete at the high end. Which is probably asking way too much!
 
I only really have two issues with the BD design:

1. The large and slow L2 cache. Load-to-use latency of 18-20 cycles for the 2MB L2 cache seems very server centric. For desk-/laptops I think the 12 cycle 256KB L2 of Sandy Bridge is a better choice (I have no data, so just a hunch).

2. The L2/L3 implementation. AFAICT when a core misses in its own L2 cache it probes the L3 for data. If it misses in L3 then the other L2s are probed. The optimization guide states that if you have producer-consumers in different 2-core modules, the queue/ring buffer should be big enough for the producer to spill data from its L2 to L3 so that the consumer doesn't incur the extra latency penalty when reading from the queue. However the queue/ring buffer shouldn't be so big that data is spilt from L3 to main memory. The guide specifically says that a ring buffer should be 2-6MB in size for a CPU with 16MB L3 cache.

WTF !?!? There is no way that programmers are going to get that right. First of all, AMD is the trailing platform, which means programs are primarily optimized for Intel platforms. Secondly, if you go through the effort of parallizing your program with lots of threads, you are likely going to have multiple queues, not a single big fat one. Especially true if you use third party software (and who doesn't these days ?)

The obviously right thing to do is to probe both the L3 and the L2s of the other modules, in parallel, on a L2 miss.

Cheers
 
Last edited by a moderator:
Yep, the BD cache sub-system is definitely a far step from the elegant and robust solution in Nehalem and post-Nehalem architectures from Intel.
 
Also, no program will have 100% of their instructions FPU/SIMD instructions, since you have to calculate addresses (array indices, pointers, etc), do logic operations, data load/store and branching.
BD has tradeoffs in this area as well. The address calculation part in the multithreaded case is better, but equivalent in single to one SB core and weaker in peak than a K8 (various other inflexibilities in that core aside).
Branching, permutation, and data load/store are a mixed bag.
There's one front-end, so branching is not better than a hypothetical 2 core solution. There is only one pipe for shuffle operations, so this is a step down, although the pipe does have a more capable permute instruction. In code that uses shuffles and blends, particularly if it targets the multiple shuffle units of Intel architectures, this could penalize BD.
The load/store bandwidth is that of a single core.

Assuming other games have similar FPU/SIMD usage patterns, it would be safe to say that sharing the floating point coprocessor between two cores shouldn't reduce the Bulldozer gaming performance at all compared to having a separate floating point unit in each core...
Two separate cores could branch better, shuffle better, and read/write better, depending on the mix.

Actually it might be the opposite. Bulldozer FPU/SIMD coprocessor has improved a lot compared to AMDs last designs, and those weren't that bad either. The optimization guideline states that Bulldozer FPU/SIMD coprocessor has four times the throughput compared to previous AMD architecture.
Is half of that based on using FMA?

The processor has improved LHS stall reduction systems, improved memory/cache systems (to reduce cache stalls),
Some things are better, some less so.
The cache is more scalable with regards to clock and power, which is likely why the L1 latency went up even after quartering its size.
The L2 is bigger, but slower.
The L3 is appears like it could be partitioned, but we'll have to wait to see if this helps with the latency numbers, which are at best not very good for AMD's current chips.

There are some very nasty corner cases for write combining, so much so AMD's optimization guide flat-out states there's going to be a new version that won't be so horrible.

edit:
On the plus side, there are some serious bandwidth and utilization problems with the current AMD northridges and memory controllers, which BD can't help but improve.

AMD chose to go with simpler integer/logic cores, have more of them, clock them higher and share beefy FPU/SIMD between pairs of cost efficient cores. I was hoping they would go as high as Power6 in clocks (5 GHz), since one of the original goals of Bulldozer architecture was to allow higher clock frequencies.
The clocks promised so far are higher, but not massively higher than competitors and predecessors.
The more likely goal was not higher clock speeds, but a pipeline that could maintain mostly equivalent IPC and modest clock gain without having as good a process nor the same level of custom circuit design as other high-performance x86 chips.

We'll need to wait on the final clocks and performance figures, particularly for the desktop market for which it seems less suited.
The to FX SKU seems to be priced around the level of a 2600K.
The AMD chip would come with a 30% larger die and 30% higher TDP.
This is somewhat higher than I had originally thought. I thought it would be between Westmere and Sandy Bridge, though closer to Westmere.
It might be a shade higher.

One thing I did not anticipate was the lack of clock scaling for Intel from the release clocks.
Aside from the Xeon model that shuts off the GPU for an extra speed grade, the clocks have not advanced where I had expected 1 or 2 bumps in the better part of a year. Perhaps there is something that makes this problematic, or Intel hasn't seen the need yet.
 
Last edited by a moderator:
There is only one pipe for shuffle operations, so this is a step down, although the pipe does have a more capable permute instruction. In code that uses shuffles and blends, particularly if it targets the multiple shuffle units of Intel architectures, this could penalize BD.

From 1.6.6 in the optimization guide:
....
Current AMD Family 15h processors support two SIMD logical/shuffle units, one in the FMUL pipe
and another in the FADD pipe, while previous AMD64 processors have only one SIMD
logical/shuffle unit in the FMUL pipe. As a result, the SIMD shuffle instructions can be processed at
twice the previous bandwidth on AMD Family 15h processors. Furthermore, the PSHUFD and
SHUFPx shuffle instructions are now DirectPath instructions instead of VectorPath instructions on
AMD Family 15h processors and take advantage of the 128-bit floating point execution units. Hence,
these instructions get a further 2X boost in bandwidth, resulting in an overall improvement of 4X in
bandwidth compared to the previous generation of AMD processors.

Isn't this similar to Intel ?

Cheers
 
That part of the guide makes no sense, since there is no FADD and FMUL pipe, just two FMA. The text likely came from the previous gen guide.

What is also irksome is the shoddy workmanship of AMD's documents. This is one of many places where they screwed up copy/paste and whatever proofreading they did missed it.

Descriptions elsewhere said the XBAR unit handled it and any ops that cross from the low to high sides of the register file, and that is in only one pipe.
 
I only really have two issues with the BD design:

1. The large and slow L2 cache. Load-to-use latency of 18-20 cycles for the 2MB L2 cache seems very server centric. For desk-/laptops I think the 12 cycle 256KB L2 of Sandy Bridge is a better choice (I have no data, so just a hunch).

2. The L2/L3 implementation. AFAICT when a core misses in its own L2 cache it probes the L3 for data. If it misses in L3 then the other L2s are probed. The optimization guide states that if you have producer-consumers in different 2-core modules, the queue/ring buffer should be big enough for the producer to spill data from its L2 to L3 so that the consumer doesn't incur the extra latency penalty when reading from the queue. However the queue/ring buffer shouldn't be so big that data is spilt from L3 to main memory. The guide specifically says that a ring buffer should be 2-6MB in size for a CPU with 16MB L3 cache.

WTF !?!? There is no way that programmers are going to get that right. First of all, AMD is the trailing platform, which means programs are primarily optimized for Intel platforms. Secondly, if you go through the effort of parallizing your program with lots of threads, you are likely going to have multiple queues, not a single big fat one. Especially true if you use third party software (and who doesn't these days ?)

The obviously right thing to do is to probe both the L3 and the L2s of the other modules, in parallel, on a L2 miss.

Cheers

Since only L3 is shared on Intel, the queue will be sized to it. And BD matches L3 size.
 
93963099.jpg


37173286.jpg



6.png


http://obrovsky.blogspot.com/

dunno how real they are however if they are then amd has serious problems. The fx-8150 is slower than the phenom II x6 at 3.3ghz. That certianly can't be right and i'm talking about stalker and how in lost planet a game that is said to use up to 8 cores by the designers themselves its only 1fps faster dispite being 300mhz faster than the 6 core phenom

There are more tests at this guys blog
 
Game testing in heavy CPU limited scenarios (such as those low resolutions) is purely academical, I wouldn't worry too much about that. However, those numbers look a bit low for a part that was supposed to turbo @ 4.2+ Ghz, maybe it easily gets TDP limited.
 
That part of the guide makes no sense, since there is no FADD and FMUL pipe, just two FMA. The text likely came from the previous gen guide.
Holy crap, you're right.

Section 2.11 in the same document, states that permute, pack and shuffle is in FP unit 1.

Cheers
 
Game testing in heavy CPU limited scenarios (such as those low resolutions) is purely academical, I wouldn't worry too much about that
Considering we want to compare CPU speeds then we are dealing with academical stuff and we don't care that games are mostly GPU dependant :)
Though wasn't that guy the same that told he faked some of the BD results on his blog before?
 
Back
Top