AMD Bulldozer Core Patent Diagrams

So, Sandy Bridge should still have an edge in SIMD performance with it's native 256-bit AVX impl?!
In K10, AMD went for two 64-bit (64-bit and 80-bit x87 legacy one, to be exact) FP/SIMD units, acting as one for full speed 128-bit SSEx processing, and now in Bulldozer we see similar tactics.

It's possible that AMD will have to use 2 cycles for those instructions. But I don't think it's such a big deal for a first gen. AVX product though, by the time programmers pick it up a better design will be available.

Also Bulldozer has FMAC which won't make an appearance on an Intel chip until Haswell and this one instruction turns out to be quite a major advantage for floating point through output: http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/61121/

AMD's STARS is way behind right now, but if Bulldozer does well, Intel won't have a major micro-architectural response until Haswell. IIRC, Sandybridge is supposed to be a power efficiency oriented part and a revision of the Core microarchitecture.
 
Last edited by a moderator:
AMD is definitely filling a gap with the Bulldozer arch in terms of integer throughput, being currently lagging behind Intel in this regard.
I wonder, what are those four int pipelines in each "core" -- simple 1 macro op or more complex. I was left with the impression, that Bulldozer would dial down the count of int pipes from the K7~K10 architectural base line of three symmetrical complex decoders (a waste in most of the cases) to just two. After all 2x2 configuration with parallel thread synchronization is viewed as much more optimal, than the old one with the almost-impossible-to-fill three pipes at once.
 
So, Sandy Bridge should still have an edge in SIMD performance with it's native 256-bit AVX impl?!
In K10, AMD went for two 64-bit (64-bit and 80-bit x87 legacy one, to be exact) FP/SIMD units, acting as one for full speed 128-bit SSEx processing, and now in Bulldozer we see similar tactics.

Why would Sandy Bridge be faster? The two halves of a simd in avx would just be mux-ed over 2 fp units. So both AVX and SSE instructions would run normally, just that AVX will be executed over 2 different SSE units.

EDIT: Ah, so there is only 1 256 bit fp per core, unlike sandy bridge which will have a 256 bit unit per core. On fp benchmarks, it'll likely be murdered then. But sandy bridge won't have fma, so may be they'll catch up there.
 
Last edited by a moderator:
sorry i think to do not understand that diagram

is that a sigle bulldozer core?
every bulldozer core is something like a dual core? (lack of better terms)

Explanation here,

This is a single Bulldozer core, but notice that it has two independent integer clusters, each with its own L1 data cache. The single FP cluster shares the L1 cache of the two integer clusters.

Within each integer “core” are four pipelines, presumably half for ALUs and half for memory ops. That’s a narrower width than a single Phenom II core, but there are two integer clusters on a single Bulldozer core.

Bulldozer will also support AVX, hinted at by the two 128-bit FMAC units behind the FP scheduler. AMD is keeping the three level cache hierarchy of the current Phenom II architecture.

A single Bulldozer core will appear to the OS as two cores, just like a Hyper Threaded Core i7. The difference is that AMD is duplicating more hardware in enabling per-core multithreading. The integer resources are all doubled, including the schedulers and d-caches. It’s only the FP resources that are shared between the threads. The benefit is you get much better multithreaded integer performance, the downside is a larger core.

So instead of just doubling the register file like in Nehalem, AMD has actually doubled the integer units and the L1D cache, without doubling the fetch/decode logic. That should really help with integer processing. This also puts to rest the possible issues regarding the fp simd performance. It'll match Sandy bridge, atleast in raw throughput.

This also clears the way for slapping a dual-core bobcat into a 7870 and making a 8P server out of it. That'll kick the gpu-gpu transfer rates (over HT, at least within a single node) into stratosphere from the pathetic present day levels.

Having said all that, don't be too excited about the bulldozer just now. AMD bungled Barcelona as well in the past. And Bulldozer really is a matter of life or death for AMD. They just have to deliver more than what Intel has on offer with sandy bridge.
 
So magny corus will launch in Q1. I wonder on what basis they are projecting such huge increases in perf/watt in their server performance.

I would imagine the increase in cores and memory bandwidth. It doesn't look terribly interesting until 2011 though. Hopefully we'll get a nice look at early numbers from the Chinese.

shared L2?... :S

I'd imagine that L2 is shared per-core and L3 is shared across all cores.
 
BTW, does anyone here have any idea why the top third (?) of the bars in the performance projection for the 2011 sever projections is blurred/different color/odd? What do you think it means?
 
BTW, does anyone here have any idea why the top third (?) of the bars in the performance projection for the 2011 sever projections is blurred/different color/odd? What do you think it means?

They probably haven't finalized clockspeeds yet.

Why would it take them >1 year to launch if bulldozer is taped out. Also, is there 32nm SOI process far along enough to be running the chips now?

http://www.fudzilla.com/content/view/16320/35/ If you believe that, it should take atleast a year, plus 32nm HKMG is a new process. You never know though, AMD could pull in the launch again just like they did before. I do think their current roadmap is artificially stretched out so they can pull off some execution "surprises."
 
Last edited by a moderator:
So the OS sees one Bulldozer core as two processors.. would Windows XP and later recognize the difference between a full core and a "half core". Let's say there are two FP heavy threads. Is Windows smart enough to send one of the threads to a real core rather than both threads to the same cluster? Actually, same question for Intel's Hyperthreading.
 
im a little confused by the AMD diagrams vs dresdenboy diagrams. in the AMD diagram is it 1 core with 2int cluters and 1 2X128bit FP cluster or is it 2 cores with 4 int cluters and 2 128bit FP clusters?

i guess everyone hopes its the first one :LOL:
 
The four pipelines in each "cluster" are actually two INT units and two load/store units, all bolted together.
 
Last edited by a moderator:
So instead of just doubling the register file like in Nehalem, AMD has actually doubled the integer units and the L1D cache, without doubling the fetch/decode logic. That should really help with integer processing. This also puts to rest the possible issues regarding the fp simd performance. It'll match Sandy bridge, atleast in raw throughput.
You hinted at it, and this is also a concern: everyone is deemphasizing single thread performance, however so far both intel and amd still have increased single thread performance (ok not much just because of slight tweaks or changes not directly related affecting performance of the whole chip). This time amd however will likely actually decrease single thread performance (at least as integer performance is concerned), I doubt they'll make up for the loss of the 3rd int pipe per half-core by increased clock. AFAIK Sandy Bridge will do no such changes, and AMD is already behind in single threaded int performance, so single threaded int performance will likely be pretty pathetic compared to Sandy Bridge.
But maybe the time is right for such a change...
 
The way I see it, the two INT clusters could actually be in a constant sync each other, in a case of running instructions from the same thread. If there's conditional branch or latency event, coming ahead in one of the clusters, it would flag the whole "context", break the execution and switch to another thread, while the other cluster would be "notified", still running the same thread, for an upcoming possible pipe stall, and re-order accordingly the instruction stream and data prefetch...
I guess some of the pipeline phases, concerning out-of-order hardware, might be interlinked or straight ahead shared.
Coarse multi-threaded environment would especially benefit from such an architecture, streamlining the performance characteristics.
 
You hinted at it, and this is also a concern: everyone is deemphasizing single thread performance, however so far both intel and amd still have increased single thread performance (ok not much just because of slight tweaks or changes not directly related affecting performance of the whole chip). This time amd however will likely actually decrease single thread performance (at least as integer performance is concerned), I doubt they'll make up for the loss of the 3rd int pipe per half-core by increased clock. AFAIK Sandy Bridge will do no such changes, and AMD is already behind in single threaded int performance, so single threaded int performance will likely be pretty pathetic compared to Sandy Bridge.
But maybe the time is right for such a change...


Think if it this way. Earlier AMD had 3 complex int units per thread. Now they have 4 (somewhat simpler units) spread across 2 threads. This will lead to higher efficiency as the 3rd int unit had much less utilization earlier. Now, even if 1 thread stalls, the other can continue running and hide the branch/cache miss latency. It is better wrt nehalem's in the sense that nehalem doubles without doubling actual execution units. But the thing is that nehalem already had higher efficiency, so can't say a priori if it is a loss or a gain.
 
Think if it this way. Earlier AMD had 3 complex int units per thread. Now they have 4 (somewhat simpler units) spread across 2 threads. This will lead to higher efficiency as the 3rd int unit had much less utilization earlier. Now, even if 1 thread stalls, the other can continue running and hide the branch/cache miss latency. It is better wrt nehalem's in the sense that nehalem doubles without doubling actual execution units. But the thing is that nehalem already had higher efficiency, so can't say a priori if it is a loss or a gain.

I completely agree it should be more efficient, and at least on paper could potentially beat current nehalem, for multithreaded workloads. Just saying if you really have completely single-threaded (int) workloads (quite rare nowadays but still not a unrealistic scenario), it's probably not going to be a really fast chip (though even in this case it could be at least quite power-efficient since I'd guess the inactive int-halfcore shouldn't use much power).
 
Last edited by a moderator:
You hinted at it, and this is also a concern: everyone is deemphasizing single thread performance, however so far both intel and amd still have increased single thread performance (ok not much just because of slight tweaks or changes not directly related affecting performance of the whole chip). This time amd however will likely actually decrease single thread performance (at least as integer performance is concerned), I doubt they'll make up for the loss of the 3rd int pipe per half-core by increased clock. AFAIK Sandy Bridge will do no such changes, and AMD is already behind in single threaded int performance, so single threaded int performance will likely be pretty pathetic compared to Sandy Bridge.
But maybe the time is right for such a change...

It wouldn't make any sense for AMD to release a microarchitecture with lower performance than K8. I'm pretty confident that AMD will be competitive. The fact that AMD has two distinct "pipelines" for loads and stores indicate they will finally have speculative load after store like Intel has had since Core 2.

Cheers
 
Last edited by a moderator:
It wouldn't make any sense for AMD to release a microarchitecture with lower performance than K8. I'm pretty confident that AMD will be competitive. The fact that AMD has two distinct "pipelines" for loads and stores indicate they will finally have speculative load after store like Intel has had since Core 2.
Even with improved load/store and other tweaks, I have a hard time to believe they can make up for the loss of the third pipe - might be close though.
Certainly however, multithreaded performance should be much better than K8 (without making the chip too big, multithreaded performance per transistor (or per die area) should be higher hence more cores at the same die size (or in the same power envelope)).
 
Dresdenboy has quoted papers about executing a single thread on multiple clusters (the long rumoured Reverse HyperThreading?!).

Also patents about power management that can make better use of TDP by much more fine-grained redistribution of power usage (clocks/voltage) through the chip, which should help with single-threaded performance.

On the other hand, one of the diagrams from the recent AMD slideshows had their projection of single-threaded performance potentially decreasing in future.

Lol, Fudo Fails to understand the diagram :oops:
 
Back
Top