AMD Bulldozer Core Patent Diagrams

The first generation of Bulldozer will still be on the same platform as Magny Cours which has a faster platform than Istanbul but less performance per core. You can see the effects of cutting per-core bandwidth on integer performance going from Istanbul to MC:

http://techreport.com/r.x/2009_4_22...res_in_early_2010_16_cores_in_2011/slide3.jpg

2x the cores but w/ only 4/3x the memory channels seems to translate into less performance per core when considering full multi-threaded performance. I think pure single threaded performance won't be so terrible since there'll be less bandwidth sharing and turbo-boost.

what are you talking about? MC has TWICE the number of memory channels compared to Istanbul.

And the reasons for the "not 100% scaling" in this case does not have anything to with memory bandwith, it's because of memory latency, clock speed etc.

And no performance analysis should be based on marketting ppt slides.
 
consumer no idea,
server i believe john Fruehe has said Q1 11 (might have been H1)

AFAIK, AMD hasn't said anything about BD timing in 2011. Also, in the last conf call, they said that 32nm SOI process had been delayed. Based on that, I think Q311 is a reasonable target.
 
So when is bulldozer due ?

Consider when they supposedly taped out, and then consider typical time from tape out to comercial availability, whilst taking into acount that it's a new design from them on a new, and likely troublesome, process.
 
I think Q311 is a reasonable target.
Oh man I hope not :cry:

Some interesting stuff here:
http://blogs.amd.com/work/2010/08/12/the-parallel-universe/
Most likely single threaded workloads will fare much better, they will have all the shared resources in a module dedicated to them, besides the spacious 2MB L2 cache.
Interesting point in the comments that brings up something that's been nagging at me.
I'm thinking that chunky shared L2 could be an overlooked key factor in Bulldozer performance. (at least as long as the Windows scheduler plays ball)
 
AFAIK, AMD hasn't said anything about BD timing in 2011. Also, in the last conf call, they said that 32nm SOI process had been delayed. Based on that, I think Q311 is a reasonable target.

they didn't say it had been delayed, they said that yeild curve hadn't progressed to where they had hoped, based on that they moved the launches of llano and ortario around. This has nothing to do with when bulldozer will come for several reasons:

1. llano was going to come first
2. Server based chips have much higher margins then consumer so lower yeilds on llano could make a consumer product unviable but not a server product.
3. bulldozer is being sold as 6/8 12/16 MCM so there are salvage parts to help with over yeilds.
4.product lunch and product availbilty can be two very differnet things :p
 
Oh man I hope not :cry:

Some interesting stuff here:
http://blogs.amd.com/work/2010/08/12/the-parallel-universe/
Interesting point in the comments that brings up something that's been nagging at me.
I'm thinking that chunky shared L2 could be an overlooked key factor in Bulldozer performance. (at least as long as the Windows scheduler plays ball)

Ooh, nice find. That should mean its a great gaming chip, as even in games that are heavily multi threaded you still usually find there's one dominant thread which can often bottleneck performance.

I wonder how a load like that would translate to a Bulldozer chip now come to think of it. Ideally you'd want that one dominant thread to have a whole module dedicated to it, and the other three modules to handle the extra threads, though I'm unsure if that's how things would be scheduled in reality.
 
this brings the question of how CPUs are detected and supported by schedulers from NT 5.1, NT 5.2, 6.0 etc. or linux 2.6.26, 2.6.32 etc., regarding SMT/hyperthreading for instance, or > 2, > 4 cores or other aspects.

are there rules depending on particular CPU models, more generic rules?
I don't know much about that and most people probably don't either.
 
I've been pondering that myself lately, mainly because I got a notebook with an i7 last spring. I've read that XP knows about SMT, but 2K doesn't. You wouldn't want to run 2K with a SMT-equipped CPU with multiple real cores because it would dish out threads inefficiently.

While playing SupCom the other day, I watched how Win7 loaded up my i7. It put threads on every other logical core, skipping the hyperthread cores. Load was on CPU 0,2,4,6 instead of going 0,1,2,3. I read that the core numbering pairs up each core's logical CPUs. I believe I've also read that SupCom tends to peak out on 4 cores, but it could also be that Win7 or the game doesn't use HT cores.
 
Last edited by a moderator:
While playing SupCom the other day, I watched how Win7 loaded up my i7. It put threads on every other logical core, skipping the hyperthread cores. Load was on CPU 0,2,4,6 instead of going 0,1,2,3. I read that the core numbering pairs up each core's logical CPUs. I believe I've also read that SupCom tends to peak out on 4 cores, but it could also be that Win7 or the game doesn't use HT cores.

Better then dumb "core"-affinity assignment by the developer is more likely than super smart-ass WI (windows-intelligence). ;)
 
Last edited by a moderator:
The good news is that some future AMD core will only need to support x86, MMX, SSE, SSE2, SSE3, SSE4, AVX, XOP, FMA3, FMA4 and probably some other three to four letter words.

I wish the post had more on the rationale, other than not even AMD cares about its 3DNow! instructions these days.
OpenCL may offer an abstraction layer that could hide the change, AMD wants to pare back on its hefty decoder (even if it is shared between two cores in Bulldozer, some of the descriptions of its capability hint it is still going to be big), or AMD coveted that opcode space.

The latter reason may make sense, there is discussion of things like lightweight profiling and a scheme for implementing primitives to accellerate an approximation of atomic memory transactions in later chips. It might be nice if there didn't need to be another couple prefix bytes to use them.
 
Any chance one day MMX will go too?

Yes, it's a bit weird. MMX+3DNow is one register-space and mirrored by SSE+SSE2 (double-wide) and then mirrored by AVX (quadruple wide). In their shoes I would probably map MMX+3DNow onto the lower half of the SSE-registers, if, yes, if there wouldn't be intel ... so maybe one could just alias the MMX-registers onto the SSE-rename-registers (as such treat them internally as xmm16-xmm23), adding an additional line to the register-adress.

The entire "state" is so borked, this could be so streamlined with nicely ordered opcodes, in the end these are all just different views onto the register-file: do me 32x adds on that block, can be 4 non-dependent MMX adds [Bulldozer could do 4, K10.5 only 3], 2 non-dependent SSE2 adds, or one AVX add; they're indifferent resource utilization to the core.. But no, we have this mess. :devilish:

And I don't stop saying 3DNow is [still] higher quality than SSE: see reciprocal functions. :D I'll miss it, and hatch my surviving K10s.
 
Back
Top