AMD Bulldozer review thread.

Didn't each core in a module have their private L1 caches so there shouldn't be any trashing there, especially in the instruction cache that shouldn't even contain any data that can be changed by other cores?

There is just one L1I cache per module, there are two separate L1D caches. And that shared L1I cache is only 2-way set associative.
 
Last edited by a moderator:
Phenom II is faster for gaming. AMD have been behind for years now and they're actually moving backwards.

A 32nm shrink of Phenom II would presumably have been much easier to get up and running on a new process and also been a smaller, cooler chip to sell to consumers (even without higher clocks or turbo mode or basically any changes).

AMDs decision not do anything new on 45nm for years, then jump in with a brand new architecture on a completely unproven process sure paid off. :(
 
I wonder what is next for AMD in the cpu space now... yeah 10/15 % improvement... still not enough.. plus this thing seems pretty hot... All engies@Intel working on IB must have a day off and a good laugh by now.
 
I just find it hard to accept that it struggles against X6 in gaming. I guess when they started talking about bundling a liquid cooling solution, it was sign of trouble.

Would a 32nm Phenom II X8 be faster and smaller compare to 4-module BD ?

Anyway I am glad with the 2600k purchase. I'll skip this and go with SNB-E, and drop the 2600k into gaming rig. I'm quite dissapointed, not expecting much, but not this.
 
Something interesting I found, from hardware.fr's review:

img00338281.gif


"Latence" = Latency
"Debit" = Throughput

Nonetheless, Debit is marked red for higher values, which cannot mean throughput, since that would be "higher is better" (or is the diagram simply colored wrong?).

Possibly debit is better translated as CPI, opposed to IPC? After initial latency of course?
 
Possibly debit is better translated as CPI, opposed to IPC? After initial latency of course?
Yes, it probably means how many cycles it takes to complete the operation (on average) for a test run.
The <1 cycle values are most likely a result of super-scalar execution issue.
 
"le débit mesure la vitesse entre chaque instruction traitée lorsqu'on en traite plusieurs. " So here, "debit" is like "time between instructions".
 
Phenom II is faster for gaming. AMD have been behind for years now and they're actually moving backwards.
Depends on the game, looked overall quite similar to me - lose some, win some. But that is indeed disappointing.
[edit: actually it indeed seems it's slower in games more often than not - the games would really need to have more threads for it to be faster for the most part.]

AMDs decision not do anything new on 45nm for years, then jump in with a brand new architecture on a completely unproven process sure paid off. :(
They didn't quite do that forgot about Llano? If the process isn't up to the task (I have no idea there really if the chip not reaching higher frequencies is really due to 32nm trouble) there's not much you can do about it anyway at this point.


Would a 32nm Phenom II X8 be faster and smaller compare to 4-module BD ?
An interesting question. With the same 512KB L2 per core and 8MB L3 it should be smaller. Which is counter to the idea that 1 BD module should be smaller than 2 older cores because some resources are shared...
It is hard to say though what the achievable frequency would be (and hence if it would be faster) as we don't know how well (or not) 32nm fares against 45nm. Also Phenom II might have some trouble scaling to more cpus, the memory throughput wasn't all that much and not sure about L3 cache throughput neither.

Something interesting I found, from hardware.fr's review:
"Latence" = Latency
"Debit" = Throughput
Yes, this confirms some information which was already available: since it has two FMAC pipes some FPU/SSE2 commands can run in both pipes which can only run in one for other archs (muls at least, and interestingly looks like DIVs too? Or maybe they are more pipelined now, though at first glance it really looks like almost anything could run in either pipe). The latency of in particular complex x87 operations is horrific though so it's probably not really a win for old software, larger queues or not...
 
Last edited by a moderator:
The latency of in particular complex x87 operations is horrific though so it's probably not really a win for old software, larger queues or not...
Intel did exactly the same with P4 ten years ago, to make room for SSE2 and to get longer pipeline to pump up the clock-rate. Sadly, AMD is quiet on the pipeline number of stages for BD.
Looks like AMD is (un)willingly following the old steps to Intel's NetBurst "hell". :LOL:
 
Intel did exactly the same with P4 ten years ago, to make room for SSE2 and to get longer pipeline to pump up the clock-rate. Sadly, AMD is quiet on the pipeline number of stages for BD.
Looks like AMD is (un)willingly following the old steps to Intel's NetBurst "hell". :LOL:

Yep, high speed, crappy performances... only problem is Intel has big pockets and very good engie teams, while AMD seems in difficulty for years now. I read somewhere that a lot of AMD engies left some time ago because of crappy management. I hope AMD will rebound but I doubt it (on the desktop side anyway).
 
Last edited by a moderator:
I can only hope for AMD that they've got the connection between the Cores and the Co-processors right and that GCN-like cores will be part of future BDs rather sooner than later. Otherwise

edit: BTW, were's Charlies rant complaining about AMD forcing a HPC/server part down the mainstreamers' throats? *SCNR*
 
Intel did exactly the same with P4 ten years ago, to make room for SSE2 and to get longer pipeline to pump up the clock-rate. Sadly, AMD is quiet on the pipeline number of stages for BD.
Looks like AMD is (un)willingly following the old steps to Intel's NetBurst "hell". :LOL:
Well the horrific latencies in the FMACs are quite specific to complex x86 ops, while the simple ones are increased too it's "only" from 4 to 5 clocks. So I don't think pipeline length has increased massively.
It seems to me like the horrific latencies for the complex x86 ops is just because they can now be executed in both pipes but AMD implemented them a bit cheaper hence the individual ones take longer to execute (but of course can instead execute two of them simultaneously). Those ops might be rare enough that they are "fast enough" but making both FMAC pipes (nearly) symmetric probably made things easier for scheduling etc. hence AMD doing this instead of going the more traditional route of only executing these ops in one pipe but a bit faster (of course more traditional fpu execution ports are asymmetric anyway).
 
Last edited by a moderator:
I can only hope for AMD that they've got the connection between the Cores and the Co-processors right and that GCN-like cores will be part of future BDs rather sooner than later. Otherwise
If you're talking Piledriver it doesn't look like that, with Anand stating L3 cache will likely be gone, which would suggest a totally separate IGP like in Llano (of course the connection COULD be faster but I don't think without common L3 cache this is really very useful for that use case). Though well integrated GCN is still one more generation ahead...
 
No, I wasn't talking about a specific iteration of BD. But from what I'm seeing in the reviews, AMD needs to get their performance per watt up as fast a possible also for the HPC space and efficient, wider vector units would greatly help to at least make/keep (depends on how you assess the current situation) them competitive in that market.
But in any case I think you're looking 2013 at the earliest for any such thing. The competition will be Haswell at that time which I've no idea how it compares to todays chips :).
 
Ok here's my summary of what I'm thinking after reading some reviews:

The good:
- TurboCore which actually does something. Of course for perf/power this is not good but it's nice to finally see this truly working.
- lower idle power. In some reviews it made a much larger difference (almost to the point of being similar to intel idle power consumption) in some not so much but in any case looks like an improvement most likely thanks to powergating the cores.
- higher memory bandwidth than Phenom II (not quite as good as SNB but definitely an improvement).
- AES instructions for catching up with SNB, AVX, FMA4 etc.
- shared FPU actually looks ok. Only some rare synthetics seem to show this to cause a performance hit, otherwise scaling to multiple threads seems largely independent if it's integer or float workload (of course this most likely is a result of the beefed up FPU too, it probably will show bad scaling with code using AVX-256 where the FPU suddenly doesn't look all that beefy anymore).
- CMT is a neat idea and scaling isn't too bad (hardware.fr has some numbers, 4->8 threads scaling is better than going from 4 to 6 Phenom II cores and of course better than HT), so might be a win on a perf/area scale for multithreaded workloads, if just the singlethreaded baseline would be a bit higher...

The bad:
- nowhere near the once promised "30% higher" clockspeed. If that's due to cpu design or manufacturing trouble I don't know.
- high load power consumption. Efficiency actually didn't increase compared to a X6 1100T which is worrysome.
- low single-thread performance even with some clock increase just barely at Phenom II levels, and not in the same ballpark as SNB cpus at all (of course it was expected but the difference is bigger than it should be).

The ugly:
- AMD promised roughly same IPC for single threaded performance and they clearly missed it for typical workloads by about 10-15% or so. I think that's the biggest problem actually. Why did they miss it? Is it just the removal of the 3rd ALU in an integer core? It looked like it could have been possible to "compensate" for that with other improvements (like memory disambiguation, better branch prediction, larger scheduling queues) given that it's hard to schedule 3 alu ops in the first place but it didn't happen.
- L1I thrashing issues with only iffy OS bandaids for a problem which should have been avoided by better cache design (higher associativity).
- not convinced of the whole cache design. Requires large area and the latency is just bad both compared to Phenom II and more so the competition, so the large size might not be worth it. Moreover, I wonder if the very low L1D write bandwidth (not even 1/5 of read bandwidth while traditionally it "should" be roughly half of the read bandwidth, it is a result of the L1D write through design together with the low L2 write bandwidth) isn't a real problem reducing throughput quite heavily in some cases (there is a "Write Coalescing Cache" to help with that, but at least in hardware.fr numbers it didn't turn up).
 
Last edited by a moderator:
Back
Top