AMD Bulldozer review thread.

DavidGraham · Oct 15, 2011

I.S.T. said:
Read the whole thread and the arch thread(Or at least the last five pages). You'll find out why.

Thanks for the tip , I just did that now . I came out with 3 prominent possible causes : slow L3 cache , old prediction algorithms , buggy front end .

Still nothing seems evidently conclusive , maybe it is just a combination of all of them ? I guess more testing is needed .

swaaye · Oct 15, 2011

L1 cache associativity is apparently too low as well and inefficient for SMT. Also, FMA4 looks very slow for floating point. It seems that it has a lot of problems.

Alexko · Oct 15, 2011

V3 said:
I mean, from some of the review they say AMD designed BD so it can be clocked much higher by sacrificing some performance per clock, but missed the target anyway.

I thought since P4, designing chips for clock is a dead end. Why did they attempt it for BD ? What's their reasoning ? I wasn't aware they were going the Pentium 4 route with BD till I read the reviews.

Bulldozer was originally supposed to have higher IPC than K10. Something, somewhere, went horribly wrong. Just what that was is the big question.

I.S.T. · Oct 16, 2011

Hell, if it had K10 IPC it'd be nice. As it is right now, in some stuff it doesn't match up.

I.S.T. · Oct 16, 2011

DavidGraham said:
Thanks for the tip , I just did that now . I came out with 3 prominent possible causes : slow L3 cache , old prediction algorithms , buggy front end .

Still nothing seems evidently conclusive , maybe it is just a combination of all of them ? I guess more testing is needed .

Slow L2 as well. Latency is about 20 clocks or so according to the realworldtech forums.

swaaye · Oct 16, 2011

When has AMD had a competitive cache architecture? I suppose Duron had a neat advantage over Celeron....

Raqia · Oct 16, 2011

Alexko said:
Bulldozer was originally supposed to have higher IPC than K10. Something, somewhere, went horribly wrong. Just what that was is the big question.

I wonder what all the delays from June were specifically attributed to; it could explain a lot.

fellix · Oct 16, 2011

I.S.T. said:
Slow L2 as well. Latency is about 20 clocks or so according to the realworldtech forums.

It's funny, that the L2 latency in Prescott and Cedar Mill cores is exactly 20 cycles (both had up to 2MB). Apparently deep pipelined architectures doesn't go well with fast caches even at moderate sizes. Conroe (65nm) managed a steady 14 cycles for its 4MB L2.

Exophase · Oct 16, 2011

fellix said:
It's funny, that the L2 latency in Prescott and Cedar Mill cores is exactly 20 cycles (both had up to 2MB). Apparently deep pipelined architectures doesn't go well with fast caches even at moderate sizes. Conroe (65nm) managed a steady 14 cycles for its 4MB L2.

Or at least, platforms that are trying to target much higher clocks than they realistically have the power budget for go for higher latency L2 caches.

There's some speculation on RWT right now that BD's L2 latency may be hurt by layout constraints due to being shared between two cores. Seems like splitting the L1 dcache, execution, and scheduling paths into two could have bad effects on latency back where they're supposed to join again. Or maybe they just didn't do a good enough job with layout.

fellix · Oct 16, 2011

Actually, the measured L2 latency in Sandra is 27 cycles to be exact:

I.S.T. · Oct 16, 2011

fellix said:
Actually, the measured L2 latency in Sandra is 27 cycles to be exact:

There are some disagreements with that if you read the RWT forums...

fellix · Oct 16, 2011

A pointer-chasing test would definitely reveal more details. Any sources?

Exophase · Oct 16, 2011

Before doing a comparison you have to know if the source is talking about L2 latency in isolation or effective latency on an L1 miss which is usually latency of L1 + L2 since L2 is not typically queried in parallel. Technical descriptions will often use the L2-only latency while measurements will often use the combined.

fellix · Oct 16, 2011

Well, for that one, I know the L2 latency in K8 was officially stated as a combination with the L1 miss-rate latency (20 cycles total). Sadly, such information is a rare for most of the other architectures around.

mczak · Oct 17, 2011

The L2 cache latency certainly isn't good but I'm not sure it makes that much of a performance difference.
Also, don't forget Llano manages to have a similarly bad L2 latency with half the L2 size and without having to worry about core sharing, the reasons could be similar (power draw related mostly, I guess).

AlexV · Oct 17, 2011

fellix said:
A pointer-chasing test would definitely reveal more details. Any sources?

http://valid.canardpc.com/show_oc.php?id=2050581

rapso · Oct 18, 2011

hoho said:
Higher MHZ for RAM not only gives higher bandwidth but also lower latency. Though obviously it can't be the only reason for 40% difference.

if the RAS/CAS settings stay equal, but usually the latency in ns stays the same (the cycle count increases), when the frequency is higher.

nicely seen for example on the kingston page: http://www.kingston.com/hyperx/products/khx_ddr3.asp

DDR3-1866 9-11-9-27
DDR3-1600 - DDR3-1800 9-9-9-27
DDR3-1600 7-8-7-20
DDR3-1333 7-7-7-20
of course, there is some noise in this comparision, but in general if you look at DDR memory over the years, DDR1(400) ~ 3cycles, DDR2(800) ~6cycles, DDR3(1333) ~9cycles.
There are of course even DDR3-2500 modules and they have similar timmings to those DDR3-1866 modules, but they are technically the same modules, just hand selected.
While 1333 vs 1866 seems to be a different design (maybe smaller process?)

The bad bulldozer performance makes me especially sad, as Intel seems to have no pressure to progress any further with their CPUs, the newest leaked/rumors say, that IvyBridge will just have the performance of SandyBridge at a lower power level (95W->77W), and again just 6cores for consumer.
If AMD continue to focus on APUs, it might be even their advantage in the long term to have a weak CPU, as at some point they'll have enough benchmarks to show off the advantage. I wish intel would view IGPs as competition for their CPUs and hurry up speeding up the vector units.

Has anyone found benchmarks of the AVX units, especially FMA4, from an independent source/reviewer? I'm very curious how the single thread performance is and how good two threads running it will scale.

hoho · Oct 18, 2011

rapso said:
The bad bulldozer performance makes me especially sad, as Intel seems to have no pressure to progress any further with their CPUs

It won't push Intel to rush out with their products but they still have to come up with newer, faster and generally better CPUs because they will need people to upgrade their existing stuff instead of just replacing broken parts or they won't be making enough money to get anywhere. Basically even if AMD would cease to exist they will still compete with their own older offerings.

rapso · Oct 18, 2011

hoho said:
It won't push Intel to rush out with their products but they still have to come up with newer, faster and generally better CPUs because they will need people to upgrade their existing stuff instead of just replacing broken parts or they won't be making enough money to get anywhere. Basically even if AMD would cease to exist they will still compete with their own older offerings.

that's just true to some degree, they dont need to compete with anyone but themself, so they can decide how fast they drive it. they could release a 8core consumer cpu with 125W, but instead they release 6cores with 77watt.
The still give it a new model number and everyone will say "it's never and faster" if they want to sell it to you and the average consumer will update their PC.

I really wonder if that strategy also works for bulldozer, getting an 6core phenom or A8 lliano seems atm to be a far better offer. Only AVX with good FMA4 results would be a point to go for bulldozer.
I was hoping bulldozer will establish 8cores on desktops, I'm waiting for ages for those (and 8core @ 2GHz xeons, which are the low and and still pricey, are not really a good alternative).
I have an 4core since more than 5 years and all I could get now is a 6core sandybridge-e or 6core phenom, it feels like the technology improvement stagnates.

I'm really wondering what went wrong with bulldozer, on paper it seems really smart to share resources on new features (AVX) while everything old could run nearly full speed (80%), if that had worked out. with some trade of execution units for more cores.

I also don't think 20cycles for L2 are that slow, on an out of order architecture, it could be hidden, with prefetch units that shall really be not causing an 8core@4.2Ghz to be slower than the old 6Core at @~3GHz.

fellix · Oct 18, 2011

The problem of BD with AVX is that the FP/SIMD portion of the architecture was designed primary as a dual 128-bit issue with [strike]SSE5[/strike]/XOP/FMA4 in mind, not for the native 256-bit implementation by Intel. AMD probably hoped for the FMA to be adopted as a prime booster for the raw performance upgrade in the new ISA, but Intel took the "wide" approach here first, with the new register format.

AMD Bulldozer review thread.

DavidGraham

swaaye

Entirely Suboptimal

Alexko

I.S.T.

I.S.T.

swaaye

Entirely Suboptimal

Raqia

fellix

Exophase

fellix

I.S.T.

fellix

Exophase

fellix

mczak

AlexV

Heteroscedasticitate

rapso

hoho

rapso

fellix

Similar threads