AMD Bulldozer review thread.

Did anyone seen some multitasking benchmarks for the dozer? If it needs those threads so badly :rolleyes:.


A also guess this could be a problem in future if software cant catch up to the cores. 8 or 12 cores will just sit idle and wont bring much improvement if u dont use it for rendering or encoding.
 
Last edited by a moderator:
http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/5

The classic Queens bench shows the sad state of the single-threaded performance, being hammered by the ill fated front-end and wimpy L1i cache.
The L1i isn't the problem here as it's largely the same as in Phenom II (it's only a real problem with multiple threads).
This benchmark MIGHT show the worst case as it's possible it is indeed optimized and written in a way which allows the cpus to schedule as many (int) instructions per clock as they can - 4 for SNB, 3 for Phenom II, 2 for BD. The scores would align nearly perfectly with that. [edit: of course SNB can't actually execute 4 alu ops per clock - not sure what I was thinking there...]
Thankfully it's not quite _that_ bad in general...
Makes you wonder if removing the third ALU pipe was a good idea though I guess it's all part of the compromise - e.g. third ALU pipe might mean instruction decode becomes the bottleneck too often (at least for multithreaded workloads), and of course it runs counter to the idea of CMT where you streamline the int cores a bit but have more of them instead.
 
Last edited by a moderator:
But in any case I think you're looking 2013 at the earliest for any such thing. The competition will be Haswell at that time which I've no idea how it compares to todays chips :).

Yes, probably 2013. But if AMD could manage to integrate a winning Vector Co-Processor and GPU-Compute has gained more traction until then, maybe they can be very competitive with Haswell, which I doubt will be more than an evolutionary step, whereas perusing GCN as CPU Co-Processor could as well be revolutionary.

We'll see.

What I found the most interesting insight so far is that CMT seems to decelerate heavily threaded games.
http://www.hardware.fr/articles/842-9/efficacite-cmt.html
More exactly this graph:
IMG0033832.gif


While SC2 is only using two threads at most and relies heavily on a high IPC, I am not exactly sure about ArmA II, but F1 2011 and RoF are definitely heavily threaded as well as Anno 1404 (aka Dawn of Discovery) and all are running slower with CMT enabled, i.e. 8C/4CU vs. 4c/4CU.
 
This kinda reminds me of Phenom 1.. came up with a couple of dumb mistakes that made it underperform badly, but were corrected with Phenom II shortly after.


Nonetheless, it is an impressive failure as a consumer desktop chip.
How come it has twice the transistor count, no iGPU, and it's still badly beaten by Sandybridge clock-for-clock?


How come they launch a completely new architecture that apparently shows an equal number of disadvantages and advantages over the previous one?
Seems like they should've just stuck with Phenom II X6 architecture instead, improved the core throttling and power management, and just take advantage of the new process to raise the clocks.



About the weaker FP processing (6 units in PII-X6 vs. 4 units in BD), could it be that AMD intends to leverage all FP calculations to the iGPU cores in future Fusion chips?
 
Last edited by a moderator:
Seriously, what's with the cache splurging?
2MB per module and then another 8 for L3- oh wow seriously.

That, and if you compare this to Gulftown for perf/transistors too...

I'm not sure if AMD's 15% a year will help much if they're always going to be stuck on dice that big. If BD was aimed at being lean and mean, this is surely not the chip to prove it.


I'm gonna say for one though, that once you look to the lower frequencies (oh the irony), there's this 2.5Ghz Opteron Octa-core at 32W ACP. It'll fare better on mobile, where Trinities should clock in at about 2.5Ghz+ or so and have 2 modules. The current 35W Llano maxes out at 1.5Ghz, the 45W variant at 1.9Ghz.

That's where the 30% comes in, but not how AMD wished it would be :LOL:
 
The amount of cache seems more appropriate for server loads. I haven't gotten to a review that looks at those.
The latencies are definitely not client-friendly.

Hopefully it looks better for server. The cache structure seems like it has gotten worse compared to the previous chips, and AMD's setup there isn't exactly the greatest.

The measured L2 latencies in some of the reviews are worse than I expected (one has 25-27, but there's a lot of variability there). In cycle terms, a miss from the puny L1 has a hit in the same league as an SB L3 hit.
 
What I found the most interesting insight so far is that CMT seems to decelerate heavily threaded games.
http://www.hardware.fr/articles/842-9/efficacite-cmt.html
While SC2 is only using two threads at most and relies heavily on a high IPC, I am not exactly sure about ArmA II, but F1 2011 and RoF are definitely heavily threaded as well as Anno 1404 (aka Dawn of Discovery) and all are running slower with CMT enabled, i.e. 8C/4CU vs. 4c/4CU.
I think that's quite easy to explain. Performance is mostly the same when using only 2 modules too so while those games might be heavily threaded it looks like (at least on this chip) performance is more or less determined by how fast one particular (or maybe 2) of these threads is running. And adding more threads just takes away resources from that thread. It might behave the same on SNB with/without HT (or maybe not as BD apparently has issues with its L1I cache SNB does not).
 
I wonder how well a 16 core bobcat CPU at 3 ghz would do vs. an Interlagos 16 core.

The problem is that I don't think the Bobcat design can actually do anywhere close to 3GHz under any real conditions.

BD might be a decent server chip, but I'm really surprised how much AMD dropped the ball with BD as a desktop chip.

Things to improve:
1. Fix the false aliasing L1 caches (increase associativity to pad up the bits in the index/tags)

I don't think turning a 2-way set associative 64KB L1 icache into 16-way set associative is something that can be described as a "fix." Especially not with this being the same L1 cache design AMD has used since K7, hell, even K6 had a 32KB 2-way associative cache. There's usually no really good reason for software to have 4KB->32KB aliasing in the first place, even between two threads.. it's just a side effect of aggressive address space randomization and has already been changed for Linux. The impact is minor, as you'd expect, because I doubt two threads are always spending all of their active cycles in the same shared library code. It might not be a problem in Windows to begin with, I mean, it isn't a problem where large pages are used..
 
Last edited by a moderator:
3dilettante said:
Hopefully it looks better for server.
I'm not sure that it will. Even if Intel CPUs have a greater initial cost, they will be running cooler and consuming less electricity paying for themselves over time. Not a good day for AMD :/
 
I'm not sure that it will. Even if Intel CPUs have a greater initial cost, they will be running cooler and consuming less electricity paying for themselves over time. Not a good day for AMD :/

The Opteron models won't be driven to 4 GHz.
Reduced voltages can bring power numbers down very quickly, more than the commensurate drop in clock.
An Opteron in the 2-2.5 GHz range should be much cooler.
 
Yes, but when Intels IPC is higher and their process more advanced, it doesn't really matter that AMD's server CPUs are much cooler than their consumer counterparts because that isn't what they are competing against. An Intel CPU in the 2-2.5 GHz range will still be cooler and faster and use less electricity...
 
I don't think turning a 2-way set associative 64KB L1 icache into 16-way set associative is something that can be described as a "fix." Especially not with this being the same L1 cache design AMD has used since K7, hell, even K6 had a 32KB 2-way associative cache.
Viewed from another side I think it's about time they improve it :).

There's usually no really good reason for software to have 4KB->32KB aliasing in the first place, even between two threads.. it's just a side effect of aggressive address space randomization and has already been changed for Linux.
But the fact remains 2-way set associativity does not look all that well on paper. Everybody but AMD seems to agree that 4-way instruction caches (but possibly smaller) are just to be preferred. Heck Pentium III had 4-way set associative L1 instruction cache. ARM Cortex A8 has 4-way L1I. Atom has 8-way L1I. Only AMD still uses such low associativity for L1 caches (and they did at least finally move away from it for L1D with BD, though at the cost of huge size decrease). 2-way 64KB cache just looks like an idea from last century, and it's not really all that surprising aliasing issues finally hit BD - of course thanks to the 2 threads.
The impact is minor, as you'd expect, because I doubt two threads are always spending all of their active cycles in the same shared library code. It might not be a problem in Windows to begin with, I mean, it isn't a problem where large pages are used..
Large pages are used quite rarely unless I'm mistaken. I also thought the problem was the same for windows too, but maybe I remember that wrong. In any case, 3% might not sound like that much, but for a single "small" optimization this is quite a lot. I doubt you'd really need 16-way associativity to fix it, most likely you'd get about 80% or so of the performance back by just increasing from 2-way to 4-way (diminishing returns and all that - in fact 4-way 64KB L1I might even be faster overall without OS bandaids than 2-way 64KB L1I with OS bandaids as of course it would have higher hit rate for general case).
 
Yes, but when Intels IPC is higher and their process more advanced, it doesn't really matter that AMD's server CPUs are much cooler than their consumer counterparts because that isn't what they are competing against. An Intel CPU in the 2-2.5 GHz range will still be cooler and faster and use less electricity...

I have not seen server benchmarks yet to say either way. The server market is what this chip is primarily for, so I'm waiting to see some analysis on whether there is or was as of earlier this year an opening for BD to be competitive against server chips based on previous-gen cores.

The margin for a large so-so server chip is much better than that of a large uncompelling desktop chip anyway.
 
I am so frustrated. I will have to build a new machine b4 christmas and was hoping AMD could manage something decent. I have a 1066T and from the reviews it doesn't look like it would even matter much if I "upgraded" I guess I finally have to give up and go over to intel (who isn't bothering to release anything since AMD stinks).
 
Viewed from another side I think it's about time they improve it :).

But the fact remains 2-way set associativity does not look all that well on paper. Everybody but AMD seems to agree that 4-way instruction caches (but possibly smaller) are just to be preferred. Heck Pentium III had 4-way set associative L1 instruction cache. ARM Cortex A8 has 4-way L1I. Atom has 8-way L1I. Only AMD still uses such low associativity for L1 caches (and they did at least finally move away from it for L1D with BD, though at the cost of huge size decrease). 2-way 64KB cache just looks like an idea from last century, and it's not really all that surprising aliasing issues finally hit BD - of course thanks to the 2 threads.

Oh yes.. 2-way set associativity for a cache shared by two threads is a real head-scratcher. With the much higher latency L2 AMD couldn't afford to make their L1s weaker in comparison, but that's exactly what they did, on both counts.. Absolutely, increase the associativity if this is remotely possible, but not merely to try to hide the aliasing. I don't think the aliasing is going to be an issue in the real world once the OSes don't introduce it. But it's true that higher associativity will result in less of it even if it doesn't completely remove it. It'd also result in less logic for removing the aliasing lines..

Large pages are used quite rarely unless I'm mistaken. I also thought the problem was the same for windows too, but maybe I remember that wrong. In any case, 3% might not sound like that much, but for a single "small" optimization this is quite a lot. I doubt you'd really need 16-way associativity to fix it, most likely you'd get about 80% or so of the performance back by just increasing from 2-way to 4-way (diminishing returns and all that - in fact 4-way 64KB L1I might even be faster overall without OS bandaids than 2-way 64KB L1I with OS bandaids as of course it would have higher hit rate for general case).

Yeah I seriously have no idea what Windows uses, I only know that Linux isn't big on large pages (Linus seems pretty opposed to it in normal cases).. but it seems like with big enough shared libraries they'd start to make more sense. Too bad there isn't a page size between 4KB and 2MB..

I wonder how distributed that 3% is. I get the feeling it's way more erratic than that, with some programs getting much more than 3% and a lot getting close to nothing. I hope someone does a comparison.
 
I am so frustrated. I will have to build a new machine b4 christmas and was hoping AMD could manage something decent. I have a 1066T and from the reviews it doesn't look like it would even matter much if I "upgraded" I guess I finally have to give up and go over to intel (who isn't bothering to release anything since AMD stinks).
Which software's poor performance is coercing you into an upgrade before Christmas?

Or is this an "upgrade while donating old PC hardware to a family member" thing?
 
Which software's poor performance is coercing you into an upgrade before Christmas?

Or is this an "upgrade while donating old PC hardware to a family member" thing?

It is the second option. And it is going to be happening I just wish the timing had been better. I was hoping bulldozer would either bring the 2600k price down, be a big jump, or spur intel to release a 6 core on a newer process. None of those things happened though :(
 
Back
Top