AMD Bulldozer review thread.

It is the second option. And it is going to be happening I just wish the timing had been better. I was hoping bulldozer would either bring the 2600k price down, be a big jump, or spur intel to release a 6 core on a newer process. None of those things happened though :(
Well a 2700k might still appear soon (a full 3% faster max than the 2600k how pathetic) though if rumors are correct intel doesn't even feel like replacing the top end model and moving others down the price ladder but just introduce it at a higher price instead. Can't say I fault them without any competition in that space, but no luck for you :(.
It's quite telling intel sells like 95% (pure guess) of their chips not only with lower frequency than possible (which is after all a very old standard tactic) but also with quite some functionality disabled.
 
I wonder if the disabling a core also disables the instruction issue interleaving in the shared x86 decoder? Interleaving was pointed as another potential bottleneck for heavy single-threaded code, besides trashed L1i.
 
So AMD's SMT has some serious quirks then. It's reminiscent of the mixed results with P4 HT.
 
I wonder if the disabling a core also disables the instruction issue interleaving in the shared x86 decoder? Interleaving was pointed as another potential bottleneck for heavy single-threaded code, besides trashed L1i.

The decoding isn't supposed to be statically interleaved.. that defeats the point of making it shared instead of having two 2-way blocks. According to AMD code is supposed to be bursty in nature, so they want to be able to push > 2 decodes/cycle through when they can.

Note that disabling cores isn't shown to improve single-threaded code, which probably performs the same regardless of what you disable. It's shown that 4 threads performs better on 4 modules than on 2 modules.

This is perfectly in line with what AMD claimed, that the second thread on a module performs, on average, 80% as well as the first thread. This is due to having to share the fetch, decode, L2, and FPU resources which otherwise a thread can have to itself. AMD's estimate, which would yield 90% performance for two threads on one module vs two threads on two modules, looks pretty dead on based on these results (average is 88%).

The examples where 4M/8C performs worse than 4M/4C are probably cases where <= 4 threads are utilized and the OS schedules some of the threads on the same module.
 
The examples where 4M/8C performs worse than 4M/4C are probably cases where <= 4 threads are utilized and the OS schedules some of the threads on the same module.
Starcraft 2 is known to only tax two cores, yet it is one of two games in hardware.fr's review to remain unchanged from disabling CMT. I am not sure about the other one, ArmA II, the results in that test are not conclusive.
 
Indeed. I think that's a good thing though. Phenom II seemed to be a bit lacking there (or rather, designed for ddr2 speeds), overclocking NB helped a bit (not THAT much though). But this chip now delivers much larger memory bandwidth without overclocking. The lower latencies might help in theory a bit still, but the cores have even larger L2 caches now. It is surprising to see no improvement at all though, but I wouldn't base that conclusion solely on one benchmark.
 
The Sandra inter-core bandwith test shows great improvements over phenoms. The last two tests on the page http://extrahardware.cnews.cz/amd-fx-8150-6100-bulldozer-zambezi-recenze-test?page=0,11 .
Not sure what exactly this measures, but if it measures inter-core bandwidth for 2 cores in the same module then that doesn't tell us anything compared to the old Thuban.

Maybe the server sku-s with cache probe filter on and lots of sockets will scale much better than older opterons. That oversized uncore die area could probably have some meaning, just not in desktops :rolleyes:
Well the uncore area is large but if you discard the unused area it's not THAT huge (well ok L3 cache is huge too). The faster HT links (unless I'm mistaken they got a 20% clock boost) should help as well but I doubt it will change the overall picture - that is for loads which have most of their memory accesses local it will still do great, but for others (like seen in the recent anandtech numbers) it most likely will remain uncompetitive. There might be improvements in that area, but BD will continue to have more hops for memory accesses (because one socket really will house two BD cpus) compared to similarly fast intel systems.
 
Sadly, AMD haven't been much creative in BD for it's allegedly server-oriented architecture. If you look at what Intel have done in their high-end Xeon line, since Nehalem-EX, you can find all sorts of exotic dedicated bits of hardware for advanced coherency routing and directory assisted snooping, witch makes node scaling very smooth no matter how the data is spread and accessed across the memory space. The only significant dedication AMD have made for the server and HPC clients are the incremental upgrades of the Hyper Transport interface clock-rate and the quasi-software hack "HTT Assist".
 
dosboxscore1.png


LOL, the thing barely handles an ancient DOS game.
 
It seems more that it's not that good at running an old game through Dos box.
Perhaps the working set for the emulation and the application is big enough to thrash the L1.
 
The problem is that I don't think the Bobcat design can actually do anywhere close to 3GHz under any real conditions.

You're right, I did some cursory research and it seems the max OC I've seen online is about 2.4 ghz; all the design choices such as the fairly short pipeline would probably prevent it from being a speed demon in the future too, so max single threaded performance of any bobcat design is likely to lag behind BD rather significantly.

My main issue is that they didn't seem to use their die space in their current BD implementation very efficiently for the performance they're achieving, so I was toying with what AMD's alternatives had for this generation. For server / high throughput workloads, it might have been a good decision for AMD to stick lots of bobcat cores onto a die running at their stock 1.6 ghz, and they could fit in a lot:

http://www.chip-architect.com/news/AMD_Ontario_Bobcat_vs_Intel_Pineview_Atom.jpg

The 8mm^2 cpu die size at 40nm is very impressive! They could easily fit 64 cores onto a BD sized die at 32nm and have room for other odds and ends. (Not that it's a trivial thing to execute.) As for desktop performance, they probably would have gotten better performance by fabbing a die w/ 6-8 husky cores (< 300mm^2 at 32nm) and clocking it as high as possible.

BD in its current iteration seems to be an uncomfortable compromise that uses way too much die space for what it accomplishes. Many indications seem to point to process issues and much higher initial target clocks. There are cache and software optimization problems too but judging from AMD's future roadmap, most of the 10-15% yearly improvements will come from process refinements rather than architectural change; see paragraph 3:

http://www.tomshardware.com/reviews/fx-8150-zambezi-bulldozer-990fx,3043-9.html

In the the long run, k10.5 doesn't have much clock speed / threading headroom and bobcat doesn't have server features like virtualization built in, but the BD design in its current iteration still seems like it was released just a touch too early.

Edit: It seems like bobcat does come w/ a large part of hyper-v built in and there has been talk of bobcat cores in servers so maybe this is a route they will explore in the future. Another possibility for AMD's poor diespace usage w/ BD is their increased reliance on software synthesization in design. From a business standpoint, It might better for AMD to switch to a more synthesizable core design which allows them to push something through the door quickly and reap some revenues. Intel's vastly larger R&D budget allows them to take on some diminishing returns for much more polished result.
 
Last edited by a moderator:
It seems more that it's not that good at running an old game through Dos box.
Perhaps the working set for the emulation and the application is big enough to thrash the L1.

What's interesting is how much better the 2600K does over the 2500K. There's no way that DOSBox knows what to do with > 4 threads, probably barely pushing > 1. It's gotta be the extra L3 that's making the difference here. Which makes it kind of sad that BD is doing as poorly as it is given it has the most cache of anything here, but that just goes to show how bad the single-threaded performance is.

Dynarecs do often have pretty big L1 icache footprints, and may hurt associativity by calling small handlers a lot. There's also the question of if DOSBox generates x87 code to emulate x87, since I don't think BD is any good at x87.

.. gotta say though, it is nice seeing an emulator on a CPU review for a change.
 
Years ago the DOSBOX developers wrote an assembly FPU for the DOSBOX x86 dynamic core to speed up FPU-heavy games like Quake. I'm sure they're using x87 in there.

But there is a lot more to what DOSBOX is doing than just the CPU emulation. Audio and video emulation isn't trivial.
 
What's interesting is how much better the 2600K does over the 2500K. There's no way that DOSBox knows what to do with > 4 threads, probably barely pushing > 1. It's gotta be the extra L3 that's making the difference here. Which makes it kind of sad that BD is doing as poorly as it is given it has the most cache of anything here, but that just goes to show how bad the single-threaded performance is.

Dynarecs do often have pretty big L1 icache footprints, and may hurt associativity by calling small handlers a lot. There's also the question of if DOSBox generates x87 code to emulate x87, since I don't think BD is any good at x87.

.. gotta say though, it is nice seeing an emulator on a CPU review for a change.

I second that, I'd love to see more emulator / recompiler benchmarks in CPU reviews; the most relevant one might be Javascript performance since most modern browsers now use JIT techniques. It's sad to see these benchmarks get lumped with a whole bunch of others into all-in-one productivity suite metrics these days since they really can tell you a lot about the processor.
 

Better covered here: http://www.xtremesystems.org/forums...nce-Scaling-Charts-max-OCs)LN2-Results-coming!
There seems to be some gains from FSB clocking and alignment of NB and HT speeds.

...
Is turbo disabled in the hardware.fr 4M/4T tests? Because otherwise it kind of contradicts these interesting win8 results:
win%208%20wow%201680.png

(that's the best case example, but for once toms actually got an interesting page in the review)
http://www.tomshardware.com/reviews/fx-8150-zambezi-bulldozer-990fx,3043-23.html
 
Oh man, all I wanted BD to do was lower Sandy Bridge prices. Sheesh AMD can't even give us that. Failure, and AMD will probably only get further behind each generation. Sell sell sell your stock.
 
Actually some people believe GF has a bad 32nm SOI and that has problem in leakage

Details in [http://bbs.pceva.com.cn/thread-27622-1-1.html] this is a chinese website

The main part of its idea is when AMD's 32nm processor core tempreture reach 62C it will automatically reduce its frequency as well as voltage.

I think this idea can explain why the CR of athlon 631 is only 2.6
 
Back
Top