AMD Bulldozer Core Patent Diagrams

I thought it meant that running two cores in one module achieves 90% of the performance compared to running only one core in a module. I could be completely wrong :)
 
I thought it meant that running two cores in one module achieves 90% of the performance compared to running only one core in a module. I could be completely wrong :)

You are right. Mister Golden was simply misquoted, here's what he said originaly:

"...As Michael Golden, an AMD engineer, explained during its presentation, each dual-core module, when fully loaded, is capable of delivering 90% of the speed of a similar native dual-core processor, while featuring a lower power consumption and utilizing less die space.
This enables AMD to pack more cores inside the same die space and power budget..."
 
86778230.jpg


Pretty much "undoctored" die-shot me thinks. Very modular design... too much, if you ask me. :p
 
whats to the left of the northbridge (across the crossbar)? Seems way to big to be simple filler space or traces. Could almost fit another 2MB cache there
 
Leaked benchmarks of Interlagos in F@H:
F@H Benchmarks Interlagos (Without(!) Turbo core 2.0):
ubuntu 10.10 server x64
512G DDR3-1333
P6901
Average time/frame: 00:03:52

[09:15:07] Completed 0 out of 250000 steps (0%)
[09:18:59] Completed 2500 out of 250000 steps (1%)
[09:22:42] Completed 5000 out of 250000 steps (2%)
[09:26:16] Completed 7500 out of 250000 steps (3%)
[09:30:08] Completed 10000 out of 250000 steps (4%)
[09:34:06] Completed 12500 out of 250000 steps (5%)


For comparison:
Bulldozer "Interlagos" 16x4@ 1.8GHz* = 00:03:52
Opteron "Magny Cours" 12x4@ 2.2GHz = 00:06:40
Source

~58% higher single-threaded performance compared to K10, clock for clock.
 
It's highly unlikely that it is serial, it's a pretty old client and I am sure that it is well threaded. It scores more with lower clocks. Although AVX may have tipped the balance if it was serial.

1.8G is pretty low for a speed racer. I was expecting almost the same clocks as MC at at launch. Although speeds might increase with more mature process.
 
That page also has Llano benchmarks. I would have liked to see a comparison of Llano with a discrete gpu, especially the power benches.
 
Nice to see that some software vendors are already taking advantage of the new features provided by Bulldozer (and Sandy Bridge). From the Visual Studio 2010 Service Pack 1 readme:

Visual Studio 2010 SP1 adds intrinsic functions or intrinsics to enable the extensions on the AMD and Intel new microprocessors that will be released next year. The intrinsic functions allow highly efficient computing without the overhead of a function call. For more information about the intrinsics function, visit the following website:
Compiler Intrinsics

For more information about the extensions, visit the following third-party websites:
Intel AVX
AMD Bulldozer instruction sets
 
Nice to see that some software vendors are already taking advantage of the new features provided by Bulldozer (and Sandy Bridge). From the Visual Studio 2010 Service Pack 1 readme:
I'm fairly certain GCC has included AVX support for at least couple of years now, not sure how good it is though.
 
interesting couple of posts from JFAMD on anandtech:


We have a 256b FP datapath (pipes 0 and 1) AND a 256b INT datapath (pipes 2 and 3), so

2 128b FP + 2 128b INT
or
1 256b FP + 2 128b INT
or
1 256b FP + 1 256b INT
or
2 128b FP + 1 256b INT

The INT here is an integer unit for doing the integer portion of math inside an SSE instruction, that is not the integer clusters that you would commonly call cores.

Plus there is a really cool feature around moves. Technically, we can do 4 128b SSE moves per cycle with a ZERO cycle latency. This is known as “MOVE ELIMINATION”.

And to further clarify directly:

Also there are some features AMD downplayed so far in my opinion. It is because obviously AMD has not only 2 FPU pipes and 2 MMX pipes. Those MMX pipes don't do MMX they are full 128 Bit integer SSE pipelines
(true).

So all register moves and load/stores can be executed also in those two pipelines
(not really, reg-reg moves for SSE and AVX-128 can be done with mov-elimination

Load – doesn’t actually require an execution pipe in the FP at all – but is limited to 2 128b loads/cycle max throughput.
Store – does take an execution pipe, but can only execute down 1 of the pipes. That & LS restrictions limit it to 1 128b store/c throughput)

I recently read a source that those two don't do 64 Bit MMX but 128 Bit SSE! Really don't know why AMD was so quiet about that so far and obfuscated that by using the wrong term "MMX". Therefore AMD can do 4 * 128 Bit SSE/cycle!

(yes, “MMX” is likely a bad name to use in describing the BullDozer micro-architecture and is somewhat misleading. Yes, we can do 4 128b arithmetic operations/cycle: 2 “floating-point” and 2 “SSE/AVX-128 integer”. Or/instead/in-combination we can also do 2 x87 “floating point” and 2 mmx “integer” per cycle – and by mmx I really mean the architected “mmx”).



And that is the sound of me clapping my hands like a blackjack dealer and saying "all done", can't get any further into this topic.
 
Back
Top