AMD Bulldozer Core Patent Diagrams

rpg.314 · Apr 25, 2010

ShaidarHaran said:
If I'm reading the diagram correctly it looks like each BD module has the same FMAC throughput as FADD/FMUL. Great for future workloads, but not so hot for today's software if I'm not mistaken.

It is the same as present day hw, so no better or worse for today's sw. However, it has less fp/core than Sandy Bridge, so not so good for future sw.

However, I can't see much use for AVX in consumer apps though even going forward. There's going to be dedicated video decode hw on _every CPU _ in that time range.

hoom · Apr 25, 2010

Great for future workloads, but not so hot for today's software if I'm not mistaken.

Not so good for FP heavy stuff perhaps.
BD architecture can only make sense if the FP|int ratio is around/below 1|2.
If AMD was seeing/expecting BD to be exposed to much code that is 1|1 or 2|1 they wouldn't be building BD this way.

V3 · Apr 25, 2010

Is AMD really expecting software to offload FP stuff into GPU ?

Blazkowicz · Apr 25, 2010

It will run database servers, web servers, multi-user desktops and all that other stuff as well.

V3 · May 3, 2010

WIth Intel going to quad channels, Do AMD have any plan for something similar in their higher end desktop ? I know they are promising for AM3 compatibility but after that AM4 with quad ?

itsmydamnation · May 3, 2010

i wouldn't assume that intel are going quad channel for consumer boards. look how much X58 based boards cost when they came out. Having enough room for 8 dimm slots is also an issue. will they have PCI lanes to the north bridge or to the CPU.

i would hope (as an X58 owner) that sandy bridge is still lga 1366 and tri channel having to buy a new and even more expensive motherboard would stop atoption.

Silent_Buddha · May 4, 2010

In the consumer/enthusiast gamer space is more than 2 channels even needed? Bloomfield with 3 channels doesn't show an advantage over Lynnfield with 2 channels in the greater majority of consumer/enthusiast applications.

Heck there isn't even much of a performance advantage moving from DDR2 to DDR3 on AMD platforms.

In the professional space there might be a need, but even then I can see BD possibly retaining compatability with Socket-F, AM3, etc...

Regards,
SB

Squilliam · May 4, 2010

itsmydamnation said:
i wouldn't assume that intel are going quad channel for consumer boards. look how much X58 based boards cost when they came out. Having enough room for 8 dimm slots is also an issue. will they have PCI lanes to the north bridge or to the CPU.

It would simplify things actually.

One dimm = single.
Two dimms = dual.
Three dimms = tripple.
Four dimms = quad.

They can save the 8 dimm boards for the enthusiast and workstation markets, right? The only issue would be the number of CPU pins required to implement it really and the expense of the board. It also plays in nicely with rumours that Microsoft is going 128bit for Windows 8. I wonder if that'd tie in with going for quad channel ram?

aaronspink · May 4, 2010

Squilliam said:
They can save the 8 dimm boards for the enthusiast and workstation markets, right? The only issue would be the number of CPU pins required to implement it really and the expense of the board. It also plays in nicely with rumours that Microsoft is going 128bit for Windows 8. I wonder if that'd tie in with going for quad channel ram?

Eh, no one is going to 128b anytime soon. The largest scale system available only have ~52b physical addressing and those are LARGE scale systems (Standard SGI Altix 4700 maxes at 128 TB / SGI Altix UV maxes at 16 TB, though custom orders scale larger). Even if they upped the dimms slots and went to the absolute bleeding edge memory available 4-8 years from now you are only looking at 256-512 TB of memory or 48/49b of physical address. Realistically OSes don't need to start thinking about >64b base addressing until we have common machines in the 512 PB range which is quite a bit out there (15 years assuming a super Altix UV as the starting point and fab transitions every 18 months). And we'll likely go to a 96b segmented address first as well for practical reasons. So the need for 128b addressing is in the range of 20-30 years out.

Also address width has little to nothing to due with number of dram channels. Number of dram channels has more to due with bandwidth requirements, board feasibility, and data block sizes.

Squilliam · May 4, 2010

aaronspink said:
Also address width has little to nothing to due with number of dram channels. Number of dram channels has more to due with bandwidth requirements, board feasibility, and data block sizes.

Sorry if I wasn't making myself clear. I was talking about up to 128 bit wide memory interfaces. Im a little ignorant sorry, so I thought that having a 128 bit wide memory interface on a 128 bit operating system might have some practical benefit in terms of efficiency etc.

Btw are you still at Intel?

Erinyes · May 4, 2010

itsmydamnation said:
i wouldn't assume that intel are going quad channel for consumer boards. look how much X58 based boards cost when they came out. Having enough room for 8 dimm slots is also an issue. will they have PCI lanes to the north bridge or to the CPU.

i would hope (as an X58 owner) that sandy bridge is still lga 1366 and tri channel having to buy a new and even more expensive motherboard would stop atoption.

According to the info here http://vr-zone.com/articles/a-look-...tform--sandy-bridge-e--waimea-bay/8877-1.html, Intel is moving to a new socket to replace X58. Its gonna have 2011 pads/pins, integrated PCIE like lynnfield and quad channel memory controller.

The X58's life of three years(Nov 2008 to Q3 2011) is actually on the higher side for intel, usually it has a habit of forcing motherboard upgrades on users. The Lynnfield/Clarkdale platform has an even shorter life, P55 and H55 will be replaced in Q1 2011 with P55 having been introduced in Q3 2009, and H55 having been introduced in Q1 2010!

swaaye · May 10, 2010

Why don't we have 128-bit memory sticks yet? We've had 64-bit sticks since 1997 or something. Is it really difficult to do 128-bit sticks? That would get rid of this triple DIMM nonsense that we're up to now. Almost every motherboard out there is dual channel now, unlike a few years ago, so I think we're past the need to keep 64-bit per slot around.

hoom · May 10, 2010

Some interesting posts by John Fruehe over here.
He seems at first to be confirming the '4 ALUs per core' vs '2 ALUs & 2 AGUs'
(first post is referring to a diagram with 2ALU & 2AGU & he replies with)

Bulldozer has more pipelines than our existing products.

and later

Integer resources are not 2 wide.

but then does seem to be defending slower per-thread but more cores, which would indicate the latter interpretation :???:

Single thread performance should only be considered if you are going to also consider single thread price and single thread power.

Price/performance/watt will be excellent.

Also on the 5% thing

Take a bulldozer die with 8 cores. Pull 4 integer cores out and that silicon represents ~5% of the total silicon of the die.

Raqia · May 10, 2010

hoom said:
Some interesting posts by John Fruehe over here.
He seems at first to be confirming the '4 ALUs per core' vs '2 ALUs & 2 AGUs'
(first post is referring to a diagram with 2ALU & 2AGU & he replies with)

and later

but then does seem to be defending slower per-thread but more cores, which would indicate the latter interpretation

Also on the 5% thing

According to JF on the last post of the page:

http://www.amdzone.com/phpbb3/viewtopic.php?f=52&t=137432&start=550

Single threaded performance will be better. If 33% more cores gets you more than 33% better throughput, how can single thread be lower????

Throughput will be better.

All of that has been said before. ANY rumor that bulldozer will be disadvantaged relative to current products just aren't true. In asking "by how much", you will just have to wait for launch. We don't disclose that now.

I'm expecting about equal or lower IPC judging from the lower per core width but the CPU sporting much higher clocks to compensate. The diagrams they showed at their analyst's day had fuzzy performance bars presumably because they hadn't finalized their clocks at that time.

http://www.geek.com/wp-content/uploads/2009/04/amd_webcast_9.jpg

hoom · May 13, 2010

Ah good, thats a direct reference to single threaded speed being better

mboeller · May 13, 2010

I'm not sure if the single thread performance is really so great at all.

If you use the page 4 in the PDF "HC21.24.110.Conway-AMD-Magny-Cours.pdf" as reference than you can see, that Interlagos with 8 modules should have, according to AMD a INT-performance of ~37,7 at the most.The 4-core reference from 2008 has a INT-performance of ~10,3 according to this slide.

So the per module performance of a Bulldozer CPU could end up to 1,83 times higher than the per-core performance of the Phenom X4 used as a reference. Since according to AMD CMP could gain a ~80% performance advantage at only 5% larger CPU the per-core single thread performance could be the same in the end!

loop hole: no mentioning of the MHz/GHz of the CPUs.

Therefore Interlagos could end up as fast as the 3GHz Phenom (which came to the market at the end of 2008 AFAIR) in single threaded applications. It is still possible that the perMHz-performance is higher if Interlagos is clocked slower than the 3GHz Phenom but also worse if the Interlagos is clocked higher than the 3GHz Phenom.

Also, AFAIK Phenom II and especially Llano have a higher perMHz, per-core performance compared to the original Phenom. Therefore Bulldozer could still disappoint in single threaded applications.

swaaye · May 13, 2010

Phenom II is faster per clock than the original. The original had that TLB bug and the L3 cache was way too small. Maybe there was even more to it. I think the L3 is also faster in PII.

But PII is still slower per core per clock than C2Q Yorkfield outside of servers (that would be Xeon vs. Opteron though). Sometimes an old Conroe core will beat it. The model numbers and pricing tend to keep the awareness of that down.

Raqia · May 14, 2010

mboeller said:
I'm not sure if the single thread performance is really so great at all.

If you use the page 4 in the PDF "HC21.24.110.Conway-AMD-Magny-Cours.pdf" as reference than you can see, that Interlagos with 8 modules should have, according to AMD a INT-performance of ~37,7 at the most.The 4-core reference from 2008 has a INT-performance of ~10,3 according to this slide.

So the per module performance of a Bulldozer CPU could end up to 1,83 times higher than the per-core performance of the Phenom X4 used as a reference. Since according to AMD CMP could gain a ~80% performance advantage at only 5% larger CPU the per-core single thread performance could be the same in the end!

loop hole: no mentioning of the MHz/GHz of the CPUs.

Therefore Interlagos could end up as fast as the 3GHz Phenom (which came to the market at the end of 2008 AFAIR) in single threaded applications. It is still possible that the perMHz-performance is higher if Interlagos is clocked slower than the 3GHz Phenom but also worse if the Interlagos is clocked higher than the 3GHz Phenom.

Also, AFAIK Phenom II and especially Llano have a higher perMHz, per-core performance compared to the original Phenom. Therefore Bulldozer could still disappoint in single threaded applications.

Don't forget some version of turbo-boost before you draw any conclusions about single threaded performance. This might put it well within striking distance of more single thread oriented designs, though I doubt it'll actually beat the Sandy Bridge in single threaded apps.

Plus 16 Interlagos cores will continue to share 4 channels of memory, which is at a disadvantage to the Barcelona set-up where 4 processors fed off of 2 channels. This probably accounts for a sizable decrease per-core when counting peak multi-threaded performance.

itsmydamnation · May 15, 2010

but if your RAM and HT links are running much faster

Raqia · May 15, 2010

itsmydamnation said:
but if your RAM and HT links are running much faster

The first generation of Bulldozer will still be on the same platform as Magny Cours which has a faster platform than Istanbul but less performance per core. You can see the effects of cutting per-core bandwidth on integer performance going from Istanbul to MC:

http://techreport.com/r.x/2009_4_22...res_in_early_2010_16_cores_in_2011/slide3.jpg

2x the cores but w/ only 4/3x the memory channels seems to translate into less performance per core when considering full multi-threaded performance. I think pure single threaded performance won't be so terrible since there'll be less bandwidth sharing and turbo-boost.

AMD Bulldozer Core Patent Diagrams

rpg.314

hoom

V3

Blazkowicz

V3

itsmydamnation

Silent_Buddha

Squilliam

Beyond3d isn't defined yet

aaronspink

Squilliam

Beyond3d isn't defined yet

Erinyes

swaaye

Entirely Suboptimal

hoom

Raqia

hoom

mboeller

swaaye

Entirely Suboptimal

Raqia

itsmydamnation

Raqia