Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 16-Apr-2009, 19:56   #1
Raqia
Member
 
Join Date: Oct 2003
Posts: 324
Default AMD Bulldozer Core Patent Diagrams

It looks substantially like SUN's Rock processor:

http://citavia.blog.de/

and these specs seems to gel well w/ some previous reports about Bulldozer.

The above was originally pulled from:

http://www.planet3dnow.de/vbulletin/...42#post3849342

Also some other rumors to take w/ a grain of salt:

http://brightsideofnews.com/news/200....aspx?pageid=0
Raqia is offline   Reply With Quote
Old 16-Apr-2009, 20:29   #2
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

I wish they would implement hardware multihreading in it.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 16-Apr-2009, 20:41   #3
Albuquerque
Red-headed step child
 
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,089
Default

Quote:
Originally Posted by rpg.314 View Post
I wish they would implement hardware multihreading in it.
I still struggle to understand why they haven't gone down this road, seriously. In my own opinion, they had far more room for success in SMT with all IMC architectures versus everything and anything Intel had before the i7.

They've had excellent main memory bandwidth, excellent inter-core communications, and plenty of execution resources on the die. All the basics are there, why haven't they gone that last few more steps and really opened this up?
__________________
"...twisting my words"
Quote:
Originally Posted by _xxx_ 1/25 View Post
Get some supplies <...> Within the next couple of months, you'll need it.
Quote:
Originally Posted by _xxx_ 6/9 View Post
And riots are about to begin too.
Quote:
Originally Posted by _xxx_8/5 View Post
food shortages and huge price jumps I predicted recently are becoming very real now.
Quote:
Originally Posted by _xxx_ View Post
If it turns out I was wrong, I'll admit being stupid
Albuquerque is offline   Reply With Quote
Old 20-Apr-2009, 11:51   #4
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,646
Default

Quote:
Originally Posted by Albuquerque View Post
I still struggle to understand why they haven't gone down this road, seriously. In my own opinion, they had far more room for success in SMT with all IMC architectures versus everything and anything Intel had before the i7.
They would have to increase caches, ROBs and register files. All time critical structures. And all for having higher utilization of the execution units.

AMD's cores are relatively small, each core in a quad core Phenom 2 is around 10% of the entire die. The reasoning behind not developing SMT is probably that they might as well just double the number of cores and get double the performance in multithreaded scenarios. Unfortunately they do not enjoy the excessive fab capacity Intel does.

AMD cores are descendants of the original K7, they have had 2 (or 3?) internal new architecture projects cancelled since the K7 came out, I'm sure one of those contemplated SMT.

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is offline   Reply With Quote
Old 20-Apr-2009, 18:44   #5
Albuquerque
Red-headed step child
 
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,089
Default

Quote:
Originally Posted by Gubbi View Post
They would have to increase caches, ROBs and register files. All time critical structures. And all for having higher utilization of the execution units.
Register files, yes. The others are a strong maybe. Intel is doing hyperthreading with L1/L2 caches that are smaller than quite a few current AMD chips. And while I understand what you're saying about "cores are only ~10% of the total die space", I also understand that a 2x increased register file would be considerably smaller. The net effect of which would be, I dunno, half the performance of an additional core for one quarter (or less) increase in die space?

Again, they're (my opinion) in a far better position than Intel in terms of SMT's ability to make a difference - they've had far better IPC (inter-processor communication) and memory subsystem technology for a very long time, at least up until the i7 finally hit the street. I think this should've given them far more opportunity to deliver a highly successful SMT implementation.

Quote:
AMD cores are descendants of the original K7, they have had 2 (or 3?) internal new architecture projects cancelled since the K7 came out, I'm sure one of those contemplated SMT.
This is probably the worst part, and I don't disagree. I think this SMT thing is just another smaller and less obvious example of how AMD's processor innovation really seems to have stagnated over the last multiple years. Which makes me
__________________
"...twisting my words"
Quote:
Originally Posted by _xxx_ 1/25 View Post
Get some supplies <...> Within the next couple of months, you'll need it.
Quote:
Originally Posted by _xxx_ 6/9 View Post
And riots are about to begin too.
Quote:
Originally Posted by _xxx_8/5 View Post
food shortages and huge price jumps I predicted recently are becoming very real now.
Quote:
Originally Posted by _xxx_ View Post
If it turns out I was wrong, I'll admit being stupid
Albuquerque is offline   Reply With Quote
Old 20-Apr-2009, 19:18   #6
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,281
Default

One of the pitfalls to SMT on a complex OoO processor is that it does expand the engineering resources needed to properly design and verify.

If the diagram and patent applications are indicative of Bulldozer--and here we need to be cautious, as patents do not always make it to implementation--we see that AMD has made the decision to simplify a number of things at the core unit level to make way for more complexity in speculation.

The aggressive speculation is also an argument against SMT, since slots consumed by speculation cannot be doled out to other threads.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 21-Apr-2009, 03:03   #7
Albuquerque
Red-headed step child
 
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,089
Default

Really, my post only serves as a note of sadness and despair for AMD's current processor lineup. They had so many opportunities and so much time to make something awesome, and yet here we are with K7 part 22.

Let's do something new AMD, seriously. Let's get back into the game; let's do something with those R&D resources that have seemingly been idle for the last half decade. Come on guys, bring the pain or something!
__________________
"...twisting my words"
Quote:
Originally Posted by _xxx_ 1/25 View Post
Get some supplies <...> Within the next couple of months, you'll need it.
Quote:
Originally Posted by _xxx_ 6/9 View Post
And riots are about to begin too.
Quote:
Originally Posted by _xxx_8/5 View Post
food shortages and huge price jumps I predicted recently are becoming very real now.
Quote:
Originally Posted by _xxx_ View Post
If it turns out I was wrong, I'll admit being stupid
Albuquerque is offline   Reply With Quote
Old 21-Apr-2009, 14:16   #8
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,281
Default

AMD did try to bring the pain, repeatedly.

It had at least one false start with K8, it had several false starts (something like six months delay each) prior to Barcelona, which itself was a faceplant.

Complex designs need significant resources and time, and we can see the disparity between AMD and Intel the amount of resources they have in reserve for such efforts.

The multiyear gap before the long-delayed Bulldozer (whichever design they've settled after how many they've scrapped) points to a significant limitation of means.

I'm curious about how much the layoffs have hit the engineering and design groups, and it's also not clear to me that the engineering executives whose tenure most matches the abortive attempts at a K8 successor have been culled, or whether like the current AMD CEO, just got promoted.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 21-Apr-2009, 18:22   #9
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,832
Send a message via Skype™ to fellix
Default

Hm, that clustered approach is teasing to me. Looks like each cluster is scaled down integer block, found in K10. The L1D cache probably will stay the same dual-ported bank-differential array with high throughput for arbitrarily access, but some details could be touched, like the size (probably halved, per cluster) and doubled/quadrupled associativity, for compensation.
All the matters point to heavy "modularization" to the lowest arch level, here.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 23-Apr-2009, 11:07   #10
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,646
Default

Quote:
Originally Posted by Albuquerque View Post
Register files, yes. The others are a strong maybe. Intel is doing hyperthreading with L1/L2 caches that are smaller than quite a few current AMD chips. And while I understand what you're saying about "cores are only ~10% of the total die space", I also understand that a 2x increased register file would be considerably smaller. The net effect of which would be, I dunno, half the performance of an additional core for one quarter (or less) increase in die space?
My point is that SMT is not just something you bolt on the side of your processor. We saw with P4 what that would do.

Northwoods 8 KB 2-way D$, the halving of the per process ROB entries and the thrashing of the trace cache made SMT an almost sure loss for most workloads.

Prescott improved the D$ to 16KB 4-way and double the trace caches. It had better SMT performance as a result.

Core i7 has 32KB 8-way D$, similar instruction caches. Ci7 increase the ROB entries to 128 from 96 for Core 2 architecture, when in SMT mode each context has 64 entries (that is why you see lower performance for single thread workloads in SMT mode). Ci7 also has a per core L2 cache that functions as a victim cache for the D$ and I$. So while you have the same first level caches as C2, the per core cache system is greatly improved.

The active register file is doubled, the architected register file is doubled.

All the critical structures are made with SMT in mind, and it shows performance-wise in multithreaded workloads.

Cheers
__________________
I'm pink, therefore I'm spam

Last edited by Gubbi; 23-Apr-2009 at 11:44.
Gubbi is offline   Reply With Quote
Old 23-Apr-2009, 12:36   #11
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,832
Send a message via Skype™ to fellix
Default

Quote:
Northwoods 8 KB 2-way D$,...
The L1D array in 180 and 130nm P4's was 4-way associative.
Quote:
Prescott improved the D$ to 16KB 4-way and double the trace caches.
Similar here -- Prescott and later models doubled associativity to 8-way, for the L1D, and the trace-cache size remained unchanged throughout the entire NetBurst family.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.

Last edited by fellix; 23-Apr-2009 at 15:04.
fellix is offline   Reply With Quote
Old 24-Apr-2009, 11:01   #12
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,646
Default

Quote:
Originally Posted by fellix View Post
The L1D array in 180 and 130nm P4's was 4-way associative.

Similar here -- Prescott and later models doubled associativity to 8-way, for the L1D, and the trace-cache size remained unchanged throughout the entire NetBurst family.
Right, My bad.

I keep making that mistake and assumes that associativity is size of cache divided by page size (4KB) like in most virtually adresses, physically tagged caches. Other notable exception is K7/8's 2 way associative 64KB caches.

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is offline   Reply With Quote
Old 29-Apr-2009, 14:39   #13
hoom
Senior Member
 
Join Date: Sep 2003
Posts: 2,078
Default

So I've been trying to work out what is so special about this setup: Why put a 2nd int pipeline into a core when we're in the multi-core period?
Why not just make another core???

Finally I think I get it.
Its about making much better use of the FPU/SIMD unit.

Currently (& presumably for the foreseeable future according to AMD?) CPU instruction ratio int:FP/SIMD must be typically below 2:1?
ie at least half the time that big FPU unit on a modern x86 CPU is sitting idle.

So to make better use of that silicon, you share one FPU between two int pipelines in a 'cluster'.

You get 2 full speed int threads.
The scheduler can reorder FPU ops to prevent conflicts where both threads are trying to use the FPU at once.
(Could you even schedule some FPU ops to do work on both threads at once? ie a 64bit op from thread A + one from thread B? or 2 * 32bit from thread B?!)
Scheduling should be easier than Intels macro fusion ops etc.

When per thread int:FPU is 2:1, the FPU will be sitting at 100% utilisation & both threads running basically the same speed as if they were on 2 separate cores.

A 3ghz quad core of that would be pretty impressive I think.
Shame its not coming later this year but 2011 :-/
__________________
But it's DOUBLE CONFIRMED
hoom is offline   Reply With Quote
Old 29-Apr-2009, 15:08   #14
Scali
Naughty Boy!
 
Join Date: Nov 2003
Posts: 2,127
Send a message via ICQ to Scali Send a message via MSN to Scali
Default

Quote:
Originally Posted by Albuquerque View Post
I still struggle to understand why they haven't gone down this road, seriously. In my own opinion, they had far more room for success in SMT with all IMC architectures versus everything and anything Intel had before the i7.
Makes you wonder. Quite a few people were convinced that HT only worked because Pentium 4 was so inefficient to begin with. A more efficient architecture (eg K7/K8) would not have enough spare resources for a second thread to take advantage of (many people thought that Intel was crazy when they first heard of HT making a comeback on Nehalem).
Perhaps even AMD's engineers believed this?

Or perhaps an efficient SMT-implementation is so hard to do that AMD simply didn't have the resources.
All I know is that they will need it if they want any chance of competing at all in the future. Currently Intel's new Nehalem dual-socket servers are a threat to AMD's quad-socket systems, and SMT plays an important role in that (especially when it comes to capacity for virtual machines).
Scali is offline   Reply With Quote
Old 29-Apr-2009, 17:45   #15
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,281
Default

The description of the FP unit shows support for FMAC and 64-128 bit maximum operand width.
From a silcon point of view, we have a total of 4 INT units per core, up 25% from the 3 in a current Opteron core.
The FP unit is going to support operations that could force its size up at least by that much. An FMAC would require at least 50% more operand bandwidth, and the bit width could be enough to bloat the FP unit up as well.
The proportion of idling silicon isn't massively changed or it could be even more slanted in favor of the FP unit.

I think it could be that the design isn't sharing a deemphasized FP unit, but instead it is balanced around several critical resources, some of which might be more related to a much more powerful FP unit than they are for integer execution.

Clustering points to a certain amount of deemphasis of peak integer execution.
Highest peak would be a big expensive 4-way scheduler and a big expensive crossbar servicing all 4 integer lanes.
AMD has cut these into two half-sized entries. This is actually a net savings, as a lot of common circuits for superscalar issue scale quadratically with peak width.

The front end has been increased signficantly. It's 4-wide, but if AMD uses the same symmetric decoder, it is significantly more expensive to implement than the complex-simple-simple-simple scheme used by Intel.
The rename stage works in terms of 4 instructions, which is also expensive.
It then feeds, however, integer clusters that are physicaly incapable of that kind of throughput.

As a result a very expensive front end is amortized over more threads.
The slimmer integer clusters with private schedulers can also do more speculation, since they do not speculate over as wide an integer pipeline.
Other patents hint at attempts to reduce the complexity of the integer register file.

The cache bandwidth is also much higher. 4 data cache loads in total doubles what Opteron can do.
However, each integer cluster has access to only one L1 capable of two loads.
The FPU, however can hit both, which is something that data-hungry FP really needs.
The FPU, being separate, can also go with less speculation that doesn't benefit is as much, and may also have more register ports to support FMAC.

Peak single-threaded integer performance would be increased, if clocks and other things oblidge, but the FP unit looks like it might be the big winner.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 29-Apr-2009, 18:36   #16
I.S.T.
Senior Member
 
Join Date: Feb 2004
Posts: 2,482
Default

Quote:
Originally Posted by 3dilettante View Post
From a silcon point of view, we have a total of 4 INT units per core, up 25% from the 3 in a current Opteron core.
Don't you mean 33.3%?
I.S.T. is offline   Reply With Quote
Old 29-Apr-2009, 18:45   #17
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,281
Default

Sure, non-distracted math works too.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 02-May-2009, 09:09   #18
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,832
Send a message via Skype™ to fellix
Default

AMD64 Architecture Programmer’s Manual:
128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 02-May-2009, 19:29   #19
Raqia
Member
 
Join Date: Oct 2003
Posts: 324
Default

Quote:
Originally Posted by fellix View Post
AMD64 Architecture Programmer’s Manual:
128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions
Looks like AMD is adopting Intel's AVX instructions. Seems like a reasonable move, and it sounds like it's not too big a deal to share floating point resources between non-orthogonal to x86 registers (except they'll have to double the width to 256 bit in this case). We've seen this since 3dnow! Pro which also handled SSE on the Athlon XP.
Raqia is offline   Reply With Quote
Old 03-May-2009, 08:46   #20
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

Sounds cool, I wonder which chip of theirs will have it.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 03-May-2009, 15:26   #21
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

It is interesting that it has not been given any particular name like SSEx or AVX. Or are they just introducing new instructions now and will implement them when they see fit?
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 04-May-2009, 03:19   #22
Raqia
Member
 
Join Date: Oct 2003
Posts: 324
Default

Quote:
Originally Posted by rpg.314 View Post
It is interesting that it has not been given any particular name like SSEx or AVX. Or are they just introducing new instructions now and will implement them when they see fit?
I'm pretty sure that documentation is just describing their implementation of AVX but a superset of it with a few additional instructions from SSE5 like FMAC that don't yet have an Intel equivalent.
Raqia is offline   Reply With Quote
Old 04-May-2009, 06:16   #23
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,079
Send a message via Skype™ to rpg.314
Default

What was pretty interesting was a a) lot of integer mac and madd instructions were implemented. Integer multiplication rarely gets any love from SSEx but here they seem to be almost half of all instructions.

b) Also the cast to half float from float and vice versa. There seems to be no equivalent in AVX.

c) The extraction of fractional part of a floating point number. That will help in a lot of transcendental functions required for an Opencl implementation. And since opencl code is jit compiled at runtime, it's not a big problem if it is not adopted by intel (like with 3dnow)
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 06-May-2009, 09:09   #24
dkanter
Regular
 
Join Date: Jan 2008
Posts: 354
Default

SMT is extremely difficult to verify. It was not 'bolted on' to the P4, it was designed in from day one, and turned off for over a year and a half. And remember Intel has more engineering resources than AMD.

The upside is big though as average IPC is typically 0.5-1 for most CPUs wo/SMT.

Another issue is that AMD's architecture would need some pretty serious modifications. Their L1D cache associativity is really way too low, and that's with a single thread running. With 2 threads it would be practically intolerable. And that L1D is definitely on the critical path, so once you try and increase associativity, latency probably increases, which means you need to resize the TLBs, buffering and pipeline depth, etc. etc.

SMT is a huge win though...it's a pretty obvious sign when everyone else (Sun, IBM, Intel, NV, ATI) has already gotten on the bandwagon. Hell, even some embedded CPUs like RMI have multi-threading.

That being said, it's pretty clear to me that it would be a huge undertaking for AMD to verify SMT. Definitely the kind of thing that would have made sense for Barcelona, but it's tantamount to doing an entirely new uarch.

DK
__________________
www.realworldtech.com
dkanter is offline   Reply With Quote
Old 06-May-2009, 11:25   #25
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,832
Send a message via Skype™ to fellix
Default

Yup, it's like a balanced trade off -- poor associativity for larger size, low access latency and sheer R/W multi-banked throughput (the K10's L1D impl is the fastest solution in this regard up to this day). It has been a main philosophy for AMD's architectures since the very first K7.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote

Reply

Tags
amd, blewdozer, oh well, patents

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 05:59.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.