If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
Member
Join Date: Oct 2003
Posts: 324
|
It looks substantially like SUN's Rock processor:
http://citavia.blog.de/ and these specs seems to gel well w/ some previous reports about Bulldozer. The above was originally pulled from: http://www.planet3dnow.de/vbulletin/...42#post3849342 Also some other rumors to take w/ a grain of salt: http://brightsideofnews.com/news/200....aspx?pageid=0 |
|
|
|
|
|
#2 |
|
Senior Member
|
I wish they would implement hardware multihreading in it.
|
|
|
|
|
|
#3 |
|
Red-headed step child
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,089
|
I still struggle to understand why they haven't gone down this road, seriously. In my own opinion, they had far more room for success in SMT with all IMC architectures versus everything and anything Intel had before the i7.
They've had excellent main memory bandwidth, excellent inter-core communications, and plenty of execution resources on the die. All the basics are there, why haven't they gone that last few more steps and really opened this up?
__________________
"...twisting my words" |
|
|
|
|
|
#4 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,646
|
Quote:
AMD's cores are relatively small, each core in a quad core Phenom 2 is around 10% of the entire die. The reasoning behind not developing SMT is probably that they might as well just double the number of cores and get double the performance in multithreaded scenarios. Unfortunately they do not enjoy the excessive fab capacity Intel does. AMD cores are descendants of the original K7, they have had 2 (or 3?) internal new architecture projects cancelled since the K7 came out, I'm sure one of those contemplated SMT. Cheers
__________________
I'm pink, therefore I'm spam |
|
|
|
|
|
|
#5 | ||
|
Red-headed step child
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,089
|
Quote:
Again, they're (my opinion) in a far better position than Intel in terms of SMT's ability to make a difference - they've had far better IPC (inter-processor communication) and memory subsystem technology for a very long time, at least up until the i7 finally hit the street. I think this should've given them far more opportunity to deliver a highly successful SMT implementation. Quote:
__________________
"...twisting my words" |
||
|
|
|
|
|
#6 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,281
|
One of the pitfalls to SMT on a complex OoO processor is that it does expand the engineering resources needed to properly design and verify.
If the diagram and patent applications are indicative of Bulldozer--and here we need to be cautious, as patents do not always make it to implementation--we see that AMD has made the decision to simplify a number of things at the core unit level to make way for more complexity in speculation. The aggressive speculation is also an argument against SMT, since slots consumed by speculation cannot be doled out to other threads.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#7 |
|
Red-headed step child
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,089
|
Really, my post only serves as a note of sadness and despair for AMD's current processor lineup. They had so many opportunities and so much time to make something awesome, and yet here we are with K7 part 22.
Let's do something new AMD, seriously. Let's get back into the game; let's do something with those R&D resources that have seemingly been idle for the last half decade. Come on guys, bring the pain or something!
__________________
"...twisting my words" |
|
|
|
|
|
#8 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,281
|
AMD did try to bring the pain, repeatedly.
It had at least one false start with K8, it had several false starts (something like six months delay each) prior to Barcelona, which itself was a faceplant. Complex designs need significant resources and time, and we can see the disparity between AMD and Intel the amount of resources they have in reserve for such efforts. The multiyear gap before the long-delayed Bulldozer (whichever design they've settled after how many they've scrapped) points to a significant limitation of means. I'm curious about how much the layoffs have hit the engineering and design groups, and it's also not clear to me that the engineering executives whose tenure most matches the abortive attempts at a K8 successor have been culled, or whether like the current AMD CEO, just got promoted.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#9 |
|
Senior Member
|
Hm, that clustered approach is teasing to me. Looks like each cluster is scaled down integer block, found in K10. The L1D cache probably will stay the same dual-ported bank-differential array with high throughput for arbitrarily access, but some details could be touched, like the size (probably halved, per cluster) and doubled/quadrupled associativity, for compensation.
All the matters point to heavy "modularization" to the lowest arch level, here.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#10 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,646
|
Quote:
Northwoods 8 KB 2-way D$, the halving of the per process ROB entries and the thrashing of the trace cache made SMT an almost sure loss for most workloads. Prescott improved the D$ to 16KB 4-way and double the trace caches. It had better SMT performance as a result. Core i7 has 32KB 8-way D$, similar instruction caches. Ci7 increase the ROB entries to 128 from 96 for Core 2 architecture, when in SMT mode each context has 64 entries (that is why you see lower performance for single thread workloads in SMT mode). Ci7 also has a per core L2 cache that functions as a victim cache for the D$ and I$. So while you have the same first level caches as C2, the per core cache system is greatly improved. The active register file is doubled, the architected register file is doubled. All the critical structures are made with SMT in mind, and it shows performance-wise in multithreaded workloads. Cheers
__________________
I'm pink, therefore I'm spam Last edited by Gubbi; 23-Apr-2009 at 11:44. |
|
|
|
|
|
|
#11 | ||
|
Senior Member
|
Quote:
Quote:
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. Last edited by fellix; 23-Apr-2009 at 15:04. |
||
|
|
|
|
|
#12 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,646
|
Quote:
I keep making that mistake and assumes that associativity is size of cache divided by page size (4KB) like in most virtually adresses, physically tagged caches. Other notable exception is K7/8's 2 way associative 64KB caches. Cheers
__________________
I'm pink, therefore I'm spam |
|
|
|
|
|
|
#13 |
|
Senior Member
Join Date: Sep 2003
Posts: 2,078
|
So I've been trying to work out what is so special about this setup: Why put a 2nd int pipeline into a core when we're in the multi-core period?
Why not just make another core??? Finally I think I get it. Its about making much better use of the FPU/SIMD unit. Currently (& presumably for the foreseeable future according to AMD?) CPU instruction ratio int:FP/SIMD must be typically below 2:1? ie at least half the time that big FPU unit on a modern x86 CPU is sitting idle. So to make better use of that silicon, you share one FPU between two int pipelines in a 'cluster'. You get 2 full speed int threads. The scheduler can reorder FPU ops to prevent conflicts where both threads are trying to use the FPU at once. (Could you even schedule some FPU ops to do work on both threads at once? ie a 64bit op from thread A + one from thread B? or 2 * 32bit from thread B?!) Scheduling should be easier than Intels macro fusion ops etc. When per thread int:FPU is 2:1, the FPU will be sitting at 100% utilisation & both threads running basically the same speed as if they were on 2 separate cores. A 3ghz quad core of that would be pretty impressive I think. Shame its not coming later this year but 2011 :-/
__________________
But it's DOUBLE CONFIRMED |
|
|
|
|
|
#14 | |
|
Naughty Boy!
|
Quote:
Perhaps even AMD's engineers believed this? Or perhaps an efficient SMT-implementation is so hard to do that AMD simply didn't have the resources. All I know is that they will need it if they want any chance of competing at all in the future. Currently Intel's new Nehalem dual-socket servers are a threat to AMD's quad-socket systems, and SMT plays an important role in that (especially when it comes to capacity for virtual machines). |
|
|
|
|
|
|
#15 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,281
|
The description of the FP unit shows support for FMAC and 64-128 bit maximum operand width.
From a silcon point of view, we have a total of 4 INT units per core, up 25% from the 3 in a current Opteron core. The FP unit is going to support operations that could force its size up at least by that much. An FMAC would require at least 50% more operand bandwidth, and the bit width could be enough to bloat the FP unit up as well. The proportion of idling silicon isn't massively changed or it could be even more slanted in favor of the FP unit. I think it could be that the design isn't sharing a deemphasized FP unit, but instead it is balanced around several critical resources, some of which might be more related to a much more powerful FP unit than they are for integer execution. Clustering points to a certain amount of deemphasis of peak integer execution. Highest peak would be a big expensive 4-way scheduler and a big expensive crossbar servicing all 4 integer lanes. AMD has cut these into two half-sized entries. This is actually a net savings, as a lot of common circuits for superscalar issue scale quadratically with peak width. The front end has been increased signficantly. It's 4-wide, but if AMD uses the same symmetric decoder, it is significantly more expensive to implement than the complex-simple-simple-simple scheme used by Intel. The rename stage works in terms of 4 instructions, which is also expensive. It then feeds, however, integer clusters that are physicaly incapable of that kind of throughput. As a result a very expensive front end is amortized over more threads. The slimmer integer clusters with private schedulers can also do more speculation, since they do not speculate over as wide an integer pipeline. Other patents hint at attempts to reduce the complexity of the integer register file. The cache bandwidth is also much higher. 4 data cache loads in total doubles what Opteron can do. However, each integer cluster has access to only one L1 capable of two loads. The FPU, however can hit both, which is something that data-hungry FP really needs. The FPU, being separate, can also go with less speculation that doesn't benefit is as much, and may also have more register ports to support FMAC. Peak single-threaded integer performance would be increased, if clocks and other things oblidge, but the FP unit looks like it might be the big winner.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#16 |
|
Senior Member
Join Date: Feb 2004
Posts: 2,482
|
|
|
|
|
|
|
#17 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,281
|
Sure, non-distracted math works too.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#18 |
|
Senior Member
|
AMD64 Architecture Programmer’s Manual:
128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#19 | |
|
Member
Join Date: Oct 2003
Posts: 324
|
Quote:
|
|
|
|
|
|
|
#20 |
|
Senior Member
|
Sounds cool, I wonder which chip of theirs will have it.
|
|
|
|
|
|
#21 |
|
Senior Member
|
It is interesting that it has not been given any particular name like SSEx or AVX. Or are they just introducing new instructions now and will implement them when they see fit?
|
|
|
|
|
|
#22 |
|
Member
Join Date: Oct 2003
Posts: 324
|
I'm pretty sure that documentation is just describing their implementation of AVX but a superset of it with a few additional instructions from SSE5 like FMAC that don't yet have an Intel equivalent.
|
|
|
|
|
|
#23 |
|
Senior Member
|
What was pretty interesting was a a) lot of integer mac and madd instructions were implemented. Integer multiplication rarely gets any love from SSEx but here they seem to be almost half of all instructions.
b) Also the cast to half float from float and vice versa. There seems to be no equivalent in AVX. c) The extraction of fractional part of a floating point number. That will help in a lot of transcendental functions required for an Opencl implementation. And since opencl code is jit compiled at runtime, it's not a big problem if it is not adopted by intel (like with 3dnow) |
|
|
|
|
|
#24 |
|
Regular
Join Date: Jan 2008
Posts: 354
|
SMT is extremely difficult to verify. It was not 'bolted on' to the P4, it was designed in from day one, and turned off for over a year and a half. And remember Intel has more engineering resources than AMD.
The upside is big though as average IPC is typically 0.5-1 for most CPUs wo/SMT. Another issue is that AMD's architecture would need some pretty serious modifications. Their L1D cache associativity is really way too low, and that's with a single thread running. With 2 threads it would be practically intolerable. And that L1D is definitely on the critical path, so once you try and increase associativity, latency probably increases, which means you need to resize the TLBs, buffering and pipeline depth, etc. etc. SMT is a huge win though...it's a pretty obvious sign when everyone else (Sun, IBM, Intel, NV, ATI) has already gotten on the bandwagon. Hell, even some embedded CPUs like RMI have multi-threading. That being said, it's pretty clear to me that it would be a huge undertaking for AMD to verify SMT. Definitely the kind of thing that would have made sense for Barcelona, but it's tantamount to doing an entirely new uarch. DK
__________________
www.realworldtech.com |
|
|
|
|
|
#25 |
|
Senior Member
|
Yup, it's like a balanced trade off -- poor associativity for larger size, low access latency and sheer R/W multi-banked throughput (the K10's L1D impl is the fastest solution in this regard up to this day). It has been a main philosophy for AMD's architectures since the very first K7.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
![]() |
| Tags |
| amd, blewdozer, oh well, patents |
| Thread Tools | |
| Display Modes | |
|
|