AMD Bulldozer Core Patent Diagrams

Discussion in 'PC Industry' started by Raqia, Apr 16, 2009.

  1. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    494
  2. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,294
    Location:
    /
    I wish they would implement hardware multihreading in it.
     
  3. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    3,744
    Location:
    Guess ;)
    I still struggle to understand why they haven't gone down this road, seriously. In my own opinion, they had far more room for success in SMT with all IMC architectures versus everything and anything Intel had before the i7.

    They've had excellent main memory bandwidth, excellent inter-core communications, and plenty of execution resources on the die. All the basics are there, why haven't they gone that last few more steps and really opened this up?
     
  4. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,077
    They would have to increase caches, ROBs and register files. All time critical structures. And all for having higher utilization of the execution units.

    AMD's cores are relatively small, each core in a quad core Phenom 2 is around 10% of the entire die. The reasoning behind not developing SMT is probably that they might as well just double the number of cores and get double the performance in multithreaded scenarios. Unfortunately they do not enjoy the excessive fab capacity Intel does.

    AMD cores are descendants of the original K7, they have had 2 (or 3?) internal new architecture projects cancelled since the K7 came out, I'm sure one of those contemplated SMT.

    Cheers
     
  5. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    3,744
    Location:
    Guess ;)
    Register files, yes. The others are a strong maybe. Intel is doing hyperthreading with L1/L2 caches that are smaller than quite a few current AMD chips. And while I understand what you're saying about "cores are only ~10% of the total die space", I also understand that a 2x increased register file would be considerably smaller. The net effect of which would be, I dunno, half the performance of an additional core for one quarter (or less) increase in die space?

    Again, they're (my opinion) in a far better position than Intel in terms of SMT's ability to make a difference - they've had far better IPC (inter-processor communication) and memory subsystem technology for a very long time, at least up until the i7 finally hit the street. I think this should've given them far more opportunity to deliver a highly successful SMT implementation.

    This is probably the worst part, and I don't disagree. I think this SMT thing is just another smaller and less obvious example of how AMD's processor innovation really seems to have stagnated over the last multiple years. Which makes me :(
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,721
    Location:
    Well within 3d
    One of the pitfalls to SMT on a complex OoO processor is that it does expand the engineering resources needed to properly design and verify.

    If the diagram and patent applications are indicative of Bulldozer--and here we need to be cautious, as patents do not always make it to implementation--we see that AMD has made the decision to simplify a number of things at the core unit level to make way for more complexity in speculation.

    The aggressive speculation is also an argument against SMT, since slots consumed by speculation cannot be doled out to other threads.
     
  7. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    3,744
    Location:
    Guess ;)
    Really, my post only serves as a note of sadness and despair for AMD's current processor lineup. They had so many opportunities and so much time to make something awesome, and yet here we are with K7 part 22. :(

    Let's do something new AMD, seriously. Let's get back into the game; let's do something with those R&D resources that have seemingly been idle for the last half decade. Come on guys, bring the pain or something!
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,721
    Location:
    Well within 3d
    AMD did try to bring the pain, repeatedly.

    It had at least one false start with K8, it had several false starts (something like six months delay each) prior to Barcelona, which itself was a faceplant.

    Complex designs need significant resources and time, and we can see the disparity between AMD and Intel the amount of resources they have in reserve for such efforts.

    The multiyear gap before the long-delayed Bulldozer (whichever design they've settled after how many they've scrapped) points to a significant limitation of means.

    I'm curious about how much the layoffs have hit the engineering and design groups, and it's also not clear to me that the engineering executives whose tenure most matches the abortive attempts at a K8 successor have been culled, or whether like the current AMD CEO, just got promoted.
     
  9. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,330
    Location:
    Varna, Bulgaria
    Hm, that clustered approach is teasing to me. Looks like each cluster is scaled down integer block, found in K10. The L1D cache probably will stay the same dual-ported bank-differential array with high throughput for arbitrarily access, but some details could be touched, like the size (probably halved, per cluster) and doubled/quadrupled associativity, for compensation.
    All the matters point to heavy "modularization" to the lowest arch level, here.
     
  10. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,077
    My point is that SMT is not just something you bolt on the side of your processor. We saw with P4 what that would do.

    Northwoods 8 KB 2-way D$, the halving of the per process ROB entries and the thrashing of the trace cache made SMT an almost sure loss for most workloads.

    Prescott improved the D$ to 16KB 4-way and double the trace caches. It had better SMT performance as a result.

    Core i7 has 32KB 8-way D$, similar instruction caches. Ci7 increase the ROB entries to 128 from 96 for Core 2 architecture, when in SMT mode each context has 64 entries (that is why you see lower performance for single thread workloads in SMT mode). Ci7 also has a per core L2 cache that functions as a victim cache for the D$ and I$. So while you have the same first level caches as C2, the per core cache system is greatly improved.

    The active register file is doubled, the architected register file is doubled.

    All the critical structures are made with SMT in mind, and it shows performance-wise in multithreaded workloads.

    Cheers
     
    #10 Gubbi, Apr 23, 2009
    Last edited by a moderator: Apr 23, 2009
  11. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,330
    Location:
    Varna, Bulgaria
    The L1D array in 180 and 130nm P4's was 4-way associative.
    Similar here -- Prescott and later models doubled associativity to 8-way, for the L1D, and the trace-cache size remained unchanged throughout the entire NetBurst family.
     
    #11 fellix, Apr 23, 2009
    Last edited by a moderator: Apr 23, 2009
  12. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,077
    Right, My bad.

    I keep making that mistake and assumes that associativity is size of cache divided by page size (4KB) like in most virtually adresses, physically tagged caches. Other notable exception is K7/8's 2 way associative 64KB caches.

    Cheers
     
  13. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    2,374
    So I've been trying to work out what is so special about this setup: Why put a 2nd int pipeline into a core when we're in the multi-core period?
    Why not just make another core???

    Finally I think I get it.
    Its about making much better use of the FPU/SIMD unit.

    Currently (& presumably for the foreseeable future according to AMD?) CPU instruction ratio int:FP/SIMD must be typically below 2:1?
    ie at least half the time that big FPU unit on a modern x86 CPU is sitting idle.

    So to make better use of that silicon, you share one FPU between two int pipelines in a 'cluster'.

    You get 2 full speed int threads.
    The scheduler can reorder FPU ops to prevent conflicts where both threads are trying to use the FPU at once.
    (Could you even schedule some FPU ops to do work on both threads at once? ie a 64bit op from thread A + one from thread B? or 2 * 32bit from thread B?!)
    Scheduling should be easier than Intels macro fusion ops etc.

    When per thread int:FPU is 2:1, the FPU will be sitting at 100% utilisation & both threads running basically the same speed as if they were on 2 separate cores.

    A 3ghz quad core of that would be pretty impressive I think.
    Shame its not coming later this year but 2011 :-/
     
  14. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Makes you wonder. Quite a few people were convinced that HT only worked because Pentium 4 was so inefficient to begin with. A more efficient architecture (eg K7/K8) would not have enough spare resources for a second thread to take advantage of (many people thought that Intel was crazy when they first heard of HT making a comeback on Nehalem).
    Perhaps even AMD's engineers believed this?

    Or perhaps an efficient SMT-implementation is so hard to do that AMD simply didn't have the resources.
    All I know is that they will need it if they want any chance of competing at all in the future. Currently Intel's new Nehalem dual-socket servers are a threat to AMD's quad-socket systems, and SMT plays an important role in that (especially when it comes to capacity for virtual machines).
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,721
    Location:
    Well within 3d
    The description of the FP unit shows support for FMAC and 64-128 bit maximum operand width.
    From a silcon point of view, we have a total of 4 INT units per core, up 25% from the 3 in a current Opteron core.
    The FP unit is going to support operations that could force its size up at least by that much. An FMAC would require at least 50% more operand bandwidth, and the bit width could be enough to bloat the FP unit up as well.
    The proportion of idling silicon isn't massively changed or it could be even more slanted in favor of the FP unit.

    I think it could be that the design isn't sharing a deemphasized FP unit, but instead it is balanced around several critical resources, some of which might be more related to a much more powerful FP unit than they are for integer execution.

    Clustering points to a certain amount of deemphasis of peak integer execution.
    Highest peak would be a big expensive 4-way scheduler and a big expensive crossbar servicing all 4 integer lanes.
    AMD has cut these into two half-sized entries. This is actually a net savings, as a lot of common circuits for superscalar issue scale quadratically with peak width.

    The front end has been increased signficantly. It's 4-wide, but if AMD uses the same symmetric decoder, it is significantly more expensive to implement than the complex-simple-simple-simple scheme used by Intel.
    The rename stage works in terms of 4 instructions, which is also expensive.
    It then feeds, however, integer clusters that are physicaly incapable of that kind of throughput.

    As a result a very expensive front end is amortized over more threads.
    The slimmer integer clusters with private schedulers can also do more speculation, since they do not speculate over as wide an integer pipeline.
    Other patents hint at attempts to reduce the complexity of the integer register file.

    The cache bandwidth is also much higher. 4 data cache loads in total doubles what Opteron can do.
    However, each integer cluster has access to only one L1 capable of two loads.
    The FPU, however can hit both, which is something that data-hungry FP really needs.
    The FPU, being separate, can also go with less speculation that doesn't benefit is as much, and may also have more register ports to support FMAC.

    Peak single-threaded integer performance would be increased, if clocks and other things oblidge, but the FP unit looks like it might be the big winner.
     
  16. I.S.T.

    Veteran

    Joined:
    Feb 21, 2004
    Messages:
    2,894
    Don't you mean 33.3%?
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,721
    Location:
    Well within 3d
    Sure, non-distracted math works too.
     
  18. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,330
    Location:
    Varna, Bulgaria
  19. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    494
    Looks like AMD is adopting Intel's AVX instructions. Seems like a reasonable move, and it sounds like it's not too big a deal to share floating point resources between non-orthogonal to x86 registers (except they'll have to double the width to 256 bit in this case). We've seen this since 3dnow! Pro which also handled SSE on the Athlon XP.
     
  20. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,294
    Location:
    /
    Sounds cool, I wonder which chip of theirs will have it.
     

Share This Page

Loading...