AMD Bulldozer Core Patent Diagrams

Discussion in 'PC Industry' started by Raqia, Apr 16, 2009.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,833
    Location:
    Well within 3d
    The uop cache is definitely simpler and it doesn't have a bunch of the glass jaws the trace cache introduced. The mapping to the L1 is straightforward and its contents aren't dependent on the sequence of execution and not subject to duplication, and various events that flushed the entire trace cache in P4 do not do the same for the uop cache.

    That this could be snuck in as a parallel process to the decoder block probably took some engineering work, and it does buffer the more limited 16B decoder path.

    The memory pipeline is pretty fundamental to the functioning of the front end. Changing that probably means changing the front end, which is more significant than what a tweaked Bulldozer variant is going to get.
     
  2. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,347
    Location:
    Varna, Bulgaria
    I still think AMD went a bit too far with their automated design approach for Bulldozer, especially regarding the front-end. If you look carefully on the die-shot, the front-end block is very similar to what they have been using since K8. Now, of course, the thing is somehow patched for dual-threaded workloads, but I can't help to think there are too many legacy leftovers there, where the "copy-pasted" L1i being just one of them.
     
  3. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,095
    The uop cache is effectively a cache for one trace. If anything it validates the trace cache concept, - massive issue width while consuming less power.

    The problem with the P4 was that it was limited to decoding a single instruction per cycle when it missed the trace cache.

    Intel picked the low hanging fruit by exploiting the very predictable nature of loops, but I wouldn't be surprised to see more aggresive "uop" caches in the future with support for more traces. It won't be called a trace cache, that name is forever stigmatized.

    Cheers
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,833
    Location:
    Well within 3d
    The more automated design flow doesn't automatically lead into the decision to have a low-associativity L1 Icache.
    The contents of the cache have changed, since branch prediction no longer has bits in the L1 like in K8.

    The cache would have been reimplemented in BD, so it's not a matter of simple matter of copy/paste. The designers looked at major design parameters of the previous gen's L1, and they kept them.
     
  5. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,914
    The l1 instruction cache might be quite linked to the whole frontend, but I don't think this poses a fundamental problem for increasing cache size or associativity. Maybe they weren't able to increase associativity while not increasing latency though.
    That said are you suggesting AMD is going to stick with 2-way l1 instruction cache associativity for the next 5 years or so? Because tweaked Bulldozer is all that's on the roadmap (well apart from the low-power designs).
    fwiw l1i associativity for core2 was 8, 4 on Nehalem and now back to 8 for Sandy/Ivy Bridge. Clearly these things can be redesigned.
    AMD OTOH stuck with 2-way 64KB L1 instruction/data caches for forever (since K7 days - K6 also had two-way caches but only 2x32KB). Only BD now has different L1D (Bobcat is also "blessed" with 2-way 32KB L1I though it probably makes sense there, and the 8-way 32KB L1D is actually more than what you get with BD...)
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,833
    Location:
    Well within 3d
    I don't see it as having any traces. The contents of a trace cache reflects the execution path taken by the processor. If code is stored contiguously in memory as AA if BB else CC, the trace cache may contain AABB and AACC or various combinations of trace fragments.
    The uop cache is a post-decode cache that has a fixed relationship to the linearly addressed Icache, and it primarily validates a benefit for a chip running a complex ISA. To a limited extent, it gathers some low-hanging fruit for the original purpose of a trace cache--solving the problem of discontinuities in the fetch stream compromising superscalar issue.

    The memory closest to the execution pipeline has less leeway in terms of how it impacts cycle time and the stages in the pipeline that deal with fetch and prediction.
    The TLB and tag check logic would be altered if the ratio of tag and index bits changes.
    One possible, if unlikely change, would be to significantly change the associativity or reduce capacity so that it matches the size/associativity ratio of Sandy Bridge.
    This would eliminate the aliasing problem entirely, and discard a portion of the cache fill pipeline used to invalidate synonyms.

    AMD promised little more than increases in some buffers and minor changes like new instructions for Piledriver coupled with improved clocks at a given power level, and that's what we got.
    Some reports say that more change is in store with Steamroller, more so than was promised with Piledriver.
    Since Steamroller is also meant to be on a new non-SOI node, more changes could be in the air because various parts of the pipeline will need to be adjusted anyway.

    Sandy Bridge is a bit more than a tweaked Nehalem.
    Bulldozer to Piledriver is something like the SB to IVB transition, without the node jump.

    The size and associativity haven't changed, but the BD L1 is physically different and it no longer serves as part of the branch predictor.
     
  7. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,251
    Location:
    Cleveland, OH
    The cache holds 1536 uops. According to David Kanter's Sandy Bridge article (http://www.realworldtech.com/page.cfm?ArticleID=RWT091810191937&p=4) it is accessed in 32 byte windows of 4 uops each, meaning that each uop is 8 bytes. This is actually kind of surprising since Prescott also used 64-bit uops which were generally much more limited than SB fused uops.

    That would make it 12KB, but it probably takes up significantly more space than a normal 12KB 6-way set-associative instruction cache would due to metadata mapping instruction addresses between it and the L1 instruction cache.
     
  8. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,347
    Location:
    Varna, Bulgaria
    Thanks for the head up. ;)
     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,833
    Location:
    Well within 3d
    My interpretation is that the 32 byte window represents the 32 byte aligned chunks as represented in memory and the standard ICache.
    Any given window can be represented by up to 18 uops that can take up to 3 lines in the uop cache.
    There isn't a 1:1 correspondence between the external representation and the uop cache in terms of size or op count. I'm not sure how to arrive at 8 bytes per uop in a general case.

    There is a restriction that instructions with 64-bit immediates take up two slots, which may give a granularity to the uop cache where a 32 bit immediate can fit comfortably, so perhaps each slot is between 64 and 128 bits in length.

    The amplification of a 32 byte chunk of instructions would be variable.
    If slots are 64 bits and each way can have up to 6 uops, that's at least 48 bytes of uop, not counting metadata. A 32 byte window mapping to a fully occupied way would be amplified by a factor of 1.5.
    96 bits per uop would give a factor of 2.25.

    Mapping seems like it could be derived with a pointer and enough length counters.
    There is at least 48 bits for the IP of the first instruction in the window. Then there would be theoretical max of 18 length counters per window. The max byte length for x86 is 18, which naively makes me think 5 bits per counter. This assumes there is a valid way to pad an instruction out to 18 bytes while translating to one uop. It may be unnecessary to be that naive because an instruction that long wouldn't allow enough room in the window for 17 additional instructions.
     
  10. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,251
    Location:
    Cleveland, OH
    I'm not talking about the 32-byte x86 instruction window that the uop cache scans from, I'm talking about the interface coming out of the uop cache which is being labelled as 32-bytes. I may have incorrectly read a description of the former as one for the latter, though, since I was just skimming for some textual reference to the label. On second read, I think that he's using 32 byte to refer to what x86 instructions it could be "representing." The 32-byte x86 windows don't correspond to the 4 uops that can come out of the cache, though, but the full 6 uop lines, and I think you'd be hard pressed to get any 4 uops to fit 32 bytes of x86 code.. But the correlation between x86 instructions and uops is irrelevant to what I'm saying, I'm referring strictly to the size of uops here (and thus the size of the uop cache). Of course I understand/agree with what you're saying. I'm strictly interested in the uop size.

    64-bits might fit as uop size. Despite being the same size as Prescott's (generally simpler) uops the uop fusion rules don't really add a lot of extra data per-uop. If this number is incorrect then I suspect it's not that much larger, for example 80 bits; 128 bits would really surprise me. We both agree they're fixed width though, right?

    The confusion behind fellix's original comment is probably because of Intel's claim that the uop cache performs "like a 6KB instruction cache." Going just by capacity that'd imply that a uop in the cache is worth about 4 bytes of x86 code - the average bytes/instruction in typical programs is probably well under 4, but the average uops/instruction is also going to be a bit higher than 1. Then the uop cache will have some unused parts in lines. So it's a pretty reasonable sounding estimate.
     
    #1210 Exophase, Jun 8, 2012
    Last edited by a moderator: Jun 8, 2012
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,833
    Location:
    Well within 3d
    I'm trying to think through the description of the process. The term "window" is used multiple times, but I wasn't sure what exactly is being referenced in each instance.
    I reread the description of the uop cache hit process a few times, and after thinking it through the initial guess of 64 bits fits. Upon a hit the uop cache sends the 1-3 lines to an intermediate buffer, and that buffer can take 6 uops a cycle but it has an output limit of 4 uops/32 bytes, which gives each one 64 bits. Since each way of the uop cache has 6 uops, that would make mean each way of the uop cache contains 48 bytes worth of op data plus additiona meta data. 32 sets * 8 ways *48 bytes per way / 8 bytes in 64bits gives 1.5K uops.

    Fixed or mostly fixed since ops with 64-bit immediates apparently span more than one slot according to Intel's documentation.
     
    #1211 3dilettante, Jun 9, 2012
    Last edited by a moderator: Jun 9, 2012
  12. itsmydamnation

    Regular

    Joined:
    Apr 29, 2007
    Messages:
    935
    Location:
    Australia
  13. Commenter

    Newcomer

    Joined:
    Jan 9, 2010
    Messages:
    219
    How does ~10-15% better perfomance for piledriver stack up to Intel's bridges made of sand and ivy though?
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,833
    Location:
    Well within 3d
    I will need to go back and review numbers from years back.
    On a multithreaded basis, there may be more areas where on a module to core basis Piledriver is more competitive.
    In terms of single thread performance, I will need to check where it is in relation to Westmere before worrying about SB and IB.
     
  15. itsmydamnation

    Regular

    Joined:
    Apr 29, 2007
    Messages:
    935
    Location:
    Australia
    perf per clock its only just up to Llano (but still behind in quite a few areas) which is about 5% on avg better then phenom II, So no where near intel, but its big turn around considering it is a minor revision, the fact at same TDP you have almost 1000mhz over llano and turbo isn't used in that review means significant performance increase over llano.

    There have been statements from AMD saying steamroller with bring there single thread performance much closer to intels, i wonder what they are going to change? Decode, load/store to the FPU, number of ALU's, trace cache?

    it will be very interesting to see what steamroller is, looks like its good enough for atleast sony, maybe even Mircosoft. I guess we will see if bulldozer was Yonnah or northwood :lol:
     
  16. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,115
    Location:
    Somewhere over the ocean
    There's around some benchmark with the last win8 preview?
     
  17. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    3,787
    Location:
    35.1415,-90.056
    Is this question in relation to the updated scheduling that was touted in Win8 to provide an incremental performance improvement on Bulldozer? My own speculation: I doubt it's significant enough to write about.
     
  18. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    276
    Location:
    Herwood, Tampere, Finland
    The most important scheduler changes were already released for windows 7 during the winter, so I don't except big improvements over them with w8.
     
  19. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,115
    Location:
    Somewhere over the ocean
    really? i read that the new scheduler is am 8's exclusive
     
  20. I.S.T.

    Veteran

    Joined:
    Feb 21, 2004
    Messages:
    2,938
    Incorrect, it's been in Windows 7 for months now.
     

Share This Page

  • About Beyond3D

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...