AMD Bulldozer Core Patent Diagrams

Discussion in 'PC Industry' started by Raqia, Apr 16, 2009.

Tags:
  1. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    The transactional memory instructions are a big change, but these instructions could probably be added to cores without affecting the portions of the pipeline aimed at single threaded performance improvements. They could affect how Hyperthreading is implemented, and I eagerly await more details.

    For Haswell, the new FMAC and the associated framework to keep it fed could be a huge boost. AMD does seem to be rebalancing its CPU Cores toward relatively more integer to floating-point performance, but it's not fabbing a massive GPU onto the same die for no reason. I think they want people to make use of the GPU when really heavy streams of floating-point calculations arise; the existing FPUs are sufficient to address legacy code and any sporadic floating point math that might come up.

    It does sound like AMD is steering away slightly from its cluster based approach mainly to give each core more single threaded performance. Giving a decoded-uops cache to each core could take a significant chunk of silicon per core to implement and sounds like it would naturally lead to a splitting the decoder in two to better service each core's uop-cache individually.

    Separating the usual ICache into 2 higher-associativity, but smaller pieces seems like it might entail more complication in the Ifetcher or even a split, which might be going far enough to defeat the point of the cluster base approach. It also probably does less for single threaded performance and power savings than a uops cache since the latter's contents are much "closer" pipeline-wise to the execution units than the Icache. From Agner Fog's tests, having only 2-way associativity in its ICache hurts the BD, especially since it's servicing two threads; it sounds like addressing the poor associativity would be a better first step than splitting it up. Whatever AMD did, reducing L1 misses by 30% is a lot...
     
    #1261 Raqia, Aug 31, 2012
    Last edited by a moderator: Aug 31, 2012
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,555
    Likes Received:
    4,725
    Location:
    Well within 3d
    The proviso about feeding the decoders is basically a "what-if" where AMD doubles down on the decode duplication and gives them more raw bandwidth than what is currently available.
    The single-threaded case wouldn't change as much, but the dual threaded case would give an advantage relative to an SB that misses the uop cache.
    It sounds brute-force and rather aggressive, so I may have been charitable for entertaining the thought.

    However, if the decoders are truly duplicated without degrading them, then the aggregate throughput in the case of doubles and microcode would be significantly better, since BD can't do two doubles at once and microde blocks the front end for the other thread.
     
  3. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,330
    Likes Received:
    444
    Location:
    Australia
    From a time to market and therefore effort perspective, wouldn't reusing the existing decode be quicker then designing a "new" 2 wide decoder? Would a 4 wide decoder make any meaningful difference to "regular consumer code"?
     
  4. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    My guess is that the new split decoders might have something to do w/ the uop cache that was mentioned. The caches are meant to improve single threaded performance by caching the results of the decode stage instead of before it and the contents are tied to each core separately, so I'm guessing they probably saw a benefit in its implementation to separating the decoders.

    It also is implied that there isn't going to be any AVX2 support until excavator from the total lack of mention. It does seem like they revised BD to support AVX at the last minute and it wasn't an ideal implementation, and hopefully their inplementation of scatter/gather in AVX2 is up to snuff. Wild guess but maybe they'll alias some of the GPU's units for the CPU's FPU needs eventually, not sure if that's realistic or how the context switching would work...
     
  5. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    296
    Likes Received:
    37
    Location:
    Herwood, Tampere, Finland
    They are not going to use a 2-wide decoder, it would decrease signle-thread performance too much, and would usually not be any better than the single 4-wide decode for 2 threads.

    So the decoders will be either 3- or 4-way. 3 might be the best comphromise.
     
  6. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    296
    Likes Received:
    37
    Location:
    Herwood, Tampere, Finland
    The demand is the amount of instructions that need to be fetched to execute the program. doubling decoders don't really increase it. The decoders are not asking for more instructions. The fetcher is giving decoders code stream, and they will decode what they get.

    Better branch prediction will also decrease demand, as less instructions will be fetcher.

    With single decoder, if the buffer between ifetch and decode got full, ifetch had to wait/stall.
    And when the buffer had quite many instructions , and there was a branch prediction miss, all these instructions for the thread had to be flushed out.

    With Steamroller there is in average less instructions waiting to be decoded, so when a branch prediction miss occurs, there are less instructions flushed. So with more decoders, the total amount of instructions that had to be fetched is actually less.
     
    #1266 hkultala, Sep 1, 2012
    Last edited by a moderator: Sep 1, 2012
  7. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    This is just a matter of semantics.. nothing increases the amount of instructions that "need" to be fetched, but increasing decoder width (or increasing execution width, and so on) can increase the amount of fetch bandwidth it could utilize. In other words, it could move a bottleneck onto the fetch units more often.

    But no, I didn't mean that more fetch bandwidth would be needed to get the same amount of work done, if that's what you thought I was saying.

    But the cost of a branch mispredict is pretty much uniform across the pipeline, so the relative demand stays the same..

    Well yeah. And when the execution units can't find enough to execute from its OoO window the decode units stall, and so on. There's no question that more decoder width is better if the bandwidth wasn't good enough (that's kind of a trivial statement), the question is how often was there not enough bandwidth. But I don't think anyone is really questioning that in the dual-threaded scenario the decoders were a bottleneck some of the time.

    It doesn't matter if a fetched instruction was waiting to be decoded or if it was further along in the pipeline.. a branch misprediction must flush all instructions that were ever fetched after that branch, regardless of whether or not they've been decoded. So the branch misprediction penalty is not reduced by having more decode/execution/whatever resources. And you still have to fetch the same amount to get back to an equivalent amount of work done.

    If you want to look at it as energy wasted instead of time it probably wastes less if the data never got to leave the fetch buffers.

    You could say that wider decoders mean that fetch buffers don't need to be as large, but since there are already separate ones for each thread you'd lose in the single threaded case by making them smaller. Since the single threaded decode bandwidth isn't increasing. You could say the same thing for the post-decode buffer, depending on how robust it is - if it were a real cache it'd be worth relying, but a loop buffer either works or doesn't and pretty easily becomes completely useless if not executing a small enough loop.. so I don't know if AMD will want to rely on it to guarantee a performance baseline.
     
  8. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    There are quiet some sharp minds here so I will dare a few questions.
    AMD went for CMT on the premise that it would 80% of the performances of 2 cores for 50% of the costs. I haven't made measurements (and comparing a bulldozer module to previous AMD cores may not be an optimal comparison) but looking at both a Trinity die and then at Llano seems to tell another story.

    AMD might improve their modules performances with Streamrollers and then Excavators but I would not bet on a significant diminution in the size of the module (their high density libraries may do that but that also applies to their others processors lines).

    So what do you think of CMT?
    The premise was +80% of performances for 50% more silicon and it looks like what AMD will pull out is +100% of performances but a 100% increase in silicon (which is a bit moot).
    Do you think the approach is failure? So you would expect them to abandon it for their next brand new architecture.

    Do you think that if CMT were to be pushed further it could actually get closer to its premise? By pushed further I mean a module consisting of more than 2 cores.
     
  9. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,533
    Likes Received:
    139
    This comparison would make sense if there existed a full, old-school, BD based dual-core, with no shared elements. Versus K8L they added quite a few things, so it's not necessarily surprising that the module looks fat compared to that.
     
  10. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Agreed.
    I made some quick (and gross) measurements of Star cores and Piledriver modules.
    I found out that PD are ~93% of the area two Star cores would cover. That without L2, the L2 isze are a wash. By eyes I would say that there is less a bit less "glues" between a 2 modules part and a 4cores part which may push the advantage further in favor of BD/PD.

    It ain't that bad, in power constrained environment at least, Trinity offer anywhere between 110% 120% the performances of llano. There are cases where Llano win but cases where trinity won by a greater margin.

    So let say CMT is a good idea, do you think that after Excavator AMD could go further and increase the number of cores within a module (like having 4 cores in a module)?
    Or there are constrained in how big they can scale the front end?
     
  11. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,533
    Likes Received:
    492
    Location:
    Varna, Bulgaria
    AMD already refers to the new Jaguar quad-core architecture as a "Compute Unit", together with a shared L2. If they push Jaguar a bit over the pure mobile concept, it would fit very well in a sever/WS envelope, with some sort of scalable interconnect cache/memory infrastructure.
     
  12. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    An interesting hear-say update about AMD's CPU development:

    VR-Zone Article

    The rumored gains are nice, but AMD's main problem has always been execution. Keller being back in charge is a very hopeful change though.
     
    #1272 Raqia, Sep 7, 2012
    Last edited by a moderator: Sep 7, 2012
  13. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    The content of the article is reasonable (if you ignore the assertion that there could be other magic fixes for relevant performance issues) but the title is a total farce.
     
  14. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,532
    Likes Received:
    957
    Does anyone have any idea what this NRAC thing might be? My guess would be a workaround for some x87-specific bug.
     
  15. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    This is a nice hack but the gains are in a benchmark from the mid nineties.
    Anything less than 10 year old is probably able to use SSE2 or more, and x87 is completely unavailable to any 64bit application. This is good if you had a performance problem in Quake 1 and aren't bothered by any potential bug - the article doesn't explain too much why but says the gain goes away with multi-threading or perhaps multi-tasking.

    It's a lot irrelevant, that's slightly more useful than unlocking better 80286 mode performance, you would have to do number crunching all day on a single legacy single-threaded program you can't recompile or update.
     
  16. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,935
    Likes Received:
    914
    Location:
    Torquay, UK
    Best is to go to the source and check XS thread about it. Original Bulldozer was affected by this and hangs after applying patch. Later iterations of it (Piledriver, Richland) are working fine and give speed up in SuperPi (so far only, so it might be specific instruction mix used in it).
     
  17. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    Some purported early benchmarks of steamroller:

    http://www.chinadiy.com.cn/html/21/n-11921.html

    A 34% improvement in integer IPC is nothing to sneeze at if true and could put it neck and neck w/ Haswell. The FPU takes a step back, and I seem to remember reading somewhere that the steamroller was going to pare down redundancies in its FPU for steamroller somewhat, so this drop is consistent with what we've heard.

    EDIT: It was this article:

    http://techreport.com/review/23485/amd-cto-reveals-first-steamroller-details

    Of course PR would say that. ;) Steamroller is a much more APU oriented CPU. With consumers using GPU functionality more and more for their needs, a zippy CPU based FPU isn't so important for the masses. They weren't in a hurry to fix BD's L3 latency issues either as mentioned in the anandtech article since presumably most of their consumer parts won't have it.

    I guess scientists doing simulations will learn to make do with GPUs for their floating point needs, but they are much less predictable and clean than using a CPU's instruction set and unified memory. AMD's memory unification starting with Kaveri is a nice start for properly unleashing the GPU.
     
    #1278 Raqia, Nov 2, 2013
    Last edited by a moderator: Nov 2, 2013
  18. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,330
    Likes Received:
    444
    Location:
    Australia
    L3 latency isn't the issue its not great but its still way faster then main memory. its L3 throughput ( especially write) that's the issue. The L'2s are large and the L3 is an eviction cache, things are fetched into L1 and L2. But the L3 throughput........

    http://www.vmodtech.com/main/wp-con...2133c9d-16gxh-with-amd-fx-8350/aida64-mem.jpg
    http://cdn.overclock.net/e/e4/350x700px-LL-e4eb580f_cachememtest.png


    want i want to know is the L1 "broken" 6:1 ratio of read to write seems kinda pointless to me.

    edit: here is my FX8350... its running ESXi and its doing a fair bit while the benchmark was going........

    [​IMG]
     
    #1279 itsmydamnation, Nov 2, 2013
    Last edited by a moderator: Nov 2, 2013
  19. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,533
    Likes Received:
    492
    Location:
    Varna, Bulgaria
    Try the latest version of AIDA (3.0). The memory benchmark suite is now fully multi-threaded and latency readings are much more accurate.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...