Trinity vs Ivy Bridge

Discussion in 'Architecture and Products' started by rpg.314, Jun 29, 2011.

  1. Well yes, my comment was directed to the Piledriver's architecture performance.

    Nonetheless, it seems that Trinity's performance advantages come mainly from higher clocks in both the CPU and GPU and an actually working Turbo function.

    Nonetheless, it's still a good upgrade from Llano and if there are less fab limitations this time, it'll probably sell lots and lots (outside china).
    Let's not forget that even the A10-4500M will probably be priced below dual-core Ivy-Bridges, and offer a similar overall experience (plus much better gaming performance).


    What worries me is if laptop OEMs keep linking AMD to "low-cost laptops" and forever place their chips into machines with 40WH batteries, low-quality plastics, poor TN panels, low-speed HD drives and god-awful cooling systems that force the APUs to underclock.
     
  2. yuri

    Regular

    Joined:
    Jun 2, 2010
    Messages:
    283
    Likes Received:
    296
    Like you said the AMD itself promoted '10-15% improvement in performance/watt every year' on Bulldozer launch. Expecting more than this from AMD is not wise in following 2-3 years.
     
  3. Right now, I only expect AMD to stay competitive at a given TDP or price point. I fear that may not happen in the following 2-3 years if they stick with Bulldozer as a base architecture for their CPU cores..
     
  4. kalelovil

    Regular

    Joined:
    Sep 8, 2011
    Messages:
    568
    Likes Received:
    104
    That was the plan last year, but VR-Zone has since reported that AMD will make more substantial changes in Steamroller onwards in an attempt to compete with Intel in the top end of the market again.
    http://vr-zone.com/articles/amd-to-survive-and-thrive-still-/15564.html
    http://www.xbitlabs.com/news/cpu/di...vements_with_Steamroller_Microprocessors.html
     
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Do they have a choice? I doubt there's a huge market out there willing to pay for an expensive chassis without the best internals. Anand mentions a probable $600 ceiling for Trinity based laptops.
     
  6. yuri

    Regular

    Joined:
    Jun 2, 2010
    Messages:
    283
    Likes Received:
    296
    The slides with 10-15% improvement figures were shown in Oct 2011 and the Steamroller based products are scheduled on 2013 then you would have ~2 years for a massive architecture change. This simply can't happen :)

    On the other hand, sure the Steamroller architecture is meant to be much bigger departure from Bulldozer compared to Piledriver, which is rather just a new silicon revision. At least the Steamroller is rumored (A. Stiller - c't mag) to have split the decoders to two independent ones. This bottleneck has been pointed out by many.
     
  7. imaxx

    Newcomer

    Joined:
    Mar 9, 2012
    Messages:
    131
    Likes Received:
    1
    Location:
    cracks
    yeah, BD architecture seems to have an outrageous IPC kept dormant by the pesky front-end, even worse than K8/10 (which was front-end&retirement limited).

    About BD architecture... it seems not that bad, at least on paper. It has surely good margins for improvement -whereas it's probably harder for Intel's one, which had already undergone several iterations since the old PM design.

    Just a side note, reading the anadtech review... a perceptron?!?! No really, does it mean AMD engineers did really add a perceptron - a true one?!?!
     
  8. jimbo75

    Veteran

    Joined:
    Jan 17, 2010
    Messages:
    1,211
    Likes Received:
    0
    I find it a bit strange that the same games it loses to Ivy are the same games that AMD loses heavily to Nvidia. Batman, Dirt 3, Skyrim etc, that seems a bit too much to be a coincidence to me.
     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    It's AMD's word that this is the case.
    We'll have to wait and see evidence that the same design pipeline that spat out BD (minus a slew of those it fired or has lost) can do significantly better.
    There is no expectation that the task they face will be any easier, and significant headwinds coming up that they refuse to address.

    Where would this outrageous IPC be hiding? The front end is a limiter, and as it turns out even weaker than expected, but there's not enough behind it to say there's a world of untapped performance per clock. That's not what the design targeted.

    On paper, it was described as purposefully trading off against performance for the sake of power and design simplification. The latter seems to be a higher priority. Its philosophical underpinnings promise underperformance.

    I'm trying to find references, but there may have been one added to Bobcat first.
     
  10. Blazkowicz

    Legend

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    Piledriver, in the form of a follow up to the current FX 4100, would be good enough for me. not that I need it much for now, these days I run a dual core with web browser constantly wasting half of it.

    100€ would be more than expensive enough for a CPU, considering that for it to make sense I need to upgrade memory, mobo, storage, graphics + learn yet another dirty command line things etc.
    a low end Piledriver standalone CPU will be better value than a core i3 for me, with the drawback that it eats more power.
     
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    They could be winning for different reasons.
    One possible reason in common is that AMD's designs at the high end seem to need a high pixel load to amortize higher overheads. Being limited at the front end seems to come up relatively frequently compared to Nvidia.
    Intel's IGP might be better balanced, or its superior CPU and memory architecture is able to churn through the overhead better.

    Intel's plans for reducing driver overhead further have been mentioned, so IVB may improve relative to Trinity this year, possibly.
     
  12. imaxx

    Newcomer

    Joined:
    Mar 9, 2012
    Messages:
    131
    Likes Received:
    1
    Location:
    cracks
    It is outrageous not in absolute sense, but relative, if you consider that decoders average ~2instr/cycle in the long run, and you have such decoder alternating between two cores, leading to starving CPU result. In absolute sense, K10 was the best given it's 9inst/cycle potential. Anyway, BD has the LSU 2R1W per core (so 4R2W/cycle, even if LSU seems to have some problem, maybe WCC slows it down?) so for a module you have a theoretical of 4(6 w/fused jcc)ALU+4AGLU (only mov r,r/r,[m], not like K10 design unfortunately)+2FPU if you were able to pump up properly the module.
    Intel's processor is way behind it(4/5 alu+"agu", 2R1W/cycle), yet it uses way better all its resources.
    BD also pays with a longer pipeline and some slower instruction clock-to-clock with intel's counterpart (and a shared l2 access, a smaller l1d with related higher trash etc...)

    It promises a single-thread drop for performance, and a drop for very intensive FPU applications that requires sharing of the FlexFPU.
    Yet, it promises a better overall IPC module wise.

    Thank you - it would be very interesting to me, I couldn't believe my eyes when I read it.
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The integer back end is rather constrained. There are situations where the front end can leave it stalled, but it's generally balanced for that anemic 2 ops per cycle.

    Theoretically. Agner Fog's tests show that BD's read bandwidth suffered from some other penalty, because it couldn't sustain the bandwidth 2 reads per cycle should have garnered, and the write bandwidth to the L2 (required due to write-through) is very bad. Trinity should have doubled the latter, which is still pretty tight considering two cores have to share it.

    Why are you comparing BD two cores versus one of the others?

    The common usage of IPC, at least prior to AMD's marketing efforts, referred to instructions issued per cycle for a thread. It was more about straightline speed, not the aggregate number of instructions on the silicon. This is probably due to the term being commonplace prior to multicore solutions.
    Weakening the language as AMD did is not an argument from a position of strength.
    In general, the design is rife with special cases and crippling glass jaws. It needs code to cater to its many weaknesses, and so whatever utilization it supposedly garners is eclipsed by the loss of peak performance and a world that refuses to cater to a weak architecture.

    It's a second-hand quote from a presentation.

    http://www.realworldtech.com/beta/forums/index.cfm?action=detail&id=126504&threadid=126503&roomid=2
     
  14. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,107
    Location:
    35.1415,-90.056
    A6-4455M
    Modules: 1
    Cores: 2
    TDP: 17W

    [​IMG]

    Edit: Back to the topic at hand: Not surprising, and basically what I expected. Low clock, low CPU performance, more GPU than it can use. Sounds great for keeping themselves in the lowest cost bracket, complete with the lowest margins of the entire consumer-grade processor stack. This is a perfect way to keep themselves in just enough business to stay afloat, and not much else.

    Sad :(
     
    #514 Albuquerque, May 15, 2012
    Last edited by a moderator: May 15, 2012
  15. imaxx

    Newcomer

    Joined:
    Mar 9, 2012
    Messages:
    131
    Likes Received:
    1
    Location:
    cracks
    in order to get that 2ops/cycle you'd need to decode 4 instructions (assumed single path) every cycle, since the decoder *alternates* between cores every cycle, so you decode (up to) 4 instructions every TWO cycles. Let me highly doubt you can get them: say you decode 3 mops/cycle, quite high. You then have 2 full cycles for executing 3 mops, filling 1.5 ALU, and supposing no AGLU usage, not even for mov r,r/mov r,m. Averaging a decoding between 2 and 3 mops/cycle, you get your 2+2 stuff quite underused (between 1 and 1.5 mops/cycle!).

    Because of hyperthreading. A module runs two threads, and an Intel core runs two threads (granted, using very different architectures). The Intel core uses all its resources as its best and can grant high single-thread performance, whereas AMD module should give a slower single thread performance in exchange for a bigger 2-thread IPC.
    Unless you consider the AMD 4M/8C as a full 8-core processor, but having a shared front-end, I think it is not.


    Well, if you consider mostly simple instructions for IPC they can go fine, since MOP output from the decoder is somewhat similar between AMD and Intel (not so equal, as i.e. even a push gets decoded differently...).

    yeah, this has been a weak point of AMD, all the time. If you need to optimize assembly, you go for intel's processor..
     
    #515 imaxx, May 15, 2012
    Last edited by a moderator: May 15, 2012
  16. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    ... and there are single uop instructions and double uop instructions. Decoder can only decode one double uop instruction per cycle. Microcoded instructions are even worse, and block the decoder (for both threads) for several cycles (automatically causing pipeline underutilization for both cores, since the max decode rate equals the execution rate). Same is true for instructions that have long prefixes (4-7 prefixes = 14-15 extra decoding cycles, again blocking both cores). Also the CPU has instruction fusion, but instruction fusion doesn't increase the decoding rate (so fused instructions do not help with the decoding bottleneck). And the 2-way shared L1 instruction cache doesn't help much either...

    But there's only 2 integer/logic ALU pipes per core (2 other pipes can only do memory operations), so it wouldn't be able to execute more than 2 integer/logic ALU instructions per clock even if the decoder was more capable. It would definitely reach closer to 2 ALU per cycle...
     
  17. imaxx

    Newcomer

    Joined:
    Mar 9, 2012
    Messages:
    131
    Likes Received:
    1
    Location:
    cracks
    all true, I did just cut lengthy details ...just one more: the presence of double-vector instr. out of a perfect 2-1-1 sequence adds even more stall to the shared decoder, causing it to give an even lower throughput when it's shared. Intel has the same problem, of course, yet it suffers less since it does not have a whole 'core' to feed.

    With this front-end? hmmm I doubt it. AMD says it was getting 2/Cycle using the K10 3-instr front end, hence their third unused ALU. SB can constantly pump up 4-5/Cycle only when it uses the TC on relatively optimized code... maybe it will happen when they fix/split the decoder somehow...
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Masking the other core off showed performance improvement, although it was pretty modest.
    There are just so many other ways to lose utilization that the decoupled front end is one weakness of many.
    There are issue restrictions for which EXE pipeline can do what, such as MUL and DIV, and branches can only use one pipeline. In branchy code the core can look 50% thinner on a given cycle.
    edit: Also a lack of move elimination, which is more noticeable with the claustrophobic 2 issue slots. Later iterations of the architecture will give the AGU ports the ability to handle moves, though. Intel's design does better.

    For all but the most friendly apps, Bulldozer doesn't provide superior aggregate throughput.

    The cores have separate memory pipelines, issue, and control hardware.


    IPC as traditionally used involves the number of instructions a design can execute for a thread in a cycle. In more general terms, it is what a core can generally manage when given a non-toy workload.
    Regardless, actual benchmarks show that those cores wind up stalling more, so even in aggregate terms the instructions issued in a clock is weak.

    The bigger problem is that general-purpose processors have trended towards being resilient enough to not require so much handholding.
    Sandy Bridge has a few core optimizations, such as trying to keep instruction counts low enough to fit in the uop cache and the complex and simple decoder arrangement. It's still very strong in non-ideal situations.
    Bulldozer has a raft of other problems on top of that and it drops off ideal very quickly.
     
    #518 3dilettante, May 15, 2012
    Last edited by a moderator: May 15, 2012
  19. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    I meant that no matter how much they would improved the decoder, with only two ALU execution units (two doing only memory ops), the integer IPC wouldn't dramatically improve. Of course if we assume they get somewhere around 1.0 per cycle right now, and they could reach 1.5 with a (vastly) improved front end, that's 50% extra performance... but that would be overly optimistic... unless they drastically improved the cache system, branch predictors, store forwarding, etc... but it seems that these are actually things that Piledriver improved on, so the instruction decode might be even a bigger bottleneck this time, unless they improved it also from Bulldozer. We need more detailed architecture analysis (than the current reviews).
     
  20. Ernestds

    Newcomer

    Joined:
    Dec 10, 2011
    Messages:
    19
    Likes Received:
    0
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...