New AMD low power X86 core, enter the Jaguar

Discussion in 'PC Industry' started by liolio, Aug 28, 2012.

Tags:
  1. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    K8/K10 also had dedicated ECC protection or the L1D, where BD omitted this feature, so now on every single bit error the corresponding cache line must be reloaded from the L2.
     
  2. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Unless you're living very close to the sun I don't think you'd notice it's slower due to that happening once every other year...
     
  3. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    I don't think it's so low. The susceptibility goes up as voltage and feature size goes down. Check out the soft-error rate models used in this publication: http://www.cse.psu.edu/~mdl/paper/lin-islped04.pdf

    For "normal" voltage they had 1 every 10 million cycles, and that was way back on 130nm. So I'd expect the number to be in the seconds, not years.

    Nonetheless, write-through caches have much lower error rates.
     
  4. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    296
    Likes Received:
    38
    Location:
    Herwood, Tampere, Finland
    Ok, you have small vectors on 3d grahics. I was thinking more on the direction of signal processing and scientific workloads.

    But even in those 4-wide case where there are 3 FMA + 1 MUL, 2 FMA units can give throughput of 1 iteration/2 cycles(*), but one adder + one multiplier can only give throuhput of one iteration/4 cycles.
    So FMA still give twice the troughput.

    (*) (assuming we can parallelize the code so that the serialization of the fma's do not become bottleneck, for example by running "multiple totally independent work items" in parallel in same simd lane)
     
    #84 hkultala, Sep 25, 2012
    Last edited by a moderator: Sep 25, 2012
  5. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    I am intrigued by the fact that AMD chose to implement twice as large and twice as associative L1D caches to the (low end) Bobcat than they did for Bulldozer. Did they have to compromise the cache design in Bulldozer to reach high clocks? Current max BD turbo is at 4.5 GHz (and without heat limits BD can overclock up to 8 GHz with LN2). They seem to have lots of extra clock headroom that they cannot use (even on desktops) because of TDP/heat constraints.

    Intel seems to be using it's extra clock headroom to improve IPC (clocks haven't improved lately but IPC has). Same seems to be true for Haswell. This kind of development seems to better fit the idea of using down clocked high end parts in 10W/17W ultra portables. BD/PD at 1.6 GHz (AMD A8-4555M) isn't in any way optimal use of hardware (it has so many needless extra transistors dedicated for reaching higher clocks).
    Yes. But BD has only two FMA units per module (2 cores), while Bobcat/Jaguar have two adders and two multipliers. So both reach the same throughput.
     
  6. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    That has to be the case, BD's L1D cache is really small by all comparable standards. Even most ARM chips have been using 32KB L1 for a while now, although Cortex-A15 is going to 2-way associativity, regrettably. Still, you can see that Bobcat's decisions are not out of place, rather BD that looks odd.

    I'm guessing AMD chose the quite wide associativity in Bobcat's L1D at least partially so they could use VIPT w/o aliasing problems. The L1I could be PIPT or maybe they don't mind aliasing flushes as much there (at least they didn't on BD)
     
  7. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Yeah the only other "modern-ish" x86 cpu featuring such small l1d cache is - p4. Actually up to Northwood only 8kB (4-way) but Prescott bumped it to 16kB (8-way).
    I dunno though sacrificing cache size/associativity so you could reach higher clock speeds which you actually can't reach in practice anyway sounds like a colossal mistake to me. The p4 at least could actually hit higher clock speeds even in practice (not that it helped it mind you but small l1d cache was probably the smallest of its problems).
    Though atom isn't that far ahead of BD there with its 24kB/6-way l1d cache :).

    That Bobcat paper mentions L1 ITLB and cache are accessed in parallel which would imply virtual indexing. It also mentions though the itlb isn't actually accessed if it's the same page as previous fetch hence it shouldn't really matter for performance.
     
  8. I.S.T.

    Veteran

    Joined:
    Feb 21, 2004
    Messages:
    3,174
    Likes Received:
    389
    Forgive me, but what do VIPT and PIPT stand for? Once I get what they stand for, I can just look it up, so y'all don't need to explain the whole thing.

    Thanks in advance.
     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Virtually Indexed Physically Tagged
    Physically Indexed Physically Tagged
     
  10. I.S.T.

    Veteran

    Joined:
    Feb 21, 2004
    Messages:
    3,174
    Likes Received:
    389
    Thank you!
     
  11. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    FWIW looks like the 2-module ULV Trinity part (A8-4555M) has been released. Unlike the 1-module version (A6-4455M, released ages ago) it didn't quite make it to 17W though instead it's now a 19W part. Clocks 1.6Ghz/2.4Ghz - so turbo clock should be higher than Jaguar but I don't know how often it's actually able to clock up that much. In any case the clocks are quite a bit lower compared to the 25W part (A10-4655M - 2Ghz/2.8Ghz) - though a 1.8Ghz quad-core Kabini might also need 25W. I couldn't find information about the a8-4555m gpu other than it's called 7600G, could be either a 4 simd part or a 6 simd part with low clocks). In any case a 4 CU GCN part should look quite favorable to that though trinity ulv should still be faster there because of dual channel memory.
    But probably a better comparison of quad-core Kabini would be against ULV 2-module Kaveri.
     
  12. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
  13. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,105
    Location:
    35.1415,-90.056
    Damn! That's pretty swanky...

    I'm looking to buy a long-lifed Win8 Pro dockable tablet for my wife this year; I'd love to get one of those Jaguar cores over the Atom options that would otherwise fit the bill.
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The L2 interface takes up about as much space as an entire core.
    There's a whitepaper out there talking about the power optimization tech used for Jaguar, and the addition of that interface added a large amount of active flops to the design relative to other components.

    I wonder what other scalability measures it has besides allowing for a shared 4-bank L2.
     
  15. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,105
    Location:
    35.1415,-90.056
    I think I found the whitepaper you are referring to over on Calypto. I'll give it a read later tonight...
     
  16. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    It's more like a fifth "dedicated" core among the rest, with a lot of active logic and power control functions beside facilitating the interface arbitration.

    Well, the bus interface unit in a BD module is not small either. ;)
     
  17. I.S.T.

    Veteran

    Joined:
    Feb 21, 2004
    Messages:
    3,174
    Likes Received:
    389
    When is Jaguar due out? I don't recall any news about that...
     
  18. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,535
    Likes Received:
    144
    As far as I know, and subject to change, we'll probably see Kabini (desktop Jag) around May-ish. Lisa Su had a tentative roadmap in her CES2013 talk. Take with adequate grain of salt though, AMD's roadmaps are fluid.
     
  19. I.S.T.

    Veteran

    Joined:
    Feb 21, 2004
    Messages:
    3,174
    Likes Received:
    389
    To say the least.

    Thanks, AlexV. :)
     
  20. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    There are rumors that the next MS Surface Pro will be Kabini-based, in addition to a higher-end Haswell model if I recall correctly. For whatever that's worth.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...