New AMD low power X86 core, enter the Jaguar

Discussion in 'PC Industry' started by liolio, Aug 28, 2012.

Tags:
  1. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,484
    Likes Received:
    396
    Location:
    Varna, Bulgaria
    A server SKU would need some additional RAS features to the micro-architecture and the memory pipeline, but otherwise Jaguar is very well suited for a Niagara-type of high-throughput server SoC with large variety of misc. I/O and dedicated HW blocks.
     
  2. Lightman

    Veteran Subscriber

    Joined:
    Jun 9, 2008
    Messages:
    1,802
    Likes Received:
    473
    Location:
    Torquay, UK
    At least now Jaguar support ECC, so they made first step to bring it into server world.
     
  3. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    284
    Likes Received:
    6
    Location:
    Herwood, Tampere, Finland
    I would not say "very well suited for high-throughput" for a core that has absolutely no multi-threading support. OOE only hides relatively short stalls, the cores would be idling when a longer stall occurs.

    And when there are _lots_ of threads, OOE is just not needed, multi-threading gets better performance for cheaper/less power.
     
  4. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,509
    Likes Received:
    839
    It would have similar issue rate. 4x2 instructions /cycle @1.6GHz vs 1x4 instructions/cycle @3.2GHz, similar OOOe capabilities, 4x48 ROB slots vs 2 x 96 ROB slots for Haswell in SMT mode (1x192 in non-SMT mode). Similar LS bandwidth per nanosecond. It would have larger internal exution width and larger branch resolve capabilities.

    If you can stuff four times the cores into the same die area as a single Ivy Bridge or Pile driver core and keep the power envelope lower, I'd say it is well suited.

    Oracle (was: Sun) added OOOe to their T4 design, increasing per socket performance compared to T3 while halving the number of cores.

    Cheers
     
  5. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    And they're going back to 16 cores with the T5, keeping the T4 core design. I've read an article on it, when I stumbled on it by chance. There's eight threads per core, so they're selling a 128 thread OOOe CPU really. They've stuffed lots of NUMA links so that 8 socket builds are 1st class citizen, too, giving you up to a computer with 1024 threads :runaway:
    It will be a tad more expensive than a console, PC or tablet though (and you have to buy it from Oracle?)


    I've seen more than once mentioned on this forum though that not all OOOe are equal, the kind found in e.g Power PC G3/Gamecube/Wiis was said to be much simpler than what's in Pentium Pro/2/3 for instance.
     
  6. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    That's true until you start to trash your caches. If for a given IPC improvement you pay a similar cost for OoOE or multi-threading you always want to go with the former.
     
  7. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Acer Aspire V5 benchmarks (ultraportable with Temash A6-1450 tablet chip):
    http://ultrabooknews.com/2013/05/10/live-now-acer-aspire-v5-and-amd-temash-testing/

    For the CPU alone they measured a 3W TDP (four cores, 1.4 GHz).

    A6-1450 scores 1.23 in multithreaded Cinebench (11.5). In comparision the 17W ULV Sandy Bridge i5-2537M (1.4/2.3 GHz) = 1.34 (Samsung Series 9), and the 16W ULV Ivy Bridge i7-3517U (1.8/2.9 GHz) = 2.79 (ASUS Zenbook Prime UX21A).

    If you directly extrapolate the Cinebench result to 2.0 GHz (Kabini at max clock) the score would be 1.76. That's not enough to compete with Ivy Bridge (or Haswell), but it should beat all old Sandy Bridge based ULV models (in multithreaded SIMD code - it of course loses badly in single threaded code).
     
  8. DavidC

    Regular

    Joined:
    Sep 26, 2006
    Messages:
    347
    Likes Received:
    24
    That's not TDP, in actual TDP it should be somewhat higher.

    So let's compare it to Silvermont.

    A 1.8GHz Atom Z2760 "Clover Trail" gets 0.5 points in 3DMark11.

    If we assume double the scores with quad core and 1.5x the perf/clock increase that would mean a 1.8GHz Silvermont should get 1.5 points. It probably won't scale linearly, so it may need 2GHz to do so.

    Looks like Jaguar may have about 15% advantage per clock over Silvermont, and Bobcat as well.
     
  9. jimbo75

    Veteran

    Joined:
    Jan 17, 2010
    Messages:
    1,211
    Likes Received:
    0
    You mean 0.5 points in Cinebench 11.5 yes?

    >15% IPC is what AMD has said all along and they appear to have hit that target at least.

    Sebbbi - remember none of Intel's big core chips have the southbridge on die yet, so that'll need more power. Based on the leaks of Haswell going around I'm expecting a step backwards in x86 performance in order to pay for the GPU increase and integrated southbridge on the ULV parts.
     
  10. DavidC

    Regular

    Joined:
    Sep 26, 2006
    Messages:
    347
    Likes Received:
    24
    Yes, thanks for pointing that out. :grin:

    ULT version of Haswell is said to be 15% faster than the predecessor, which indicates somewhat higher clock speeds.
     
  11. HMBR

    Regular

    Joined:
    Mar 24, 2009
    Messages:
    416
    Likes Received:
    105
    Location:
    Brazil
    overall it looks like a great improvement over bobcat, higher ST performance and much higher MT performance....

    but still, looking at some of the test it makes me wonder if it's really valid to go with such a low ST performance, there are occasions where you are limited mostly by ST performance for the basic usage, so I don't think the mt benchmarks really represents well the difference between using an i3 ULV vs this CPU.

    looking at this

    [​IMG]

    [​IMG]

    [​IMG]

    am I wrong in thinking a single ivy/sandy bridge core (with HT), at 3GHz with 2-3MB of l3 would be able to compete with the 1.5GHz quad core Jaguar for MT, and for ST it would be a totally different level, I wonder if it would be possible to get a single strong core at 15w TDP.
     
  12. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,294
    Likes Received:
    394
    Location:
    Australia
    remember Kabini is a complete SOC your ivb isn't. At low power missing this when comparing performance in a TDP skews results.


    As to a 3ghz IVB in 15 watts. it would all come down to voltage. small bumps in voltage have a large effect on power consumption.
     
  13. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,716
    Likes Received:
    2,450
    So what is the IPC for a single Jaguar core ? I read it's less than 1 !!
    Also , for comparison purpose , I would like to know the IPC for a single SandyBridge Core i7 and and Bulldozer core .
     
  14. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    IVB core in 15W at 3Ghz should be quite doable. After all intel has 17W TDP parts out there which reach 3.3Ghz with Turbo, though I don't know if it can hold that clock when using one core at full load, maybe it can but near certainly the IGP must be idle to do it. And the clock goes down pretty rapidly with smaller TDPs, the 13W parts (granted the chip was never meant to get that low) do not exceed 2.6Ghz.
    So if you have a multithreaded load you really rather want 2 cores at 1.5Ghz rather than 1 at 3Ghz even for IVB, as that will be much better for power consumption. Don't forget those Kabini power measurements show the 4 cores consuming something like 6W at full load (the gpu taking up the rest of the 15W TDP), whereas that single IVB core would be probably roughly twice that for the same (multithreaded) performance.

    The theoretical max sustained IPC is 2 for Jaguar, same as it was for Bobcat (it can decode/dispatch/retire 2 ops per cycle after all). SNB/IVB would be 4, BD is also only 2 (per int core, though it can dispatch/retire more but decode is 4 shared by 2 int cores). Ok that's x86 IPC the picture gets more complicated if you look at executed ops in the core (which is what's used past the decode stage).
    Now the IPC you get in practice is something else entirely and will depend on a LOT more factors, though it will definitely be lower, and increasing this for a cpu implementation is HARD (and the higher your theoretical IPC, the more trouble you have to achieve some real-world IPC improvement - this is the reason after all lots of low-power cores being restricted to two-wide). It will also vary wildly depending on the code. Last time I did some profiling for some code I had trouble getting it to exceed 1.0 (on a K8 though which could in theory sustain 3 but in practice is probably very close to Jaguar). Looking at the benchmarks I guess IVB has about 1.6 times higher IPC in real world (for a single thread) compared to Jaguar, which given the two times higher theoretical performance is a very good result.
     
    #214 mczak, May 24, 2013
    Last edited by a moderator: May 24, 2013
  15. RedVi

    Regular

    Joined:
    Sep 12, 2010
    Messages:
    387
    Likes Received:
    39
    Location:
    Australia
    I really think they needed a turbo core feature in all models, not just the top end tablet chip. It really sounds like they have turbo sorted out in that one chip though, I guess tests will show for sure.
     
  16. NotTarts

    Regular

    Joined:
    Jun 16, 2010
    Messages:
    278
    Likes Received:
    0
    Vs. Brazos:

    [​IMG]

    [​IMG]

    [​IMG]
     
  17. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,716
    Likes Received:
    2,450
    But I remember reading the theoretical IPC of Core 2 (Conore) is 4 too ! Does that mean the therotical maximum has not seen an improvement since that day?

    Also did you count the supposed fused micro-ops ? or is it just a theoretical niche too ? unattainable during practice ?
     
  18. jimbo75

    Veteran

    Joined:
    Jan 17, 2010
    Messages:
    1,211
    Likes Received:
    0
    The perf/Watt improvement over Brazos is astounding.
     
  19. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    I remember reading some IPC comparison article/benchmark years ago, but I can't find it anymore (it might have been from Realworldtech or Anandtech). If I remember correctly, in the general purpose integer test Bulldozer average IPC was 1.1, Sandy was 1.7, and Bobcat was 0.8. According to the Jaguar (Kabini A4-5000) benchmarks, 1.5 GHz Jaguar beats 1.6 GHz Bobcat by 22% (23.46% IPC increase). Extrapolating from the Bobcat IPC score, Jaguar average IPC should be very close to 1.0 in that general purpose integer test. However average IPC of each architecture might be completely different in another test case. If I remember correctly, Sandy's IPC in a mixed SIMD+integer code test was 2.9 (in the same benchmark article). But unfortunately I can't remember how the other chips performed in that test (and if I remember correctly they also used hyperthreading in that test to fill the Sandy Bridge core better = gain better IPC).

    1.0 IPC is actually quite good if you compare Jaguar to the chips it's going to replace and compete against. In-order PPC CPUs (current gen consoles) have average IPC of around 0.2, ATOM has average IPC of around 0.5 and Bobcat around 0.8. In recent benchmarks 1.5 GHz Jaguar beat 1.7 GHz Cortex A15, indicating it has higher IPC than the top of the line ARM CPU. I don't think the IPC is a problem for Jaguar.

    Jaguar would badly need dynamic CPU clocking (turbo). Intel radically improved their dynamic clocking for Ivy Bridge, and thus the 17W parts can clock up to 3.0 GHz (single threaded tasks / boost burst performance). Haswell improved this further by shortening the idle<->alive transition time to around 1 ms. ARM of course has focused on dynamic clocking / turning off chip parts since day one (as mobile/integrated chips are their main business area).

    One thing I do not understand in AMDs Kabini/Temash SOC configurations is the GPU. It has only 2 CUs. They could have included 4 CUs instead and clocked them to half, and had exactly the same performance, but at a lower TDP. Or even better... they could have created a dynamic GPU clocking system similar to Intel, and had both lower power consumption in normal usage and much higher performance at demand. 17W Ivy Bridge parts have 350 MHz nominal GPU frequency, and turbo up to 1200 MHz (4.8x boost). This is something AMD needs badly, if they want to conquer the tablet/ultraportable market.
    Yeah, it even surpasses even P4->Core2 :)
     
  20. RedVi

    Regular

    Joined:
    Sep 12, 2010
    Messages:
    387
    Likes Received:
    39
    Location:
    Australia
    They could make some laptops with simply amazing battery life - whether anyone actually makes a quality AMD laptop is another matter. They'll probably instead see it as a good opportunity to save money on the battery and/or overvolt the thing for no reason. *sigh*
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...