New AMD low power X86 core, enter the Jaguar

Discussion in 'PC Industry' started by liolio, Aug 28, 2012.

Tags:
  1. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    #1 liolio, Aug 28, 2012
    Last edited by a moderator: Aug 31, 2012
  2. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,593
    Likes Received:
    998
    Jaguar: Double the FPU width (and datapaths), larger ROP and higher frequency (with same power consumption). I'd like to know how far down they can scale this power-wise.

    Wrt. Steamroller: It seems they've looked at all the issues we've discussed here. Reduced I$ misses, wider dispatch to cores, better branch prediction and "major improvements to store handling". It'll be interesting to see what that adds up to.

    Cheers
     
  3. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Well it seems that they also came with a new cache hierarchy. The L2 is inclusive and shared.

    They haven't disclose any TDP for those CPUs but I would assume that they are to aim as low as with their predecessors ( and the scrapped ones) so 4.5 Watts for the slowest SKU.
    They definitely can't go into the phone realm as Intel is doing with its Atom. They may do well in the tablet market and above, still I feel they might still be too power hungry for the low end tablets.
    The problem is they have to get those parts out fast, the roadmap states 2013 and they haven't provide more details. That's bothering as Intel is to launch their OoO Atom with most likely way better power characteristic in q4 2013. AMD execution problem is going to push them off the cliff for good at this rate. They are missing Windows 8 launch that's bad. They would have moved a wagon of those cores.

    By the way what with the black about those CPU core in the SMS. I can't find proper article in English be it, Anandtech, Tom's hardware, techreport.

    Sadly when streamroller is release so will be Haswel... They may just go past their phenonII part in single thread performances with those cores. It's bad. There are still of plenty of issue unfixed (as anand and hardware.fr point in their articles, L3 cache, FP/SIMD scheduler).
    With further tweaking, fixing and their new high density library, Excavator may be when Bulldozer comes together. The whole issue is that's in 2014.
    If Haswel is a strike as Conroe, Nehalem were, they are in a shitty situation no matter what.

    I wonder if AMD sort of acknowledge the intrinsic issue in BD and they are fixing plenty of things so they look committed to this architecture (they don't have choice anyway as I guess that even reusing the inner of BD to make a "standard" it would take 2 years or more to come with something new... 2 years without new products when their previous architecture has alreay given everything it could ... is not an option).
    If they survive I would not be that surprised if at they point they split their module in 2 plain cores and forget with the lot of headache the module introduce. They may at that point also come with a new cache hierarchy.
     
  4. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    Jaguar seems to be a really big improvement over Bobcat. It has slightly reduced power consumption... and still AMD has managed to double the core count, double the total L2 cache size, improve clocks by 10%, improve general x86 IPC by 15% and vector IPC by 100%.

    If my math is correct, we should see around 2.53x (2.0*1.10*1.15) performance in generic (four thread) x86 software and around 5.06x (previous * 2.0) performance in vector processing (integer/float) compared to Bobcat. Not bad at all :)
     
  5. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,593
    Likes Received:
    998
    The wider SIMD paths will help in some general purpose workloads too, SSE is widely used for copying, string search etc.

    Seems like a good fit for a W8 tablet SOC.

    Cheers
     
  6. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    Jaguar IPC shouldn't be far away from Bulldozer's.

    The original Bobcat introduction marketing slides described Bobcat goal was to have 90% of the IPC compared to their high end desktop CPUs. If I remember correctly Bobcat was compared to Phenom at that time. If we are just calculating IPC based on their marketing department numbers, Jaguar should have on average pretty much identical IPC compared to Phenom (0.9*1.15 = 1.035). Phenom II has around 5%-25% higher IPC than Phenom (benchmarks: http://www.anandtech.com/show/2702/9). Bulldozer on the other hand has slightly worse IPC than Phenom II. All things combined, the Jaguar IPC shouldn't be that far away from Bulldozer's. Of course Bulldozer scales to much higher clocks, and both Piledriver and Steamroller improve the IPC further. Still the IPC is very respectable for such a low power CPU.
     
  7. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,531
    Likes Received:
    127
    I think that they never solidified the claim, and my hunch (based on how Bobcat fared in practice) was that they were actually comparing to an older K8 or something. I don't think there's any workload where Bobcat gets reasonably close to Phenom (at equal clocks).
     
  8. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,318
    Likes Received:
    416
    Location:
    Australia
    http://www.xtremesystems.org/forums...Core-performance-analysis-Bobcat-vs-K10-vs-K8

    [​IMG]


    things could look quite bad for AMD when jaguar beats trinity in non FMA FP.......lol.
     
    #8 itsmydamnation, Aug 31, 2012
    Last edited by a moderator: Aug 31, 2012
  9. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Thanks for that data :)

    I remember reading an article (anandtech) stating that Bobcat was in between Atom and K10. Now that I see your results I realized that I did'nt pay attention to something important: the result are not normalized for clock speed. AMD came close to their 90% claims.

    EDIT you post that chart in the "predict next gen etc" /forum. There seems to be some concerns about the perfs of those parts.
     
  10. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    A lot more detail from the actual talk here:

    http://www.theregister.co.uk/2012/08/29/amd_jaguar_core_design/

    It's interesting that the layout tools from the ATI side are now in full deployment on CPU core designs. They don't mention if Jaguar will have a unifed CPU/GPU memory space; it seems like something they could do since they're pairing off the CPU w/ a GCN based GPU. Also, I guess when each core is that small, it doesn't make much sense to invest in sizable circuitry to power them each down individually.
     
  11. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Do somebody knows if those "layout tools" are the same as what AMD called " high density libraries" and plan to use on Excavator cores?
     
  12. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    8,619
    Likes Received:
    689
    Location:
    WI, USA
    So slightly better budget notebooks, tablets-with-fans, and curious little ITX boards are one the way? ;)
     
  13. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    They probably are one and the same, and it's likely that the same libraries are responsible for the dramatic increase in density from ATI's 38XX series to to the 48XX series; I certainly wasn't expecting a leap from 320 shaders to 800 in one generation!

    It's interesting that on the BD fpu, there's a large regular striped area on the left that doesn't look like cache:

    http://images.anandtech.com/doci/6201/Screen Shot 2012-08-28 at 4.38.31 PM.png

    Maybe those are ALUs? I need Hans De Vries. :) The density is very nice for die space and power but Anand says these denser units don't clock as well. I'll settle for two of these denser FPUs per module over a single fast clocked FPU.
     
  14. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    I guess it's a reasonable guess to make.
    I guess both pictures are simulation, the blank part is just that blank space. I guess it helps to enlighten the gain in space.
    Wrt to clock speed, AMD it self stated that those library best serves power constrained designs. Everything is power constrained nowadays... lol.
    But I think it's going to hace an impact on OC but it could prove a way lesser evil than falling even further from INtel with regard to transistors density and power characteristics.

    ------------------
    By the way there are inaccuracy in the article of The Register, they stated that each cores has 512kb of L2, that's incorrect the 2MB of cache are shared by the 4 cores.
    Overall I like hardware.fr article better (more info /slides and less inaccuracy), go Frenchies / Belgians :)
     
  15. Raqia

    Regular

    Joined:
    Oct 31, 2003
    Messages:
    508
    Likes Received:
    18
    I meant that region w/ the multi-colored, vertical striped pieces on the left of the connected, non white-space region present in both panels. It's pretty clear what features are homologous across the two panels, and I'm pretty sure the unmoved boxes are cache or memory cells of some sort. The striped region is a mystery; it looks much more regular than the blobby regions, and its topology seems less affected by the optimization.

    I'm wondering what difference in power having a full-speed cache would have made. The shared cache is definitely a nice single-thread feature that probably consolidates some circuitry across the cores and possibly the GPU as well.

    Jaguar looks like a pretty good upgrade over Bobcat, but it's not enough reason for me to get a new netbook. I'd be much more convinced if they dramatically improved the displays they put in those things, and I'm glad Apple's pushing the industry toward high density IPS screens.
     
  16. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,515
    Likes Received:
    441
    Location:
    Varna, Bulgaria
    Looking at the BD die, I think the sample above is part of the Integer SIMD (MMX) pipeline.
     
  17. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    284
    Likes Received:
    6
    Location:
    Herwood, Tampere, Finland
    Your math is not correct.

    You can't just multiple the "overall ipc increase" and simd width numbers.

    When the vector width doubles, in order for the performance to really double, the performance of all others parts of the chip that are needed for those fp calculations(instruction fetch, decode, memory bandwidth etc) should also double. But those other parts are only getting 15% increase in performance. So the real performance increase is somewhere between 1.15 and 2, not 1.15*2.

    Though they also widened the L1D<->FP datapaths, so that L1D bandwidth to FPU is also doubled, so that won't become worse bottleneck.
     
  18. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,318
    Likes Received:
    416
    Location:
    Australia
  19. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,293
    Location:
    Helsinki, Finland
    That wasn't intentional :oops:. The correct peak vector (int/float) performance should be 1.1*2.0*2.0 = 4.4x. Fortunately they also doubled the datapaths, and cache sizes (double perf part will go though the caches twice as fast). Unfortunately memory bandwidth is still unknown as is the GPU paired with the core. You need more memory bandwidth to be able to reach 4.4x performance. Hopefully AMD reveals the other parts of the APU soon. If they intend to change the GPU to GCN (as most reviewers seem to believe), I expect a big bump in GPU performance as well, and a better GPU would make them bandwidth constrained (all Trinity APUs are also bandwidth constrained. Memory overclocks give large benefits to performance).

    Another thing to discuss is how they have implemented the 256 bit vector (AVX) support for the chip, and how Bobcat handled 128 bit SSE instructions in it's half width 64 bit vector pipelines. Did Bobcat split 128 bit vector ops in the decoders to two 64 bit uops? Does Jaguar the same for 256 bit ops (split to 128 bit uops)? If both architectures split the large vectors to two uops, the Jaguar frontend/icaches/etc should be able to process same amount of 128/256 bit instructions as the Bobcat processed 64/128 bit instructions. This would mean that the frontend/icaches/etc are also able to sustain the 2x performance. If the new architecture can populate the pipelines better (better frontend/etc), we might even see slightly more than 2x vector processing performance (real performance, not peak). But this of course assumes we are not bottlenecked elsewhere (and as you said, we likely are).
     
  20. DavidC

    Regular

    Joined:
    Sep 26, 2006
    Messages:
    348
    Likes Received:
    27
    It is true: http://www.anandtech.com/bench/Product/328?vs=116

    K10 based on Deneb I think is about 15-20% faster, so comparing to Bobcat the difference is probably 25-35%. That's not small.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...