Comparing embedded GPUs in Mobiles to Laptops

Discussion in 'Mobile Graphics Architectures and IP' started by Nano, Jun 5, 2011.

  1. Nano

    Regular

    Joined:
    Dec 7, 2007
    Messages:
    288
    Likes Received:
    0
    Location:
    London, England
    Hi all,

    I have a bit of a question about the recent hand-held GPUs that have been a huge feature in smart-phones, from the PowerVR SGX to the Tegra and Adreno.

    I have a very good idea about the architectural differences that come to play, but from a performance perspective, how would we compare their computing power and fill-rates to popular integrated chipsets found in Laptops i.e. ATI Mobility Radeon HD 3200, Intel GMA 4500, nVIDIA ION.

    How does a Tegra 2, PowerVR SGX 540 or 543 MP2 compare?

    Considering the power and heat considerations, but clever, efficient designs that have greatly evolved over the years, it would be very interesting to consider where mobile phones are in relation to embedded Laptop equivalents in the modern day.


    Thanks,

    Nano

    http://www.nvidia.com/object/tegra-2.html

    http://www.imgtec.com/PowerVR/sgx.asp
     
  2. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Their raw theoretical performance is a complete joke compared to even the lowliest of current IGPs. But at least it's improving VERY quickly, especially in terms of flops. The effective ALU-TMU ratio will soon exceed that of many PC parts - memory bandwidth is scarce on handhelds and some rather hilariously lame reasons (not sure I can get into those?) mean Anisotropic Filtering often isn't even available.

    Just to give you an idea... Tegra2 and Mali-400MP1 have 1 TMU, SGX535/SGX540 in the Apple A4 or Samsung Hummingbird have 2 TMUs, Mali-400MP4 in Exynos and SGX543MP2 in the Apple A5 have 4 TMUs, SGX543MP4 in NGP has 8 TMUs. And we're talking clock speeds between 200MHz and 400MHz, so also much lower than IGPs. NVIDIA's original Ion had 8 TMUs at 450MHz, so more than twice the NGP with 8 TMUs at (apparently) only 200MHz!

    In practice, it's better than that, they're certainly more efficient, and TMUs are the worst offender as I said. A real comparison in a real complex game with a high quality port would be very interesting. But expect no miracles here.

    The next generation will be a lot more interesting... ST-Ericsson A9600 with Rogue/Series6 will have 5+ GPixels fillrate (versus 3.6 for Ion) and a 4x higher flops-to-TMU ratio. And likely other efficiency advantages as well.
     
  3. Nano

    Regular

    Joined:
    Dec 7, 2007
    Messages:
    288
    Likes Received:
    0
    Location:
    London, England
    What about in comparison to the Intel GMA 4500? Or lower end chips (than the ION) like the Mobility Radeon 3200?

    I expect these embedded GPUs are very modest in their form, but still it's quite fascinating to see.

    Do you have any statistics on the flops ratings between these devices?

    Thanks for your responses so far :)
     
    #3 Nano, Jun 5, 2011
    Last edited by a moderator: Jun 5, 2011
  4. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    GMA4500 has quite high flops actually - it just doesn't make very efficient use of them :) 8 TMUs too I believe. Current high-end SOCs might have comparable performance (though much lower theoretical numbers).
    It's got 10 shader units which are physically 4-wide - 40 flops (80 if you count the very limited MAC) per clock at quite a high clock.
    The Radeon Mobility HD 3200 has also 40 shader units (80 flops/clock / 8 TMUs) - overall performance is comparable to ION, flops (as usual) quite a bit higher.
     
  5. Nano

    Regular

    Joined:
    Dec 7, 2007
    Messages:
    288
    Likes Received:
    0
    Location:
    London, England
    So are we talking around 40 Gflops for Mobility Radeon 3200? I think ION is about 52 Gflops

    In raw terms how many Gflops are we looking at in a Tegra 2, SGX 543 MP2 or SGX 543 MP4 like PSvita?

    How would the considerations to the efficiency of these embedded mobile chips factor into real-world graphics performance?
     
    #5 Nano, Jun 7, 2011
    Last edited by a moderator: Jun 7, 2011
  6. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Theoretical performance numbers like FLOPs or triangle rates are as meaningless as always, since f.e. FLOP != FLOP between architectures.

    SoCs like Tegra2, Apple A5 or the one in Sony's NGP haven't been developed for something like notebooks or low end PCs and none of them is even DX10 compliant. In that regard I don't see what any comparison would even be worth anything. Embedded SoCs are tailored primarily for ultra low power consumption and for the time being the necessary >DX9.0 overhead in terms of die area would be a waste.

    For the next technology generation of embedded GPUs like the IMG Series6 Rogue in the ST Ericsson A9600 which is a smart-phone SoC being at least DX10 compliant, it's >210GFLOPs/s, >5 GTexels/s and >350M Tris/s are a tad better as a comparison but then again with notebook/PC SoCs =/>2012 which will be DX11 and a lot faster than what is available today.

    In the rather unlikely case where SONY would opt for a DX11 Rogue MP mega-SoC for a console design things would get definitely interesting, since for Intel for anything above embedded I have to see it first and then believe it.
     
  7. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Only with very generous counting. Otherwise more like 35 GFlops max. Some Ions only have 8 instead of 16 "cuda cores" so only half.

    That's not easy to answer, especially if you're looking at TBDR vs. IMR.

    I'm not quite sure what the chips are missing for DX10 compliance, and if that really would cost that much. Some though definitely don't have the required precision (for the ALUs for instance, and also z-buffer) which makes them not even really DX9 compliant.

    210Gflops for Rogue? Where did you get that number? IIRC SGX543 has got something like 16 (fp32) flops / clock which at usual clocks is about 2 orders of magnitude lower, so that would be more than a drastic increase.
     
  8. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    I'm a bit generous today ;)

    When was the last time you've checked B3D's frontpage? LOL http://www.beyond3d.com/content/articles/112/

    http://www.stericsson.com/press_releases/NovaThor.jsp

    As Arun suggests though it is most likely either a dual or triple core. Single core seems completely unlikely for such high performance metrics. Albeit I'm in the majority of cases wrong I'd follow Arun's 667MHz theory for each core having something like 16 Vec5 ALUs and 4 TMUs. Baseline DX10; maximum DX11 according to IMG's latest announcements.
     
  9. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Well I actually read that at some point. I think I dismissed it cause the numbers looked too good...
    The 210gflops number could be lower (fp16) precision in theory though it would still be very high. GFlops in the neighborhood of a radeon HD6450 (twice that of Brazos) just doesn't sound right for a SOC, especially not for a TBDR.
     
  10. Nano

    Regular

    Joined:
    Dec 7, 2007
    Messages:
    288
    Likes Received:
    0
    Location:
    London, England
    True, the thought did cross my mind, but it's interesting from a numbers perspective.

    If I'm not mistaken, ImgTec have made the PowerVR SGX chips DX10 & DX10.1 compliant for any application in Laptops or such..
     
  11. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    595
    Likes Received:
    18
    Location:
    UK
    All SGX parts meet shader precision requirements for Dx9 SM3.0

    SGX543/544 are actually 36 flops/clock for a single core, 554 is 72 flops/clock also for a single core. So not sure how you get to two orders of magnitude.

    John.
     
  12. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Haven't seen any yet. SGX535 is fully d3d9 compliant (might be true for SGX543 and up too, not sure). SGX545 is rumored to be d3d10.1 compliant, but it's not in any shipping product yet.
     
  13. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    As far as I know they aren't necessarily fast at it though (not all of them at least). But I was speaking generally for SOCs, and nvidia ones for instance don't.

    But that's not fp32 flops is it? I thought the 543 has 4 USSE2 pipes, each capable of maybe 2 fp32 fmads (hence 16 flops).
    How do you arrive at 36?
     
  14. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,563
    Likes Received:
    171
    Location:
    In the Island of Sodor, where the steam trains lie
    Perhaps because he's on the team that specified it.
     
  15. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I'd believe John if I were you ;) 543 has 4 USSE2 pipes, each capable of 4 FP32 MADDs and one additional FP32 operation which IMG has never really talked about (best case it's an extra ADD, worst case it only does stupid stuff like format conversions, I don't know). FP16 and INT8 performance is higher but not massively so.
     
  16. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Hmm 9 per USSE2 pipe is an odd number though, so I've got some feeling that's not just straight MADs :).
    And you didn't answer the fp32 part :). All I've heard so far is flops without any precision.

    Is there a (trustworthy) source somewhere for the 4 FP32 MADs per pipe?
     
  17. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    595
    Likes Received:
    18
    Location:
    UK
    Well they're not actually slow either, there's just not that many of them in early variants, however on the variants that are appearing in devices now....
    Yes it is FP32 ops, as Arun indicated each pipeline is VEC4 F32 FMAD, in addition we can run a parallel (real) floatingpoint op giving 9 F32 flops per pipe or 36 per core.

    The key point is that per clock these cores are in the same ballpark (or higher) as the desktop derived mobile parts mentioned above, series 6 raises this bar a lot further.

    Edit, I have no idea why you wouldn't consider me to be trustworthy on this.
     
  18. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Hmm ok. That's quite a lot. I thought the earlier SGX parts (52x, 53x) couldn't do that much (per pipe).

    Ok series 6 looking good then (of course I'd expect 28nm fusion to include Caicos-like IGP but still looks like those SOCs will close the gap). I guess we'll see slower ones than that A9600 though, I wonder how series 6 compare in terms of efficiency (both power and die area) to series 5, considering the better feature set.

    Oh I certainly do - you just didn't confirm it was really fp32 before (not in this thread, at least, maybe elsewhere I missed).
     
    #18 mczak, Jun 7, 2011
    Last edited by a moderator: Jun 7, 2011
  19. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    430
    Location:
    Cleveland, OH
    They can't. Series5XT can dual issue FMADDs. I've been presuming that the same limitations as with Series5 apply to the FMADDs, which is that they can be four fully independent fixed-point 8 or 10-bit operations, two fully independent FP16, one fully independent FP32, or two FP32 sharing a parameter. I think the shared parameter was a common multiplier. That information is just relayed from what JohnH has previously posted, not barring places where I may have broken it.

    That it can do some other operations means that that operation has to be encoded somewhere and that it can issue more than two instructions somehow, unless there's an FMADD + something else sort of instruction.
     
  20. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    595
    Likes Received:
    18
    Location:
    UK
    The original series 5 cores were 2x F32 MAD per pipe per clock so 4 flops/clock/pipe, so a little less than half S5 XT per pipe performance.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...