Samsung Orion SoC - dual-core A9 + "5 times the 3D graphics performance"

Discussion in 'Mobile Devices and SoCs' started by Mike11, Sep 7, 2010.

  1. Wishmaster

    Newcomer

    Joined:
    Nov 16, 2008
    Messages:
    238
    Location:
    Warsaw, Poland
    According to those scores it is clocked at 1400mhz, so should offer roughly similar performance to dual core Kraits.

    Probably because off screen scores aren't limited to 60fps so they can test full performance.
     
  2. tangey

    Veteran

    Joined:
    Jul 28, 2006
    Messages:
    1,218
    Location:
    0x5FF6BC
    35% jump in Egypt offscreen compared to the magiclego4212, but only 15% up on the pro offscreen, seems hard to rationalise.
    Magiclego was also showing clock of 1.4Ghz.

    Because it is offscreen and not frame rate limited of course.
     
  3. french toast

    Veteran

    Joined:
    Jan 5, 2012
    Messages:
    1,648
    Location:
    Leicestershire - England
    Cheers...i should of realised that.

    No i meant what do you think the shipping clock speed of GS3 will be...rumours sway from 1.5-1.8ghz?
     
  4. Wellington

    Newcomer

    Joined:
    Apr 5, 2012
    Messages:
    10
    Pro is probably CPU bound, so won't see a huge increase with GPU uplift.
     
  5. tangey

    Veteran

    Joined:
    Jul 28, 2006
    Messages:
    1,218
    Location:
    0x5FF6BC
    Dunno.

    Where the 1Ghz ipad2 is 3% slower in egypt, it is actually 16% quicker in pro. One would assume a CPU bound benchmark would show a difference in favour of the 1.4Ghz S3, but it is the opposite.

    Additonally, ipad2->ipad3 showed a 60% increase in pro (and around 56% increase in egypt), with no change in cpu speed. Although there are major changes to the memory bandwidth on ipad3, if Anandtech is to be believed these are really only exposed to the GPU.
     
    #305 tangey, Apr 5, 2012
    Last edited by a moderator: Apr 5, 2012
  6. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    6,082


    My guess is that the Pro test really is bandwidth-limited.
    That test was already available in GLBenchmark 1.1, which tests only OpenGL ES 1.1 functionality.
    That said, I guess what limits that test may be memory bandwidth alone, since everything else has pretty much skyrocketed since 2006.
     
  7. Lazy8s

    Veteran

    Joined:
    Oct 3, 2002
    Messages:
    3,081
    The bandwidth should get a bit of a pop with TSV approaches and Wide I/O mobile DRAM becoming standard within a few years. LPDDR3 should fit in there sometime, too.
     
  8. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,208
    Location:
    Chania
    Or the other former devices simply had the GPU clocked lower. In any case I expected a Mali400MP4 to be quite a bit ahead even of Tegra3 tablet SoCs. Too bad for Samsung they couldn't ship that MP4 at 400MHz under 45nm.
     
  9. french toast

    Veteran

    Joined:
    Jan 5, 2012
    Messages:
    1,648
    Location:
    Leicestershire - England
    Could well be bandwidth..ONE X scores are a closer ratio between Egypt and pro.....and Tegra 3 has worse bandwidth..so if it is bandwidth..that is what you would expect to happen.
    http://www.glbenchmark.com/phonedetails.jsp

    Is it LPDDR2 800 that gets that 6.4gb/s? is there a chance we could see an increase with LPDDR2 1066?

    EDIT; Actually..on second look there isn't a closer ratio..maybe but you would expect one to be bigger considering the lower bandwidth of Tegra 3..

    EDIT 2; If this is anything to go buy, we are looking at a redesign for GS3...a clever mover by Sammy to persuade people to hold off buying HTC ONE X;
    [​IMG]
     
    #309 french toast, Apr 5, 2012
    Last edited by a moderator: Apr 5, 2012
  10. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,208
    Location:
    Chania
    It's not alone bandwidth when comparing the Mali400MP4@400MHz vs. Tegra3 ULP GeForce@520MHz.

    I'm not in the clear whether the ULP GF has 1 or 2 TMUs after all, but let's be generous and assume 2:
    ULP GeForce
    2 TMUs * 520MHz = 1040 MTexels/s
    Mali400MP4
    4 TMUs * 400MHz = 1600 MTexels/s

    PS ALUs:

    ULP GeForce
    2 Vec4 = 16 FLOPs * 0.52GHz = 8.32 GFLOPs
    Mali400MP4
    4 Vec4 = 32 FLOPs * 0.40GHz = 12.8 GFLOPs

    VS ALUs go into the ULP GF's favor and I don't have the slightest idea how many z/stencil units the Mali400 has but it still sounds like another sizeable advantage.
     
  11. french toast

    Veteran

    Joined:
    Jan 5, 2012
    Messages:
    1,648
    Location:
    Leicestershire - England
    Spot on. So the Mali has twice the pixel shaders...4. but strangely only a single Vertex shader?..seems very weak, you would have thought that would have affected it in certain scenarios/games..but the Mali 400 has been a mobile monster.

    -I thought Tegra 3 was a '12 core beast' :wink:..seriously though..as that obviously relates to vliw 4.. P/V shaders (and not 'cores') are you sure there is only 2 of them? or have i read that wrong?

    (unless for P/S its 2*4 ALUS=8 then for V/S its 1*4 to make-12 'CORE?':???:)

    Too be honest i don't understand the Mali architecture..(not that i have a great deal of understanding of any architecture mind!:grin:) but the Mali one is baffling..so it has 4 TMU's... which does seem alot..and warrents its 'quad core' status...but Tegra 3 has only 2 if were generous...how many ROP'S are included in that?..and do 'rasterizers' fit into this equation?

    EDIT; Ha ive just done a quick wiki and now know that 'Rasteriser' is a ROP (doh!) and also learned that a ROP/TMU/P/S usually go in tandam...thus answering my own question regarding Mali... 4 TMU's 4 ROP's 4 (VLIW4)pixel shaders. only a single vertex shader..phew!

    Now ive answered one of my questions, i need to add another one..what does 'MAD's' refer to?..Anand uses that term..a quick think.'multiple/add/divide?? buts thats only 3 components of a VLIW4?? lol.
     
    #311 french toast, Apr 5, 2012
    Last edited by a moderator: Apr 5, 2012
  12. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,208
    Location:
    Chania
    Mali GPU IP scales only fragment cores and unfortunately not vertex shaders; else whether you have 1 or 4 fragment "core" you will always just one vertex shader.

    I think it's 4 Vec4 (FP16) PS ALUs + 1 Vec2 (FP32) VS ALU, but don't quote me on the VS ALU since my memory is weak on that one.

    Oh that's easy just count each ALU lane as a core and you get twelve:

    2 Vec4 PS ALUs (2*4= 8 "cores") + 1 Vec4 PS ALU (4 "cores") = 8+4 = 12 cores.

    Tegra1 and 2 ULP GeForces had 8 cores only = 1 Vec4 PS + 1 Vec4 VS ALUs.

    Mali has 1 TMU at a time per fragment core; else for each Vec4 PS ALU one TMU. MP4 = 4 fragment cores = 4 TMUs.

    Tegra GPUs should be 8 z/stencil, while if Mali400MP4 also scales z/stencil with each fragment core it could have 32 z/stencil. Rasterizers? Errr one on each of the fore mentioned probably? No idea to be honest.

    Note that blending is at least on SGX and ULP GeForce carried out in the ALUs (PS ALUs for ULP GF); I don't see why Mali would be different in that regard.

    A rasterizer is NOT a render output. Both sit fairly on different ends of a GPU.

    It's MADD actually and stands for multiply (MUL) + add (ADD) = MADD for two floating point operations. Each ALU lane or stream processor in desktop marketing parlance is capable of 1 MADD or else 2 floating point operations or else 2 FLOPs.

    Mali400MP4 has 4 Vec4 PS ALUs or else 16 SPs * 2 FLOPs * 0.4GHz = 12.8 GFLOPs.
     
  13. french toast

    Veteran

    Joined:
    Jan 5, 2012
    Messages:
    1,648
    Location:
    Leicestershire - England
    Thanks, I see so..vertex shader (VS) is only 2 wide..but is FP32 (floating point) so that is obviously 2x FP16....hence why you described VS as '4 cores/ALU's' instead of 2 had it been FP16 - like on PS (pixel shader)
    Got that.:smile:
    .
    Don't get that that! lol...i know Mali/Adreno/ULV Geforce are IMR with 'early z rejection (immediete mode renderer)..thats as far as i know..:???:
    Ha, i didn't look that up well then:oops:, well i know that 'ROP's' scale linealy with TMU's & PS in non Unified shader designs...so Mali must have 4 ROP's??....havn't got a clue about rasterisers:???:
    Cheers..(just to be pedantic..you would have thought it would of been MULADD..:grin:)
    Right, so adding to that the VS which if is FP32..would be VS-1*4(FP32)= 4 ALU's/MAD's... 4*2flops*0.4ghz= 3.2GFLOPS
    -(3.2+12.8=16GFLOPS..?)

    Unless the vertex shader on both are FP16? which Anand suggests when i looked at his example..although he is taking a wild guess...
     
  14. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,208
    Location:
    Chania
    Nope it's not 2*FP16; it's FP32 as you read it and it would be a very idiotic idea even for small form factor to handle vertex shading with less than FP32 precision.

    z/stencil fillrates have nothing to do with architectures per se. When you're counting on desktop GPU ROPs capabilities there's also z rates amongst others. ULP GF in T3 should be capable of 8 z/stencil per clock unless it has changed since T2. Else 8 * 520MHz = 4.16 GPixels/ z/stencil.

    The majority of those GPUs don't have dedicated blending units, they're capable of programmable blending in the ALUs.

    No ROPs don't scale linearly with TMUs and PS in any sort of desktop GPUs. If then it's rather a memory controller affair. Radeons have ROPs decoupled from the memory controller (Tahiti for instance has 32 ROPs while on a 384bit bus), while on GeForces the amount of ROPs scales with the buswidth. On recent GeForce GPUs you have for each 64bit block one ROP partition with 8 ROPs per partition (hence 256bit bus = 4*64bit = 4*8 ROPs = 32 ROPs, or 384bit = 6*64bit = 6*8 ROPs = 48 ROPs etc.).

    Raster and trisetup units up to DX10 GPUs used to be one of each per GPU. With the advent of DX11/tessellation the amount of both raster and trisetup units started to scale; no idea if something like that is also necessary for a DX11 small form factor SoC GPU.

    Yep.

    I don't think Anand made such a mistake. I don't think there's even one small form factor GPU out there that hasn't FP32 vertex shaders. The widest majority of those GPUs integrated have USC ALUs anyway so there FP32 is a given. For fragment processing however and non USC cores it's a totally different story; Mali is FP16 and ULP GF should be FP24 (like in Tegra2).

    Vivante, Adreno, SGX have all unified shader cores.
     
  15. french toast

    Veteran

    Joined:
    Jan 5, 2012
    Messages:
    1,648
    Location:
    Leicestershire - England
    Ok, here is what i read off WIKI;
    That suggests that they used to be equal, however wiki is not always accurate.
    Yea unified is the way forward..just to clarify..this is what Anand wrote on his Galaxy S2 review;...
    So he does seem to suggest FP16...for both PS/VS...This plays out on his projected Mali400 @ 400mhz in his table;..http://www.anandtech.com/show/4686/samsung-galaxy-s-2-international-review-the-best-redefined/16

    So looking at that table...he calculates 18 MAD's which works out at 10.8 GFLOPS @300 mhz so.. 10.8/3= 3.6 10.8+3.6=14.4GFLOPS @400 mhz...
     
  16. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    867
    Location:
    Luxembourg
    I don't get what you mean by that, please explain. If you mean that it's possibly higher than 400MHz, then yes, maybe.


    I ran some CPU-relative benches again for comparison, I wanted to see how much CPU bound GLBenchmark is:

    Code:
    Exynos 4210, Mali400 @ 400MHz
    
                        Egypt           Pro
    1600Mhz Dual    8391 / 74fps    5725 / 114fps
    1600Mhz Single  8262 / 73fps    5467 / 111fps
    
    1400Mhz Dual    8303 / 74fps    5774 / 116fps
    1400Mhz Single  8376 / 74fps    5794 / 116fps
    
    1200Mhz Dual    8342 / 74fps    5760 / 116fps
    1200Mhz Single  8394 / 74fps    5581 / 112fps
    
    1000Mhz Dual    8209 / 73fps    5690 / 114fps
    1000Mhz Single  8363 / 74fps    5705 / 114fps
    
    800Mhz Dual     8130 / 72fps    5536 / 111fps
    800Mhz Single   8218 / 73fps    5755 / 115fps
    Conclusion is that it's as CPU bound as a lame duck. Only under/at 500MHz does CPU freq make any difference. Makes even less sense for those i9300 results. I'm looking through the driver diffs now to see if there really is some kind of magic, but I doubt it. There must be more to it.

    Edit: There we have it!
    Code:
    mali_dvfs_table mali_dvfs_all[MAX_MALI_DVFS_STEPS]={
    	{160   ,1000000   ,  875000},
    	{266   ,1000000   ,  900000},
    	{350   ,1000000   ,  950000},
    	{440   ,1000000   , 1025000} };
    From \kernel\drivers\media\video\samsung\mali\platform\pegasus-m400\mali_platform_dvfs.c
    So the 4412 is running at at least 440MHz, if they haven't upped it even more since the source drop, and certainly would explain the benchmarks.
     
    #316 Nebuchadnezzar, Apr 6, 2012
    Last edited by a moderator: Apr 6, 2012
  17. french toast

    Veteran

    Joined:
    Jan 5, 2012
    Messages:
    1,648
    Location:
    Leicestershire - England
    Genius! ;)
     
  18. Wellington

    Newcomer

    Joined:
    Apr 5, 2012
    Messages:
    10
    Pro is not the best benchmark - it's probably hitting other system limitations before the GPU or CPU. Most likely bandwidth like has already been suggested.

    Which drivers are these from ? I didn't realize arm release all the driver source ?
     
  19. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,208
    Location:
    Chania
    Well 350MHz for the former results bodes rather well with my initial estimate: http://forum.beyond3d.com/showpost.php?p=1632652&postcount=282

    Granted I never expected 100% linear scaling, but those early results smelled suspiciously like <400MHz and the newer GalaxySIII results sounded like a tad too high for "just" 400MHz.

    So you probably found the missing pieces of the puzzle with the above kernel entries. Those initial 7k points bode rather well to 350MHz and the 10k points equally well to 440MHz.
     
  20. ltcommander.data

    Regular

    Joined:
    Apr 4, 2010
    Messages:
    613
    http://www.nvidia.com/content/PDF/t...ing_High-End_Graphics_to_Handheld_Devices.pdf

    Tegra 2 PS are actually FP20 (bottom of page 7 in the above white paper). No idea about Tegra 3, but seeing it is mainly an expansion of Tegra 2 rather than a redesign, it's probably still at FP20. Which is why I've been curious how they meet the DX9 compliance necessary for the Windows 8 support they've been demoing.
     

Share This Page

Loading...