Samsung Orion SoC - dual-core A9 + "5 times the 3D graphics performance"

Either way its going to be the fastest Android smartphone device of its generation...what does everyone think those quad A9's are going to be clocked at?
According to those scores it is clocked at 1400mhz, so should offer roughly similar performance to dual core Kraits.

Also how come the off screen score is twice the frame rate as the standard on screen test..when the device runs at 720p?? is it a different test?
Probably because off screen scores aren't limited to 60fps so they can test full performance.
 
35% jump in Egypt offscreen compared to the magiclego4212, but only 15% up on the pro offscreen, seems hard to rationalise.
Magiclego was also showing clock of 1.4Ghz.

Also how come the off screen score is twice the frame rate as the standard on screen test..when the device runs at 720p?? is it a different test?

Because it is offscreen and not frame rate limited of course.
 
Cheers...i should of realised that.

No i meant what do you think the shipping clock speed of GS3 will be...rumours sway from 1.5-1.8ghz?
 
Pro is probably CPU bound, so won't see a huge increase with GPU uplift.

Dunno.

Where the 1Ghz ipad2 is 3% slower in egypt, it is actually 16% quicker in pro. One would assume a CPU bound benchmark would show a difference in favour of the 1.4Ghz S3, but it is the opposite.

Additonally, ipad2->ipad3 showed a 60% increase in pro (and around 56% increase in egypt), with no change in cpu speed. Although there are major changes to the memory bandwidth on ipad3, if Anandtech is to be believed these are really only exposed to the GPU.
 
Last edited by a moderator:
Dunno.

Where the 1Ghz ipad2 is 3% slower in egypt, it is actually 16% quicker in pro. One would assume a CPU bound benchmark would show a difference in favour of the 1.4Ghz S3, but it is the opposite.

Additonally, ipad2->ipad3 showed a 60% increase in pro (and around 56% increase in egypt), with no change in cpu speed. Although there are major changes to the memory bandwidth on ipad3, if Anandtech is to be believed these are really only exposed to the GPU.



My guess is that the Pro test really is bandwidth-limited.
That test was already available in GLBenchmark 1.1, which tests only OpenGL ES 1.1 functionality.
That said, I guess what limits that test may be memory bandwidth alone, since everything else has pretty much skyrocketed since 2006.
 
The bandwidth should get a bit of a pop with TSV approaches and Wide I/O mobile DRAM becoming standard within a few years. LPDDR3 should fit in there sometime, too.
 
That's a huge jump on the Egypt score even from my overclocked device... 74fps to 92fps. Either the quadcore brings a lot, the GPU is running faster than 400MHz, or there's some new driver magic going on. The strange thing is that the Pro scores aren't that off from what I'm getting.

Or the other former devices simply had the GPU clocked lower. In any case I expected a Mali400MP4 to be quite a bit ahead even of Tegra3 tablet SoCs. Too bad for Samsung they couldn't ship that MP4 at 400MHz under 45nm.
 
The bandwidth should get a bit of a pop with TSV approaches and Wide I/O mobile DRAM becoming standard within a few years. LPDDR3 should fit in there sometime, too.

Could well be bandwidth..ONE X scores are a closer ratio between Egypt and pro.....and Tegra 3 has worse bandwidth..so if it is bandwidth..that is what you would expect to happen.
http://www.glbenchmark.com/phonedetails.jsp

Is it LPDDR2 800 that gets that 6.4gb/s? is there a chance we could see an increase with LPDDR2 1066?

EDIT; Actually..on second look there isn't a closer ratio..maybe but you would expect one to be bigger considering the lower bandwidth of Tegra 3..

EDIT 2; If this is anything to go buy, we are looking at a redesign for GS3...a clever mover by Sammy to persuade people to hold off buying HTC ONE X;
article-1333638307544-127B3D2D000005DC-6849_466x285.jpg
 
Last edited by a moderator:
It's not alone bandwidth when comparing the Mali400MP4@400MHz vs. Tegra3 ULP GeForce@520MHz.

I'm not in the clear whether the ULP GF has 1 or 2 TMUs after all, but let's be generous and assume 2:
ULP GeForce
2 TMUs * 520MHz = 1040 MTexels/s
Mali400MP4
4 TMUs * 400MHz = 1600 MTexels/s

PS ALUs:

ULP GeForce
2 Vec4 = 16 FLOPs * 0.52GHz = 8.32 GFLOPs
Mali400MP4
4 Vec4 = 32 FLOPs * 0.40GHz = 12.8 GFLOPs

VS ALUs go into the ULP GF's favor and I don't have the slightest idea how many z/stencil units the Mali400 has but it still sounds like another sizeable advantage.
 
It's not alone bandwidth when comparing the Mali400MP4@400MHz vs. Tegra3 ULP GeForce@520MHz.

I'm not in the clear whether the ULP GF has 1 or 2 TMUs after all, but let's be generous and assume 2:
ULP GeForce
2 TMUs * 520MHz = 1040 MTexels/s
Mali400MP4
4 TMUs * 400MHz = 1600 MTexels/s

PS ALUs:

ULP GeForce
2 Vec4 = 16 FLOPs * 0.52GHz = 8.32 GFLOPs
Mali400MP4
4 Vec4 = 32 FLOPs * 0.40GHz = 12.8 GFLOPs

VS ALUs go into the ULP GF's favor and I don't have the slightest idea how many z/stencil units the Mali400 has but it still sounds like another sizeable advantage.

Spot on. So the Mali has twice the pixel shaders...4. but strangely only a single Vertex shader?..seems very weak, you would have thought that would have affected it in certain scenarios/games..but the Mali 400 has been a mobile monster.

-I thought Tegra 3 was a '12 core beast' ;)..seriously though..as that obviously relates to vliw 4.. P/V shaders (and not 'cores') are you sure there is only 2 of them? or have i read that wrong?

(unless for P/S its 2*4 ALUS=8 then for V/S its 1*4 to make-12 'CORE?':???:)

Too be honest i don't understand the Mali architecture..(not that i have a great deal of understanding of any architecture mind!:D) but the Mali one is baffling..so it has 4 TMU's... which does seem alot..and warrents its 'quad core' status...but Tegra 3 has only 2 if were generous...how many ROP'S are included in that?..and do 'rasterizers' fit into this equation?

EDIT; Ha ive just done a quick wiki and now know that 'Rasteriser' is a ROP (doh!) and also learned that a ROP/TMU/P/S usually go in tandam...thus answering my own question regarding Mali... 4 TMU's 4 ROP's 4 (VLIW4)pixel shaders. only a single vertex shader..phew!

Now ive answered one of my questions, i need to add another one..what does 'MAD's' refer to?..Anand uses that term..a quick think.'multiple/add/divide?? buts thats only 3 components of a VLIW4?? lol.
 
Last edited by a moderator:
Spot on. So the Mali has twice the pixel shaders...4. but strangely only a single Vertex shader?..seems very weak, you would have thought that would have affected it in certain scenarios/games..but the Mali 400 has been a mobile monster.

Mali GPU IP scales only fragment cores and unfortunately not vertex shaders; else whether you have 1 or 4 fragment "core" you will always just one vertex shader.

I think it's 4 Vec4 (FP16) PS ALUs + 1 Vec2 (FP32) VS ALU, but don't quote me on the VS ALU since my memory is weak on that one.

-I thought Tegra 3 was a '12 core beast' ;)..seriously though..as that obviously relates to vliw 4.. P/V shaders (and not 'cores') are you sure there is only 2 of them? or have i read that wrong?

(unless for P/S its 2*4 ALUS=8 then for V/S its 1*4 to make-12 'CORE?':???:)
Oh that's easy just count each ALU lane as a core and you get twelve:

2 Vec4 PS ALUs (2*4= 8 "cores") + 1 Vec4 PS ALU (4 "cores") = 8+4 = 12 cores.

Tegra1 and 2 ULP GeForces had 8 cores only = 1 Vec4 PS + 1 Vec4 VS ALUs.

Too be honest i don't understand the Mali architecture..(not that i have a great deal of understanding of any architecture mind!:D) but the Mali one is baffling..so it has 4 TMU's... which does seem alot..and warrents its 'quad core' status...but Tegra 3 has only 2 if were generous...how many ROP'S are included in that?..and do 'rasterizers' fit into this equation?
Mali has 1 TMU at a time per fragment core; else for each Vec4 PS ALU one TMU. MP4 = 4 fragment cores = 4 TMUs.

Tegra GPUs should be 8 z/stencil, while if Mali400MP4 also scales z/stencil with each fragment core it could have 32 z/stencil. Rasterizers? Errr one on each of the fore mentioned probably? No idea to be honest.

Note that blending is at least on SGX and ULP GeForce carried out in the ALUs (PS ALUs for ULP GF); I don't see why Mali would be different in that regard.

EDIT; Ha ive just done a quick wiki and now know that 'Rasteriser' is a ROP (doh!) and also learned that a ROP/TMU/P/S usually go in tandam...thus answering my own question regarding Mali... 4 TMU's 4 ROP's 4 (VLIW4)pixel shaders. only a single vertex shader..phew!
A rasterizer is NOT a render output. Both sit fairly on different ends of a GPU.

Now ive answered one of my questions, i need to add another one..what does 'MAD's' refer to?..Anand uses that term..a quick think.'multiple/add/divide?? buts thats only 3 components of a VLIW4?? lol.
It's MADD actually and stands for multiply (MUL) + add (ADD) = MADD for two floating point operations. Each ALU lane or stream processor in desktop marketing parlance is capable of 1 MADD or else 2 floating point operations or else 2 FLOPs.

Mali400MP4 has 4 Vec4 PS ALUs or else 16 SPs * 2 FLOPs * 0.4GHz = 12.8 GFLOPs.
 
I think it's 4 Vec4 (FP16) PS ALUs + 1 Vec2 (FP32) VS ALU, but don't quote me on the VS ALU since my memory is weak on that one.

Oh that's easy just count each ALU lane as a core and you get twelve:

2 Vec4 PS ALUs (2*4= 8 "cores") + 1 Vec4 PS ALU (4 "cores") = 8+4 = 12 cores.
Thanks, I see so..vertex shader (VS) is only 2 wide..but is FP32 (floating point) so that is obviously 2x FP16....hence why you described VS as '4 cores/ALU's' instead of 2 had it been FP16 - like on PS (pixel shader)
Mali has 1 TMU at a time per fragment core; else for each Vec4 PS ALU one TMU. MP4 = 4 fragment cores = 4 TMUs.
Got that.:smile:
Tegra GPUs should be 8 z/stencil, while if Mali400MP4 also scales z/stencil with each fragment core it could have 32 z/stencil
.
Don't get that that! lol...i know Mali/Adreno/ULV Geforce are IMR with 'early z rejection (immediete mode renderer)..thats as far as i know..:???:
A rasterizer is NOT a render output. Both sit fairly on different ends of a GPU.
Ha, i didn't look that up well then:oops:, well i know that 'ROP's' scale linealy with TMU's & PS in non Unified shader designs...so Mali must have 4 ROP's??....havn't got a clue about rasterisers:???:
It's MADD actually and stands for multiply (MUL) + add (ADD) = MADD
Cheers..(just to be pedantic..you would have thought it would of been MULADD..:D)
Mali400MP4 has 4 Vec4 PS ALUs or else 16 SPs * 2 FLOPs * 0.4GHz = 12.8 GFLOPs.
Right, so adding to that the VS which if is FP32..would be VS-1*4(FP32)= 4 ALU's/MAD's... 4*2flops*0.4ghz= 3.2GFLOPS
-(3.2+12.8=16GFLOPS..?)

Unless the vertex shader on both are FP16? which Anand suggests when i looked at his example..although he is taking a wild guess...
 
Thanks, I see so..vertex shader (VS) is only 2 wide..but is FP32 (floating point) so that is obviously 2x FP16....hence why you described VS as '4 cores/ALU's' instead of 2 had it been FP16 - like on PS (pixel shader)

Nope it's not 2*FP16; it's FP32 as you read it and it would be a very idiotic idea even for small form factor to handle vertex shading with less than FP32 precision.

Don't get that that! lol...i know Mali/Adreno/ULV Geforce are IMR with 'early z rejection (immediete mode renderer)..thats as far as i know..:???:
z/stencil fillrates have nothing to do with architectures per se. When you're counting on desktop GPU ROPs capabilities there's also z rates amongst others. ULP GF in T3 should be capable of 8 z/stencil per clock unless it has changed since T2. Else 8 * 520MHz = 4.16 GPixels/ z/stencil.

The majority of those GPUs don't have dedicated blending units, they're capable of programmable blending in the ALUs.

Ha, i didn't look that up well then:oops:, well i know that 'ROP's' scale linealy with TMU's & PS in non Unified shader designs...so Mali must have 4 ROP's??....havn't got a clue about rasterisers:???:
No ROPs don't scale linearly with TMUs and PS in any sort of desktop GPUs. If then it's rather a memory controller affair. Radeons have ROPs decoupled from the memory controller (Tahiti for instance has 32 ROPs while on a 384bit bus), while on GeForces the amount of ROPs scales with the buswidth. On recent GeForce GPUs you have for each 64bit block one ROP partition with 8 ROPs per partition (hence 256bit bus = 4*64bit = 4*8 ROPs = 32 ROPs, or 384bit = 6*64bit = 6*8 ROPs = 48 ROPs etc.).

Raster and trisetup units up to DX10 GPUs used to be one of each per GPU. With the advent of DX11/tessellation the amount of both raster and trisetup units started to scale; no idea if something like that is also necessary for a DX11 small form factor SoC GPU.

Right, so adding to that the VS which if is FP32..would be VS-1*4(FP32)= 4 ALU's/MAD's... 4*2flops*0.4ghz= 3.2GFLOPS
-(3.2+12.8=16GFLOPS..?)
Yep.

Unless the vertex shader on both are FP16? which Anand suggests when i looked at his example..although he is taking a wild guess...
I don't think Anand made such a mistake. I don't think there's even one small form factor GPU out there that hasn't FP32 vertex shaders. The widest majority of those GPUs integrated have USC ALUs anyway so there FP32 is a given. For fragment processing however and non USC cores it's a totally different story; Mali is FP16 and ULP GF should be FP24 (like in Tegra2).

Vivante, Adreno, SGX have all unified shader cores.
 
No ROPs don't scale linearly with TMUs and PS in any sort of desktop GPUs. If then it's rather a memory controller affair. Radeons have ROPs decoupled from the memory controller (Tahiti for instance has 32 ROPs while on a 384bit bus), while on GeForces the amount of ROPs scales with the buswidth. On recent GeForce GPUs you have for each 64bit block one ROP partition with 8 ROPs per partition (hence 256bit bus = 4*64bit = 4*8 ROPs = 32 ROPs, or 384bit = 6*64bit = 6*8 ROPs = 48 ROPs etc.).

Raster and trisetup units up to DX10 GPUs used to be one of each per GPU. With the advent of DX11/tessellation the amount of both raster and trisetup units started to scale; no idea if something like that is also necessary for a DX11 small form factor SoC GPU.
Ok, here is what i read off WIKI;
''Historically the number of ROPs, TMUs, and pixel shaders have been equal. However, as of 2004, several GPUs have decoupled these areas to allow optimum transistor allocation for application workload and available memory performance''
That suggests that they used to be equal, however wiki is not always accurate.
I don't think Anand made such a mistake. I don't think there's even one small form factor GPU out there that hasn't FP32 vertex shaders. The widest majority of those GPUs integrated have USC ALUs anyway so there FP32 is a given. For fragment processing however and non USC cores it's a totally different story; Mali is FP16 and ULP GF should be FP24 (like in Tegra2).

Vivante, Adreno, SGX have all unified shader cores
Yea unified is the way forward..just to clarify..this is what Anand wrote on his Galaxy S2 review;...
Based on this as well as some internal information we can assume that a single Mali fragment shader is a 4-wide VLIW processor. The vertex shader is a big unknown as well, but knowing that vertex processing happens on two coordinate elements (U & V) Mali's vertex shader is likely a 2-wide unit.
Thus far every architecture we've looked at has been able to process one FP16 MAD (multiply+add) per execution unit per clock. If we make another assumption about the Mali-400 and say it can do the same, we get the following table:
So he does seem to suggest FP16...for both PS/VS...This plays out on his projected Mali400 @ 400mhz in his table;..http://www.anandtech.com/show/4686/samsung-galaxy-s-2-international-review-the-best-redefined/16

So looking at that table...he calculates 18 MAD's which works out at 10.8 GFLOPS @300 mhz so.. 10.8/3= 3.6 10.8+3.6=14.4GFLOPS @400 mhz...
 
Or the other former devices simply had the GPU clocked lower.
I don't get what you mean by that, please explain. If you mean that it's possibly higher than 400MHz, then yes, maybe.


I ran some CPU-relative benches again for comparison, I wanted to see how much CPU bound GLBenchmark is:

Code:
Exynos 4210, Mali400 @ 400MHz

                    Egypt           Pro
1600Mhz Dual    8391 / 74fps    5725 / 114fps
1600Mhz Single  8262 / 73fps    5467 / 111fps

1400Mhz Dual    8303 / 74fps    5774 / 116fps
1400Mhz Single  8376 / 74fps    5794 / 116fps

1200Mhz Dual    8342 / 74fps    5760 / 116fps
1200Mhz Single  8394 / 74fps    5581 / 112fps

1000Mhz Dual    8209 / 73fps    5690 / 114fps
1000Mhz Single  8363 / 74fps    5705 / 114fps

800Mhz Dual     8130 / 72fps    5536 / 111fps
800Mhz Single   8218 / 73fps    5755 / 115fps

Conclusion is that it's as CPU bound as a lame duck. Only under/at 500MHz does CPU freq make any difference. Makes even less sense for those i9300 results. I'm looking through the driver diffs now to see if there really is some kind of magic, but I doubt it. There must be more to it.

Edit: There we have it!
Code:
mali_dvfs_table mali_dvfs_all[MAX_MALI_DVFS_STEPS]={
	{160   ,1000000   ,  875000},
	{266   ,1000000   ,  900000},
	{350   ,1000000   ,  950000},
	{440   ,1000000   , 1025000} };
From \kernel\drivers\media\video\samsung\mali\platform\pegasus-m400\mali_platform_dvfs.c
So the 4412 is running at at least 440MHz, if they haven't upped it even more since the source drop, and certainly would explain the benchmarks.
 
Last edited by a moderator:
Edit: There we have it!

Code:
mali_dvfs_table mali_dvfs_all[MAX_MALI_DVFS_STEPS]={ {160 ,1000000 , 875000}, {266 ,1000000 , 900000}, {350 ,1000000 , 950000}, {440 ,1000000 , 1025000} };
From \kernel\drivers\media\video\samsung\mali\platform\ pegasus-m400\mali_platform_dvfs.c
So the 4412 is running at at least 440MHz, if they haven't upped it even more since the source drop, and certainly would explain the benchmarks.

Genius! ;)
 
Pro is not the best benchmark - it's probably hitting other system limitations before the GPU or CPU. Most likely bandwidth like has already been suggested.

Which drivers are these from ? I didn't realize arm release all the driver source ?
 
I don't get what you mean by that, please explain. If you mean that it's possibly higher than 400MHz, then yes, maybe.


I ran some CPU-relative benches again for comparison, I wanted to see how much CPU bound GLBenchmark is:

Edit: There we have it!
Code:
mali_dvfs_table mali_dvfs_all[MAX_MALI_DVFS_STEPS]={
    {160   ,1000000   ,  875000},
    {266   ,1000000   ,  900000},
    {350   ,1000000   ,  950000},
    {440   ,1000000   , 1025000} };
From \kernel\drivers\media\video\samsung\mali\platform\pegasus-m400\mali_platform_dvfs.c
So the 4412 is running at at least 440MHz, if they haven't upped it even more since the source drop, and certainly would explain the benchmarks.

Well 350MHz for the former results bodes rather well with my initial estimate: http://forum.beyond3d.com/showpost.php?p=1632652&postcount=282

Granted I never expected 100% linear scaling, but those early results smelled suspiciously like <400MHz and the newer GalaxySIII results sounded like a tad too high for "just" 400MHz.

So you probably found the missing pieces of the puzzle with the above kernel entries. Those initial 7k points bode rather well to 350MHz and the 10k points equally well to 440MHz.
 
I don't think Anand made such a mistake. I don't think there's even one small form factor GPU out there that hasn't FP32 vertex shaders. The widest majority of those GPUs integrated have USC ALUs anyway so there FP32 is a given. For fragment processing however and non USC cores it's a totally different story; Mali is FP16 and ULP GF should be FP24 (like in Tegra2).
http://www.nvidia.com/content/PDF/t...ing_High-End_Graphics_to_Handheld_Devices.pdf

Tegra 2 PS are actually FP20 (bottom of page 7 in the above white paper). No idea about Tegra 3, but seeing it is mainly an expansion of Tegra 2 rather than a redesign, it's probably still at FP20. Which is why I've been curious how they meet the DX9 compliance necessary for the Windows 8 support they've been demoing.
 
Back
Top