Comparing embedded GPUs in Mobiles to Laptops

Nano

Regular
Hi all,

I have a bit of a question about the recent hand-held GPUs that have been a huge feature in smart-phones, from the PowerVR SGX to the Tegra and Adreno.

I have a very good idea about the architectural differences that come to play, but from a performance perspective, how would we compare their computing power and fill-rates to popular integrated chipsets found in Laptops i.e. ATI Mobility Radeon HD 3200, Intel GMA 4500, nVIDIA ION.

How does a Tegra 2, PowerVR SGX 540 or 543 MP2 compare?

Considering the power and heat considerations, but clever, efficient designs that have greatly evolved over the years, it would be very interesting to consider where mobile phones are in relation to embedded Laptop equivalents in the modern day.


Thanks,

Nano

http://www.nvidia.com/object/tegra-2.html

http://www.imgtec.com/PowerVR/sgx.asp
 
Their raw theoretical performance is a complete joke compared to even the lowliest of current IGPs. But at least it's improving VERY quickly, especially in terms of flops. The effective ALU-TMU ratio will soon exceed that of many PC parts - memory bandwidth is scarce on handhelds and some rather hilariously lame reasons (not sure I can get into those?) mean Anisotropic Filtering often isn't even available.

Just to give you an idea... Tegra2 and Mali-400MP1 have 1 TMU, SGX535/SGX540 in the Apple A4 or Samsung Hummingbird have 2 TMUs, Mali-400MP4 in Exynos and SGX543MP2 in the Apple A5 have 4 TMUs, SGX543MP4 in NGP has 8 TMUs. And we're talking clock speeds between 200MHz and 400MHz, so also much lower than IGPs. NVIDIA's original Ion had 8 TMUs at 450MHz, so more than twice the NGP with 8 TMUs at (apparently) only 200MHz!

In practice, it's better than that, they're certainly more efficient, and TMUs are the worst offender as I said. A real comparison in a real complex game with a high quality port would be very interesting. But expect no miracles here.

The next generation will be a lot more interesting... ST-Ericsson A9600 with Rogue/Series6 will have 5+ GPixels fillrate (versus 3.6 for Ion) and a 4x higher flops-to-TMU ratio. And likely other efficiency advantages as well.
 
What about in comparison to the Intel GMA 4500? Or lower end chips (than the ION) like the Mobility Radeon 3200?

I expect these embedded GPUs are very modest in their form, but still it's quite fascinating to see.

Do you have any statistics on the flops ratings between these devices?

Thanks for your responses so far :)
 
Last edited by a moderator:
What about in comparison to the Intel GMA 4500? Or lower end chips (than the ION) like the Mobility Radeon 3200?
GMA4500 has quite high flops actually - it just doesn't make very efficient use of them :) 8 TMUs too I believe. Current high-end SOCs might have comparable performance (though much lower theoretical numbers).
It's got 10 shader units which are physically 4-wide - 40 flops (80 if you count the very limited MAC) per clock at quite a high clock.
The Radeon Mobility HD 3200 has also 40 shader units (80 flops/clock / 8 TMUs) - overall performance is comparable to ION, flops (as usual) quite a bit higher.
 
So are we talking around 40 Gflops for Mobility Radeon 3200? I think ION is about 52 Gflops

In raw terms how many Gflops are we looking at in a Tegra 2, SGX 543 MP2 or SGX 543 MP4 like PSvita?

How would the considerations to the efficiency of these embedded mobile chips factor into real-world graphics performance?
 
Last edited by a moderator:
Theoretical performance numbers like FLOPs or triangle rates are as meaningless as always, since f.e. FLOP != FLOP between architectures.

SoCs like Tegra2, Apple A5 or the one in Sony's NGP haven't been developed for something like notebooks or low end PCs and none of them is even DX10 compliant. In that regard I don't see what any comparison would even be worth anything. Embedded SoCs are tailored primarily for ultra low power consumption and for the time being the necessary >DX9.0 overhead in terms of die area would be a waste.

For the next technology generation of embedded GPUs like the IMG Series6 Rogue in the ST Ericsson A9600 which is a smart-phone SoC being at least DX10 compliant, it's >210GFLOPs/s, >5 GTexels/s and >350M Tris/s are a tad better as a comparison but then again with notebook/PC SoCs =/>2012 which will be DX11 and a lot faster than what is available today.

In the rather unlikely case where SONY would opt for a DX11 Rogue MP mega-SoC for a console design things would get definitely interesting, since for Intel for anything above embedded I have to see it first and then believe it.
 
So are we talking around 40 Gflops for Mobility Radeon 3200? I think ION is about 52 Gflops
Only with very generous counting. Otherwise more like 35 GFlops max. Some Ions only have 8 instead of 16 "cuda cores" so only half.

How would the considerations to the efficiency of these embedded mobile chips factor into real-world graphics performance?
That's not easy to answer, especially if you're looking at TBDR vs. IMR.

Theoretical performance numbers like FLOPs or triangle rates are as meaningless as always, since f.e. FLOP != FLOP between architectures.

SoCs like Tegra2, Apple A5 or the one in Sony's NGP haven't been developed for something like notebooks or low end PCs and none of them is even DX10 compliant. In that regard I don't see what any comparison would even be worth anything. Embedded SoCs are tailored primarily for ultra low power consumption and for the time being the necessary >DX9.0 overhead in terms of die area would be a waste.
I'm not quite sure what the chips are missing for DX10 compliance, and if that really would cost that much. Some though definitely don't have the required precision (for the ALUs for instance, and also z-buffer) which makes them not even really DX9 compliant.

For the next technology generation of embedded GPUs like the IMG Series6 Rogue in the ST Ericsson A9600 which is a smart-phone SoC being at least DX10 compliant, it's >210GFLOPs/s, >5 GTexels/s and >350M Tris/s are a tad better as a comparison but then again with notebook/PC SoCs =/>2012 which will be DX11 and a lot faster than what is available today.
210Gflops for Rogue? Where did you get that number? IIRC SGX543 has got something like 16 (fp32) flops / clock which at usual clocks is about 2 orders of magnitude lower, so that would be more than a drastic increase.
 
I'm not quite sure what the chips are missing for DX10 compliance, and if that really would cost that much. Some though definitely don't have the required precision (for the ALUs for instance, and also z-buffer) which makes them not even really DX9 compliant.

I'm a bit generous today ;)

210Gflops for Rogue? Where did you get that number? IIRC SGX543 has got something like 16 (fp32) flops / clock which at usual clocks is about 2 orders of magnitude lower, so that would be more than a drastic increase.
When was the last time you've checked B3D's frontpage? LOL http://www.beyond3d.com/content/articles/112/

http://www.stericsson.com/press_releases/NovaThor.jsp

As Arun suggests though it is most likely either a dual or triple core. Single core seems completely unlikely for such high performance metrics. Albeit I'm in the majority of cases wrong I'd follow Arun's 667MHz theory for each core having something like 16 Vec5 ALUs and 4 TMUs. Baseline DX10; maximum DX11 according to IMG's latest announcements.
 
I'm a bit generous today ;)

When was the last time you've checked B3D's frontpage? LOL http://www.beyond3d.com/content/articles/112/
Well I actually read that at some point. I think I dismissed it cause the numbers looked too good...
The 210gflops number could be lower (fp16) precision in theory though it would still be very high. GFlops in the neighborhood of a radeon HD6450 (twice that of Brazos) just doesn't sound right for a SOC, especially not for a TBDR.
 
Theoretical performance numbers like FLOPs or triangle rates are as meaningless as always, since f.e. FLOP != FLOP between architectures.
True, the thought did cross my mind, but it's interesting from a numbers perspective.

Ailuros said:
SoCs like Tegra2, Apple A5 or the one in Sony's NGP haven't been developed for something like notebooks or low end PCs and none of them is even DX10 compliant.
If I'm not mistaken, ImgTec have made the PowerVR SGX chips DX10 & DX10.1 compliant for any application in Laptops or such..
 
I'm not quite sure what the chips are missing for DX10 compliance, and if that really would cost that much. Some though definitely don't have the required precision (for the ALUs for instance, and also z-buffer) which makes them not even really DX9 compliant.
All SGX parts meet shader precision requirements for Dx9 SM3.0

210Gflops for Rogue? Where did you get that number? IIRC SGX543 has got something like 16 (fp32) flops / clock which at usual clocks is about 2 orders of magnitude lower, so that would be more than a drastic increase.
SGX543/544 are actually 36 flops/clock for a single core, 554 is 72 flops/clock also for a single core. So not sure how you get to two orders of magnitude.

John.
 
If I'm not mistaken, ImgTec have made the PowerVR SGX chips DX10 & DX10.1 compliant for any application in Laptops or such..
Haven't seen any yet. SGX535 is fully d3d9 compliant (might be true for SGX543 and up too, not sure). SGX545 is rumored to be d3d10.1 compliant, but it's not in any shipping product yet.
 
All SGX parts meet shader precision requirements for Dx9 SM3.0
As far as I know they aren't necessarily fast at it though (not all of them at least). But I was speaking generally for SOCs, and nvidia ones for instance don't.

SGX543/544 are actually 36 flops/clock for a single core, 554 is 72 flops/clock also for a single core. So not sure how you get to two orders of magnitude.
But that's not fp32 flops is it? I thought the 543 has 4 USSE2 pipes, each capable of maybe 2 fp32 fmads (hence 16 flops).
How do you arrive at 36?
 
But that's not fp32 flops is it? I thought the 543 has 4 USSE2 pipes, each capable of maybe 2 fp32 fmads (hence 16 flops).
How do you arrive at 36?
Perhaps because he's on the team that specified it.
 
As far as I know they aren't necessarily fast at it though (not all of them at least). But I was speaking generally for SOCs, and nvidia ones for instance don't.
[...]
But that's not fp32 flops is it? I thought the 543 has 4 USSE2 pipes, each capable of maybe 2 fp32 fmads (hence 16 flops).
How do you arrive at 36?
I'd believe John if I were you ;) 543 has 4 USSE2 pipes, each capable of 4 FP32 MADDs and one additional FP32 operation which IMG has never really talked about (best case it's an extra ADD, worst case it only does stupid stuff like format conversions, I don't know). FP16 and INT8 performance is higher but not massively so.
 
Perhaps because he's on the team that specified it.
Hmm 9 per USSE2 pipe is an odd number though, so I've got some feeling that's not just straight MADs :).
And you didn't answer the fp32 part :). All I've heard so far is flops without any precision.

I'd believe John if I were you ;) 543 has 4 USSE2 pipes, each capable of 4 FP32 MADDs and one additional FP32 operation which IMG has never really talked about (best case it's an extra ADD, worst case it only does stupid stuff like format conversions, I don't know). FP16 and INT8 performance is higher but not massively so.
Is there a (trustworthy) source somewhere for the 4 FP32 MADs per pipe?
 
As far as I know they aren't necessarily fast at it though (not all of them at least). But I was speaking generally for SOCs, and nvidia ones for instance don't.
Well they're not actually slow either, there's just not that many of them in early variants, however on the variants that are appearing in devices now....
But that's not fp32 flops is it? I thought the 543 has 4 USSE2 pipes, each capable of maybe 2 fp32 fmads (hence 16 flops).
How do you arrive at 36?

Yes it is FP32 ops, as Arun indicated each pipeline is VEC4 F32 FMAD, in addition we can run a parallel (real) floatingpoint op giving 9 F32 flops per pipe or 36 per core.

The key point is that per clock these cores are in the same ballpark (or higher) as the desktop derived mobile parts mentioned above, series 6 raises this bar a lot further.

Edit, I have no idea why you wouldn't consider me to be trustworthy on this.
 
Yes it is FP32 ops, as Arun indicated each pipeline is VEC4 F32 FMAD, in addition we can run a parallel (real) floatingpoint op giving 9 F32 flops per pipe or 36 per core.
Hmm ok. That's quite a lot. I thought the earlier SGX parts (52x, 53x) couldn't do that much (per pipe).

The key point is that per clock these cores are in the same ballpark (or higher) as the desktop derived mobile parts mentioned above, series 6 raises this bar a lot further.
Ok series 6 looking good then (of course I'd expect 28nm fusion to include Caicos-like IGP but still looks like those SOCs will close the gap). I guess we'll see slower ones than that A9600 though, I wonder how series 6 compare in terms of efficiency (both power and die area) to series 5, considering the better feature set.

Edit, I have no idea why you wouldn't consider me to be trustworthy on this.
Oh I certainly do - you just didn't confirm it was really fp32 before (not in this thread, at least, maybe elsewhere I missed).
 
Last edited by a moderator:
Hmm ok. That's quite a lot. I thought the earlier SGX parts (52x, 53x) couldn't do that much (per pipe).

They can't. Series5XT can dual issue FMADDs. I've been presuming that the same limitations as with Series5 apply to the FMADDs, which is that they can be four fully independent fixed-point 8 or 10-bit operations, two fully independent FP16, one fully independent FP32, or two FP32 sharing a parameter. I think the shared parameter was a common multiplier. That information is just relayed from what JohnH has previously posted, not barring places where I may have broken it.

That it can do some other operations means that that operation has to be encoded somewhere and that it can issue more than two instructions somehow, unless there's an FMADD + something else sort of instruction.
 
Hmm ok. That's quite a lot. I thought the earlier SGX parts (52x, 53x) couldn't do that much (per pipe).

The original series 5 cores were 2x F32 MAD per pipe per clock so 4 flops/clock/pipe, so a little less than half S5 XT per pipe performance.
 
Back
Top