Qualcomm Krait & MSM8960 @ AnandTech

Synthetic benchmarks like GLBenchmark attempt in their own way to predict how future games might look like. No mobile ISV would be as insane to create an as demanding game today as GLBenchmark2.5 and have the majority of users/devices look at single digit framerates.

The real point here is that Adrenos in general do extremely well with highly complex shaders and start to wind back as complexity shrinks and that's probably due to their still shaky driver/compiler not allowing the GPUs to reach a higher potential the actual hw should actually have.

If you'd ask me as a user to measure two competing GPUs of any kind I would use the most tortering synthetic stress tests along with as many as possible real 3D games and definitely not the best case scenarios in those and from the entire crop of results I'd attempt to reach a conclusion. Each result will have its own merit; it just comes down to how well you're able to interpret them.

When a mobile device without vsync reaches in any sort of 3D application way above the vsync limit (typically 60Hz), then that performance overhead should be better invested f.e. in IQ improving features like multisampling and/or AF.

The unfortunate thing here is that the small form factor market and especially 3D for it is still too young. Games unfortunately don't have any benchmarking functions, as their results would be far more representative than a handful of synthetic benchmarks. Assume there would be a healthy collection of game benchmarks available would you look rather at those case examples where GPUs get an average framerate of way beyond 100fps or rather something that drives the tested GPUs to their edge with 30-20 or even less average framerates? Or better how would you suggest to measure and compare different GPUs in such cases?
 
Has the market for high-quality 3D games on Android and iOS grown?

Apple showcases 3D games every keynote and NVidia obviously evangelizes development for devices using their SOCs.

In PC gaming, there used to be one or two games which were used as benchmarks for all the video cards. Is there a counterpart for that in mobile games?
 
One or two games isn't nearly good enough, but for the time being it would be a pleasant change. I'd expect that Epic might do something in that direction for its Unreal mobile engine, unless they've already something relevant and I've missed it.
 
One or two games isn't nearly good enough, but for the time being it would be a pleasant change. I'd expect that Epic might do something in that direction for its Unreal mobile engine, unless they've already something relevant and I've missed it.

I've not Beemer bothered to learn how o multi quote posts on my mobile as of yet so....

Right you correctly point out that the exynos 5250 would likely beat the adreno 320...fine, but like I said when is Samsung likely to stick that in a phone? Only a high end one, and galaxy note 2 has just been launched with out one, galaxy s3 is only a few months old...so good luck waiting for that to show up...earliest is MWC...which is just what I said.

I don't know die areas your correct, but when I mentioned redundancy (hope I've got the correct terminology for this remark :) ) I meant that Qualcomm has with adreno 320..not unlike IMG TECH.. with its rogue....formed what appears to be some sort of clusters with one TMU for each...where as the current gen SGX 543 MP4 is literally 4 gpus bolted together...this is not as efficient in die area compared to performance/power consumption at a guess...IMG TECH move to a new uarch in this area posts to that IMO.

So coming back to what I have said, if you shrink down A5x, to 28nm, and clock it @ 400mhz (do we even know adreno clocks??) To match adreno 320...whilst performance would be better for the sgx...I'm guessing that due to the added execution resources of the full 4x sgx gpu cores and lets not forget the quad channel memory controller.....the APQ snapdragon adreno 320 gpu would still be smaller and I'm guessing slightly more power efficient....tests not taking into consideration the 4 kraits and 2mb L2 cache.

Besides..none of that matters for the her and now...it's speculative..snapdragon s4 pro is launching in a smartphone soon, part of processor design/decision making/planning is obviously manufacturing process and judging benefits/yields supply...and this is where Qualcomm gambled correct..and reaped the performance benfits, good luck to them.

Like I said, the snapdragon s4 pro is showing up in phones around October...retail...all round it is a VERY good chip...the best Imo, and I don't see a SMARTPHONE with the backbone and efficiency to rival it for at least 4-6 months....coming from behind as Qualcomm was a while ago, that's some achievement as there is some big players about with big resources...6 months is long time :).
 
My understanding is that Qualcomm has presented APQ 8604 as a tablet chip. OTOH I agree, and forgot to say in my previous post that we should wait for battery life results. But performance always has a price (especially when using the same process).

Your correct of course, APQ denotes a tablet chip in Qualcomm speak, but I like to point out that if it launches in a smartphone before any tablets...then that makes it s smartphone SOC.

Of course like you say we await power consumption numbers...bit if it's s success..(and I fully expect it to be) then it's a massive win for Qualcomm. :)
 
I said at least twice that iPhone5 will be highly competitive; now that the first 2.5 results are out you may bother to check them. It's most likely a MP3@=/>300MHz, so that's not any sort of hypothesis anymore and quite a bit closer to a hypothetical MP4@400MHz.

The difference in Egypt 2.5 offscreen is less than 2 fps, while the iPhone5 has the highest performance per pixel right now.
 
IPhone 5 is the fastest mobile device available, and its GPU is trading wins with upcoming Adreno 320 phones depending on the specific test in GLBench and other benchmarks. Ultimately, Adreno 320 devices and newer devices shouldn't have too much difficulty overtaking it by a good margin in the near future, but remaining even halfway competitive in the lesser year of their two year cycle is pretty incredible. A6X in the next iPad should lead the market by a margin approached only by the feats of the introductions of the 3GS and the iPad 2.
 
I said at least twice that iPhone5 will be highly competitive; now that the first 2.5 results are out you may bother to check them. It's most likely a MP3@=/>300MHz, so that's not any sort of hypothesis anymore and quite a bit closer to a hypothetical MP4@400MHz.

The difference in Egypt 2.5 offscreen is less than 2 fps, while the iPhone5 has the highest performance per pixel right now.

We agree, the apple chip is awesome...but my point about adreno 320 being the fastest, most advanced smartphone gpu for 4-6 months still stands.

Sgx is on very mature drivers..and is still slightly behind, doesn't include halti support and also doesn't have access to 4 krait cores and 2mb of cache :).

I don't fully believe that the 1.2ghz dual core apple chip would smack around the snapdragon s4 pro in a REAL world scenario...use same software and I expect apple's superior memory subsystem would mean it pulled out some wins, including likely consumption, but I would rather have s4 pro on my android device.
 
We agree, the apple chip is awesome...but my point about adreno 320 being the fastest, most advanced smartphone gpu for 4-6 months still stands.

No one said that it isn't an advanced GPU, rather the contrary.

Sgx is on very mature drivers..and is still slightly behind, doesn't include halti support and also doesn't have access to 4 krait cores and 2mb of cache :).
The point is GPU efficiency and not how each SoC manufacturer integrates N other units into a SoC. Adreno320 exactly as its predecessors need additional driver/compiler homework and no today's trend didn't start today. Adreno2xx GPUs were also always excellent with high complexity shaders where efficiency weakened quite a bit with less shader complexity.

iPhone5 gets in Egypt 2.1 offscreen 91.6 fps, while the Xiaomi M2 77.8 fps.

The MP3 in the A6 is clocked at 300MHz or slightly above with less ALUs than the 320 which is clocked at 400MHz. If you still can't comprehend efficiency per clock and per unit it's not my fault really.

I don't fully believe that the 1.2ghz dual core apple chip would smack around the snapdragon s4 pro in a REAL world scenario...use same software and I expect apple's superior memory subsystem would mean it pulled out some wins, including likely consumption, but I would rather have s4 pro on my android device.
Who cares? Qualcomm is stil the king of all Android SoC manufacturers when it comes to SoC smartphone design wins and market share and Apple serves its own market as before. It's rather Intel, Texas Instruments, Samsung etc. that should be seriously worried about Qualcomm's execution. They're the ones that are behind compared to Qualcomm, while Apple is give or take on schedule for its own roadmap.
 
No one said that it isn't an advanced GPU, rather the contrary.

The point is GPU efficiency and not how each SoC manufacturer integrates N other units into a SoC. Adreno320 exactly as its predecessors need additional driver/compiler homework and no today's trend didn't start today. Adreno2xx GPUs were also always excellent with high complexity shaders where efficiency weakened quite a bit with less shader complexity.

iPhone5 gets in Egypt 2.1 offscreen 91.6 fps, while the Xiaomi M2 77.8 fps.

The MP3 in the A6 is clocked at 300MHz or slightly above with less ALUs than the 320 which is clocked at 400MHz. If you still can't comprehend efficiency per clock and per unit it's not my fault really.

Who cares? Qualcomm is stil the king of all Android SoC manufacturers when it comes to SoC smartphone design wins and market share and Apple serves its own market as before. It's rather Intel, Texas Instruments, Samsung etc. that should be seriously worried about Qualcomm's execution. They're the ones that are behind compared to Qualcomm, while Apple is give or take on schedule for its own roadmap.

Ha ha who said anything about efficiency per clock comprehension??.. do we even know the clock speed of adreno 320?? Do we know the execution units of that chip??...

The answer is from public information..we don't...so all we can go on is early benchmarks and guestimates...efficiency per clock has nothing to do with me buying s smartphone...performance does..which the adreno has the most of in more future based gaming scenarios.

Like I said 4-6 months :)
 
but remaining even halfway competitive in the lesser year of their two year cycle is pretty incredible.

Nothing too incredible about it. The normalized SoC die size of A6 is as large as A5X. That translates to roughly 2x more SoC die size area relative to something like Tegra 3 SoC. And the die size area dedicated to the GPU is much larger than 2x relative to something like Tegra 3 SoC. Since Apple dedicated so much silicon area for the GPU, it should not be surprising in the least that they have competitive GPU performance even if they haven't yet moved to a newer architectural design.

A6X in the next iPad should lead the market by a margin approached only by the feats of the introductions of the 3GS and the iPad 2.

Considering that ipad 3 was noticeably thicker, heavier, and warmer to the touch than ipad 2, Apple will make a goal of reducing thickness, weight, and power consumption for ipad 4 (relative to ipad 3). So just as A6 SoC die size (@ 32nm LP fabrication process) is smaller than A5 SoC die size (@ 45nm LP fabrication process), the expectation is that A6X SoC die size (@ 32nm LP fabrication process) will be smaller than A5X SoC die size (@ 45nm LP fabrication process). Assuming that the difference in SoC die size area between A6 and A6X will be roughly similar to the difference in SoC die size area between A5 and A5X, then A6X SoC die size (@ 32nm LP fabrication process) will be ~ 126 mm^2, and a normalized A6X SoC die size (using 45nm LP fabrication process as a baseline) would be ~ 222 mm^2. Apple has three options available to them with this increase in effective SoC die size area: 1) Add 2 more CPU cores, or 2) Add 2 more GPU cores based on the existing SGX 543 architecture, or 3) Move to a newer GPU cluster-based architecture. Option 1 is unlikely given past history, so Apple will surely be at a significant deficit with respect to CPU performance vs. top competing SoC's in 2013, especially in multi-threaded benchmarks or any application that can take advantage of more than two CPU cores. Option 2 is very doable, and since execution units for the GPU would increase by 50% relative to A5X SoC in ipad 3, Apple would need to increase GPU operating frequency by ~ 33% in order to double GPU performance relative to ipad 3 (this would be a very similar approach to what was used when moving from iphone 4s to iphone 5). Option 3 is doable too, but may take more time to implement than option 2. Since fabrication process and SoC die size area devoted to the GPU is the same as option 2, and since a unified shader architecture is still being used as in option 2, then realistically how much more GPU performance can we expect to see than option 2? Maybe overall GPU performance will improve by ~ 2.5-3x vs. ipad 3 (rather than 2x as in option 2) due to new architectural efficiencies, but anything more than that is certainly questionable and debatable.

So if overall GPU performance improves by ~ 2.5x vs. ipad 3, it is quite a stretch to say that other competing SoC's for tablets/clamshells will not be competitive. The Tegra 4 variant for tablets/clamshells is rumored to have up to 64 CUDA "cores". Assuming that operating frequency has not gone down from current gen Tegra, then that would give 8x (!) more pixel fillrate than Tegra 3! Just to give an example, if we used an 8x multiplier for [strike]pixel[/strike] fillrate in GLBenchmark 2.5 for Google Nexus 7 and Asus Transformer Pad Infinity to estimate the performance of next gen high end tablets/clamshell devices available ~ 6-9 months from now, we would get between 3787-4352 MTexels/s [strike]pixel[/strike] fillrate. If we used a 2-2.5x multiplier for [strike]pixel[/strike] fillrate in GLBenchmark 2.5 for ipad 3 to estimate the performance of next gen ipad 4 available ~ 6-9 months from now, we would get between 3714-4642 MTexels/s [strike]pixel[/strike] fillrate. And who knows what differences we may see in geometry performance and other metrics. So yes, we do expect Apple to continue to devote a relatively large portion of their SoC die size area towards the GPU, and we do expect Qualcomm to have relatively strong GPU performance in smartphones rather than high end tablets, but you may be seriously underestimating competing solutions from other vendors (including next gen ULP Geforce, Mali T604, etc).
 
Last edited by a moderator:
Power consumption (with an eye on heat dissipation) tends to limit mobile processor designs before area. Apple gains some performance by trading against area in their implementations; see TI or Renesas's more area-optimized SGX MP2 solutions for comparative reference.

Simply giving the current generation Tegra, Adreno, and Mali (the latter two not being small GPUs themselves already) more silicon won't prevent them from hitting the wall in power consumption before they can match PowerVR, though the extra silicon should let them find a better balance than they currently have for some increased performance.
 
Last edited by a moderator:
Yes, but the current gen Tegra, Adreno, Mali SoC's all have an avg. power consumption that is suitable for use in smartphones. As we have discussed before, the avg. power consumption requirements are much less strict for tablets/clamshell devices, so a larger and more power hungry SoC [such as A5X, upcoming A6X, and other competing upcoming SoC's] is suitable for use in these devices. The A6 SoC is useable in a smartphone in large part due to use of the smaller 32nm LP fabrication process with significantly lower leakage and significantly smaller die size relative to the 45nm LP fabrication process.
 
Last edited by a moderator:
Apple has three options available to them with this increase in effective SoC die size area: 1) Add 2 more CPU cores, or 2) Add 2 more GPU cores based on the existing SGX 543 architecture, or 3) Move to a newer GPU cluster-based architecture. Option 2 is very doable, and since execution units for the GPU would increase by 50% relative to A5X SoC in ipad 3, Apple would need to increase GPU operating frequency by ~ 33% in order to double GPU performance relative to ipad 3 (this would be a very similar approach to what was used when moving from iphone 4s to iphone 5). Option 3 is doable too, but may take more time to implement than option 2. Since fabrication process and SoC die size area devoted to the GPU is the same as option 2, and since a unified shader architecture is still being used as in option 2, then realistically how much more GPU performance can we expect to see than option 2?

The iphone5/ipad3 are getting about 30gflops performance, and theoretical raw fill rate of 2 Gpixels per sec. So a doubling of that performance would get you say 60 gflops and 4 Gpixels per sec. Way back when st announced the A9600 at the start of 2011, they cited 210 gflops and 5 Gpixels, for their rogue implementation, and this is on 28nm process. One might assume that the quoted ST rogue implemenation would not be given such a generous proportion of the die area as apple gives to graphics, and hence it's likely to be smaller than a 543mp6 that apple would have to implement to go double performance in next gen iPad. So on that, albeit limited data, a rogue implementation brings significant performance and size benefits over a bigger 5xt implementation, never mind the increased compliance of full opencl and gles3.0.

http://www.stericsson.com/press_releases/NovaThor.jsp
 
Last edited by a moderator:
The intent communicated by Apple's designs speaks to them freeing up even more headroom, by introducing a modern CPU design that prioritizes efficiency so much and the forthcoming savings from the second-gen "retina" panels, to push the GPU even further relative to the current iPhone and iPad.

I wouldn't expect less from a company that proposed a spec like OpenCL; they clearly have a mindset established.
 
So if overall GPU performance improves by ~ 2.5x vs. ipad 3, it is quite a stretch to say that other competing SoC's for tablets/clamshells will not be competitive. The Tegra 4 variant for tablets/clamshells is rumored to have up to 64 CUDA "cores". Assuming that operating frequency has not gone down from current gen Tegra, then that would give 8x (!) more pixel fillrate than Tegra 3! Just to give an example, if we used an 8x multiplier for pixel fillrate in GLBenchmark 2.5 for Google Nexus 7 and Asus Transformer Pad Infinity to estimate the performance of next gen high end tablets/clamshell devices available ~ 6-9 months from now, we would get between 3787-4352 MTexels/s pixel fillrate. If we used a 2-2.5x multiplier for pixel fillrate in GLBenchmark 2.5 for ipad 3 to estimate the performance of next gen ipad 4 available ~ 6-9 months from now, we would get between 3714-4642 MTexels/s pixel fillrate. And who knows what differences we may see in geometry performance and other metrics. So yes, we do expect Apple to continue to devote a relatively large portion of their SoC die size area towards the GPU, and we do expect Qualcomm to have relatively strong GPU performance in smartphones rather than high end tablets, but you may be seriously underestimating competing solutions from other vendors (including next gen ULP Geforce, Mali T604, etc).

If you're truly refering to texel fillrates, if the ULP GF in T3 has 2 TMUs as I assume that's on paper 1040MTexels/s. Lord knows what the ULP GF will look like in Wayne but so far Mali T6xx and Adreno3xx look like single TMU/cluster designs, meaning that T604 with 4 SIMDs has 4 TMUs in total with a projected frequency of 500MHz for the start at 2.0 GTexels/s. If Wayne goes a similar route it would end with 4 TMUs also for 64 ALU lanes just as the other two designs; if they'd go for 2 TMUs/cluster it's on the same level as Series5XT and Series6 from IMG.

The bigger battle for next generation small form factor GPUs will be in terms of GFLOPs/clock or mm2 whichever sounds more convenient and way less about any sort of fillrate. If there's anything NVIDIA's competition should worry it should be their driver/compiler efficiency and their developer relations incentives.

The question mark with Apple would be which GPU generation they'll integrate in their next tablet; whether Series5XT or Series6 however it won't change anything in terms of on paper texel fillrates per clock and per cluster/core.

Maybe overall GPU performance will improve by ~ 2.5-3x vs. ipad 3 (rather than 2x as in option 2) due to new architectural efficiencies, but anything more than that is certainly questionable and debatable.

Apple so far counts GPU performance increases in floating point performance; in that regard that up to 3x estimate is a huge understatement. For the rest let's just wait and see what the cat drags in.
 
Who is to say Apple will be on 32nm for iPad 4? Reuters linked them to TSMC over a year ago and Samsung has 28nm ready for production

And the only official word on Tegra 4 is that it will be 2x Tegra 3. Maybe a higher end version will come later but why would Nvidia not use that one as a basis for comparison over Tegra 3? Its not like they have problems with bending the truth a little... Tegra 3 being 5x Tegra 2 and their vague branding of laptop chips that has people confusing Fermi with Kepler
 
If you're truly refering to texel fillrates

I was referring to the [strike][Pixel][/strike] Fill Test in GLBenchmark 2.5: http://images.anandtech.com/graphs/graph6126/48879.png . And I just noticed that I accidentally used the On-screen test data rather than the Off-screen test data which is here: http://images.anandtech.com/graphs/graph6126/48880.png . If we used an 8x multiplier in GLBenchmark 2.5 for Google Nexus 7 and Asus Transformer Pad Infinity to estimate the performance of next gen high end tablets/clamshell devices available ~ 6-9 months from now, we would get between 3973-4440 MTexels/s [strike]pixel[/strike] fillrate (Off-screen 1080p). If we used a 2-2.5x multiplier for [strike]pixel[/strike] fillrate in GLBenchmark 2.5 for ipad 3 to estimate the performance of next gen ipad 4 available ~ 6-9 months from now, we would get between 3568-4460 MTexels/s [strike]pixel[/strike] fillrate (Off-screen 1080p).
 
Last edited by a moderator:
I was referring to the [Pixel] Fill Test in GLBenchmark 2.5: http://images.anandtech.com/graphs/graph6126/48879.png . And I just noticed that I accidentally used the On-screen test data rather than the Off-screen test data which is here: http://images.anandtech.com/graphs/graph6126/48880.png .

It's in fact a texel fillrate synthetic, but that's for the hairsplitting realm. TBDRs typically show nearer efficiency to their theoretical maximum fillrates than IMRs do (whether tile based or not).

If we used an 8x multiplier in GLBenchmark 2.5 for Google Nexus 7 and Asus Transformer Pad Infinity to estimate the performance of next gen high end tablets/clamshell devices available ~ 6-9 months from now, we would get between 3973-4440 MTexels/s pixel fillrate (Off-screen 1080p). If we used a 2-2.5x multiplier for pixel fillrate in GLBenchmark 2.5 for ipad 3 to estimate the performance of next gen ipad 4 available ~ 6-9 months from now, we would get between 3568-4460 MTexels/s pixel fillrate (Off-screen 1080p).

My question is still the same: where does that 8x multiplier exactly come from? As I said one feasable speculative dilemma would be that either NV implements 1 or 2 TMUs per compute cluster. The latter brings it up in theory to 8 TMUs which with a feasable frequency of say 700MHz under 28nm gives 5.6 GTexels and with 4 TMUs = 2.8GTexels/s. TMUs aren't exactly cheap and they can't just throw around with them like there's no tomorrow with all the power restrictions.

By the way a clamshell can also be deemed as one product category higher than a tablet, always depending on the target device power consumption.
 
A next generation ULP Geforce with unified shader architecture and up to 64 CUDA "cores" (clocked at ~ 500MHz) would have up to ~ 8x more peak pixel shader performance relative to ULP Geforce in Tegra 3 (based on having a maximum of 64 pixel shader units on Tegra 4 vs. 8 pixel shader units on Tegra 3). The ULP Geforce in Tegra 3 has ~ 3-3.5x more peak pixel shader performance relative to ULP Geforce in Tegra 2 (based on having 2x more pixel shader units and ~ 66% higher clock speed on Tegra 3 vs. Tegra 2), and that performance delta seems to be reasonably well reflected in the GLBenchmark Fill Test results. So just to reiterate, I was looking at the performance delta between Tegra 2 and Tegra 3 on GLBenchmark Fill Test to extrapolate results for Tegra 4.
 
Last edited by a moderator:
Back
Top