Qualcomm Krait & MSM8960 @ AnandTech

A next generation ULP Geforce with unified shader architecture and up to 64 CUDA "cores" (clocked at ~ 500MHz) would have up to ~ 8x more peak pixel shader performance relative to ULP Geforce in Tegra 3 (based on having a maximum of 64 pixel shader units on Tegra 4 vs. 8 pixel shader units on Tegra 3).

Well I figured how you came to that assumption, I just wanted you to confirm it. So given that awkward speculative math it'll have according to it either 7 or 9 TMUs.

The ULP Geforce in Tegra 3 has ~ 3-3.5x more peak pixel shader performance relative to ULP Geforce in Tegra 2 (based on having 2x more pixel shader units and ~ 66% higher clock speed on Tegra 3 vs. Tegra 2),
That's an exact 3.2x increase in pixel shader performance.

...and that performance delta seems to be reasonably well reflected in the GLBenchmark Fill Test results.
Does it? The fastest T3 device with quite a bit higher bandwidth than typical T3 based devices gives an offscreen result of almost 568MTexels/s while the fastest Tegra2 device from the Kishonti database itself gives 220MTexels/s, which gives a 2.58x difference, whereby if you use a TF300 as a reference the difference shrinks to 2.3x. I'd rather say that both Tegra2 as Tegra3 SoCs have the same amount of TMUs with different frequencies and higher bandwidth on T3 allowing higher fillrate efficiency.

If yes it breaks your theory quite quickly since there are 2 vs. 1 Vec4 PS ALU between T3 and T2. On top of that I can't know for sure, but I'd be willing to believe that texturing is decoupled from ALUs on ULP GFs (unlike NV3x/NV4x desktop trends), albeit you'll see it on a ARM Mali of the current generation and Lord knows which else architecture.

Wayne in all likeliness will have a revamped GPU architecture, for which I expect a departure from vector ALUs (pretty much for every SFF next generation GPU); else it could very well be SIMD8 or SIMD16, where nothing speaks against the notion that two SIMD8 could share a texture block. For the very least of major changes these are going to be with utmost certainty USC ALUs so it's more like "12SPs" on the T3 ULP GF in total, where a 5x times increase in TMU count is equally senseless.

So just to reiterate, I was looking at the performance delta between Tegra 2 and Tegra 3 on GLBenchmark Fill Test to extrapolate results for Tegra 4.
Floating point performance will explode with the next generation GPUs; fillrates rather not. By the way it didn't ever strike you that despite NV's sw/compiler efficiency 64 GFLOPs of maximum theoretical floating point power are rather pitiful for a design that aims to reach up to clamshells? :rolleyes:
 
Does it? The fastest T3 device with quite a bit higher bandwidth than typical T3 based devices gives an offscreen result of almost 568MTexels/s while the fastest Tegra2 device from the Kishonti database itself gives 220MTexels/s, which gives a 2.58x difference

The fastest Tegra 3 GPU has an operating frequency of 520MHz, while the fastest Tegra 2 GPU has an operating frequency of 400MHz. So the difference in peak pixel shader performance will be 2*(520/400) = 2.60x difference. Kishonti's GLBench Fill Rate data referenced above gives 2.58x difference. So there does appear to be a reasonably good correlation. Not perfect, but reasonably good as an approximation. You are right though that the benchmark is measuring MTexels/s and not MPixels/s.

By the way it didn't ever strike you that despite NV's sw/compiler efficiency 64 GFLOPs of maximum theoretical floating point power are rather pitiful for a design that aims to reach up to clamshells? :rolleyes:

Why would anyone care about theoretical GFLOP throughput for a GPU used in a handheld (smartphone, tablet, clamshell) device? These products are not geared towards High Performance Computing. And besides, as we have discussed before, differences in GFLOP throughput typically does not correlate well with differences in gaming performance, particularly when comparing totally different GPU architectures ;)
 
Last edited by a moderator:
Err... if the goal is to match PowerVR in the area of pixel/texel/Z fill rates, the game is truly lost for the competition. A PowerVR TBDR should always outperform a comparable IMR in that respect, most especially in real-world application.
 
The goal should never be to beat someone at a peak theoretical performance metric, but rather to beat someone in real world gaming experience. Anyway, Apple's GPU advantage in the tablet space (note that Apple really no longer has a significant GPU advantage in the smartphone space with Adreno 320 starting to ship this month and next month in the LG Optimus G) is temporary and is bound to significantly diminish next year as the competition moves to a totally different GPU architecture with unified shaders, increases SoC die size dedicated to the GPU, relaxes avg. power consumption limits for use in tablets/clamshell type devices, and utilizes new and improved Windows and Android operating systems.
 
The fastest Tegra 3 GPU has an operating frequency of 520MHz, while the fastest Tegra 2 GPU has an operating frequency of 400MHz.

It was 333MHz last time I checked. If there's a 400MHz T2 I stand corrected for the PS floating point difference.

So the difference in peak pixel shader performance will be 2*(520/400) = 2.60x difference. Kishonti's GLBench Fill Rate data referenced above gives 2.58x difference. So there does appear to be a reasonably good correlation. Not perfect, but reasonably good as an approximation. You are right though that the benchmark is measuring MTexels/s and not MPixels/s.
I said it before it would be high time that Kishonti considers registering GPU frequencies just as CPU frequencies wherever possible for its database for all SoCs.

Why would anyone care about theoretical GFLOP throughput for a GPU used in a handheld (smartphone, tablet, clamshell) device? These products are not geared towards High Performance Computing. And besides, as we have discussed before, differences in GFLOP throughput typically does not correlate well with differences in gaming performance, particularly when comparing totally different GPU architectures ;)
Because the majority is counting in GFLOPs these days otherwise the claimed performance increases will never materialize. Have a second look at NV's own future Tegra roadmap and tell me where the huge (granted per SoC) increases are supposed to come from.

It still stands though that you can't that easily guess on fillrates without knowing or have a reasonable guess on the possible architectural layout.

As a current case example the Mali400MP4 in the 32nm Exynos4 runs in current products at 440MHz, which gives it (pixel and vertex shader ALUs combined) ~15.8 GFLOPs with a texel fillrate of 1.76 GTexels/s. Upcoming next generation GPU T604MP4@500MHz is at 72 GFLOPs with a texel fillrate of 2.0 GTexels. ARM itself claims on its own website a 5x times performance increase for the T604 compared to the former generation. I'm sure you wouldn't suggest that that increase is supposed to mean texel fillrates do you?

IMG claims >20x times the performance increase, but the tricky part of creative marketing here is that they most likely compare a DX10.1 SGX545 with 4 Vec2 ALUs against a DX10.1 four cluster GC6400 for a comparable die area. Considering a SGX545@640MHz is at barely 10.24 GFLOPs and the GC6400 at >210 GFLOPs; remains to be seen if die are ends up being comparable under 28nm for the latter, which is a tough cookie for me to swallow.

Back to Adreno320 I believe and would like to stand corrected that it consists of 4*SIMD16 (or 64SPs in desktop marketing parlance) where each cluster has a single TMU, giving 4 TMUs for a 4 cluster 320. Here compared to Adreno225 (ignoring architectural differences and based only on sterile unit counts) its 8 Vec4 ALUs account for 32SPs accompanied by 2 TMUs. Both should be clocked at 400MHz based on paper specs at least there's nothing more to claim than a factor 2x. Higher performance comes IMHO from the architectural advancements of the Adreno3xx family.

The only other problem is that NV manages to stay highly competitive with Adreno225 GPUs (especially for the lower complexity shader stuff) with barely 12SPs@520MHz vs. 32SPs@400MHz. NV doesn't have any magic wand nor isn't designing its hw with pixie dust, it just shows how damn efficient their driver/compiler really is.

In hindsight I'm not surprised to read ever repeating idiotic rumors that Qualcomm supposedly has licensed GPU IP from IMG or that they are interested in buying AMD's graphics department or whatever else, since different internet rumor mongerers try to find a feasable "solution" for Qualcomm's GPU roadmap for >DX9L3 and stuff like windowsRT drivers. I wouldn't suggest that Qualcomm needs as radical solutions as proposed left and right. IMHO if Qualcomm should be truly interested f.e. in AMD's GDP it would be rather because they have far more ambitious future plans than just small form factor SoCs and not to solve some silly driver/compiler issue which is (always IMHO) a resource problem money can buy. In that regard Qualcomm is anything but stingy when it comes to constantly hiring engineering talent of all kinds; au contraire.
 
The goal should never be to beat someone at a peak theoretical performance metric, but rather to beat someone in real world gaming experience. Anyway, Apple's GPU advantage in the tablet space (note that Apple really no longer has a significant GPU advantage in the smartphone space with Adreno 320 starting to ship this month and next month in the LG Optimus G) is temporary and is bound to significantly diminish next year as the competition moves to a totally different GPU architecture with unified shaders, increases SoC die size dedicated to the GPU, relaxes avg. power consumption limits for use in tablets/clamshell type devices, and utilizes new and improved Windows and Android operating systems.

We know what Apple did with the A6/iPhone5 smartphone for now and it'll have to stick until next year's fall (unless Apple speeds up development cycles which I consider unlikely). Their next tablet SoC for early 2013 is still an unknown. The only safe bet at the moment would be that it'll carry custom designed CPU cores as in A6. IF it'll have something like a Series5XT MP6 it would equal to 96 USC SPs, IF a Series6 Rogue based GPU it'll come down to amount of clusters, frequency and what not.
 
Back to Adreno320 I believe and would like to stand corrected that it consists of 4*SIMD16 (or 64SPs in desktop marketing parlance) where each cluster has a single TMU, giving 4 TMUs for a 4 cluster 320. Here compared to Adreno225 (ignoring architectural differences and based only on sterile unit counts) its 8 Vec4 ALUs account for 32SPs accompanied by 2 TMUs. Both should be clocked at 400MHz based on paper specs at least there's nothing more to claim than a factor 2x. Higher performance comes IMHO from the architectural advancements of the Adreno3xx family.

Not completely related to the topic at hand, but an Adreno 305 is a quarter of what an Adreno 320 consists of right? In other words an Adreno 305 GPU has 1 SIMD16 cluster along with a single TMU running at the same clock as the 320. Something that should give the 305 comparable performance to an Adreno 220 I believe.
 
Not completely related to the topic at hand, but an Adreno 305 is a quarter of what an Adreno 320 consists of right? In other words an Adreno 305 GPU has 1 SIMD16 cluster along with a single TMU running at the same clock as the 320. Something that should give the 305 comparable performance to an Adreno 220 I believe.

Yes that would be my guess. However I'm not sure whether the 305s will be actually clocked at 400MHz. If yes I'd still think it should win by quite a healthy margin against a 220, unless the latter is clocked at 400MHz which I don't think so.
 
Some 220s do have 400MHz available to the vendor. Not all Adrenos are equal when it comes to frequency and some have turbo, some don't.
 
Yea this backs up what i speculated some time ago, that similar to the adreno 205 \220 the 320 would be roughly double the 305...same clock just double execution units..

Could still be wrong,early drivers and no in depth knowledge of the uarch yet, but would make more sense than the adreno 320 being 4x the adreno 305 which some had speculated earlier.

Also it backs up the qualcomm slide which puts the snapdragon s4 plus as being up to adreno 305....instead of the currently shipping 225...i wonder then whether 305 is being used for windows phone 8 instead of 225?
 
Snapdragon 800 (MSM8974)
  • 28nm HPm ("High Performance for mobile") fabrication process
  • Upgraded Hexagon V5 digital signal processor
  • 800MHz LPDDR3 memory, 12.8 GB/s memory bandwidth
  • Krait 400 (Quad-core) architecture running at higher clock speeds (up to 2.30 GHz)
  • Adreno 330 GPU
  • 802.11ac WiFi

Snapdragon 600 (APQ8064T, so new name for Snapdragon S4 Pro)
  • Krait 300 (Dual-core) architecture running at higher clock speeds (up to 1.90 GHz)
  • Adreno 320 GPU
  • LPDDR3 RAM

As well as Snapdragon 400 and 200, although nothing known about them at this time.

Keynote ongoing.

AnandTech already has a bit more information about the new architectures, like L2 cache prefetchers.
 
Last edited by a moderator:
Adreno 330 = 2X compute performance of Adreno 320 with 50% less power draw according to Qualcomm

What the anandtech article says is that it has 2x more compute performance an 50% higher graphics performance.
 
Snapdragon 800 (MSM8974)
  • 28nm HPm ("High Performance for mobile") fabrication process
  • Upgraded Hexagon V5 digital signal processor
  • 800MHz LPDDR3 memory, 12.8 GB/s memory bandwidth
  • Krait 400 (Quad-core) architecture running at higher clock speeds (up to 2.30 GHz)
  • Adreno 330 GPU
  • 802.11ac WiFi
Sounds like an absolute beast. Also gives us some expectations for Swift's successor. I would definitely expect LPDDR3 support and TSMC's 28nm HPm process. With two fully custom designs out there now, it's interesting to see if one can outdo the other. I expect Rogue will still give 330 a good thrashing.

What the anandtech article says is that it has 2x more compute performance an 50% higher graphics performance.

Could be a 554 vs 543 situation where ALU count doubles but not TMUs*.
 
Last edited by a moderator:
What the anandtech article says is that it has 2x more compute performance an 50% higher graphics performance.

I never referenced the Anandtech article. I was following the keynote

And while im not saying it doesnt have 50% higher graphics performance. It should be mentioned that Anand is the source behind this, Qualcomms official press release only mentions 2X compute performance over 320

http://www.qualcomm.com/media/relea...neration-snapdragon-premium-mobile-processors
 
Could be a 554 vs 543 situation where ALU count doubles but not shaders.

How can you double on any architecture ALU count and not shaders, especially with USC ALUs like in Adrenos or SGX? If you meant TMUs for instance, yes it could very well be the case here also.
 
Back
Top