Qualcomm Krait & MSM8960 @ AnandTech

Yes thats mighty impressive for an adreno 205 replacement, remembering once again very early drivers.

Also the adreno 320 is likely clocked much lower than 225 with it being a quad(4x305?), i think they have more than doubled at least the shaders and sorted out that dodgy compiler.
 
Yes thats mighty impressive for an adreno 205 replacement, remembering once again very early drivers.

Also the adreno 320 is likely clocked much lower than 225 with it being a quad(4x305?), i think they have more than doubled at least the shaders and sorted out that dodgy compiler.

If they have one single compiler for all those variants (most likely unless 3xx ALUs are "scalar" or fundamentally different than 2xx), then any compiler and/or driver changes will affect most likely all variants as long as Qualcomm's partners update the sw accordingly.
 
Yes thats mighty impressive for an adreno 205 replacement, remembering once again very early drivers.

Also the adreno 320 is likely clocked much lower than 225 with it being a quad(4x305?), i think they have more than doubled at least the shaders and sorted out that dodgy compiler.

If 320 was nothing more than quad 305 it would mean that 305 was 4Vec4 ALUs and 1 TMUs GPU and yet would manage to almost match the performance of 225 with twice the number of units. Magic?
It seems unlikely that 320 is simply quad 305.
 
Well I don't have the answers obviously, but that's seriously flawed logic.

it's a new architecture with very likely much improved compiler and more efficient shaders, anand, arun, ailuros et al all pointed to the compiler as a serious bottle neck In previous designs, plus we don't have clocks do we?

My take is that it is a quad 305 just clocked at half speed for smartphones, I don't know that I'm just throwing it out there.
 
My take is that it is a quad 305 just clocked at half speed for smartphones, I don't know that I'm just throwing it out there.

Ok, even if we assume that 305 is 1/4th of the full 320 and has half the units 225 has, how can it almost match its performance? It would mean that with the same number of units as 205 they achieved the performance of their current best GPU - 225.
Unless they drastically redesigned their compiler and/or drivers I can't see how something like that is possible.
 
for a start we don't know 305 performance in comparison to adreno 225 as 225 was v sync limited?

How do you know how many tmu's and rops 305 has? bus speed? clock speed? shaders ? it's impossible to tell from what I have seen.

Anyway there are other ways to increase gpu performance beyond just raw execution units, much of which is beyond my understanding, however cache coherency, dedicated v ram as well as the improved design of said execution units have to be examples, just look at the improvement amd got by moving to gcn....

Companies usually think things through such as coherent naming schemes and marketing.
Qualcomm has already stated 305 is a single core and 320 is a quad core, so there is obviously a duplication of some kind of execution units from 305 to come to that 'quad core' naming, I'm taking a guestimate and saying 320 = 4x305 lower clocked, after all 320 had been benched in smartphone guise nearly matching apple a5x tablet performance on early drivers, which as you know is a moderately clocked quad core.

I don't think what I have suggested is that far fetched...
 
for a start we don't know 305 performance in comparison to adreno 225 as 225 was v sync limited?
At the same resolution as One S it almost matched it results so even at this state it can be considered almost as fast, if not as fast with proper drivers.

How do you know how many tmu's and rops 305 has? bus speed? clock speed? shaders ? it's impossible to tell from what I have seen.
If 320 is really quad 305 and 320 has 4 TMU's and 16Vec4 ALUs as it was suggested by Ailuros(and confirmed by metafor) than 305 should have 1/4th of its theoretical performance and half of the theoretical performance of 225.

Qualcomm has already stated 305 is a single core and 320 is a quad core, so there is obviously a duplication of some kind of execution units from 305 to come to that 'quad core' naming,(...)
And nvidia's GPU is 12 core as they call it, all depends on the naming scheme.
I think that the difference between 305 and 320 is number of TMU's. With 1 TMU and higher clocks than 320 and of course improved drivers and compilers that kind of performance is achievable. Although I don't know if 3xx architecture allows decoupling of ALU units from TMU and if it would even make sense. But it would allow 305's to be nothing more than scaled down 320 with 3 TMU units 'cut off'.

It's nothing more than my assumptions and loose thoughts but it's just that I can't agree with what you're saying :)
 
Well it seems I'm not up to date or perhaps it was some time ago and have forgot? Any way you conclusions do make sense.

It would be more efficient for them to do as you have suggested rather than bolt together 4 gpus ala A5x....
 
If 320 is really quad 305 and 320 has 4 TMU's and 16Vec4 ALUs as it was suggested by Ailuros(and confirmed by metafor) than 305 should have 1/4th of its theoretical performance and half of the theoretical performance of 225.


And nvidia's GPU is 12 core as they call it, all depends on the naming scheme.

Here comes the obvious objection and for NV's marketing, not for the above. Assuming the 320 truly has 16 Vec4 (USC) then that's in NV's marketing parlance equal to "64 cores".

It would be more efficient for them to do as you have suggested rather than bolt together 4 gpus ala A5x....

That comes down to implementation and application workload; scaling clusters instead of entire cores obviously saves quite a bit of die area (redundancy), but without knowing what and how exactly has been scaled I'd say that the assumption above is tad premature. Adrenos having USC ALUs obviously don't have the theoretical (for today's applications) shortcoming of ARM Mali's where they scale fragment cores and not vertex shader units with MPs, but for example somewhat weak triangle setup unit (-s) could limit quite a few things. Note that I honestly don't know how 320 really looks like, but the Series5XT cores scale geometry between cores at 95%. That's just one example; another example would be that the MP4 in A5X has a total of 64 z/stencil units, which helps quite a bit with Multisampling performance for instance. I don't expect IMG themselves to have left as much relative "redundancy" in Rogue since they're scaling clusters instead of cores this time, but it remains that Series5XT MPs have the downside of added die area redundancy due to full core scaling and the possible advantage of additional units here and there. What it then comes down to, is whether any of that redundancy can turn into any possible advantage in today's mobile applications.

In any case applications like GLBenchmark2.1 are obviously not stressful enough for GPUs like the 320 or the A5X MP4 to show their real strengths and yes always IMHO.
 
Here comes the obvious objection and for NV's marketing, not for the above. Assuming the 320 truly has 16 Vec4 (USC) then that's in NV's marketing parlance equal to "64 cores".

Correct. And something that I nudged to in my naming scheme comment, the adreno 220&225 were never called '32' cores, they were called single core.

Although it is entirely possible that Qualcomm had changed strategy and is going another route - IMG TECH 'clusters' anyone ?
 
Correct. And something that I nudged to in my naming scheme comment, the adreno 220&225 were never called '32' cores, they were called single core.

Although it is entirely possible that Qualcomm had changed strategy and is going another route - IMG TECH 'clusters' anyone ?

Adreno 2xx variants were already scaling clusters afaik; that might be an indication for what 320 might be but no guarantee yet either until further details appear. What you can go by for now is that for 2xx cores for each 4Vec4 cluster there's 1 TMUs available. Adreno 225 has 8 Vec4 + 2 TMUs, which is exactly why I assume that 320 might be 16Vec4/4TMUs. We "know" for now that the latter has 4 TMUs, but there's no indication so far what the ALUs really look like. Heck for efficiency reasons (both on hw and sw/compiler level) the 320 might contain 64 "scalar" ALUs, which would be a reasonable explanation why it excels as much compared to its 2xx predecessors. My current bets are there are fundamental changes in Qualcomm's 3xx family of Adreno GPUs, otherwise it would be another silly marketing stunt to signify it as a new generation.

Another thing would be that if Adreno 3xx is Halti compliant, Qualcomm (irrelevant of possible efficiency ratio against Rogue, T6xx, Wayne etc.) is again on the forefront of execution against its competition.
 
Yea technically they started this 'cluster' idea in mobile's but didn't use dodgy marketing which I applaud them for.

Something has changed dramatically since 2xx series, I may be more than marketing names.

Adreno 3xx is Haiti.
 
I thought all Adrenos from 200 onwards had a Vec4+Scalar shader arranjement, similar to Xenos.
If so, wouldn't it be correct to say that Adreno 225 has "40 cores" (using nvidia's marketing nomenclature)?
 
I know that is what I meant. However are you sure it's not Haiti ??

I'm sure. There was a moment when wikipedia falsely called it Haiti before someone corrected it. Even now some websites call it Haiti, but it happens probably because they don't see the dot in i :p

Even when you check presentations and documents by different GPU vendors you'll see that they call it Halti, and not Haiti.
 
I thought all Adrenos from 200 onwards had a Vec4+Scalar shader arranjement, similar to Xenos.
If so, wouldn't it be correct to say that Adreno 225 has "40 cores" (using nvidia's marketing nomenclature)?

Depends what you mean exactly with "scalar"; and no SFU units aren't capable of any floating point operations in the particular case ;)

***edit:

http://www.anandtech.com/show/4940/qualcomm-new-snapdragon-s4-msm8960-krait-architecture/3

[strike] Each 4-wide vector unit is capable of a maximum of 8 MADs per clock, while each scalar unit is similarly capable of 2 MADs per clock. That works out to 160 floating point operations per clock, or 32 GFLOPS at 200MHz.[/strike] Update: Qualcomm has clarified the capabilities of its 4-wide Vector ALUs. Similar to the PowerVR SGX 543, each 4-wide vector ALU is capable of four MADs (one per component). The scalar units cannot be combined to do any MADs, although they are helpful we haven't really been tracking those in this table (IMG has something similar) so we've excluded them for now.
 
Back
Top