Qualcomm Krait & MSM8960 @ AnandTech

RecessionCone · Feb 27, 2012

metafor said:
Interestingly enough, the benchmarks of the HTC One S are worse off compared to a 1.5GHz A33 Tegra. Albeit Anand didn't have time to benchmark Basemark (which Adreno performs great in) nor Vellamo and Linpack, which Krait excels in.

But I do find it disturbing that there is quite a difference between the MDP and OEM software. Hopefully other manufacturers will improve on this.

About Linpack: the version Anand uses is not indicative of the computational power of an SoC. It is written in Java, so performance depends strongly on the VM. Secondly, it's not cache-blocked, even though all practically used linear algebra libraries are. This means that it's a memory bandwidth test rather than a computational throughput test. This is why Tegra 3 single-threaded to multi-threaded performance doesn't scale as well as you would expect. This is also why the GFLOPS numbers they get are 10-50x slower than the peak GFLOPS of the SoC, which is unexpected for people used to seeing Linpack running at 80-95% of peak.

Memory bandwidth and Java VM benchmarks have their place, but I'm really tired of people claiming this Linpack benchmark has any relevance to computational throughput, and I'm disappointed that Anand hasn't yet figured out what this benchmark is really measuring.

Wishmaster · Feb 27, 2012

metafor said:
Interestingly enough, the benchmarks of the HTC One S are worse off compared to a 1.5GHz A33 Tegra. Albeit Anand didn't have time to benchmark Basemark (which Adreno performs great in) nor Vellamo and Linpack, which Krait excels in.

But I do find it disturbing that there is quite a difference between the MDP and OEM software. Hopefully other manufacturers will improve on this.

HTC was always known for phones slower than their possibilities due to heavy customization and in the past benchmarks have proven this fact.
Just look at coolpad 9900 at GLBenchmark, it has the highest score of any adreno 220 device.
I would suspect that companies like asus do better job at releasing optimised devices.

And to tell you the truth, I prefer devices that try to be competitive thanks to their technical advancements and not UI skins like HTC sense. Although that was slightly OT

Helmore · Feb 27, 2012

RecessionCone said:
About Linpack: the version Anand uses is not indicative of the computational power of an SoC. It is written in Java, so performance depends strongly on the VM. Secondly, it's not cache-blocked, even though all practically used linear algebra libraries are. This means that it's a memory bandwidth test rather than a computational throughput test. This is why Tegra 3 single-threaded to multi-threaded performance doesn't scale as well as you would expect. This is also why the GFLOPS numbers they get are 10-50x slower than the peak GFLOPS of the SoC, which is unexpected for people used to seeing Linpack running at 80-95% of peak.

Memory bandwidth and Java VM benchmarks have their place, but I'm really tired of people claiming this Linpack benchmark has any relevance to computational throughput, and I'm disappointed that Anand hasn't yet figured out what this benchmark is really measuring.

I agree with what you're saying, but you also have to realize that most Android apps run on that Java based VM. In other words, a SoC that performs better in Linpack on Android, will also mean better performance on certain Android apps. That said, I've no idea for how many apps this will matter and how much difference it will make for those apps.

Edit: In other words, Linpack does tell you a little about how a SoC will perform compared to other SoCs, in certain aspects, on the Android platform.

Laurent06 · Feb 27, 2012

RecessionCone said:
Memory bandwidth and Java VM benchmarks have their place, but I'm really tired of people claiming this Linpack benchmark has any relevance to computational throughput,

Couldn't agree more. Even the home page of the app says so. It looks like reviewers don't understand what they are running, they just provide numbers without further thinking

and I'm disappointed that Anand hasn't yet figured out what this benchmark is really measuring.

I've lost all faith in Anand since the Medfield presentation.

metafor · Feb 27, 2012

RecessionCone said:
About Linpack: the version Anand uses is not indicative of the computational power of an SoC. It is written in Java, so performance depends strongly on the VM. Secondly, it's not cache-blocked, even though all practically used linear algebra libraries are. This means that it's a memory bandwidth test rather than a computational throughput test. This is why Tegra 3 single-threaded to multi-threaded performance doesn't scale as well as you would expect. This is also why the GFLOPS numbers they get are 10-50x slower than the peak GFLOPS of the SoC, which is unexpected for people used to seeing Linpack running at 80-95% of peak.

Memory bandwidth and Java VM benchmarks have their place, but I'm really tired of people claiming this Linpack benchmark has any relevance to computational throughput, and I'm disappointed that Anand hasn't yet figured out what this benchmark is really measuring.

Based on profiling, there are a few things that affect the Linpack currently on Android market. One, as you mentioned, is memory bandwidth. But the other is multiply->divide->add latency. Not throughput, latency. The VM seems to love chaining and reusing registers, go figure. This is partially why Krait/Scorpion scores so well in these and also why Krait has leaped significantly even though its peak FP throughput isn't really higher than Scorpion's.

So while it's not completely computationally bound, nor anywhere close to throughput bound, it is certainly representative of the kind of FP performance people who slap together an Android program will get without optimization.

Laurent06 · Feb 27, 2012

metafor said:
So while it's not completely computationally bound, nor anywhere close to throughput bound, it is certainly representative of the kind of FP performance people who slap together an Android program will get without optimization.

Good point, but OTOH wouldn't one expect that a computationally bound program would use native code through NDK?

metafor · Feb 27, 2012

Laurent06 said:
Good point, but OTOH wouldn't one expect that a computationally bound program would use native code through NDK?

Honestly, I'm quite surprised ARM doesn't offer a math library with standard matrix operations that are NEON optimized (maybe they do?). I'm surprised none of Google's Android partners have pushed that as a standard API call in Android either.

It's such a no-brainer even if we're just talking about using one single NEON lane vs VFP unless you really really need denormal/rounding support.

Laurent06 · Feb 27, 2012

metafor said:
It's such a no-brainer even if we're just talking about using one single NEON lane vs VFP unless you really really need denormal/rounding support.

As you wrote the main issue the is non-IEEE compliancy. Also if you use a single NEON lane, on A9 the VFP would be as fast I guess (A8 is a different story, but in that case if you don't use the hardfp ABI then you'd be killed by very slow transfer between NEON and integer registers).

metafor · Feb 27, 2012

Laurent06 said:
As you wrote the main issue the is non-IEEE compliancy. Also if you use a single NEON lane, on A9 the VFP would be as fast I guess (A8 is a different story, but in that case if you don't use the hardfp ABI then you'd be killed by very slow transfer between NEON and integer registers).

I haven't read up on the optimization manual but don't VFP variants have higher latency due to normalization/rounding stages compared to their NEON counterparts? Or do NEON ops go through and exit at the same stages?

Are most of ARM's software developers concerned at all about IEEE denormal/round/NaN handling? Is anyone? Wouldn't that effectively mean GPGPU is a no-go?

Edit: Just looked and it appears instruction timing is the same on A9. Couldn't find A15 info.

Laurent06 · Feb 27, 2012

metafor said:
I haven't read up on the optimization manual but don't VFP variants have higher latency due to normalization/rounding stages compared to their NEON counterparts? Or do NEON ops go through and exit at the same stages?

On A9 VFP ADD has a latency of 4, while NEON FP ADD latency is 5. FP MUL is 5 cycles in both cases.
VFP instruction timing
Advanced SIMD floating-point instructions

Are most of ARM's software developers concerned at all about IEEE denormal/round/NaN handling? Is anyone? Wouldn't that effectively mean GPGPU is a no-go?

Didn't GPGPU start to be really used when it got IEEE compliancy? At least that's the case for HPC.

I agree the lack of some parts of IEEE is not critical for all apps, but I had to deal with some FP computations that went wrong and I'm glad I had some of the advanced features of IEEE.

Edit: Just looked and it appears instruction timing is the same on A9. Couldn't find A15 info.

Didn't find the same it seems :smile:

Entropy · Feb 27, 2012

RecessionCone said:
About Linpack: the version Anand uses is not indicative of the computational power of an SoC. It is written in Java, so performance depends strongly on the VM. Secondly, it's not cache-blocked, even though all practically used linear algebra libraries are. This means that it's a memory bandwidth test rather than a computational throughput test. This is why Tegra 3 single-threaded to multi-threaded performance doesn't scale as well as you would expect. This is also why the GFLOPS numbers they get are 10-50x slower than the peak GFLOPS of the SoC, which is unexpected for people used to seeing Linpack running at 80-95% of peak.

Memory bandwidth and Java VM benchmarks have their place, but I'm really tired of people claiming this Linpack benchmark has any relevance to computational throughput, and I'm disappointed that Anand hasn't yet figured out what this benchmark is really measuring.

Personally, I've always found running LinPack using successively larger matrices to be a useful way to assess the memory subsystem of a processor. Now, for most of the stuff we do, the main memory bound limit is what is actually indicative, but then again, that is not always the case, and the behavior as the matrix size grows is not always as straightforward as one might guess.

metafor · Feb 27, 2012

Laurent06 said:
On A9 VFP ADD has a latency of 4, while NEON FP ADD latency is 5. FP MUL is 5 cycles in both cases.
VFP instruction timing
Advanced SIMD floating-point instructions

Interesting. Why is the NEON latency higher? Staggered register write/read for multiple lanes? Or a renaming issue?

Didn't GPGPU start to be really used when it got IEEE compliancy? At least that's the case for HPC.

IIRC, no GPU today is truly IEEE compliant. They began being used heavily when they supported rounding and 32-bit SP throughout the entire pipeline. But I don't think denormal support was ever important nor used very much. They also happen to be the most costly to support in hardware.

RecessionCone · Feb 28, 2012

metafor said:
IIRC, no GPU today is truly IEEE compliant. They began being used heavily when they supported rounding and 32-bit SP throughout the entire pipeline. But I don't think denormal support was ever important nor used very much. They also happen to be the most costly to support in hardware.

Fermi fully supports denormals. As does Cypress.

metafor · Feb 28, 2012

RecessionCone said:
Fermi fully supports denormals. As does Cypress.

Interesting. I guess the have fully caught up then.

Exophase · Feb 28, 2012

metafor said:
Interesting. Why is the NEON latency higher? Staggered register write/read for multiple lanes? Or a renaming issue?

NEON and VFP don't have renaming on Cortex-A9. Outside of load/store NEON is the same as on Cortex-A8. VFP doesn't share any logic with it.

We can see both listed as explicitly needing four pipeline stages in Cortex-A8 pipeline diagrams, and its cycle timings list a latency of four cycles for all NEON F32x2 instructions except vmla/vmls and whatever needs stage N1 to broadcast a scalar operand, and another full cycle for F32x4. Apparently on Cortex-A9 that turned into a 5 cycle base with no penalty for needing N1, so I guess all inputs are needed at N1 (could have something to do with the way it reorganized the NEON interface).

I assume VFP's add being one stage cycle shorter than its multiply is just an optimization that didn't make it into NEON on account of them not changing the internal design of NEON.

Laurent06 · Feb 28, 2012

metafor said:
Interesting. Why is the NEON latency higher? Staggered register write/read for multiple lanes? Or a renaming issue?

As Exophase pointed out this can't be a renaming issue. There are two possibilities: more aggressive data forwarding in the VFP or a documentation typo

TrevorFrank · Mar 3, 2012

Wishmaster said:
HTC was always known for phones slower than their possibilities due to heavy customization and in the past benchmarks have proven this fact.
Just look at coolpad 9900 at GLBenchmark, it has the highest score of any adreno 220 device.
I would suspect that companies like asus do better job at releasing optimised devices.

And to tell you the truth, I prefer devices that try to be competitive thanks to their technical advancements and not UI skins like HTC sense. Although that was slightly OT

Well, if benchmark is good one, it does only proper stress for hardware and not really take UI customization in play. By looking any of these results from GLBenchmark or Basemark, you'll see that HTC phones aren't really doing bad - even they have Sense UI in those.

Wishmaster · Mar 3, 2012

TrevorFrank said:
Well, if benchmark is good one, it does only proper stress for hardware and not really take UI customization in play. By looking any of these results from GLBenchmark or Basemark, you'll see that HTC phones aren't really doing bad - even they have Sense UI in those.

I looked at different devices from HTC with adreno 220 and compared them to devices made by other companies and I have to say that HTC was always slower than those devices. Of course I'm talking only about GLBenchmark, haven't done similar comparison with Basemark.
Have a look for yourself here and you'll understand what I meant.
If it's not GUI, then it's something with drivers that HTC is using.

vshade · Mar 4, 2012

Wishmaster said:
I looked at different devices from HTC with adreno 220 and compared them to devices made by other companies and I have to say that HTC was always slower than those devices. Of course I'm talking only about GLBenchmark, haven't done similar comparison with Basemark.
Have a look for yourself here and you'll understand what I meant.
If it's not GUI, then it's something with drivers that HTC is using.

The results seems aligned with the devices screens resolutions. Given that glbenchmark can't run its offscreen test in those devices, there is not much way of comparing.

Wishmaster · Mar 4, 2012

vshade said:
The results seems aligned with the devices screens resolutions. Given that glbenchmark can't run its offscreen test in those devices, there is not much way of comparing.

Coolpad 9900 is almost 20% faster than HTC x710a Raider(currently the highest scoring HTC handset) with very similar resolutions.
And if you compare only scores of HTC devices all running at 960x540 you'll see that there's almost 30% difference between different devices all running at the same platform, so there has to be something wrong with drivers that HTC is using.
Either they don't update them on old devices or don't optimize their system configuration.

As Anand said in his preview of mdm8960, there was a massive difference between the scores from mdm8660 and final devices.
So even qualcomm acknowledged that there was a problem with drivers and final performance of shipping products.

Qualcomm Krait & MSM8960 @ AnandTech

Similar threads