Adreno 430 performance preview at Anandtech

It was mainly Josh's work, I just helped out finishing the article and on Qualcomm's EAS and energy idle drivers.

Anyway, connect the dots between the resulting memory performance and the rumours:

http://browser.primatelabs.com/geekbench3/compare/1874253?baseline=1887333
http://www.androidauthority.com/snapdragon-810-overheating-issues-579284/

Tom's calls it outright:
Digging a little deeper we discovered that the device was actually a pre-production unit, and then confirmed with Qualcomm that pre-production samples were running the memory bus at half-speed. While this sounds alarming, it’s actually quite common when working with a new type of memory—LPDDR4 in the case of the 810.
 
Last edited:
So far so good, but where's the battery life/power consumption/throttling part of the review?

***edit: don't answer that...I was just told that you had around an hour with the device? Gosh....
 
Last edited:
So any result could be bandwidth-limited...
In the sense that it could perform better in bandwidth limited scenarios, sure, although DDR4 ensures that in an absolute sense bandwidth doesn't show much change versus its predecessor. But the more general issue is that main memory latency is awful.
This doesn't show up all that much in a benchmarking environment dominated by largely cache-resident core tests, and relatively latency insensitive graphics benches. It would have larger impact in everyday use cases in an environment where there is a lot more going on than in controlled benchmarking, and with more realistic data sets.
 
It's a preview. Josh/the media only had a few hours with the device.

Any idea if this is the original or the supposedly "fixed" revision of the S810?
In the sense that it could perform better in bandwidth limited scenarios, sure, although DDR4 ensures that in an absolute sense bandwidth doesn't show much change versus its predecessor. But the more general issue is that main memory latency is awful.
This doesn't show up all that much in a benchmarking environment dominated by largely cache-resident core tests, and relatively latency insensitive graphics benches. It would have larger impact in everyday use cases in an environment where there is a lot more going on than in controlled benchmarking, and with more realistic data sets.

We saw the same thing with the Exynos 5433 vs 5430 (A57 v/s A15)..higher latency but also higher bandwidth (Both have 64 bit, 825 mhz LPDDR3 though). Since we're seeing a similar situation here with A57 v/s Krait..I wonder if this is something to do with the architecture of the A57 itself?

Higher bandwidth does not seem to be helping performance all that much even in benchmarks though. Looking at the Exynos 7420 v/s 5433 (7420 is LPDDR4), the Geekbench scores are ~15% and ~10% higher for single and multicore respectively. If you normalize for clocks (2.1 v/s 1.9 ghz) this reduces to ~5% and 0%. Link - 5433 vs 7420 on Geekbench.
 
Any idea if this is the original or the supposedly "fixed" revision of the S810?
Given the memory performance and the initial reports pointed out to a broken memory controller, and continued overheating reports from the media on the Flex2, I doubt it's the fixed version.
 
IS Adreno 3xx/4xx a scalar architecture? What is its detailed architecture? Thx very much

I'm not the best to reply for Adrenos, but yes it has so called "scalar" ALUs. The Adreno 330 has 8*SIMD16 and after that with 4xx I lost track; the 420 could be a 12*SIMD16 config at a slightly lower clock than the peak Adreno330 frequency, but that's just my own speculation.
 
I thought the Adrenos were Vec4+Scalar like the X360, hence the former Imageon nickname being mini-xenos?
 
I thought the Adrenos were Vec4+Scalar like the X360, hence the former Imageon nickname being mini-xenos?

Afaik up to Adreno2xx yes; starting from Adreno3xx though they moved to SIMD. ARM Mali and Vivate GPU IP still have vector ALUs.
 
Afaik up to Adreno2xx yes; starting from Adreno3xx though they moved to SIMD. ARM Mali and Vivate GPU IP still have vector ALUs.
Thx Ailuros,so it's more like AMD GCN ,the SIMD16 in the CU? And as I know the arch of Mali Midgard ALU pipeline is "vec4 + madd scalar alu with a big scalar alu(madd and sfu)", was that correct? Thx very much
 
If you oversimplify things yes you could say that Adreno =/>3xx ALUs are closer to today's desktop architectures.

Adreno330 is afaik 8*SIMD16 meaning at 600MHz = 8 * SIMD16 * 2 FLOPs/SIMD lane * 0.6 GHz = 153.60 GFLOPs FP32

A recent Mali is a wee bit more complicated then even past Vec4 ALUs in other GPUs. For each cluster you have 2 pipelines (and 1 TMU); in each pipeline you have 2 Vec4 + SFU (special function unit) else 17 FLOPs theoretical peak per pipeline or 34 FLOPs per cluster. Or to be a bit more realistic since you obviously need SFUs for special function ops more than they sit around idle 16 FLOPs/pipeline or 32 FLOPs/cluster.

For a Mali T760 MP6 @ 700MHz you have:

6 * [2 * (4*4)] * 0.7 GHz = 134.40 GFLOPs FP32

Other GPUs have SFUs too so it's rather silly to count those.
 
Those aren't good descriptions of the Mali or Adreno shader cores, unfortunately. For FP32:

Adreno 330 is 4*SIMD32 multiply-add.
Midgard in T760 is vec4 MADD + scalar ADD, plus a 4-wide dot product and another scalar flop. 9 flops in the first part of the pipe, 8 in the second.

Peak is reasonably easy to get close to on Adreno. Only Cthulhu himself knows how to get the Midgard shader compiler to emit something at peak utilisation.
 
*keeps notes* thank you :)
Malis still have 1 TMU per cluster; for Adreno330 it's 2 TMUs/SIMD or am I wrong again?
 
Yep, that's right. I should point out for those new to embedded GPUs that are trying to follow what's going on, that the pipeline I describe for Midgard is present twice in a T760 core and there's 6 of those cores in a T760MP6.
 
Yep, that's right. I should point out for those new to embedded GPUs that are trying to follow what's going on, that the pipeline I describe for Midgard is present twice in a T760 core and there's 6 of those cores in a T760MP6.

I'll never manage to remember myself that config from memory :p As long as I know that they get 32 FLOPs FP32 (SFUs aside) per clock per cluster it's good enough for me. That said apart from architectural differences it seems that Adreno 330, Mali Midgaard and PowerVR Rogue all have roughly 1 TMU for every 32 FLOPs FP32.

Funny coincidence (?) would be that GK20A in K1 is at 48 FLOPs/TMU, while the Maxwell grandchild in X1 goes down to 32 FP32 FLOPs/TMU. I'm not even sure if such ratios exist on a technical level, but I'll skip the FP16 FLOPs/TMU ratio as they're the same in =/>Series6XT and the X1 GPU :runaway:
 
Back
Top