Next-Gen iPhone & iPhone Nano Speculation

AlNom · Jul 16, 2014

anexanhume said:
Real question is if there's any weight to the LPDDR4 rumors or we can expect LPDDR3 again.

Hm... on the one hand Micron does apparently have 16Gbit LPDDR3 available.

On the other hand, the first wave of LPDDR4 may still only be 8Gbit, so if board space isn't an issue, they could do two chips there to hit 2GB.

...maybe?

Alexko · Jul 16, 2014

Aren't mobile RAM chips meant to be stackable, package-on-package? (Honest question, I really don't know but that would make sense to me.)

anexanhume · Jul 17, 2014

Alexko said:
Aren't mobile RAM chips meant to be stackable, package-on-package? (Honest question, I really don't know but that would make sense to me.)

Yes, Apple utilizes package on package (PoP) for iPhone SoC packages. IPad DRAM is off chip though, and they use metal covers on those.

Grall · Jul 17, 2014

The A6X had off-chip DRAM due to the 128-bit bus, but I don't think current A7 iPads have it. I'm not sure, I haven't studied any teardowns intensely, but if it's off-chip again it might be for improved thermal dissipation, in order to clock the SoC higher than in iPhone.

Deleted member 13524 · Jul 17, 2014

The A7 uses 800MHz/1600MT/s LPDDR3 in a dual-channel configuration.
It has less overall bandwidth than A6X's quad-channel 1066MT/s LPDDR2, so the ipad 4 is technically less bandwidth-limited than the ipad air (12.8GB/s vs. 17.66GB/s), though the more recent GPU probably compensates that somehow (more cache?).

anexanhume · Jul 17, 2014

Grall said:
The A6X had off-chip DRAM due to the 128-bit bus, but I don't think current A7 iPads have it. I'm not sure, I haven't studied any teardowns intensely, but if it's off-chip again it might be for improved thermal dissipation, in order to clock the SoC higher than in iPhone.

The retina mini does not have it (it has PoP like the 5S). The ipad air does have DRAM off chip and a metal cover/heat spreader. Prior to ipad air, I would have thought they'd do away with that and do all PoP. I have to think eventually they want to work towards wideIO with SIP or TSV.

ToTTenTranz said:
The A7 uses 800MHz/1600MT/s LPDDR3 in a dual-channel configuration.

Sorry, where was that speed confirmed? I had been assuming 1333MT/s.

ltcommander.data · Jul 17, 2014

anexanhume said:
Sorry, where was that speed confirmed? I had been assuming 1333MT/s.

https://d3nevzfk7ii3be.cloudfront.net/igi/XR6jJPopqjgw4LfW
http://www.micron.com/products/dram/mobile-lpdram#fullPart&306=2

The part number of the LPDDR3 is Elpida F8164A1MD-GD-F with the GD corresponding to LPDDR3-1600. JD is LPDDR3-1866. No idea what LPDDR3-1333 would have been.

On the A8, is there a reason why core counts tend to be powers of two or even numbers? Are there technical advantages or is it just because a 2x increase is easier to market and is enabled by doubled transistor counts each full node? The Metal documentation uses 3 threads in their multithreading examples. There probably isn't any deeper meaning there, but it seems sensible for them to use all approaches to increase CPU performance: bump the core count to 3 cores, continued microarchitecture improvements, modest clock speed increases. Apple seems to like using transistors rather than clock speeds to improve performance since they can eat the cost and save the power so the rumours that they would jump from 1.3 GHz to 2 GHz or more seem extreme.

anexanhume · Jul 17, 2014

ltcommander.data said:
https://d3nevzfk7ii3be.cloudfront.net/igi/XR6jJPopqjgw4LfW
http://www.micron.com/products/dram/mobile-lpdram#fullPart&306=2

The part number of the LPDDR3 is Elpida F8164A1MD-GD-F with the GD corresponding to LPDDR3-1600. JD is LPDDR3-1866. No idea what LPDDR3-1333 would have been.

Thanks, I was using an outdated Elpida PDF that didn't have those codes listed (or even LPDDR3).

On the A8, is there a reason why core counts tend to be powers of two or even numbers? Are there technical advantages or is it just because a 2x increase is easier to market and is enabled by doubled transistor counts each full node? The Metal documentation uses 3 threads in their multithreading examples. There probably isn't any deeper meaning there, but it seems sensible for them to use all approaches to increase CPU performance: bump the core count to 3 cores, continued microarchitecture improvements, modest clock speed increases. Apple seems to like using transistors rather than clock speeds to improve performance since they can eat the cost and save the power so the rumours that they would jump from 1.3 GHz to 2 GHz or more seem extreme.

Well, they've been able to advance ISA/microarchitecture each of the last 3 iphone iterations (A8->A9->ARMv7s->ARMv8-A). There's no new ISA or reference design to jump to this time around, so they'll have to stick to microarchitecture improvements to their own design. They've also claimed 2x speedup the last three SoC iterations, and I think they'll have a tough time getting there without increasing core count or drastically increasing clock speed. However, here we are nearing a full year later and they're still the only mobile device on the market with a ARMv8-A design, so I'm done underestimating them

On the 3 core (or more idea), I'm guessing they'd really want to bleed the stone on microarchitecture improvements now that they've added a big L3 cache. Defining the the bus interconnects for all of that is likely a pain and uses a fair amount of die space. I'm sure they have an informed trade of clock speed/core count/core complexity that's driving their decisions though.

Ailuros · Jul 18, 2014

ltcommander.data said:
Apple seems to like using transistors rather than clock speeds to improve performance since they can eat the cost and save the power so the rumours that they would jump from 1.3 GHz to 2 GHz or more seem extreme.

I don't know what they've planned, but let's assume they have a G6630@=/>600MHz in the A8. That's doesn't follow your reasoning above either as it's by 50% more area (6 clusters vs. 4 in A7) and =/>35% higher frequency. And yes that theory makes more sense than say a G6430@770MHz, albeit it's not technically impossible at all.

The point here is that it depends if N architecture is laid out for excessive frequencies or not. Rogue as an architecture is far more tolerant to them afaik compared to its predecessors; after all unless I'm reading something wrong out the Allwinner A80 Manhattan results it's clocked at ~780MHz. Yes a dual cluster G6230 is far more easier to get clocked higher than a four or more cluster config, but all it should mean is that just because an IHV has a track record of N hw strategy that any change to that would be absolutely taboo.

anexanhume · Jul 19, 2014

After their aggressive ARMV8-A adoption, I'm half expecting a GX6450 or GX6650. I'm guessing the finer power gating is extremely attractive and Metal can compensate for the lack of big clock boosts or USC growth potentially.

mavere · Jul 19, 2014

TSMC's 20nm is much more of a density story than a perf/power story. Between that and the suspected roomier 4.7" housing and the expected wafer allotment (up to 40k/month in Q3, 50+k wpm throughout Q4), I think Apple will have a lot of new transistors to spend on something.

Maybe it'll be for largest possible PowerVR design, 3rd CPU core, larger cache, or even a fancy on-die voltage regulator to turbo to rumored clockspeeds without sacrificing battery life. Or maybe they'll just stick to the original A7 architecture and draw happy faces on the spare die area.

Ailuros · Jul 19, 2014

anexanhume said:
After their aggressive ARMV8-A adoption, I'm half expecting a GX6450 or GX6650. I'm guessing the finer power gating is extremely attractive and Metal can compensate for the lack of big clock boosts or USC growth potentially.

No one has seen yet how 6XT cores perform in real time, however from the layman's observer corner I'm standing here, I'm not all that optimistic that the claimed up to 50% compared to 6430/6630 stands up to be an average also.

The only hw change they've revealed for 6XT are additional FP16 ALUs; so far the 6200 vs. 6230 & 6400 vs. 6430 (whereby on 6200/6400 you don't have any FP16 ALUs and no framebuffer compression) haven't shown any groundbreaking performance differences, so I'm obviously missing something.

anexanhume · Jul 19, 2014

Ailuros said:
No one has seen yet how 6XT cores perform in real time, however from the layman's observer corner I'm standing here, I'm not all that optimistic that the claimed up to 50% compared to 6430/6630 stands up to be an average also.

The only hw change they've revealed for 6XT are additional FP16 ALUs; so far the 6200 vs. 6230 & 6400 vs. 6430 (whereby on 6200/6400 you don't have any FP16 ALUs and no framebuffer compression) haven't shown any groundbreaking performance differences, so I'm obviously missing something.

That's why I'm guessing it's not about peak performance improvement. Better idle power and improved memory utilization via compression techniques seem desirable in and of themselves.

Grall · Jul 19, 2014

How well would framebuffer compression work anyway? Even non-realtime lossless compression schemes doesn't get you all that far really.

Ailuros · Jul 19, 2014

anexanhume said:
That's why I'm guessing it's not about peak performance improvement. Better idle power and improved memory utilization via compression techniques seem desirable in and of themselves.

I don't recall how they call the function where you can power gate 2 clusters at a time but it's also present in the 6630 as well as framebuffer compression and what not. 6XT cores also support ASTC in hw but that rather means more transistors dedicated to it. In terms of FP16 ALUs it's 288 SPs in the 6630 and 384 in the 6650.

Grall said:
How well would framebuffer compression work anyway? Even non-realtime lossless compression schemes doesn't get you all that far really.

I'd say as with all single out aspects it just further contributes to efficiency like a multitude of other factors. You know marketiers would kill even for a say 5% performance difference

silent_guy · Jul 19, 2014

Grall said:
How well would framebuffer compression work anyway? Even non-realtime lossless compression schemes doesn't get you all that far really.

I can't find the link that refers to the compression, but even if it's pure lossless compression without any additional context bits, I think it could do really quite well (say, 40% reduction?) in a lot of cases. And for cases where they don't, it's still not a big deal: I expect that the power cost of having it enabled to be less than the high cost of transfer data to DRAM.

And you could imagine it to be adaptive: when a frame don't sufficiently compress to be worth it, one could do it once every 100 frames or so to see if the workload has changed for it to be useful.

anexanhume · Jul 20, 2014

Ailuros said:
I don't recall how they call the function where you can power gate 2 clusters at a time but it's also present in the 6630 as well as framebuffer compression and what not. 6XT cores also support ASTC in hw but that rather means more transistors dedicated to it. In terms of FP16 ALUs it's 288 SPs in the 6630 and 384 in the 6650.

I was under the impression that the power gaining has a finer granularity with the GX series. Is that not the case?

From Anandtech preview:

Taking a look at what changes have been made, Series6XT will be gaining finer grained power gating through what Imagination is calling “PowerGearing G6XT” technology. Mobile class GPUs have featured power gating for some time – it being necessary in order to achieve the low power consumption today’s mobile devices shoot for at idle and in light workloads – with higher grained solutions offering even more efficiency gains by being able to shut off a larger percentage of the GPU when those resources aren’t required. For Series6XT, Imagination gains the ability to shut off individual USCs and other processing blocks within their GPUs, which should be especially beneficial in light workloads where the GPU can’t idle, but it doesn’t need to allocate all of its resources either.

http://www.anandtech.com/show/7629/...rchitecture-available-for-immediate-licensing

Ailuros · Jul 20, 2014

anexanhume said:
I was under the impression that the power gaining has a finer granularity with the GX series. Is that not the case?

From Anandtech preview:

http://www.anandtech.com/show/7629/...rchitecture-available-for-immediate-licensing

Why do I have the impression that it's already present on 6630?

http://www.imgtec.com/news/detail.asp?ID=706

G6630 delivers advanced power scaling technologies including enabling USC cluster pairs to be shut down depending on the App load, allowing X2, X4 and X6 modes of operation to provide multiple power/performance operating points with the same on-chip GPU.

Else if you're adequate with just 2 clusters to run the device GUI and what not, the other 4 clusters can be power gated (besides obviously clock gating since you don't necessarily need to run at full frequency for the GUI either). If now they got the ability to shut off also on single cluster levels since 6XT no idea, but it doesn't make all that much sense to me either, since you have 1 quad TMU coupled with 2 USCs at a time.

anexanhume · Jul 20, 2014

Ailuros said:
Why do I have the impression that it's already present on 6630?

http://www.imgtec.com/news/detail.asp?ID=706

Else if you're adequate with just 2 clusters to run the device GUI and what not, the other 4 clusters can be power gated (besides obviously clock gating since you don't necessarily need to run at full frequency for the GUI either).

Yes, but still finer grain (individual USCs), which was my question. Though I doubt the benefits of 1 USC vs 2 with others power gated is all that dramatic.

Ailuros · Jul 20, 2014

anexanhume said:
Yes, but still finer grain (individual USCs), which was my question. Though I doubt the benefits of 1 USC vs 2 with others power gated is all that dramatic.

Hmmm well IMO if you have in total 6 clusters power gate 5 of them I'd still figure that at least one quad TMU would have to be active, which makes the gain against a 2 USC + 4 TMU active scenario rather negligable as you say. Unless of course you can turn off part of each quad TMU also, but they wouldn't then state that 2 USC share a texture pipeline at a time.

Next-Gen iPhone & iPhone Nano Speculation

AlNom

Moderator

Alexko

anexanhume

Grall

Invisible Member

Deleted member 13524

Guest

anexanhume

ltcommander.data

anexanhume

Ailuros

Epsilon plus three

anexanhume

mavere

Ailuros

Epsilon plus three

anexanhume

Grall

Invisible Member

Ailuros

Epsilon plus three

silent_guy

anexanhume

Ailuros

Epsilon plus three

anexanhume

Ailuros

Epsilon plus three

Similar threads