Apple A9X SoC

What if the L3 is there, but rather is distributed assymmetrically across the die rather than in one monolithic rectangle area?

What would make the die grow by over 40% if there is no L3? There's no additional CPU resources, and the two extra GPU clusters can't account for the difference by themselves.
 
There's no additional CPU resources, and the two extra GPU clusters can't account for the difference by themselves.
You mean 6 extra clusters. The 3 cluster pairs and doubled memory interface seems to make up for most of the area. Also pity on being beat on the L3 news.
 
I wonder if it was more a case of die area / production concerns, or power consumption issues, that made them leave out a L3 cache. Does anybody know if the L3 in the A9 is power-gated?, I imagine it must be.
 
Despite the sizable clock speed increases to the CPUs of the A9 and A9X, Apple apparently could've left the GPU clocks almost the same as last gen, just relying on the new graphics architecture and the 50% increase in cluster count in each case.

Interesting to see opposing approaches taken between the CPU and GPU this gen, although wider for graphics and faster for central processing are the more typical methods to generational improvement in either case.
 
You mean 6 extra clusters. The 3 cluster pairs and doubled memory interface seems to make up for most of the area. Also pity on being beat on the L3 news.
Sorry you lost your scoop, on the other hand there is an intersting story hiding there. Why did Apple scrap the L3 on A9x? The ratio of GPU resources to bandwidth is similar between A9 and A9x. If the bandwidth to L3 would have stayed the same, the benefit vs. going straight to main mem would have been reduced, favouring
removing it. Latency of main mem reckoned in CPU cycles is presumably a touch worse on the A9x due to higher CPU clocks. Unless ditching the L3 compensates for that of course, but it should otherwise favour keeping the L3. I don't really buy into the size/yield argument too much since the L3 is rather small in terms of die area, and it's a very regular structure so it shouldn't affect yield much. In terms of power it draws a bit, but the L3 is not particularly fast, and it reduces off die traffic which saves a bit of power.
It all seems rather finely balanced, it is not obvious why they would reckon the L3 to be a win on A9 and not on A9x. If you can figure this out, it may also hint at where Apple might go on 10nm.
 
In terms of power it draws a bit, but the L3 is not particularly fast, and it reduces off die traffic which saves a bit of power.
I think power is the sole argument here. Same rhetoric as to why vendors don't bother with PoP memory in tablet and bigger form factors.
 
The issue I see with power being the reason for dropping the L3 from the A9x is that the phone SoC A9 has it, even though it has a tighter power envelope. Since the power draw of the L3 has to be relatively modest in the A9, it's proportional contribution to the whole in the A9x should be quite a bit lower still with its twice as wide gpu. Also, if the memory bus is the same speed but twice as wide, energy savings due to reduced bus traffic should be worth more on the A9x.
Hmm.
 
The issue I see with power being the reason for dropping the L3 from the A9x is that the phone SoC A9 has it, even though it has a tighter power envelope. Since the power draw of the L3 has to be relatively modest in the A9, it's proportional contribution to the whole in the A9x should be quite a bit lower still with its twice as wide gpu. Also, if the memory bus is the same speed but twice as wide, energy savings due to reduced bus traffic should be worth more on the A9x.
Hmm.
Except if it is cheaper to include a bit larger battery than increase the die size. Maybe the die is so large that they needed to cut something, and this was easy to cut (easy to compensate for, costs a lot of die).
 
The cache should be a saver of power overall if anything. Since A9X's power budget isn't quite as tight as A9's, Apple perhaps just went for the more straightforward approach (easier to implement/balance into the design) of the wider memory interface because that SoC could afford it.
 
Except if it is cheaper to include a bit larger battery than increase the die size. Maybe the die is so large that they needed to cut something, and this was easy to cut (easy to compensate for, costs a lot of die).
Yes, but again this is a line of reasoning that actually is stronger for cutting the L3 on the A9, rather than the A9x. The L3 is a larger percentage of the A9 die area, ergo removing it would represent a larger increase in useable dies per wafer, and that would also be particularly welcome on their higher volume part!
My feeling is that there may be an interesting story here. Unfortunately, since their SoC engineers have no link to the outside world, we may never hear it. But some inspired detective work might turn something up.
 
The issue I see with power being the reason for dropping the L3 from the A9x is that the phone SoC A9 has it, even though it has a tighter power envelope.
I think you misunderstood my statement, the A9X dropped it because it has a higher power envelope. I.e. the L3 is there just for power savings and would only be worth it on the smaller devices.
 
I think you misunderstood my statement, the A9X dropped it because it has a higher power envelope. I.e. the L3 is there just for power savings and would only be worth it on the smaller devices.
Oh. Thanks for the clarification. That makes logical sense, although I can't really see that the hit rate of L3-and-not-3MB-of-L1-or-L2 would reduce bus traffic enough to more offset the power draw of the cache system and in itself justify having the L3. A few percent of the memory bus traffic which is also counter balanced by cache power draw just seems like very marginal effects in the overall picture.
Hmm.
 
That makes logical sense, although I can't really see that the hit rate of L3-and-not-3MB-of-L1-or-L2 would reduce bus traffic enough
You're not thinking about non-CPU blocks. GPU, display pipeline and a lot of other things in there would probably benefit greatly from reduced memory controller and DRAM activity. Main memory is just ridiculously more expensive in terms of power, I can see them min-maxing as much as possible out of the SoC architecture via such a cache. Of course this is just my theory...
 

Nice. However:

We don’t know the clockspeed of the GPU – this being somewhat problematic to determine within the iOS sandbox – but based on our earlier performance results it’s likely that A9X’s GPU is only clocked slightly higher than A9’s. I say slightly higher because no GPU gets 100% performance scaling with additional cores, and with our GFXBench Manhattan scores being almost perfectly double that of A9’s, it stands to reason that Apple had to add a bit more to the GPU clockspeed to get there.

Unless my memory fails me aren't Mali GPUs in latest Exynos SoCs scaling somewhat linearly with cluster increases in Manhattan? I don't know where the A8 & A8X GPU frequencies lie but I always figured somewhere in the 533MHz ballpark for both. The latter being "just" a mirror of the A8 GX6450 and I don't see why their performance results shouldn't be as expected. While scaling clusters it's not always necessary to scale all GPU units outside ALUs/TMUs, but in the given case it's smartphone SoC GPU times 2 for both generations.

If I take both 3.0 and 3.1 fillrate tests (which don't help much with the alpha blending they contain) and compare them between the iPad Pro and the iPhone6S, the difference between the two is at 2.1x and 2.2x times respectively, but then again the A9X has a shitload of more bandwidth which should allow the GPU to stretch its legs quite a bit more. It's Apple's marketing that claimed a 360x times increase in GPU performance compared to the initial iPad. Since they obviously counted on purpose with FP16 OPs and the SGX535 is a 2Vec2 GPU clocked at 250MHz it's at 2 GFLOPs. Times 360x = 720GFLOPs FP16. Reverse math and for 1536 FP16 OPs/clock you'd need 469MHz. No guarantees either that it is correct, but the maximum frequency I'd expect from Apple even at this generation wouldn't exceed the ~550MHz mark either way.
 
Back
Top