Apple A8 and A8X

Ever since that explanation for A7's physics score in 3DMark turned up (I think I read it first in the comments section of Anandtech's iPhone 5s or iPad Air review), I've tried to determine if real cross platform games using Bullet physics show any relative weakness in iOS performance. While I do believe, as I mentioned when this was last discussed months ago, that the test highlights an actual characteristic of the Cyclone architecture that's not a particular strength, and although it's challenging to distinguish its impact when looking at overall performance in a cross platform comparison, I've never been able to identify a real world performance impact.

As others have pointed out, accounting for the specifics of the target hardware is precisely what a software developer does, so you of course might not expect to see much of a real world impact.

The Air 2 is close to 2.5x the Air as advertised in a whole bunch of GFXBench tests and sub-tests, interestingly: Manhattan, T-Rex, ALU, Alpha Blending, and surprisingly fill. That fill rate result suggests that an iOS device is finally achieving somewhat higher effectiveness in the 3.0 version of GFXBench's fill rate sub-test like they used to achieve in the 2.7 version, among other fill improving factors like having more TMUs and a higher clock. Still, though, between the possible TMU count with the required clock rate needed and what was assumed about the configuration of announced PowerVR cores, explaining the Air 2's fill result is challenging.

Getting back to the discussion about Apple Pay and whether the 5s would truly be prevented from third-party app payments, Ars Technica's explanation matches what ltcommander.data speculated:
The second piece of hardware is what Apple calls the "Secure Element," a dedicated chip in supported devices. This is what's responsible for storing and protecting your credit card data—it can apparently be paired with any of Apple's SoCs, which is why the iPad Mini 3 can use Apple Pay in apps while the iPhone 5S can't. If you have those two things, the Apple Pay button can be used in apps that support it.
http://arstechnica.com/apple/2014/10/ios-8-1-mini-review-testing-apple-pay-sms-forwarding-and-more/
 
Two 1GB memory chips. Hello again 128 bit bus.
hmmm so 128-bit RAM bus, TBDR renderer, FP16 precision, 20nm process, 3 billion transistors, and it doesn't do better than TK1 in GFXbench ? really ?
what's the reason of such poor efficiency ? serious question, not starting a flaming war...
 
hmmm so 128-bit RAM bus, TBDR renderer, FP16 precision, 20nm process, 3 billion transistors, and it doesn't do better than TK1 in GFXbench ? really ?
what's the reason of such poor efficiency ? serious question, not starting a flaming war...
What if its doing all that for less energy drain and better power efficiency while at a lower Mhz rating and pumping out less heat?
 
A8X better be efficient running inside a 6 mm profile. And heating to Shield tablet levels of burn probably wouldn't go over so well in a device as mainstream as an iPad.

Appears to me Apple may have done it again: push the envelope in form factor yet still take the crown in performance.
 
Last edited by a moderator:
I was hoping that I could make a reasonable frequency estimate from the A8X detailed tests, but nope it doesn't help. With the alpha blending used in the fillrate tests and A8X having that much more bandwidth doesn't help (and I think there's 2 pixels/cluster output for alpha blending).

All things equal (which isn't the case) the GPU would clock at ~650MHz on estimate. Considering 50% more TMUs and 50% more pixels output along with the added SoC bandwidth frequencies shouldn't have an as high distance to the GX6450 frequency in A8. On a pure speculative basis (which is most likely wrong again) I'd go for something slightly below 600MHz.
 
Well..I was pretty sure they'd be going for a split SoC strategy this time and turns out I was right. But I was quite surprised to see a tri-core..really expected it to be quad. But Apple has never shied away from the seemingly unconventional configs (didnt they do an MP3 config for the GPU in A6?)

The Gfxbench results are quite impressive too. Are we still thinking GX6650 or could they have gone for an 8 cluster configuration? (The extra core and 1 MB L2 explains a part of the transistor count increase from A8, but there are still a LOT of transistors unaccounted for)
 
The Gfxbench results are quite impressive too. Are we still thinking GX6650 or could they have gone for an 8 cluster configuration? (The extra core and 1 MB L2 explains a part of the transistor count increase from A8, but there are still a LOT of transistors unaccounted for)

I'd be surprised if it turned out to be an 8 cluster. although I'm sure it's doable, from the very first announcements of rogue, the maximum examples of cluster count, and the maximum IMG have stated as a licensable product is 6 core. That is not to say that Apple being Apple, that the rules can't change at any time...but it would be a bolt out of the blue.

Regarding that transistor count, you must include a 128-bit memory controller in there too. If the block of memory in the A8 is something related to graphics, it's possible that this has been increased too in the A8X.
 
I'd be surprised if it turned out to be an 8 cluster. although I'm sure it's doable, from the very first announcements of rogue, the maximum examples of cluster count, and the maximum IMG have stated as a licensable product is 6 core. That is not to say that Apple being Apple, that the rules can't change at any time...but it would be a bolt out of the blue.

Although there are no 8 cluster listed products, IMG have stated that the Rogue architecture scales up to 8 clusters:

http://www.imgtec.com/powervr/powervr-architecture.asp
The PowerVR Series6 ‘Rogue’ GPUs deploy a pipeline cluster approach, the Unified Scalable Cluster (or USC), with each cluster containing up to 16 pipelines and GPUs containing from 1 up to 8 clusters. Each cluster in a PowerVR Series6 GPU includes 32 ALU cores, providing optimal performance for FP32 and FP16 computing. PowerVR Series6XT and Series6XE GPUs deliver a dramatic improvement in performance, by doubling the number of FP16 ALUs.

http://withimagination.imgtec.com/powervr/powervr-g6630-go-fast-or-go-home
PowerVR Series6 'Rogue' cores deploy a pipeline cluster approach, with each GPU core scaling up to 8 clusters, and each cluster containing up to 16 pipelines.
 
I'm pretty confident it's a GX6650. Obviously marketing has announced that clusters can go up to 8 for the current designs (albeit I could had sworn that it's actually 12), but either/or is definitely not the maximum design latency.

So far Rogues scale pretty accurately in Manhattan according to their FLOP rates; whereby 6XT cores go a wee bit higher than Series6.

GX6450
estimated frequency 520MHz (iPhone 6 Plus)
256 FP32 FLOPs/clock * 0.52 = 133.12 GFLOPs FP32
Manhattan offscreen 18.80 fps

GX6650
Manhattan offscreen 32.70 fps
reverse speculative math according to the above 231.54 GFLOPs FP32
231.54 / 384 FP32 FLOPs/clock = 603MHz

Now if of course the GX6450 frequency is wrong, then the whole thing is worthless.
 
I wonder what the chances are that series 7 will include some licensable products with 8 clusters. Although it starts to become an avenue with diminishing returns,going from 4-6 gets you 50%-ish, going from 6-8 gets you 25%-ish.

series 7 should be initially revealed within the next 2-3 months.

Did I just indirectly start an A9 speculation discussion ? :)
 
The bigger headache for scaling clusters beyond 6 is bandwidth IMO. If I/O doesn't get increasingly wider on ULP SoCs also, it'll become another bottleneck sooner than later.
 
Looking at GfxBench alone, we see
ALU, Offscreen: (11054/5142)=2.15
Alpha Blend, Offscreen: (17791/8041)=2.21
Fill, Offscreen: (7606/3420)=2.22

Hmm.
Onscreen weren't different enough to matter, but getting rid of that variable seemed prudent.
 
These benchmarks are meant to try to mimic real-world games and go as far to use tools also used by game developers (Unity, that physics engine, etc). I don't think the point is to benchmark optimized performance on each platform, for that we have more synthetic suites such as GFXBench.

The problem is - does 3DMark actually succeed in its attempt to mimic real-world games, both in terms of what it is doing, and how it goes about it? At what level of confidence? That is, to what extent can it be used for predictive purposes?

This is not a criticism leveled at 3DMark specifically. It is simply that what it purports to do is nigh on impossible, particularly cross platform. And that remains true if you exchange 3DMark for any other cross platform 3D application/benchmark.

Grall said:
To no small amount the difficulty of scaling DRAM performance, no doubt? CPU performance goes up 1000-fold in a decade more or less, while DRAM performance increases like 10x, if that much...
Historically one fundamental problem has been that you want a benchmark to be able to run on a wide selection of hardware in order to have a wider base for comparisons. Already there, you are going to have to limit yourself to what can be comfortably run on the lowest level of hardware you want to test in terms of memory footprint. That quickly lead into problems, because you also want to run the same benchmark across generations, in order to estimate the value of upgrading. Thus most benchmarks tended to have pitifully small memory footprints compared to the actual programs run on systems, which lead to benchmark results getting ridiculously skewed when they started getting cache resident, so the better benches (such as SPEC) got continuous updates in order to yield useful results, at the cost of making wide ranging comparisons difficult.
But benchmarks have always been used for marketing, and the growing gap you mention between ALU capabilities, and the ability of the memory subsystem to feed them, created a strong pressure from the industry to have tests that de-emphasized memory access in favor of ALU. And once single thread performance started getting difficult to come by, and thus hardware manufacturers turned to multiprocessors, the benchmarks that stressed the memory subsystem got almost completely weeded out, because not only did they not help sell the new kit, they sometimes actually showed the new shiny to perform worse due to contention issues and the then new complexities in how memory accesses had to be handled.

Today we are at a point where most benchmark codes you see on hardware websites are "filter like" in that they iterate a lot over very small data sets that mostly fit in core-local cache. If I were to be cynical, and I am, I'd say that this is because those are the only codes that have served their purpose of driving sales well enough.
 
A8X better be efficient running inside a 6 mm profile. And heating to Shield tablet levels of burn probably wouldn't go over so well in a device as mainstream as an iPad.

iPad Air has a peak skin temperature under load of 42 degrees Celcius (which is two degrees Celcius hotter than peak temperature for iPad 4), while Shield tablet has a peak skin temperature under load of 45 degrees Celcius (see notebookcheck reviews). So the difference is not as much as you think.

Appears to me Apple may have done it again: push the envelope in form factor yet still take the crown in performance.

Let's not get carried away here. In comparison to Apple's iPad Air 2, Google's Nexus 9 should have higher 3dmark Ice Storm Unlimited Graphics score, higher CPU Physics score, higher Geekbench 3 single-threaded CPU score, while being at most 2 fps behind in GFXBench Manhattan 3.0 Offscreen score. So both are very good.
 
There aren't even reviews out neither for the Nexus9 nor for the iPad Air2, nor are the two directly comparable per se either. We've spent enough time leaning over GK20A's prospects and here and now is the spot where I care more about the technical aspects of A8X.
 
I suspect the A8X will have a very healthy performance advantage over the A7 regardless of third core usage, on account of doubled L2, lower latency to L3 (size?) and main memory, and improved main memory bandwidth. Even without knowing the amount of L3, Apple beefed up the memory subsystem performance of the A8X vs the A7 considerably. In the real world, this is likely to be more important than in benchmarking.
This is indeed shown by Anandtech SPECint2000 run: 181.mcf is putting a lot of pressure on the memory subsystem (the article wrongly hypothesizes integer multiplication is the source of the speedup).

Today we are at a point where most benchmark codes you see on hardware websites are "filter like" in that they iterate a lot over very small data sets that mostly fit in core-local cache. If I were to be cynical, and I am, I'd say that this is because those are the only codes that have served their purpose of driving sales well enough.
Having studied some non benchmark code (e.g., browsers displaying web pages as opposed as running some JS benchmark), the data cache hit ratio is quite high, much higher than what 181.mcf shows (about 10 misses/kinst vs 100 misses/kinst). So even SPEC shows some distortion (SPEC CPU 2006 mcf version has slightly lower D$ misses, but TLB misses are higher).
 
Let's not get carried away here. In comparison to Apple's iPad Air 2, Google's Nexus 9 should have higher 3dmark Ice Storm Unlimited Graphics score, higher CPU Physics score, higher Geekbench 3 single-threaded CPU score, while being at most 2 fps behind in GFXBench Manhattan 3.0 Offscreen score. So both are very good.

It will be interesting to see how the Nexus 9 battery life performs in comparison also do we know what resolution the Nexus 9 runs at...is it 1080p? or higher? As that will make a difference in GFXBench marks.

Isn't the Nexus 9 releasing in a week and there is so much not known about it yet?
 
I'd be surprised if it turned out to be an 8 cluster. although I'm sure it's doable, from the very first announcements of rogue, the maximum examples of cluster count, and the maximum IMG have stated as a licensable product is 6 core. That is not to say that Apple being Apple, that the rules can't change at any time...but it would be a bolt out of the blue.

Regarding that transistor count, you must include a 128-bit memory controller in there too. If the block of memory in the A8 is something related to graphics, it's possible that this has been increased too in the A8X.

Well I wouldn't have been surprised if Apple had a custom part..but as per Ailuros..seems like it is 6 clusters.

Yes the MC would account for a bit but it is not too significant (For a 3 Billion transistor chip at least). A8 carried over the same 4 MB L3 from A7. It is possible that A8X increased it further like they did with the L2..lets wait for a die shot.
I wonder what the chances are that series 7 will include some licensable products with 8 clusters. Although it starts to become an avenue with diminishing returns,going from 4-6 gets you 50%-ish, going from 6-8 gets you 25%-ish.

series 7 should be initially revealed within the next 2-3 months.

Did I just indirectly start an A9 speculation discussion ? :)

Well that would depend on what configuration a series 7 cluster has. If it is similar to Series 6 then I think the chances should be good. Err going from 6-8 is 33% actually ;)

If past reveals are anything to go by, launch should be at CES.

I think we need a separate thread for that :smile: They should be able to move to 16/14 FF for A9, and while there would be some performance improvements, there is barely any increase in density (Though AFAIK Samsung claims their process is denser than TSMC). So unless they go for larger dies, the configurations probably would be similar. And with significant increase in wafer costs for FF, I dont know how much larger they'd be willing to go. Perhaps this time they may have to optimize more for area.

The bigger headache for scaling clusters beyond 6 is bandwidth IMO. If I/O doesn't get increasingly wider on ULP SoCs also, it'll become another bottleneck sooner than later.

True..though LPDDR4 is almost upon us so it help mitigate the problem for another generation or two. Not to mention on chip caches are getting larger with each generation.
 
You can keep the 8 cluster theory if you want, but at around 450 MHz then. My former layman's math shouldn't be completely worthless.
 
Back
Top