Apple A8 and A8X

Rys · Sep 23, 2014

I need to teach people how to write benchmarks and how to analyse die shots.

ams · Sep 23, 2014

tangey said:
I think they want to say its 6 cluster because that was Anandtech's assumption on the day of the iphone6 announcement. The fact that they obviously could not identify 6 similar looking areas does not seem to have dissuaded them from marking 6 equal sized blocks and calling it a GX6650 !

Is it possible that A8's GPU is actually 6 clusters oriented horizontally in the diagram (with 3 clusters next to each other on the top portion of the GPU and 3 clusters next to each other on the bottom portion of the GPU)? To be honest, the die shot of the GPU area is not particularly clear, especially compared to the CPU area.

Rys, are you allowed to confirm what GPU model and what GPU cluster/core config is used here?

ltcommander.data · Sep 23, 2014

http://arstechnica.com/apple/2014/09/iphone-6-and-6-plus-in-deep-with-apples-thinnest-phones/3/

Ars Technica has throttling results from a beta GeekBench build. The A8 can still only sustain max clock speeds for a minute which is about 20 seconds better than the A7. The A8 then drops to 1.2 GHz on the 6 and 1.15 GHz on the 6 Plus compared to 1 GHz for the A7 for 10 minutes. Finally after 30 minutes the A8 in the 6 is a 950 MHz, the 6 Plus is at 1.1 GHz, and the A7 is at 750 Mhz. There's some weird recovery behaviour in the iPhone 6 too. So A8 throttling is better than the A7, but not the straight line Apple showed in their keynote.

Rys · Sep 23, 2014

ams said:
Rys, are you allowed to confirm what GPU model and what GPU cluster/core config is used here?

I can do some die shot analysis like everyone else.

Just from looking at the zoomed shot, there are 4 repeated large blocks of digital logic with a layout that definitely says "processor" to me (because of the SRAM layouts) and that they appear to operate in pairs.

Jason's analysis of the rest of the chip also needs some work.

Entropy · Sep 23, 2014

ams said:
Rys, are you allowed to confirm what GPU model and what GPU cluster/core config is used here?

When someone makes a correct guess, he's allowed to shout BINGO! in front of his screen. He's not allowed to accompany it with a drink however....

Ailuros · Sep 23, 2014

Rys said:
I can do some die shot analysis like everyone else.

Just from looking at the zoomed shot, there are 4 repeated large blocks of digital logic with a layout that definitely says "processor" to me (because of the SRAM layouts) and that they appear to operate in pairs.

Jason's analysis of the rest of the chip also needs some work.

I'm aweful with die shots; if someone would had told me though that the GPU is the bottom left rectangle, I would had seen also only 4 blocks repeating themselves. On a sidenote what is it with hw designers constantly changing spots in SoCs where to put things? It's quite confusing for poor laymen like me....

ams said:
Is it possible that A8's GPU is actually 6 clusters oriented horizontally in the diagram (with 3 clusters next to each other on the top portion of the GPU and 3 clusters next to each other on the bottom portion of the GPU)? To be honest, the die shot of the GPU area is not particularly clear, especially compared to the CPU area.

Apart from the above, if it would be a 6 cluster 6650 it should had been clocked probably even lower than the initial 400MHz I estimated to get the current scores. I still haven't made up my mind where the GPU is clocked at in the 6 Plus, but it definitely doesn't look like less than 500MHz.

If the GPU block is truly around 18mm2 then it could mean a transistor increase somewhere in the 35-40% region compared to the G6430. The G6450 has amongst others:

* 25% more FP16 ALUs compared to the 6430
* PowerGearing
* ASTC
* other architectural improvements

If you consider that ASTC alone isn't exactly cheap to integrate into hw, it would take a magic wand to also squeeze into that hypothetical area/transistor difference another 2 whole clusters.

ams · Sep 23, 2014

Rys said:
I can do some die shot analysis like everyone else.

Just from looking at the zoomed shot, there are 4 repeated large blocks of digital logic with a layout that definitely says "processor" to me (because of the SRAM layouts) and that they appear to operate in pairs.

Jason's analysis of the rest of the chip also needs some work.

Good info Rys. It appears to me that A8 is using a 4 cluster Series 6XT GPU that has 50% more FP16 ALU's than the 4 cluster Series 6 GPU in A7. This would explain the increased transistor count, the increased size of GPU relative to CPU, and the 50% increased GPU performance in A8 vs. A7.

Ailuros · Sep 23, 2014

G6430 = 192 FP16 SPs
GX6450 = 256 FP16 SPs
---------------------------
Increase = 25%

ams · Sep 23, 2014

Ailuros said:
G6430 = 192 FP16 SPs
GX6450 = 256 FP16 SPs
---------------------------
Increase = 25%

You are correct, although that is technically a 33.3% increase. So the "up to 50% GPU perf. improvement" per cluster and per clock for Series 6XT (GX6450) vs Series 6 (G6430) comes from 33.3% improvement in number of FP16 ALU's and > 10% improvement in throughput efficiency.

So it is pretty safe to say that A8 is using the 4 cluster Series 6XT GX6450 with a GPU clock operating frequency similar to what was used in A7.

Arun · Sep 23, 2014

Going back to the CPU - I really don't understand why it's so hard to figure out the frequency on SoCs of *any* vendor?

Every modern CPU has a 1-cycle INT32 ADD latency. So just run a loop of, say, 1000 dependent ADDs (or whatever fits in the instruction cache) and there's no way for multi-issue or OoOE to optimise that. So that should give you the exact frequency very easily. Am I missing something? (yes, you might need a debug build to prevent optimisation, or just write the assembly manually - but again that doesn't seem like a huge deal?)

Ailuros · Sep 23, 2014

ams said:
You are correct, although that is technically a 33.3% increase. So the "up to 50% GPU perf. improvement" per cluster and per clock for Series 6XT (GX6450) vs Series 6 (G6430) comes from 33.3% improvement in number of FP16 ALU's and > 10% improvement in throughput efficiency.

Series6 has 1.5x times FP16 compared to FP32 ALUs and 6XT 2.0x times. How you want to count percentages is besides the point, but N% increase of ALUs does NOT necessarily mean N% of performance or efficiency increase.

TRex offscreen:

iPhone6: 42.9 fps
iPhone5S: 28.8 fps

Given that T-Rex is mostly alpha test bound, where would you suggest that increase comes from?

So it is pretty safe to say that A8 is using the 4 cluster Series 6XT GX6450 with a GPU clock operating frequency similar to what was used in A7.

For the iPhone6 I'd estimate 450-500MHz and for the 6 Plus <10% higher ie 500-550MHz.

tangey · Sep 23, 2014

ltcommander.data said:
http://arstechnica.com/apple/2014/09/iphone-6-and-6-plus-in-deep-with-apples-thinnest-phones/3/
Ars Technica has throttling results from a beta GeekBench build. The A8 can still only sustain max clock speeds for a minute which is about 20 seconds better than the A7. The A8 then drops to 1.2 GHz on the 6 and 1.15 GHz on the 6 Plus compared to 1 GHz for the A7 for 10 minutes. Finally after 30 minutes the A8 in the 6 is a 950 MHz, the 6 Plus is at 1.1 GHz, and the A7 is at 750 Mhz. There's some weird recovery behaviour in the iPhone 6 too. So A8 throttling is better than the A7, but not the straight line Apple showed in their keynote.

There are straight lines and there are straight lines.

Apple's graph says sustained performance, not sustained maximum performance. It looks like they have taken a performance metric that the A8 can sustain for 20 mins, and compared it to other typical socs that they say can not sustain that same level of performance.

as the AT piece itself says

We don’t know what kind of workload this slide represents, but the implied message is that the A8 didn’t need to throttle down at all

.
I'd argue the slide implies no such thing.

It looks from the Ars technica piece, that it can indefinitely (well at least 30 mins+) sustain 67% of max CPU performance. And in itself it doesn't tell much, in relation to the apple graph. The AT piece refers to A8 throttling, but the test performed doesn't stress a very significant part of the A8, ie. the GPU. One imagines sustained MAX performance of the entire A8 would see more overall throttling.

What they have proved is that the A8 throttles less under maximum CPU stress than the A7.

Ailuros said:
G6430 = 192 FP16 SPs
GX6450 = 256 FP16 SPs
---------------------------
Increase = 25%

What the increase in performance might be is open to debate. However adding 64 ALUs to 192 ALUs is definitely increasing the ALUs by 33%

pcchen · Sep 23, 2014

Arun said:
Going back to the CPU - I really don't understand why it's so hard to figure out the frequency on SoCs of *any* vendor?

Every modern CPU has a 1-cycle INT32 ADD latency. So just run a loop of, say, 1000 dependent ADDs (or whatever fits in the instruction cache) and there's no way for multi-issue or OoOE to optimise that. So that should give you the exact frequency very easily. Am I missing something? (yes, you might need a debug build to prevent optimisation, or just write the assembly manually - but again that doesn't seem like a huge deal?)

I think this is basically what BogoMIPS does (though BogoMIPS always subtracts by 1), but a lot of CPU run BogoMIPS with 2x (or even more) clock rate results.

ams · Sep 23, 2014

@tangey: I think the simple answer is that Apple was trying to show a lack of GPU (rather than CPU) performance throttling over time, but since they didn't spell that out, it was taken the wrong way. Previous iOS products never had any issues in that area either, so it is nothing new per se other than being a new way to market the product vs the competition.

Ailuros · Sep 23, 2014

tangey said:
What the increase in performance might be is open to debate. However adding 64 ALUs to 192 ALUs is definitely increasing the ALUs by 33%

If I would had used a darned calculator it would had showed me that 64/192 gives 0.33.....but I have to be as stubborn as to calculate without one....

tangey · Sep 23, 2014

Not sure I'm allowed to use the image, but over on semiaccurate, there is chipworks image with side by side comparsion of A8 and A7. the poster is calling the graphics as 4-core.

Update. looks like the guy has just scaled and glued together the current images released by chipworks of A7 and A8....nothing new.

http://semiaccurate.com/forums/showthread.php?t=8233&page=3

Interesting that over 50% of the chip is not graphics/cpu/identified cache. That's a lot of real estate taken up by what ?

Nebuchadnezzar · Sep 23, 2014

tangey said:
Interesting that over 50% of the chip is not graphics/cpu/identified cache. That's a lot of real estate taken up by what ?

eMMC controller, USB controllers, PCIe controllers, image signal processor(s), hardware video decoders/encoders, scalers, jpeg units (maybe), PLLs, chip network interconnect, memory interfaces, low power audio blocks, ....

Let me know if I miss something that hasn't been integrated in a SoC. And FYI, that's just an example of what you find in current various SoCs, not what I say is in the A8.

tangey · Sep 23, 2014

Nebuchadnezzar said:
eMMC controller, USB controllers, PCIe controllers, image signal processor(s), hardware video decoders/encoders, scalers, jpeg units (maybe), PLLs, chip network interconnect, memory interfaces, low power audio blocks, ....

Let me know if I miss something that hasn't been integrated in a SoC. And FYI, that's just an example of what you find in current various SoCs, not what I say is in the A8.

I'm sure there are many things you missed in that list, but my indirect point is that, the non GPU and non CPU stuff is taking increasingly large proportions of the die.

Most of what you mentioned (obviously not PCIe controller) in some form would be in the A6X as an example. And yet GPU and CPU took up about 60% of that die.
http://cdn.iphonehacks.com/wp-content/uploads/2012/11/a6x-899x1024.jpg

even that comparsion between A7 and A8 would suggest a slight lessening of the the real estate being given to CPU/GPU. (more pronounced on the CPU)

It looks to me that apple has only given 40% of the total die area to the GPU & CPU in the A8, with "the rest" taking 60%

As an aside, the 4MB cache doesn't seem to have shrunk much in absolute measurements

ltcommander.data · Sep 23, 2014

tangey said:
I'm sure there are many things you missed in that list, but my indirect point is that, the non GPU and non CPU stuff is taking increasingly large proportions of the die.

Most of what you mentioned (obviously not PCIe controller) in some form would be in the A6X as an example. And yet GPU and CPU took up about 60% of that die.
http://cdn.iphonehacks.com/wp-content/uploads/2012/11/a6x-899x1024.jpg

It looks to me that apple has only given 40% of the total die area to the GPU & CPU in the A8.

http://forum.beyond3d.com/showpost.php?p=1875389&postcount=136

My previous rough calculation was that the CPU, GPU, and L3 cache now take only 38% of the A8 (~34 mm2 of a 89 mm2 chip) compared to 52% in the A7 (~53 mm2 of a 102 mm2 chip).

The on-chip secure enclave/secure element seem pretty unique to Apple SoCs but it would take a lot of transistors to be devoted to that function to account for the increase in non CPU, GPU, L3 cache area.

tangey said:
As an aside, the 4MB cache doesn't seem to have shrunk much in absolute measurements

I was estimating a 60% shrink in the L3 cache in line with the 59% shrink in the CPU. The GPU only shrunk by 70% so there's clearly new functionality there.

Entropy · Sep 23, 2014

Since the phones are out and about, could someone please run for instance LinPack with different matrix sizes (or anything else really that allows changing the size of the data set), so we can determine the size of the L3?

Apple A8 and A8X

Rys

Graphics @ AMD

ams

ltcommander.data

Rys

Graphics @ AMD

Entropy

Ailuros

Epsilon plus three

ams

Ailuros

Epsilon plus three

ams

Arun

Unknown.

Ailuros

Epsilon plus three

tangey

pcchen

Moderator

ams

Ailuros

Epsilon plus three

tangey

Nebuchadnezzar

tangey

ltcommander.data

Entropy

Similar threads