Analysis Concludes Apple Now Using Custom GPU Design

Deleted member 13524 · Oct 28, 2016

sebbbi said:
Just ignore the precision tags when reading DxASM in their internal shader compiler. Simply use the same code path as their existing GPUs with no fp16 native instructions. Not a problem at all.

Well this may be going over my head, but in my limited experience while promoting lower precision to higher precision is okay (just add a bunch of zeros to the left), the opposite results in a can of worms.
If they're ignoring precision tags, what will happen if a wild FP64 appears and they're just defaulting everything to FP32?

tangey · Oct 28, 2016

I see this thread is now an Nvidia thread. You'd assume two guys with a combined beyond3d post count approaching 9000 would know better, but I guess there mustn't be any Nvidia related threads.

Deleted member 13524 · Oct 28, 2016

You're right, sorry.

To be honest, the whole article's findings seem to focus on the FP16 ability of the iGPUs in the latest apple SoCs, so it just went from there.
Perhaps these "meta" posts could be transferred here, please?

https://forum.beyond3d.com/threads/fp16-but-its-the-current-year.59725/

Exophase · Oct 28, 2016

Ailuros said:
It's my understanding that they have SIMD16 where each ALU lane is capable of 2xFP32/clock. While they most likely have dedicated FP16 units I don't see why they wouldn't be capable of 2xFP16 at all times also. If then why wouldn't 2xFP32 be also possible under conditionals? What I'm sure of is that you can't mix FP16 with FP32 within a SIMD, but I guess that's almost self-explanatory. However it's my understanding that the latter occurs because the FP32 and FP16 SPs use the same surrounding logic making only either/or scenarios possible.

You don't need to have 2xFP16 SIMD instructions to achieve 2x FP16 throughput in a GPU. They could just have a large enough wavefront length to where FP32 ops take twice as many cycles as FP16 ops.

It would be very strange for IMG to simultaneously repeatedly refer to Series 6+ as scalar (in explicit contrast to Series 5/5XT) while also repeatedly mentioning the benefits of FP16, if their FP16 implementation were not scalar.

Of course by scalar it's understood to mean no more than one lane per "thread"/work item, SIMD is naturally being used to process many "threads" in parallel.

Ailuros said:
For the record:

G6x00 Series6 cores: 1:1 FP16/FP32
G6x30 Series6 cores: 1.5:1 FP16/FP32
Series6XT/7XT cores: 2:1 FP16/FP32

I think the G6x00 Series 6 cores don't even have FP16 ALUs.

G6x30 Series 6 adds the FP16 ALUs at a 1:1 ratio to the FP32 ALUs. But the FP16 ALUs can perform 3 FLOPs somehow, vs the standard 2 FLOP FMA in the FP32 ALUs, so that's where the 1.5:1 ratio comes from. The actual ops haven't been disclosed as far as I know, but may be (A*B +/- C*D). They wouldn't be the first GPU to support operations like this either.

Series 6XT and 7XT move to an actual 2:1 FP16 to FP32 ALU ratio, and away from 3 FLOP FP16 ALUs.

Apple's A7 and A8 were at least until now alleged to be using GX6430 and GX6450, Series 6 w/FP16 and Series 6XT respectively.

Ailuros said:
As for that Series6 developer guide that document is as old as the initial Rogue launch.

The document was last updated in 2016. So while it doesn't explicitly address Series 6XT it could easily incorporate the FP16 ALUs from G6x30.

Ailuros · Oct 29, 2016

Exophase said:
I think the G6x00 Series 6 cores don't even have FP16 ALUs.

That's what I meant. It was just illustrated/phrased wrong. Note that as I repeatedly said in the past there's no public benchmark where I could ever find any efficiency difference between a 6200 and a G6230 core; obviously in order to show any difference the underlying code needs to get optimized for FP16. Sebbi stating amongst others that they'll start optimising for it is a good sign wherever those get used for.

G6x30 Series 6 adds the FP16 ALUs at a 1:1 ratio to the FP32 ALUs. But the FP16 ALUs can perform 3 FLOPs somehow, vs the standard 2 FLOP FMA in the FP32 ALUs, so that's where the 1.5:1 ratio comes from. The actual ops haven't been disclosed as far as I know, but may be (A*B +/- C*D). They wouldn't be the first GPU to support operations like this either.

As noted above ratios where for throughput not physical units

The document was last updated in 2016. So while it doesn't explicitly address Series 6XT it could easily incorporate the FP16 ALUs from G6x30.

The material is dated in any case and even if they changed anything within the year they've left a lot of crucial changes they should have made out of the document.

sebbbi · Oct 30, 2016

ToTTenTranz said:
Well this may be going over my head, but in my limited experience while promoting lower precision to higher precision is okay (just add a bunch of zeros to the left), the opposite results in a can of worms.
If they're ignoring precision tags, what will happen if a wild FP64 appears and they're just defaulting everything to FP32?

HLSL minimum precision types (10 and 16 bit) are specially designed to allow compiler to decide whether to apply them or not. It is 100% legal to threat all these types as 32 bit types.

See the documentation here:
https://msdn.microsoft.com/en-us/library/windows/desktop/hh968108(v=vs.85).aspx

64 bit float is a completely different case. It is not a minimum precision type. Shader containing 64 bit types can only run on a GPU that supports it.

tangey · Nov 10, 2016

Linley have put up a piece stating their belief that A10 is throttling a lot more than previous Ax chips, citing increased GPU frequency as the fundamental reason.

Apple Turbocharges PowerVR GPU
http://www.linleygroup.com/newsletters/newsletter_detail.php?num=5619

To me it looks poorly researched.

They assume that A10 is fundamentally using similar but customised graphics IP compared to the A9, and that most of the GPU performance is from an increase in clock. Apple haven't historically kept the same GPU IP on new Ax generations.

They cite poor increase in Futurmarks physics test, and also selective GFXbench tests as evidence of poor GPU performance/thermal throttling.

But physics is a CPU test. Ax has always struggled with the test. Futuremark put out a PR several years ago explaining why iphone5s didn't increase much compared to iphone5. It isn't a GPU issue.

https://www.futuremark.com/pressrel...results-from-the-apple-iphone-5s-and-ipad-air
"In the Physics test, which measures CPU performance, there is little difference between the iPhone 5s and the iPhone 5"

Finally, altough they do cite some glxbench tests, they don't mention the tests in that suite that might expose throttling, i.e. the sustain FPS tests and battery tests. According to the data in the Anandtech review, A10 in the iphone7 does drop from 60fps to 50fps after 5 mins, but sustains around 50fps until the battery dies. It's terminal fps is 50% more than on the iphone6s.

The slightly bigger battery also last slightly longer. Assuming those last two things roughly cancel out, the overall package appears to be getting significantly more performance in that test from the same input power. Hardly indicative of the A10 having thermal issues relative to previous Ax generations.

I guess that ultimately, if the chip has higher CPU performance and higher GPU performance, then it has the potential to generate more heat, and in fundamentally the same package, throttling of the higher performance has to happen. But Linley's argument is that much of the theoretical improvements aren't being seen, and is blaming it on GPU frequency increase. Also throwing in futuremarks physics test in a GPU discussion doesn't seem relevant.

thoughts ?

Ailuros · Nov 11, 2016

A frequency of 900MHz? Is there even any sound indication to back up their gut feeling?

Here's are the average results for the iPhone7 Plus, where it's easy to see how much the GPU throttles on average in Manhattan 3.1 (onscreen): https://gfxbench.com/device.jsp?D=A...al&testgroup=graphics&benchmark=gfx40&var=avg

For the frequency it has been pointed out over and over again that the fillrate tests in the latest Kishonti benchmark suite are an aweful indicator for possible GPU frequencies. The A10 GPU gets in the offscreen fillrate test 9752 MTexels/s vs. 6074 MTexels/s for the A9 https://gfxbench.com/compare.jsp?be...S&api2=metal&hwtype2=GPU&hwname2=Apple+A9+GPU

For the same amount of TMUs (12 in both GPUs) the difference is at 60%. So where does the author expect the A9 to clock exactly? And yes I doubt he has even the slightest clue.

Analysis Concludes Apple Now Using Custom GPU Design

Deleted member 13524

Guest

tangey

Deleted member 13524

Guest

Exophase

Ailuros

Epsilon plus three

sebbbi

tangey

Ailuros

Epsilon plus three

Similar threads