Analysis Concludes Apple Now Using Custom GPU Design

Problem being that IMG has announced only Series 8XE (which is low end GPU IP) so far: https://imgtec.com/powervr/graphics/series8xe/ There hasn't been any announcement for Series8XT, albeit they have mentioned it in public presentations from what I recall.
Apple is PowerVRs #1 premium customer. Of course they would get the new chip first (the big model) if they wanted. Apple has called their GPU as "Apple GPU" as long as I remember. Even when everybody knew it was pure a PowerVR design. Every company that uses semi-custom chips (such as console manufacturers) have some in-house hardware designers. Apple recruiting some GPU guys doesn't automatically mean that they are planning to do design their own GPU. I personally need a bit more proof to be 100% convinced.
 
Although the article refers to series 6 and series 7 similarly (albeit stating that series 7 has more 16-bit capability than series 6), the IMG document used by the author to highlight the suggested register file differences, is for series 6 only. The article itself says it is using some basic compiler and optimisation manuals for series 6, as the basis of it's comparision, and accepts that IMG IP is poorly documented in the public realm. It's probably dangerous therefore to extrapolate from those documents, how the changes in series 7 have been implemented from a register file viewpoint. But I'm just looking at this from a laymans point of view.

IMG series8XT was mentioned as being available and licensed, at the AGM some months ago. It's not however mentioned on the website, nor has any information been made public as to what differences there are compared to series 7. So, at best the article is drawing comparisons based on IMG technology that is 2 generations old.

IMG significantly expanded the capability of 16-bit operations in series 7. One assumes this was at the request of existing/future licencees. Apple is IMG's biggest licencee. It is logical therefore to consider that Apple was behind this request.

If 8XT was again designed based on expected demand from licencees (and on what other basis would it be designed ?), and it has been demonstrated that Apple's GPU has much enhanced 16-bit capability, is it not reasonably obvious to suggest that 8XT also has ?
 
Last edited:
FP16 was important for mobile before desktop and now everyone has it, so it's rather safe to assume PowerVR was tuned for that workload good while ago.
[Unfortunately I can't say much more since I don't remember whether my knowledge of it comes from working at IMG or from public documents ^^ ; and I don't have time to search]
 
But through consumer Pascal, nvidia seems to be trying to destroy the chances of FP16 adoption becoming widespread in the PC space..
In anandtech's GTX 1080 article they said the FP32 units cannot process FP16 and there's a dedicated FP16 unit for every 64 FP32 units. Unlike in previous architectures where FP16 would just be converted to FP32 so throughput was the same, if you try to run FP16 on any consumer Pascal then performance will tank pretty hard.

I wonder if this will result into any consequences regarding nvidia relations with apple.
 
Apologies for redundancy, but I feel like my earlier questions/comments might have gotten buried in the quote I placed it in.

I still don't understand how the sources in the article supports the claims David is making:

1) I can't find anywhere that states SIMD FP16 capability for Series 6, 6XT or 7. IMG repeatedly refers to the design as scalar (https://imgtec.com/blog/graphics-cores-trying-compare-apples-apples/) and that the optimization guide states that this allows unencumbered swizzling.
2) The Series 6 instruction set manual, while pretty blatantly incomplete and with some inconsistencies, does specify that 16-bit sub-register addressing can be performed for source and destination (along with broadcast to full register destination) for "supported" operations. There's no list of what these operations, but rcp is given as an example. These modifiers are specified as free, and given separately from the .lp modifier that David references in the article (that zeroes out the non-FP16 LSBs from the source and dest, but doesn't say anything about the exponent range). There are also pack and unpack instructions that appear to be able to convert two FP32 values into two FP16 values (in a single register) and vice-versa. Nothing about this manual suggests it fully applies to Series 6XT (which is attributed to Apple's A8) much less 7 and beyond.
3) Nothing that I can find in Apple's Metal optimization slide deck suggests that there are free FP16 to FP32 conversions for anything other than texture sampling, rather than between ALU ops.

So based on all of this how is it apparent that Apple's shader core is doing anything fundamentally differently from IMG's, much less to the extent that it must be a completely different design?

While Apple's GPU hires have to have been going somewhere, they'd need a lot for driver development (which many of the job titles do suggest), and higher level integration into the rest of the SoC as well as validation (which many of the other job titles suggest). There could also be more work in the pipeline that simply hasn't shown up in their GPUs yet, much less starting with A8.

I get the feeling that there's some level of insider knowledge that everyone else knows or accepts, or I'm just not seeing something bigger.
 
I don't have any insider information about the mobile GPU industry. I am wondering why you believe this is a custom Apple GPU instead of simply PowerVR Series 8? Series 8 was already announced a while ago. Wouldn't it be reasonable to believe Apple is the first to deploy a full sized Series 8 GPU?
The Apple presentation the article cites covers GPUs from the A8 and onward. The timing would seem to be backward to say the A8 is a standard implementation of the most recently announced line.

3) Nothing that I can find in Apple's Metal optimization slide deck suggests that there are free FP16 to FP32 conversions for anything other than texture sampling, rather than between ALU ops.
The slide is terse, but it its structure seems to be a hierarchy of topic, bulleted detail about about the topic, potentially a dashed elaboration about the detail about it.
The optimization for using half for texture, interpolate, and math has a sub-detail about focusing on the output of the texture operation rather than the format. The free conversion point is not subsidiary to the texture detail, so there's at least the possibility that the two points can apply to more than just the texture portion of the optimization line.

The presentation also did not mention the performance pitfall of converting data types outside of texture operation, but stating that absence is significant would require knowing what threshold there was for being important enough for the slides dedicated to pitfalls.
 
The slide is terse, but it its structure seems to be a hierarchy of topic, bulleted detail about about the topic, potentially a dashed elaboration about the detail about it.
The optimization for using half for texture, interpolate, and math has a sub-detail about focusing on the output of the texture operation rather than the format. The free conversion point is not subsidiary to the texture detail, so there's at least the possibility that the two points can apply to more than just the texture portion of the optimization line.

The presentation also did not mention the performance pitfall of converting data types outside of texture operation, but stating that absence is significant would require knowing what threshold there was for being important enough for the slides dedicated to pitfalls.

Another reason why I don't think the slide bullet applies to more than texture sampling is because it says even FP16 to FP32 conversion is supported. Suggesting that at least some set of integer to FP conversion is available in this context. I can't envision free conversion from int to FP would be very desirable outside of texture sampling. If it costs an instruction it's probably not going to be free, so they'd have to be operand modifiers or different opcodes. This doesn't seem like a worthwhile use of encoding space.

At any rate, even if the slides don't clearly say that Apple doesn't support free FP16 to FP32 ALU conversions (in a way IMG clearly doesn't) that's hardly evidence that they do.
 
According to @Rys, series 6 (and on?) each fp16 "unit" is 2xfp16 and those units are separate from the fp32 units. They use the term scalar probably because (and this just a guess) 1) not all operations could utilize both of the fp16 "flops" within a unit 2) unlike amd/intel/etc. they don't have wide vectors (just a collection of "units") and probably wanted to drive that point home

Wrt to nvidia and fp16, they artificially limit it in cuda (don't use fp32 units and convert) so you buy the top dog (gp100). There's nothing stopping them from promoting fp16 operations to fp32 (and/or doing some conversion). In fact dx/ogles have always allowed promotion, from the dx docs:

https://msdn.microsoft.com/en-us/library/windows/desktop/hh968108(v=vs.85).aspx said:
To use minimum precision in HLSL shader code, declare individual variables with types like min16float (min16float4 for a vector), min16int, min10float, and so on. With these variables, your shader code indicates that it doesn’t require more precision than what the variables indicate. But hardware can ignore the minimum precision indicators and run at full 32-bit precision.
 
Another reason why I don't think the slide bullet applies to more than texture sampling is because it says even FP16 to FP32 conversion is supported.
If I was unclear, my argument was that the texture item and the conversion items are both bullet points, meaning they may be independent of each other as to their coverage of text/interp/math set.
My reading of the conversion claim is that conversions are typically free, with the intimation that FP32 and FP16 conversions are normally expected to be more costly.
 
In anandtech's GTX 1080 article they said the FP32 units cannot process FP16 and there's a dedicated FP16 unit for every 64 FP32 units. Unlike in previous architectures where FP16 would just be converted to FP32 so throughput was the same, if you try to run FP16 on any consumer Pascal then performance will tank pretty hard.
fp16 will become more popular in PC games. Double rate fp16 in PS4 Pro means that AAA game devs and AAA engine devs will optimize portions of their code to fp16. My code will certainly be int16/fp16 optimized in the future.

Fp16 code already benefits many PC GPUs. Intel (Broadwell, Skylake) and AMD (GCN3, GCN4) save register space when fp16 is used -> increased latency hiding -> speed up. Broadwell and Skylake also have double rate fp16 ALU, futher improving the performance. AMD Vega most likely introduces double rate fp16 ALU to AMD PC GPUs (as PS4 Pro already has it). It would make no sense to disable fp16 on PC, as everybody except Nvidia gains performance.

If some Nvidia GPUs lose performance from fp16, they will certainly issue a driver to fix the problem (convert fp16 instructions to fp32 on those GPUs affected, possibly by a game/shader specific profile).
 
Last edited:
If some Nvidia GPUs lose performance from fp16, they will certainly issue a driver to fix the problem (convert fp16 instructions to fp32 on those GPUs affected, possibly by a game/shader specific profile).

If that is/was possible, why would they place a measly number of FP16 specific ALUs?
 
ISA compatibility for cuda.

CUDA specifies that there must be dedicated FP16 ALUs? Consumer Pascal chips are the only nvidia chips with it, AFAIK. Heck, I doubt GP100 has dedicated FP16 ALUs, since it seems to be using ALUs with the same capabilities as Tegra X1 (2*FP16 throughput).
 
And you don't think there's extra instructions to pack and unpack a 2xfp16 unit (among many other things)? :)
 
If that is/was possible, why would they place a measly number of FP16 specific ALUs?
min16float is a precision hint in DirectX. GPU is allowed to use higher precision. If you look at generated dxasm code, it is exactly the same, except with some precision modifier tags. It's trivial to ignore these tags when you are compiling dxasm to IHV specific microcode.

https://msdn.microsoft.com/en-us/library/windows/desktop/hh968108(v=vs.85).aspx

If Geforce Pascal (consumer) has some fp16 ALUs, they need also register addressing by 16 bit halves. I would assume they are handling 16 bit int processing by splitting 32 bit int (SIMD style) even if they added separate fp16 ALUs. Somebody needs to do a microbenchmark on Pascal to confirm the performance of fp16/int16 on consumer Pascal. I have a GTX 980, so I can't do it.
 
Or you could bounce back on topic for a change.

According to @Rys, series 6 (and on?) each fp16 "unit" is 2xfp16 and those units are separate from the fp32 units. They use the term scalar probably because (and this just a guess) 1) not all operations could utilize both of the fp16 "flops" within a unit 2) unlike amd/intel/etc. they don't have wide vectors (just a collection of "units") and probably wanted to drive that point home.

It's my understanding that they have SIMD16 where each ALU lane is capable of 2xFP32/clock. While they most likely have dedicated FP16 units I don't see why they wouldn't be capable of 2xFP16 at all times also. If then why wouldn't 2xFP32 be also possible under conditionals? What I'm sure of is that you can't mix FP16 with FP32 within a SIMD, but I guess that's almost self-explanatory. However it's my understanding that the latter occurs because the FP32 and FP16 SPs use the same surrounding logic making only either/or scenarios possible.

For the record:

G6x00 Series6 cores: 1:1 FP16/FP32
G6x30 Series6 cores: 1.5:1 FP16/FP32
Series6XT/7XT cores: 2:1 FP16/FP32

(7XT Plus increases INT hw)

Their design choice to use dedicated FP16 is probably saving more power than alternative solutions at the cost of an additonal relatively low hw cost.

---------------------------------------------------------------------------------------------------------------------------------

As for that Series6 developer guide that document is as old as the initial Rogue launch. On another note if someone wants to start with leads where for the moment the true advantages of Metal lie he should rather look into that direction IMHO: https://gfxbench.com/result.jsp?benchmark=gfx40&test=639&order=score&base=gpu&ff-check-desktop=0
It's not particularly difficult to see what the Driver Overhead 2 low level test exactly does and it's not a pure GPU only affair in the given case.

Parker seems to be fairly low in that test: https://gfxbench.com/resultdetails.jsp?resultid=hqNQqQ6fR0yzfqdvNKT97w ....one of the reasons why I believe that there might be some performance still lurking in NV's Tegra drivers for Parker.

One interesting low level test would be FP16 throughput per solution and also another test measuring consumption for that same throughput.
 
Last edited:
If Geforce Pascal (consumer) has some fp16 ALUs, they need also register addressing by 16 bit halves. I would assume they are handling 16 bit int processing by splitting 32 bit int (SIMD style) even if they added separate fp16 ALUs. Somebody needs to do a microbenchmark on Pascal to confirm the performance of fp16/int16 on consumer Pascal. I have a GTX 980, so I can't do it.

Anandtech did
, and they claim these consumer Pascal cards cannot "promote" FP16 to FP32 like Maxwell and Kepler before it.

lxGUqP.png


I'm guessing this conversion willardjuice mentions would maybe have to be done on the software/driver level, which would probably introduce an uncertain amount of latency..
 
Anandtech did, and they claim these consumer Pascal cards cannot "promote" FP16 to FP32 like Maxwell and Kepler before it.

I'm guessing this conversion willardjuice mentions would maybe have to be done on the software/driver level, which would probably introduce an uncertain amount of latency..
They don't need to convert any shaders. Just ignore the precision tags when reading DxASM in their internal shader compiler. Simply use the same code path as their existing GPUs with no fp16 native instructions. Not a problem at all. All GPUs are abstracted behind a driver. This issue can be simply fixed by a driver patch. I am pretty certain Nvidia does it immediately when the first big AAA game suffers performance downgrade because of this.

Update: Sisoft test runs on CUDA 7.5. It might be that CUDA doesn't allow promotion. DirectX 16 bit types are specially designed in a way that promotion is possible. DirectX small float/int types: min16float and min16int promise minimum of 16 bits of precision, but no guarantee. Compiler is allowed to use 32 bit register storage for these types. CUDA might require packing.
 
Last edited:
Back
Top