Apple A8 and A8X

patsu · Sep 28, 2014

Nebuchadnezzar said:
Why would they say Facetime over cellular? Unless the coding is software based and they use it for bandwidth reduction only.

Probably because the bandwidth ($) saving is more appreciated there in exchange for increase in power usage ?

mavere said:
Probably because MPEG-4 Part 2 doesn't need in-loop deblocking or CABAC and uses much less power per pixel. (Edit: I just noticed patsu linked to the compare page. which doesn't detail the codec discrepancy between Wifi and Cellular)

I think it's more reasonable to assume that they use MPEG-4 when convenient (e.g. when bandwidth is plentiful and power-cheap). Otherwise, it's H.264 when paired with older devices and HEVC when paired with A8 devices. The modem can then relax and offset the encoder/decoder's increased power profile.

Though even ignoring what codec they use and don't use, realtime software encoding on a brand new codec with limited time for computational and rate-distortion optimizations is ridiculously belligerent on a mobile platform.

Yes that's my guess too. The comparison page doesn't list H.265 as a 5s feature. Looks like something in A8 enables it. Either enough computational power, or some form of acceleration.

Nebuchadnezzar · Sep 30, 2014

AT review on the A8 and it's characteristics

http://www.anandtech.com/show/8554/the-iphone-6-review

Entropy · Sep 30, 2014

Nebuchadnezzar said:
AT review on the A8 and it's characteristics

http://www.anandtech.com/show/8554/the-iphone-6-review

That was a great review. The only thing missing was an assessment of the WiFi improvements, but still, they covered a lot of ground. Great to see them poking around the internals of the CPU, and uncovering some interesting tidbits. Particularly happy that they examined the memory subsystem latencies, since it has a lot of impact on real world performance. (And great that the examined NAND performance!) Awesome screen tests. Camera tests are moving forward, maybe they should have given a nod to DxoMark sensor tests.
Don't agree with their continued focus on FP32 for the graphics ALUs, which while it may or may not make sense for GPGPU codes, all references I've seen and private communications I've had points to FP16 being the recommended default.

Can't keep a tear of happiness away seeing them using SPECint.

ltcommander.data · Sep 30, 2014

Nebuchadnezzar said:
AT review on the A8 and it's characteristics

http://www.anandtech.com/show/8554/the-iphone-6-review

Anandtech said:
The focus on FP16 is interesting, though for iOS it may be misplaced. These half-precision floating point operations are an excellent way to conserve bandwidth and power by not firing up more expensive FP32 ALUs, but the tradeoff is that the numbers they work with aren’t nearly as precise, hence their use has to be carefully planned. In practice what you will find is that while FP16 operations do see some use, they are by no means the predominant type of floating point GPU operation used, so the FP16 increase is a 33% increase only in the cases where performance is being constrained by the GPU’s FP16 performance.

https://twitter.com/aras_p/status/509736484153597952

Aras Pranckevičius said:
basically, always use FP16 when possible. Can be faster; but even when isn't it's less power

I asked Aras from Unity about the usefulness of FP16 vs FP32 before and he said that FP16 is actually the preferred format for mobile and should be used whenever possible. So FP16 may well be the predominant type of floating point GPU operation in mobile so is a worthwhile investment by Imagination and Apple. He did mention there are some limitations in the use of the additional FP16 resources in the GX series so getting that 33% improvement won't always be possible.

willardjuice · Sep 30, 2014

ltcommander.data said:
https://twitter.com/aras_p/status/509736484153597952

I asked Aras from Unity about the usefulness of FP16 vs FP32 before and he said that FP16 is actually the preferred format for mobile and should be used whenever possible. So FP16 may well be the predominant type of floating point GPU operation in mobile so is a worthwhile investment by Imagination and Apple. He did mention there are some limitations in the use of the additional FP16 resources in the GX series so getting that 33% improvement won't always be possible.

I also found that to be an odd comment from anandtech. I've been under the impression that fp16 is used more often than fp32 in mobile. In fact, one could claim some companies even recognize this fact and have added better fp16 support in their hardware (tonga and iirc next gen intel).

ams · Sep 30, 2014

ltcommander.data said:
I asked Aras from Unity about the usefulness of FP16 vs FP32 before and he said that FP16 is actually the preferred format for mobile and should be used whenever possible.

Well, the problem with that logic is that all modern day console and PC graphics architectures and games are made with full FP32 precision in mind. As mobile graphics architectures become increasingly more powerful and more feature-filled, it makes less and less sense to focus on lower precision [FP16] rendering, especially if the goal is to bring high fidelity console and PC-class games to mobile.

Ailuros · Sep 30, 2014

On Rogue there are both FP32 and FP16 ALUs present. They obviously can't be used in parallel but I'd be very surprised that wherever FP32 is a requirement those GPUs would use the FP16 ALUs instead.

It's in a very relative sense the same reasoning (just from a by 180 degree different perspective) why NVIDIA for instance uses dedicated FP64 SPs in its desktop GPUs since Kepler. Both dedicated FP16 and FP64 ALUs save power consumption at the expense of more die area.

What I don't understand from my layman's perspective and considering that Rogues cannot use FP32 and FP16 ALUs in parallel, is why they need that many more FP16 than FP32 ALUs. A G6430 has 1.5x times the FP16 than FP32 ALUs and a GX6450 twice as many.

Assume the GPU in A8 is clocked at at least 500MHz in the 6 Plus that would mean 256 GFLOPs FP16; it doesn't look so far like it's behaving or scoring if you prefer like a 256 GFLOPs ULP mobile GPU so there might something get lost in the constant FP16 to FP32 switching (and vice versa) or simply things that are far beyond my layman's understanding of those things.

wco81 · Sep 30, 2014

ams said:
Well, the problem with that logic is that all modern day console and PC graphics architectures and games are made with full FP32 precision in mind. As mobile graphics architectures become increasingly more powerful and more feature-filled, it makes less and less sense to focus on lower precision [FP16] rendering, especially if the goal is to bring high fidelity console and PC-class games to mobile.

Is that the goal though? The highest quality graphics games are not the best selling ones on mobile.

There doesn't seem to be much demand for games which cost more and draw down the battery.

ams · Oct 1, 2014

wco81 said:
Is that the goal though? The highest quality graphics games are not the best selling ones on mobile.

Of course it is the goal to provide higher quality games and higher quality graphics rendering on mobile devices. What would be the purpose of continuing to push graphics performance, close to metal API's, etc. in ultra mobile SoC's if not for the want or desire for higher quality games and higher quality graphics on these devices? Trying to push or promote reduced precision rendering with increasingly more powerful and increasingly more feature filled ultra mobile GPU's does no favors in achieving these goals, nor does it help game developers who are trying to target not just console and PC but mobile too.

wco81 · Oct 1, 2014

I can't argue with you there, as far as Metal API and often using high-performance games for demos at the keynotes of these events.

I guess a game using mobile UE demos better than Candy Crush or whatever it is that outsells any UE game by some crazy ratio.

Though Apple doesn't emphasize specs as much as other manufacturers, I think they feel compelled to try to keep up. Most people who buy their products aren't following this site or Anandtech or other sites which drill into the technical aspects but maybe they know a person or two who does and they just need to hear that their expensive iPhone is competitive specs. wise, even if they don't really pay attention to the details.

ltcommander.data · Oct 1, 2014

ams said:
Of course it is the goal to provide higher quality games and higher quality graphics rendering on mobile devices. What would be the purpose of continuing to push graphics performance, close to metal API's, etc. in ultra mobile SoC's if not for the want or desire for higher quality games and higher quality graphics on these devices? Trying to push or promote reduced precision rendering with increasingly more powerful and increasingly more feature filled ultra mobile GPU's does no favors in achieving these goals, nor does it help game developers who are trying to target not just console and PC but mobile too.

This assumes that higher precision is the only way or best way to achieve better quality graphics. Series 6XT can perform 2 FP16 operations for every 1 FP32 operation. There may well be many cases where the ability to implement 2 lower accuracy effects vs 1 higher accuracy effect produces a more dynamic, visually pleasing overall image.

Lower precision may also be less of an issue in mobile where retina displays are common. If each pixel is too small to be individually distinguishable to the human eye are single pixel errors critical to image quality? And as Aras pointed out, there are situations where FP32 precision is needed and is used in mobile. But if there are situations where going FP16 won't significantly impact image quality you might as well use it for the power and possibly performance benefits. Yes, FP16 isn't common on PC and console, but they don't have to worry about battery life so a different set of trade-offs should probably be made on mobile to balance image quality, heat, and runtime.

mczak · Oct 1, 2014

Ailuros said:
What I don't understand from my layman's perspective and considering that Rogues cannot use FP32 and FP16 ALUs in parallel, is why they need that many more FP16 than FP32 ALUs. A G6430 has 1.5x times the FP16 than FP32 ALUs and a GX6450 twice as many.

My guess is that those were relatively cheap to add, OTOH I doubt they are anywhere close to really being a separate "core", so probably all kind of restrictions apply (as a very harsh restriction you could for instance have the ability to treat fp32 regs as 2xfp16 and perform the same 2xfp16 operation on them instead of 1xfp32).
FWIW I'm also wondering how that compares to for instance how Tonga handles fp16 support though since AMD didn't tell squat I guess I'll have to wait for that... But considering that even on desktop everything is power limited nowadays I guess the return of fp16 even might make sense there.

ams · Oct 1, 2014

ltcommander.data said:
Yes, FP16 isn't common on PC and console, but they don't have to worry about battery life so a different set of trade-offs should probably be made on mobile to balance image quality, heat, and runtime.

The PC gaming industry moved away from mixed precision rendering many many years ago. Mixed precision created a mess for game developers who had to create custom mixed precision code paths for different architectures.

Power efficiency, power consumption, heat, noise, etc. matter in all areas now, not just mobile. The GPU performance and GPU power efficiency is good enough nowadays where there should not be an overwhelming need to render pixels with reduced precision IMHO.

Lazy8s · Oct 1, 2014

Trading off more precision/functionality for more speed happens all over the place in GPU design.

This is just another example of such an optimization, and comparative analysis of A8 in die area, performance sustainability, and performance gains in mobile benchmarks for what continues to be a four cluster GPU core so far seem to show that it's been an effective optimization.

Ailuros · Oct 1, 2014

mczak said:
My guess is that those were relatively cheap to add, OTOH I doubt they are anywhere close to really being a separate "core", so probably all kind of restrictions apply (as a very harsh restriction you could for instance have the ability to treat fp32 regs as 2xfp16 and perform the same 2xfp16 operation on them instead of 1xfp32).
FWIW I'm also wondering how that compares to for instance how Tonga handles fp16 support though since AMD didn't tell squat I guess I'll have to wait for that... But considering that even on desktop everything is power limited nowadays I guess the return of fp16 even might make sense there.

I asked here in the forum if any of the IMG folks can confirm if they run in parallel and Rys said that they cannot due to front end restrictions. I'd like to stand corrected but it's my understanding that they are in fact completely different ALUs for FP16 and you cannot combine them for FP32. Another peace of info I heard of is that each FP16 ALU consumes less than half the power of a FP32 ALU which was obviously the driving force for the dedicated FP16 ALU design decision.

I don't dare to ask any further about the above, since I don't think any of the IMG boys would be allowed to reveal as much and I'm sure the according FP16 and FP32 routines across clusters and their scheduling isn't anything simple.

Entropy · Oct 1, 2014

ams said:
The PC gaming industry moved away from mixed precision rendering many many years ago. Mixed precision created a mess for game developers who had to create custom mixed precision code paths for different architectures.

I guess you missed the recent thread here that someone opened up with a question about whether limited 64-bit performance in desktop GPUs held back gaming. (Replies were that 64-bit precision is not used for basically anything.)
Sebbbi made a one-line post that was quite clear:

sebbbi said:
Game developers prefer maximum performance. 64 bit floats are 4x slower on GCN (and much slower on consumer NVIDIA cards). We often use 16 bit floats, and 16/10/8 bit normalized values to store data as tightly as possible. 64 bit float processing isn't important with the current view distances (unless your game has interplanetary scale).

Also you must have missed how one of relatively few changes to Tonga (R9 285) is improved 16-bit FP. You think they did that without reason?
Basically, using unnecessarily large representations of numbers is simply wasteful. It wastes cache and memory space, it wastes on-chip and off-chip bandwidth, it wastes power, and it wastes performance.
We are up against both power-walls and major lithographic challenges going forward. Waste is no longer cool.

Ailuros · Oct 1, 2014

ams said:
The PC gaming industry moved away from mixed precision rendering many many years ago. Mixed precision created a mess for game developers who had to create custom mixed precision code paths for different architectures.

There are more than one half float extensions in OGL_ES and they get used by everyone in the mobile space exactly to save power without any exception. If you think that NV uses FP32 values for everything even for INT cases you're quite mistaken.

For the record the idea of using dedicated units for FP64 in desktop GPUs by NVIDIA is an idea that originates actually from the ULP SoC market where dedicating specialized units at the cost of added die area in order to save power is quite common.

Power efficiency, power consumption, heat, noise, etc. matter in all areas now, not just mobile. The GPU performance and GPU power efficiency is good enough nowadays where there should not be an overwhelming need to render pixels with reduced precision IMHO.

I could have sworn that it was NVIDIA that claimed up until their Aurora GPU (Tegra4) that FP20 is perfectly sufficient for pixel shader ALUs. What has changed as dramatically overnight that suddenly FP20 has been rendered insufficient and FP32 across everything suddenly has become a necessity.

There isn't even one developer (and no not just those that deal with mobile games exclusively) I've talked to recently that hasn't claimed that FP16 is more than just a common place in the ULP mobile space.

Again Rogues have FP32 ALUs for wherever FP32 is a necessity and yes they could had saved that wee bit more die area they spared for FP16 ALUs, but they wouldn't save as much power. Last time I enquired a bit about FP64 units, synthesis alone for one of those was 0.025mm2@28LP for 1GHz. Now imagine how much lower that would be for a FP16 unit at half the frequency roughly and under 20SoC.

Also do not take AMD and Intel's decisions for special FP16 instructions too lightely. It's likelier that NVIDIA will also follow pace in due time and beyond that it's just a matter how you lay things out in hw and as long as you provide that type of precision that each application requests it is not an issue.

On the former generation Series5 GPU IP you could get from each ALU lane:

2*FP32
2*FP16
Vec3 or Vec4 INT8

There was no performance gain from using FP16 in that case either; what you are arguing about here is purely a matter of implementation in hw. As I said it does not mean that FP16 gets used in spots where higher precision (up to OGL_ES2.x "highp") is a requirement.

Last but not least there's an ungodly amount of Mali200/4x0 out there with a maximum precision of FP16 for pixel shaders which belongs to the lowest common denominator these days.

ams · Oct 1, 2014

Ailuros said:
I could have sworn that it was NVIDIA that claimed up until their Aurora GPU (Tegra4) that FP20 is perfectly sufficient for pixel shader ALUs. What has changed as dramatically overnight that suddenly FP20 has been rendered insufficient and FP32 across everything suddenly has become a necessity.

They did claim that and they also claimed that separate pixel and vertex shaders were more power efficient than unified shaders for that specific GPU design. Anyway, even FP20 pixel rendering precision is significantly higher render quality than the FP16 pixel rendering precision that is the default on so many ultra mobile GPU's today. In fact, if you look at GFXBench, the MP render quality of Tegra 4 devices is significantly higher than most other devices in the database.

Note that, starting with G80, pixel shader performance did not drop at all when moving from reduced FP16 pixel rendering precision to higher FP32 pixel rendering precision.

Ailuros · Oct 1, 2014

ams said:
They did claim that and they also claimed that separate pixel and vertex shaders were more power efficient than unified shaders for that specific GPU design. Anyway, even FP20 pixel rendering precision is significantly higher render quality than the FP16 pixel rendering precision that is the default on so many ultra mobile GPU's today. In fact, if you look at GFXBench, the MP render quality of Tegra 4 devices is significantly higher than most other devices in the database.

FP20 or FP16 therefore is perfectly good enough for the cases where FP16 actually is needed; for wherever though "highp" ie FP32 is a requirement neither/nor can come even close to cover that and there the differences actually show.

The issue is NOT whether USC or non USC here, but the maximum precision values which for Tegras where limited forever for 4 entire SoC generations to just FP20 PS.

Note that, starting with G80, pixel shader performance did not drop at all when moving from reduced FP16 pixel rendering precision to higher FP32 pixel rendering precision.

Unless I missed something neither I or anyone else for the past two pages claimed that using FP16 gains anything in terms of performance. All claims where power consumption related and again on Rogues a FP16 burns less than half the power of a FP32 ALU. Dedicated FP16 units cost more in die area in exchange of lower power consumption and that's the exact same common point I've been pondering on for the use of dedicated FP64 units in Kepler & Maxwell designs.

ams · Oct 1, 2014

Ailuros said:
FP20 or FP16 therefore is perfectly good enough for the cases where FP16 actually is needed

There is a significant difference in pixel rendering quality between FP20 and FP16 precision (and even FP20 is nothing to be proud of in the first place), so it is debatable that FP16 is "perfectly good enough" for pixel rendering precision. Anyway, modern day ultra mobile GPU architectures with increasingly higher performance and increasingly higher power efficiency should not have to resort to heavily reduced pixel rendering precision by default (Maxwell sure as hell doesn't). That is just my opinion on the subject.

Apple A8 and A8X

patsu

Nebuchadnezzar

Entropy

ltcommander.data

willardjuice

super willyjuice

ams

Ailuros

Epsilon plus three

wco81

ams

wco81

ltcommander.data

mczak

ams

Lazy8s

Ailuros

Epsilon plus three

Entropy

Ailuros

Epsilon plus three

ams

Ailuros

Epsilon plus three

ams

Similar threads