PowerVR Series 6 now official

The F16 ALUs can't be combined to do higher precision. Hopefully we talk about exactly what happens inside each pipe soon, I just couldn't swing that by the powers that be before this stuff needed to be released.

What happens inside a pipe? Minions having a party? :LOL:
 
The F16 ALUs can't be combined to do higher precision. Hopefully we talk about exactly what happens inside each pipe soon, I just couldn't swing that by the powers that be before this stuff needed to be released.

What happens inside a pipe? Minions having a party? :LOL:

Thanks :)
 
The F32 and F16 minions party separately in a given cycle.
 
If I remember rightly, Ryan asked me the same question the other day and I was vague in my response, so he was necessarily vague in the article as a result. My fault, in hindsight I could have been clearer so he could be too.
 
Hmm I wonder what these 4 float16 ops really can do. If Series 5 is any indication, my guess would be it can't be just any ops (so more like the EFOs where yes you can technically get double the flops but with quite severe limits on register choices and not really independent instructions).
Maybe that's why there's confusion if there are 4 fp16 units with 2 ops each or 2 fp16 units with 4 ops each. In any case some more insight in what these minions there can do would be welcome by me too :).
 
Well, here's what the blog post says:
That seems to slightly contradict the diagram (i.e. 2x4 flops rather than 4x2 flops). Also the issue with sharing resources between FP16 and FP32 is that it's a ~4x difference for the multiplier's size (like FP32->FP64) not 2x.

I. glad I wasn't the only one confused. The diagram and narrative didn't match for me either. The series 6 USC diagram clearly shows each FP16 having 3 flops, whilst the series 6XT USC clearly shows 2 (but the narrative says 4).

I did ask this in the comments section of the blog.

Its ok not wanting to say things, but to have a narrative and an associated diagram completely at odds with one another just frankly seems poor proof reading.
 
I. glad I wasn't the only one confused. The diagram and narrative didn't match for me either. The series 6 USC diagram clearly shows each FP16 having 3 flops, whilst the series 6XT USC clearly shows 2 (but the narrative says 4).

I did ask this in the comments section of the blog.

Its ok not wanting to say things, but to have a narrative and an associated diagram completely at odds with one another just frankly seems poor proof reading.
I wanted the diagram to match the max "ALU core" count we want to put across for marketing. The text is closer to what actually happens in the pipe. Both add up to the same ops throughput.

One is for marketing, the other is for those that actually care about how the hardware works.
 
I wanted the diagram to match the max "ALU core" count we want to put across for marketing. The text is closer to what actually happens in the pipe. Both add up to the same ops throughput.

One is for marketing, the other is for those that actually care about how the hardware works.

I see a lot of veiled comments to feature bloat on Imagination Tech's, so I take it you believe that perf/watt is superior to Kepler?
 
Last edited by a moderator:
Oh and could someone explain the difference between 6200/6230 and 6400/6430? All the announcements essentially just said the the x30 are "optimized for performance" but on paper they look all the same...
I thought once upon a time this meant you can reach higher clocks with the x30 parts but intel is saying the G6400 in Merrifield reaches the same clock as the G6430 in Moorefield, yet the latter being a good deal faster so it must be something else. More visibility tests or what?
 
I can't blame IMG for playing that game given their competition, but it's a little sad that the "SIMD Lane == Core" terminology is now pretty much recognized as the standard.
 
What implications has the inclusion of significant numbers of ALU16 on GPU compute capability. Are ALU16s useable as ALU32s, half as usable ?. Does GPGPU / opencl only see ALU32s ? I vaguely understand that the fact they are 16 bit is going to limit their application for maths calculation.

What I am basically asking is whether describing a rogue core's GPU compute in terms of only its ALU32 count is fair, given that it has far more (albeit, less useful) ALU16 cores.
 
We can issue instructions to the F16 pipe via compute APIs just as well as we can with graphics APIs. CL supports half precision floats.
 
What I am basically asking is whether describing a rogue core's GPU compute in terms of only its ALU32 count is fair, given that it has far more (albeit, less useful) ALU16 cores.

The question touches on a subject which is difficult but significant - just how much precision do you need?
Well, it depends on your problem and the algorithms you choose to attack it with. Personally , I gnash my teeth in frustration every time I see the "64-bit FP for scientific computation" trope.

Generalizing broadly based on the cumulative error propagation behaviour, you can group algorithms as convergent, neutral (stochastically accumulated error) and divergent.
If your algorithm is convergent, you don't need more precision than is required to represent your data.
If your algorithm features stochastically accumulated error, then what precision you need is dependent on the number of iterations you run, and the desired numerical precision of your answer. (In my field, chemistry, this typically means 32-bit FP is perfectly OK, although 64 bit FP is often used by tradition anyway.)
If your algorithm is divergent, you're in trouble, and you will have to keep a close watch on your code behaviour under all circumstances. Having more precision helps, obviously, but is only a band-aid, and there is nothing really saying that 64, 128 or any other number is going to be enough - ideally you should go back and try to reformulate your problem in order to be able to avoid the problematic algorithm.

Anandtechs decision to only count 32-bit FP in their BogoFLOP chart seems strange to me. For the most part the GPU will process graphics, and the bulk of graphics operations seems as if they could be done in 16-bit, (limited precision needed, minimal iteration) so why focus on 32-bit performance alone?
 
Anandtechs decision to only count 32-bit FP in their BogoFLOP chart seems strange to me. For the most part the GPU will process graphics, and the bulk of graphics operations seems as if they could be done in 16-bit, (limited precision needed, minimal iteration) so why focus on 32-bit performance alone?
Two reasons:

  1. We've traditionally only focused on FP32 performance in both mobile and desktop.
  2. I honestly didn't have a ton of time to work on this article. I don't have verified FP16 perf data handy for most other architectures, and while I have a pretty good idea of what it should be I didn't want to publish anything I wasn't sure of. And there wasn't enough time to get that data verified on a weekend.
 
Would be nice to analyse current precision workload and do a pro-rata of the ALU to get an idea of FLOPS in "usual" cases...
 
Back
Top