PowerVR Series 6 now official

Rys · Feb 24, 2014

Entropy said:
What process node for the beastly GX6650?

Up to the customer but sensibly it's 28nm and lower.

Ailuros · Feb 24, 2014

Rys said:
The F16 ALUs can't be combined to do higher precision. Hopefully we talk about exactly what happens inside each pipe soon, I just couldn't swing that by the powers that be before this stuff needed to be released.

What happens inside a pipe? Minions having a party?

Ailuros · Feb 24, 2014

Rys said:
The F16 ALUs can't be combined to do higher precision. Hopefully we talk about exactly what happens inside each pipe soon, I just couldn't swing that by the powers that be before this stuff needed to be released.

What happens inside a pipe? Minions having a party?

Thanks

Alexko · Feb 24, 2014

Is it possible to co-issue FP16 and FP32 instructions in Series 6 or 6XT?

Rys · Feb 24, 2014

The F32 and F16 minions party separately in a given cycle.

Alexko · Feb 24, 2014

Rys said:
The F32 and F16 minions party separately in a given cycle.

Thanks, I thought so. Tiny mistake in Ryan's article, then.

Rys · Feb 24, 2014

If I remember rightly, Ryan asked me the same question the other day and I was vague in my response, so he was necessarily vague in the article as a result. My fault, in hindsight I could have been clearer so he could be too.

mczak · Feb 24, 2014

Hmm I wonder what these 4 float16 ops really can do. If Series 5 is any indication, my guess would be it can't be just any ops (so more like the EFOs where yes you can technically get double the flops but with quite severe limits on register choices and not really independent instructions).
Maybe that's why there's confusion if there are 4 fp16 units with 2 ops each or 2 fp16 units with 4 ops each. In any case some more insight in what these minions there can do would be welcome by me too

.

tangey · Feb 24, 2014

Arun said:
Well, here's what the blog post says:
That seems to slightly contradict the diagram (i.e. 2x4 flops rather than 4x2 flops). Also the issue with sharing resources between FP16 and FP32 is that it's a ~4x difference for the multiplier's size (like FP32->FP64) not 2x.

I. glad I wasn't the only one confused. The diagram and narrative didn't match for me either. The series 6 USC diagram clearly shows each FP16 having 3 flops, whilst the series 6XT USC clearly shows 2 (but the narrative says 4).

I did ask this in the comments section of the blog.

Its ok not wanting to say things, but to have a narrative and an associated diagram completely at odds with one another just frankly seems poor proof reading.

Rys · Feb 24, 2014

tangey said:
I. glad I wasn't the only one confused. The diagram and narrative didn't match for me either. The series 6 USC diagram clearly shows each FP16 having 3 flops, whilst the series 6XT USC clearly shows 2 (but the narrative says 4).

I did ask this in the comments section of the blog.

Its ok not wanting to say things, but to have a narrative and an associated diagram completely at odds with one another just frankly seems poor proof reading.

I wanted the diagram to match the max "ALU core" count we want to put across for marketing. The text is closer to what actually happens in the pipe. Both add up to the same ops throughput.

One is for marketing, the other is for those that actually care about how the hardware works.

Turbotab · Feb 25, 2014

Rys said:
I wanted the diagram to match the max "ALU core" count we want to put across for marketing. The text is closer to what actually happens in the pipe. Both add up to the same ops throughput.

One is for marketing, the other is for those that actually care about how the hardware works.

I see a lot of veiled comments to feature bloat on Imagination Tech's, so I take it you believe that perf/watt is superior to Kepler?

Rys · Feb 25, 2014

mczak · Feb 25, 2014

Oh and could someone explain the difference between 6200/6230 and 6400/6430? All the announcements essentially just said the the x30 are "optimized for performance" but on paper they look all the same...
I thought once upon a time this meant you can reach higher clocks with the x30 parts but intel is saying the G6400 in Merrifield reaches the same clock as the G6430 in Moorefield, yet the latter being a good deal faster so it must be something else. More visibility tests or what?

Alexko · Feb 25, 2014

I can't blame IMG for playing that game given their competition, but it's a little sad that the "SIMD Lane == Core" terminology is now pretty much recognized as the standard.

Entropy · Feb 25, 2014

Rys said:
One is for marketing, the other is for those that actually care about how the hardware works.

I love that you guys can actually, at least in low profile places like this, tell it like it is.

tangey · Feb 25, 2014

What implications has the inclusion of significant numbers of ALU16 on GPU compute capability. Are ALU16s useable as ALU32s, half as usable ?. Does GPGPU / opencl only see ALU32s ? I vaguely understand that the fact they are 16 bit is going to limit their application for maths calculation.

What I am basically asking is whether describing a rogue core's GPU compute in terms of only its ALU32 count is fair, given that it has far more (albeit, less useful) ALU16 cores.

Rys · Feb 25, 2014

We can issue instructions to the F16 pipe via compute APIs just as well as we can with graphics APIs. CL supports half precision floats.

Entropy · Feb 25, 2014

tangey said:
What I am basically asking is whether describing a rogue core's GPU compute in terms of only its ALU32 count is fair, given that it has far more (albeit, less useful) ALU16 cores.

The question touches on a subject which is difficult but significant - just how much precision do you need?
Well, it depends on your problem and the algorithms you choose to attack it with. Personally , I gnash my teeth in frustration every time I see the "64-bit FP for scientific computation" trope.

Generalizing broadly based on the cumulative error propagation behaviour, you can group algorithms as convergent, neutral (stochastically accumulated error) and divergent.
If your algorithm is convergent, you don't need more precision than is required to represent your data.
If your algorithm features stochastically accumulated error, then what precision you need is dependent on the number of iterations you run, and the desired numerical precision of your answer. (In my field, chemistry, this typically means 32-bit FP is perfectly OK, although 64 bit FP is often used by tradition anyway.)
If your algorithm is divergent, you're in trouble, and you will have to keep a close watch on your code behaviour under all circumstances. Having more precision helps, obviously, but is only a band-aid, and there is nothing really saying that 64, 128 or any other number is going to be enough - ideally you should go back and try to reformulate your problem in order to be able to avoid the problematic algorithm.

Anandtechs decision to only count 32-bit FP in their BogoFLOP chart seems strange to me. For the most part the GPU will process graphics, and the bulk of graphics operations seems as if they could be done in 16-bit, (limited precision needed, minimal iteration) so why focus on 32-bit performance alone?

Ryan Smith · Feb 26, 2014

Entropy said:
Anandtechs decision to only count 32-bit FP in their BogoFLOP chart seems strange to me. For the most part the GPU will process graphics, and the bulk of graphics operations seems as if they could be done in 16-bit, (limited precision needed, minimal iteration) so why focus on 32-bit performance alone?

Two reasons:

We've traditionally only focused on FP32 performance in both mobile and desktop.
I honestly didn't have a ton of time to work on this article. I don't have verified FP16 perf data handy for most other architectures, and while I have a pretty good idea of what it should be I didn't want to publish anything I wasn't sure of. And there wasn't enough time to get that data verified on a weekend.

Rodéric · Feb 26, 2014

Would be nice to analyse current precision workload and do a pro-rata of the ALU to get an idea of FLOPS in "usual" cases...

PowerVR Series 6 now official

Rys

Graphics @ AMD

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

Alexko

Rys

Graphics @ AMD

Alexko

Rys

Graphics @ AMD

mczak

tangey

Rys

Graphics @ AMD

Turbotab

Rys

Graphics @ AMD

mczak

Alexko

Entropy

tangey

Rys

Graphics @ AMD

Entropy

Ryan Smith

Rodéric

a.k.a. Ingenu

Similar threads