Creative ZMS-20 and ZMS-40

french toast · Jan 7, 2012

darkblu said:
Both units are well-kept trade secrets. One of them is advertised as SIMD, while the other is supposedly vec4 + "complex" scalar op. Also, one can do single- and half-precision floats plus 32bit ints, the other is strictly single-precision floats (at least in older 20x adrenos). But I'm not so sure the ALUs are so interesting on their own, without their respective entourages.

So correct me if im wrong, but does that mean that adreno is vliw5 architecture? and that zms-xx is more like multiple SIMDS aka neon?
..So if that makes sense then it would be easy for zii labs to include the neons ff the A9s to make 26 GFLOPS?

french toast · Jan 7, 2012

And by being usefull without entourages do you mean the rest of the make up of the vliw 5 shader? how many alus would a shader compile?

Sorry for all the nooby questions but im trying to get a better understanding.

mboeller · Jan 8, 2012

french toast said:
Yes looks like they have just increased the throughput from 05-08 and as someone pointed out above maybe they have included the simd/neon from the coetex a9s??

Didn't you read the press release?

58 GFlops StemCell compute power

For me this says clearly that the 58 Gflops are only from the Stemcell core.

french toast · Jan 8, 2012

mboeller said:
Didn't you read the press release?

For me this says clearly that the 58 Gflops are only from the Stemcell core.

I missed that. ok then with that in mind and with the calculations above then it looks like they have added some fixed function hardware... although those benchmark numbers dont look special to me..

Exophase · Jan 8, 2012

french toast said:
And by being usefull without entourages do you mean the rest of the make up of the vliw 5 shader? how many alus would a shader compile?

Sorry for all the nooby questions but im trying to get a better understanding.

This is what i.MX51's reference manual says about its shading capabilities:

2 Vector (4 component) and 1 Scalar (single component) ALU instruction per clock
1 Control Flow Instructions (Jumps, loops) per clock

i.MX51 and i.MX53 licensed z430 from AMD, before the line was sold to Qualcomm. This should be the same GPU as Adreno 200.

That implies VLIW4, but a very different kind than what was in AMD's Cypress for instance. That also implies 9 ALUs, with one of them possibly having different capabilities from the other 8.

Less is known about the updates to Adreno 200, except that Adreno 205 is roughly 2x as powerful as 200 and has 4 ALU arrays, Adreno 220 is roughly 5x as powerful and has 8 ALU arrays, and Adreno 225 is 220 with 50% higher clock. Here's a good reference:

http://www.anandtech.com/show/4940/qualcomm-new-snapdragon-s4-msm8960-krait-architecture/3

It looks like the vector to scalar ratio changes from 8 to 1 to 4 to 1, and the scalar units don't do MADs. No idea what the TMU and ROP ratios are like beyond 1 TMU/ROP for 200, but if I had to take a stab I'd say 2 TMUs for 205 and 4 for 220.

More on topic: I don't really like ZiiLabs' marketing. I strongly doubt that all these Stem Cell processors are actually cores, where I consider a "core" to be capable of running an independent instruction stream. It's probably another big SIMD array like you see in most GPUs.

Deleted member 13524 · Jan 8, 2012

I thought Adreno 200's shader arrangement was the same as Xenos', hence being called "Mini-Xenos" during its development period.

darkblu · Jan 9, 2012

Exophase said:
That implies VLIW4, but a very different kind than what was in AMD's Cypress for instance. That also implies 9 ALUs, with one of them possibly having different capabilities from the other 8.

Shouldn't vec4 + scalar be VLIW5 according to AMD's own naming conventions?

It looks like the vector to scalar ratio changes from 8 to 1 to 4 to 1, and the scalar units don't do MADs.

Where's the 4:1 ratio coming from? 225 is still rated at 4 MADs per SIMD, AFAICS.
edit: Ah, I assume you were referring to the article correction where they originally claimed 8 MADs per SIMD, which later changed to 4 MADs per SIMD? If so then yes, the ratio had always been 4:1.

Exophase · Jan 9, 2012

darkblu said:
Shouldn't vec4 + scalar be VLIW5 according to AMD's own naming conventions?

Honestly I don't really know, I guess I thought "VLIW5" meant 5 different operations. The term VLIW has to refer to unique instructions, not just SIMD, but I really don't know what arrangement was being used.

But the Freescale documentation does clearly say two 4-way SIMD instructions and one scalar instruction. It isn't clear if the branch can be executed in addition to those other three instructions, so it could be 3-way VLIW or 4-way. Of course, it could just mean a single FMADD but that'd be 2 4-way vector operations, not 2 4-way instructions. So in that case the documentation would be wrong, not my interpretation.

darkblu said:
Where's the 4:1 ratio coming from? 225 is still rated at 4 MADs per SIMD, AFAICS.
edit: Ah, I assume you were referring to the article correction where they originally claimed 8 MADs per SIMD, which later changed to 4 MADs per SIMD? If so then yes, the ratio had always been 4:1.

4:1 if a MAD is considered one operation. 8 to 1 in regard to what the Freescale documentation calls two instructions.

darkblu · Jan 9, 2012

Exophase said:
Honestly I don't really know, I guess I thought "VLIW5" meant 5 different operations. The term VLIW has to refer to unique instructions, not just SIMD, but I really don't know what arrangement was being used.

Ah, naturally, if we took VLIW# to be the op width, adreno would be VLIW2, xenos-style. But in terms of functional correspondence to later AMD ALUs, the best match would be VLIW5.

Exophase said:
But the Freescale documentation does clearly say two 4-way SIMD instructions and one scalar instruction. It isn't clear if the branch can be executed in addition to those other three instructions, so it could be 3-way VLIW or 4-way. Of course, it could just mean a single FMADD but that'd be 2 4-way vector operations, not 2 4-way instructions. So in that case the documentation would be wrong, not my interpretation.

Indeed. But they corrected that in imx53's RM.

slyd · Jan 9, 2012

But the Freescale documentation does clearly say two 4-way SIMD instructions and one scalar instruction.

Yes that manual is really a bit misleading (they mean 2x2 component which is 1x4 components).
The Adreno 20x all do 8 + 1 FLOPs / Clock and in certain cases can issue 3 different instructions in parallel (2x Vec2 + 1 scalar) as far as I know.

darkblu · Jan 9, 2012

slyd said:
The Adreno 20x all do 8 + 1 FLOPs / Clock and in certain cases can issue 3 different instructions in parallel (2x Vec2 + 1 scalar) as far as I know.

Do you have an example of that?

Exophase · Jan 9, 2012

darkblu said:
Ah, naturally, if we took VLIW# to be the op width, adreno would be VLIW2, xenos-style. But in terms of functional correspondence to later AMD ALUs, the best match would be VLIW5.

Oh okay, that makes sense. 5-op VLIW is kind of wide, but there's widerout there. I guess this means one vector op and one scalar OR control flow op.

darkblu said:
Indeed. But they corrected that in imx53's RM.

Actually it says the same thing. Do you have word from Freescale that this is a mistake?

darkblu · Jan 9, 2012

Exophase said:
Actually it says the same thing. Do you have word from Freescale that this is a mistake?

Perhaps we're looking at different revisions of the document. Check rev 2 (pg. 575).

french toast · Jan 9, 2012

Thanks for the explanations.

Exophase · Jan 9, 2012

darkblu said:
Perhaps we're looking at different revisions of the document. Check rev 2 (pg. 575).

Ah, okay, I was looking at page 1778 which says in both documents: "2 Vector (4 component) and 1 Scalar (single component) ALU instruction per clock." Looks like they had it right in one place and not the other. This section also says TWO control flow ops per cycle - surely that's not correct either?? What did they do, trade one error for another?

It also says 1 4 component interpolation per cycle, which makes a lot more sense than a 14 component interpolation per cycle as it says in the other section. I take it that this happens in parallel with the projective reciprocal and barycentric coordinate calculations, and that it actually is a dedicated MADD and doesn't take a shader operation.

darkblu · Jan 10, 2012

Exophase said:
Ah, okay, I was looking at page 1778 which says in both documents: "2 Vector (4 component) and 1 Scalar (single component) ALU instruction per clock." Looks like they had it right in one place and not the other.

It seems like they carried over the original section from imx51's RM in addition to the new section. Frankly, I was blissfully unaware of that 'carry over'. And I intend to just continue using the new section : )

This section also says TWO control flow ops per cycle - surely that's not correct either?? What did they do, trade one error for another?

No, it's correct ; ) But it's not just jumps and loops.

It also says 1 4 component interpolation per cycle, which makes a lot more sense than a 14 component interpolation per cycle as it says in the other section. I take it that this happens in parallel with the projective reciprocal and barycentric coordinate calculations, and that it actually is a dedicated MADD and doesn't take a shader operation.

Yep, interpolants are free at the rate of one per clock. I.e. if you had a fragment shader that used no interpolants and took 1 clock to execute, adding one interpolant to that would not penalize you.

french toast said:
Thanks for the explanations.

Not at all ; )

Creative ZMS-20 and ZMS-40

french toast

french toast

mboeller

french toast

Exophase

Deleted member 13524

Guest

darkblu

Exophase

darkblu

slyd

darkblu

Exophase

darkblu

french toast

Exophase

darkblu

Similar threads