instructions/operations per clk

Ante P · May 6, 2004

ATi claim that they can execute 6 pixel shader instructions per clock.
nVidia claim they can execute 8 pixel shader operations per clock.

Help me decipher.

By Dave's X800 preview it looked like the X800 can only execute 5 ops per clk.

Operations/instructions? 5 or 6 for X800?

Me is a bit confused...

It's my birthday today so be nice.

Tahir2 · May 6, 2004

My b'day too...happy b'day to us!

NV40 can do about 1/3 more work per clock cycle (ish).
I wanna know why this is - specifically NV40 supports more features and has a higher IPC or is it to do with the 'type' of instructions that are being fed into each pipeline by the same application differs.

Apples to apples doesnt seem possible this generation even though everything looks similar on a high level.

Dave Baumann · May 6, 2004

NVIDIA were counting components - two shaders with 4 components each.

Ostsol · May 7, 2004

I'm kinda confused, too. The section on NVidia's pixel shaders in Dave's review says there's two ALUs per pixel pipeline, the first one being able to execute a texture opertion or an arithmetic op. As such it appears that there's a maximum of two Vec4 instructions per clock. There's also the co-issuing capability which will allow one to split a Vec4 into two segments (Vec3 and scalar, or two Vec2s), making for an effective maximum of four instructions per clock.

Everything on ATI's pixel pipeline seems to indicate that the maximum is five, including co-issuing. One tex, and two Vec3 + scalar pairs.

EDIT: Bah, I post too slowly. . . Thanks, Dave. . .

Tahir2 · May 7, 2004

Right as a relatively speaking idiot no one has stated in a clear cut easy to understand manner why the IPC on X800 varies so much from the IPC on NV40 on an apples to apples comparison.

People were saying the opposite would be true this generation.

Dave Baumann · May 7, 2004

With co-issue and dual issue NV40 has a 4 instruction capability straight away; they might be able to additionally execute other instructions simulataneously for free (FP16 normalise) so the possability for 6 may be there.

I don't recall the ATI's presentations stating 6 instructions, only 5.

Ante P · May 7, 2004

DaveBaumann said:
With co-issue and dual issue NV40 has a 4 instruction capability straight away; they might be able to additionally execute other instructions simulataneously for free (FP16 normalise) so the possability for 6 may be there.

I don't recall the ATI's presentations stating 6 instructions, only 5.

They claim 49.92 billion instructions per second.
49.92 billion / 16 / 520 = 6 pixel shader instructions per clock (or else it just doesn't add up)

As for the 9800 it's even more confusing, they claim something like 5.3 there.

In any case, if anything I'm more confused.
Some please clarify: Operations, components and instructions. How do I tell them apart.
(Gaah, I hate feeling like an idiot but if I never ask I'll never know.)

The object is quite simple, I want to calculate the max (theoretical) number of pixel shader instructions per second for these four boards:
5950 Ultra, 9800 XT, 6800 Ultra Extreme and X800 XT Platinum Edition.

DemoCoder · May 7, 2004

The NV40 has two vector ALUs. In 2+2 co-issue mode, they can issue two *different* instructions on each ALU. So technically, it can perform 4 different vec2 ops. In addition, each ALU can perform a "complex" scalar op, yielding 6. Finally, there is the "free NRM16" which gives 7 ALU ops.

If you want to count only "full" vectors. It can do 2 vec4s, 2 complex, and 1 free nrm = 5 ops I believe.

If you're texturing, it can do 1 vec4, 1 tex, 2 complex, 1 free nrm = 5 also.

Difference is, R420 can do 2 vec4 + texturing at same time. The ops that are possible are not equivalent.

Technically, you should count vec and scalar/complex separate, not together in total op count if you want to get a clearer picture.

Dave Baumann · May 7, 2004

Where are those complex instructions coming from?

Ostsol · May 7, 2004

One of NVidia's technical briefs on the NV40 says: "During each cycle, the dual shader units can execute up to four instructions per cycle and up to eight operations per pixel." Those eight operations are one operation per component. Two Vec4s == 8 components; one Vec4 instruction == 4 scalar ops. The instructions can be made on one, two, three, or four components at once -- still sticking to the maximum of four instructions per cycle (two per ALU), of course. It doesn't seem to say anything about a free FP16 nrm. . .

Tridam · May 7, 2004

It's pretty clear that basically R3x0/R420 can do 5 instructions per clock : 1 tex, 1 vec3, 1 scalar, 1 modifier on vec3, 1 modifier on scalar.

It isn't as simple with NV40. Basically, its pipelines can do : 2 instructions (vec2 + vec2 or (vec3 or tex) + scalar) + 2 modifiers (2+2 or 3+1) + 1 NRM FP16 + 2 instructions (2+2 or 3+1) + 2 modifiers (2+2 or 3+1). -> 9 instructions

My results showed that the "first unit" can do MUL/TEX and complex instructions (but no ADD/MAD/DP3) while the "second unit" can do MUL/ADD/MAD/DP3 only.

Ostsol · May 7, 2004

Any chance that people will post their sources so we can all have a chance to check it out?

Tridam · May 7, 2004

Ostsol said:
Any chance that people will post their sources so we can all have a chance to check it out?

Tests done by myself and phone calls / emails with David Kirk. The official documents don't talk about everyhting. They just describe a basic/marketing picture of the things that the NV40's pipelines can do.

DemoCoder · May 7, 2004

NRM_PP (FP16 nrm) is a pre-ALU operation on shader unit 1.

As for the complex helper instructions, they are SIN/COS, RSQ, etc There is conflicting information as to whether a SINCOS or RSQ operation "borrows" a scalar unit or is in addition to. One source code example from NV seems to indicate that they do not "borrow", but that could be a documentation error.

Even Nvidia says the FP16 norm is "free"

Tridam · May 7, 2004

DemoCoder said:
NRM_PP (FP16 nrm) is a pre-ALU operation on shader unit 1.

As for the complex helper instructions, they are SIN/COS, RSQ, etc There is conflicting information as to whether a SINCOS or RSQ operation "borrows" a scalar unit or is in addition to. One source code example from NV seems to indicate that they do not "borrow", but that could be a documentation error.

Even Nvidia says the FP16 norm is "free"

Are you talking about THE shader example ?

The shader example in the official docs is BS.

Ostsol · May 7, 2004

Tridam said:
Ostsol said:

Any chance that people will post their sources so we can all have a chance to check it out?

Click to expand...

Tests done by myself and phone calls / emails with David Kirk. The official documents don't talk about everyhting. They just describe a basic/marketing picture of the things that the NV40's pipelines can do.

Ah. . . that's what I thought. The first document on CineFX 3.0 really didn't say anything useful.

DemoCoder · May 7, 2004

Yes, "THE" shader example.

Nappe1 · May 7, 2004

Happy Instructions, Ante P and Tahir!

and what comes to Birthdays, does it really matter how many you have those per clock? there's someother limiting factors that makes things slow down a bit in real life.

EDIT: umh... it seem that I really need more coffee.

I never should have eat that much on lunch break...

Ante P · May 7, 2004

Nappe1 said:
Happy Instructions, Ante P and Tahir!

and what comes to Birthdays, does it really matter how many you have those per clock? there's someother limiting factors that makes things slow down a bit in real life.

thanks

hehe the first person to congratulate me was the boss of Creatives swedish offices.. says a lot about my social life doesn't it

well as for the instructions per sec I just wanted to make a nice chart of theoretical max specs for the new boards compared to the old generations but at the same time I wanted to make sure the specs are comparable between ATi and nvidia

but anyways, I gave up on that idea

Hanners · May 7, 2004

Nappe1 said:
and what comes to Birthdays, does it really matter how many you have those per clock? there's someother limiting factors that makes things slow down a bit in real life.

Personally, I've found that I'm totally agerate limited due to my birthdays per clock capabilities.

Makes me envious of the Queen with her dual-issue birthday feature, it must make a big difference.

instructions/operations per clk

Ante P

Tahir2

Dave Baumann

Gamerscore Wh...

Ostsol

Tahir2

Dave Baumann

Gamerscore Wh...

Ante P

DemoCoder

Dave Baumann

Gamerscore Wh...

Ostsol

Tridam

Ostsol

Tridam

DemoCoder

Tridam

Ostsol

DemoCoder

Nappe1

lp0 On Fire!

Ante P

Hanners

Similar threads