NV30 fragment processor test results

KimB · Apr 12, 2003

But why is Nvidia so stuck on int and not fp!!

Well, think about it. It should take roughly the same number of transistors for 4 FP ALU's as it takes for 8 Int ALU's. Much higher peak performance can possibly be achieved if some integer precision is used. Depending on what sorts of programs are necessary for shader calculations, going this way could mean significantly higher performance.

demalion · Apr 12, 2003

Uttar said:
The NV30 is superior to the R300 if the following is true:

What is "superior"? Offering more features, or more speed? It doesn't seem to do both at the same time.

1. Both INT & FP are used in the same program
2. Few registers are used
3. There's little scalar

1. It is superior if you remove dynamic range and precision data during operations? Isn't that a bit contradictory to the label "superior" and to proposing shader length advantage?
2. With "few registers used", you are either ignoring integer processing (which reintroduces the performance issues and eliminates the assertion that it competes well in speed), or are just repeating 1.
3. "little scalar" precludes claiming quality advantage without drastic performance deficit, and doesn't facilitate claiming performance advantage without the above items being included as well.

It's not THAT hard, now, is it?

You usually make sense, Uttar, but I don't see it here. "not THAT hard" after listing a set of criteria that seems to contradict the premise of "the NV30 is superior to the R300"?

2 is Cg's job.
3 is true in many cases.

It is? What are these cases? Again, I see it ranging from being "ps 1.3" functional and faster, or "ps 2.0 extended" and much slower. Note that the extended features keep being proposed as being significant, but are proposed without discussing R300 instruction and functionality advantages.

1 is the big problem, because DX9 doesn't support INT.

It is also a problem because it offers inferiority, both in tangible and in theoretical results, to the R300. 3dmark 03, shader benchmarks, and John Carmack's discussions all seem to support this (a 500 MHz part using a custom tailored path at reduced quality barely edging out a 325 MHz part using a generic path does not establish superiority).

Where did this jump from "nv30 is competitive when using integer" to "nv30 is superior" come from all of a sudden? The support seems to be predicated on a theoretical situation and ignoring factors outside of that case.

Not even through extensions. I'm sure nVidia would gladly pay them $25M "under the table", or maybe even more, to get full integer support in DX9.1 extensions and the right to use FP16 registers for most operations...

Using fp16, it is still slower than the R300. If it is using intermixed integer ops, and thereby freely dropping the advantages of fp16 in intermixing ops, that performance parity can indeed be somewhat addressed in a realistic workload. However, PS 1.3 functionality is not clearly superior to PS 2.0 at fp24 with a performance lead, nor is "extended" functionality with integer precision (what we just mentioned) clearly superior to PS 2.0 at fp24 with intermittent performance parity, nor "extended functionality" with fp16 precision compared to PS 2.0 at fp24 and significantly slower performance. And all these situations are with a significant clock speed advantage.

It looks to me like a series of tradeoffs where a good case can also be made for the nv30 being inferior, and these are the best case situations for what you seem to be trying to propose as "superior".
Well, unless you are disregarding performance completely and want to discuss fp32 alone (though that seems odd in a discussion involving the performance offered by the integer pipeline), though that does seem to leave the door open for CPUs to compete. :-?

KimB · Apr 12, 2003

Uttar said:
Another possibility is that the NV30 architecture is fundamentally different from what we think ( and that's what I'd bet on )
32 units, which can do either:
1FP in 8 cycles
1TEX in 4 cycles ( or 8 cycles, if dependent )
1FX ADD in 4 cycles
1FX MUL in 2 cycles

Except if each unit was totally flexible like this, then it couldn't work on FP and Int together so well.

But we have known for some time that graphics cards are highly-pipelined. This is the only way to reach higher clock speeds, and is also necessary for hiding memory latencies (which is possible with smart caching and prediction). This is nothing new.

KimB · Apr 12, 2003

demalion said:
1. It is superior if you remove dynamic range and precision data during operations? Isn't that a bit contradictory to the label "superior" and to proposing shader length advantage?

Sure, but only for calculations that don't need them. Why run at full precision for all calculations, if not all calculations need full precision? This makes whether or not the FX is superior dependent upon the nature of the shader being calculated (except in DirectX, where Microsoft has screwed nVidia). Unless you want to comment on precise shaders that would be used on significant portions of a game scene and require FP precision throughout for maximum quality, then go ahead. But just stating more precision for the sake of more precision is meaningless.

Where did this jump from "nv30 is competitive when using integer" to "nv30 is superior" come from all of a sudden? The support seems to be predicated on a theoretical situation and ignoring factors outside of that case.

Rather, it is superior for a select class of shaders. That means that it cannot be said that it is absolutely inferior. Exactly how well it will match up to the R3xx architecture depends hugely upon the application. Since most games will likely use similar shaders, it seems likely that one company made the right decision, and the other made the wrong one. What I don't see is any evidence in this thread which company that is.

LeStoffer · Apr 12, 2003

MDolenc said:
You don't even have to jump out of NV_fragment_program at all to use ints.

But of course you can't jump from NV_fragment_program to register combiners and back.

MDolenc, thanks a lot for clearing this out. I then assume that using int12 within the NV_fragment_program will be as fast as doing the same instrutions in NV_register_combiners.

demalion · Apr 12, 2003

Moved to this thread.

demalion · Apr 12, 2003

Hey, Chalnoth and Uttar, why don't we take this to another thread where detailed technical analysis isn't occurring?

EDIT: OK, did. I do think my post is based on this analysis in a factual way, but I don't view the discussion being generated as anything to do with the thread topic (except for things I've already said).
Just click on the link in the post above to see if you agree.

OpenGL guy · Apr 12, 2003

LeStoffer said:
Chalnoth said:

The only problem is, you just need to make use of integer calculations whenever possible for the NV30 architecture to have high performance.
...
Now, this certainly makes the NV30 a little bit harder to program for, but this is probably why nVidia made Cg. With optimized compiling, the vast majority of the quirkiness of the NV30 architecture need not be made visible to the programmer.

Click to expand...

Yes, this very high dependence on int12 ins is obviously key in the NV30 shader performance. I would also agree that this in part is why they created Cg, but it seems increasing clear to me, that the hardware architecture guys where too much influenced by the those within nVidia who primary wanted a strong NV30GL (Quadro FX) for the professional GL-apps.

That market probably loves that the good old register combiner stayed in there and they should like the fact that they still can still work with the NV extensions they have been used to. A more capable FP16/32 shader glue on top is just the way a relatively conservative market segment likes it.

A tenuous argument at best. You can emulate all the legacy features with floating point pixel shaders, just like on the R300, and the application need never know.

LeStoffer · Apr 13, 2003

OpenGL guy said:
A tenuous argument at best. You can emulate all the legacy features with floating point pixel shaders, just like on the R300, and the application need never know.

Tenuous, probably yes, but it's still the best I can think of in terms of a reason for why nVidia choose to design the shader as they did. What's your private (not ATI's) take on this BTW?

OpenGL guy · Apr 13, 2003

LeStoffer said:
OpenGL guy said:

A tenuous argument at best. You can emulate all the legacy features with floating point pixel shaders, just like on the R300, and the application need never know.

Click to expand...

Tenuous, probably yes, but it's still the best I can think of in terms of a reason for why nVidia choose to design the shader as they did. What's your private (not ATI's) take on this BTW?

My take is that nvidia's driver philosophy (any driver should work on any card) has finally failed them. I mean, it doesn't matter if old driver XX.XX doesn't work on new card Y: That's what driver developers get paid for. Hardware bring up is a fact of life for a driver engineer and new hardware will ship with new drivers so the fact that old drivers don't work is meaningless.

Also, all this unneeded complexity in the chip must make validation very difficult because of all the extra vectors that need to be tested.

Just my opinion.

KimB · Apr 14, 2003

I'm not so sure, Dave. First of all, weren't the NV2x chips only capable of one register combiner op per pixel pipeline per clock? This would make the NV3x have twice the integer processing power as the NV2x chips. Even if the NV2x did have 2 register combiner ops per pixel pipeline per clock, the FP portion of the NV3x could easily have picked up the slack.

In short, I don't really see backwards-compatibility being the entire reason.

I think that perhaps the best thing nVidia could have done is gone for 8 float ops + 4 int ops (instead of 4 float + 8 int). But, given the problems nVidia had manufacturing the NV30, I seriously doubt this was feasible.

Arun · Apr 14, 2003

Chalnoth said:
I think that perhaps the best thing nVidia could have done is gone for 8 float ops + 4 int ops (instead of 4 float + 8 int). But, given the problems nVidia had manufacturing the NV30, I seriously doubt this was feasible.

Yeah, but I'd guess the reason for doing 4 float + 8 int is more because of transistor cost. Although nVidia is traditionally VERY good at optimizing their cores for refreshes, enabling them to add a second unit of something. Wanna bet the NV35 will have 8 float, and at least 4 int ops/cycle?

Uttar

Xmas · Apr 14, 2003

Chalnoth said:
I'm not so sure, Dave. First of all, weren't the NV2x chips only capable of one register combiner op per pixel pipeline per clock? This would make the NV3x have twice the integer processing power as the NV2x chips. Even if the NV2x did have 2 register combiner ops per pixel pipeline per clock, the FP portion of the NV3x could easily have picked up the slack.

NV2x has two register combiner units per pipe which means it can perform at least two and at most 6 integer ops per clock per pipe.

NV30 fragment processor test results

KimB

demalion

KimB

KimB

LeStoffer

demalion

demalion

OpenGL guy

LeStoffer

OpenGL guy

KimB

Arun

Unknown.

Xmas

Porous