My take on ATI and nVidia

antlers · Feb 19, 2003

Dave H said:
Still, there must be something wrong if FP16 mode gives you absolutely no performance benefit compared to FP32; otherwise why bother with FP16 mode at all? And Carmack implied the performance difference is almost 100% as would be expected.

We shouldn't assume that the reason Carmack's NV30 path is 100% faster is because it is using FP16. For all we know it is using integer operations, which have been shown to be quite quick on the NV30. It may be that FP16 is only faster than FP32 in limited circumstances (or in many circumstances that the driver is unable to expose yet)

Joe DeFuria · Feb 19, 2003

Let me sum up what appears to be a general concensus, based on data pieced together from several sources around the 'net, and this thread:

1) GeForceFX can only write 4 "color values" per clock. Irrespective of whether z or stencil values are being read/written as well.

2) GeForceFx can write 8 z/stencil values per clock, when no color values are being written. (BTW, is this 16bit? 32bit? floating point z)

3) GeForceFX can calculate 4, fp32 bit shader ops per clock. (Though it might actually be 8 per clock, with bandwidth limitations reducing this to effectively 4).

4) GeForceFX can calculate 8, fp16 bit shader ops per clock.

I must say, if all this info ends up being more or less correct, this is all quite disturbing. :? For nVidia to claim on their spec sheets "8 pixels/clock rendering pipeline" would be a travesty.

Most people don't know what 8x1 is anyway, and those who do, know enough to check the benchmarks...Compared to GF4 Ti4800SE, this is really small potatoes, IMO.

Yes and no. As you can see from pretty much ALL of the initial reviews (and even from Carmack's .plan), much of the "unexpected" performance shortcomings of the FX are being blamed on "bad drivers", when the performance is actually readily explained by the hardware implementation. So the consumers / readers are given the impression that the FX is severely underperforming to its spec. When in reality, its performing much closer to spec than expected.

Everyone is blaming "bad drivers", because everyone is EXPECTING that FX can output 8 "real pixels" (color, not just z) per clock. I wonder where they got that idea.....

Hyp-X · Feb 19, 2003

Joe: The problem is that the actual GFFX performance in FP shaders is way below than what would be expected for 4 FP32 instructions per clock!

Joe DeFuria · Feb 19, 2003

Hyp-x,

Let's take a closer look:

We'll assume that PS2.0 synthetic tests run through GF FX's FP 32 pipeline. (I think that's a valid assumption, but correct me if I'm wrong.)

I'm also going to assume that Radeon 9700 Pro does 8 FP24 instructions per clock. (Is that correct as well?)

So in a synthetic PS 2.0 shader test, we might expect that the FX to reach a peak of 77% of the performance of the 9700 Pro.

Right now, FX appears to be at about 50% of the 9700 Pro, I believe. (See Digit-Life). Yes, that is a fairly sizable difference.

But still, it's just as sizable a difference between FX32 theoretical peak, and R-300 theoretical peak.

There should be the most room for improvement via drivers for fp16 pipeline...but the question there is, is that really only relevant for OpenGL and PS 1.4 (which nVidia is arguing to avoid?).

Hyp-X · Feb 19, 2003

Hmm.

The R300 can execute 8 vector and 8 scalar FP24 instructions per clock if some "pairing" rules are kept.

Maybe that's the missing part of our puzzle?

Joe DeFuria · Feb 19, 2003

Could very well be. On one hand...giving the FX more credit for "average number of shader calcs," and on the other, not giving the Radeon 9700 enough credit.

The whole picture is of course muddled because there is no fundamental link between how many (and what type) of shader ops can be performed per clock, and how many pixels can be written (traditional fill-rate).

Arun · Feb 19, 2003

But on the other hand, the GFFX also got dedicated integer hardware. So, in non-theorical cases where some instructions would work just fine with integer, the GFFX might also get an advantage.
And the GFFX got more instructions than the R300. So, in cases such instructions are used, the NV30 would also get an advantage over the R300.

There really are a LOT of factors. It sounds like that overall, the R300 is superior theorically. But we're probably missing a lot of other important factors... Unlikely we'd get to a much different conclusion, however...

Uttar

Luminescent · Feb 19, 2003

Seems that Nvidia underestimated the performance demand produced by the R300. Nvidia would have done much better performance wise by including 8 full fp32 fragment pipelines, capable of computing a shader instruction per clock (and 2 fp16's per clock) and scrapping the legacy register combiners.

LeStoffer · Feb 19, 2003

Hyp-X said:
Hmm.

The R300 can execute 8 vector and 8 scalar FP24 instructions per clock if some "pairing" rules are kept.

Maybe that's the missing part of our puzzle?

Maybe, but since you have to construct the shader with this division in mind (and it apparently wont help NV30) I'm not so sure that this has been done just yet.

My take on ATI and nVidia

antlers

Joe DeFuria

Hyp-X

Irregular

Joe DeFuria

Hyp-X

Irregular

Joe DeFuria

Arun

Unknown.

Luminescent

LeStoffer

Similar threads