Do you believe NV30 fragment shading hardware is capable of

Is NV30 capable of single cycle FP32 in its FS?

  • Yes, with no forseeable performance penalty

    Votes: 0 0.0%
  • No, only 1 fp16 component per clock

    Votes: 0 0.0%
  • At this moment, any option is viable

    Votes: 0 0.0%

  • Total voters
    284
What do you guys think in light of Carmack's comments and Geforce FX' shadermark benchmark results (8500 beats FX in some tests; seems to show unoptimized drivers)? Are the NV30 fragment-color-shading fp execution units capable of producing an fp32 result/component in 1 clock cycle.

Is it 1 fp32 component per clock and 2 fp16's or 1 fp16 component per clock and 1/2 an fp32. It seems to me that whether it is integer or floating point, the NV30 is seriously underperforming.
 
Well, except in Quality Aniso it seems that R300 doesn't lose much from two-cycle Trilinear... so two-cycle FP32 probably won't affect NV30 much except in really long fragment programmes.
 
What seems fishy to me is the fact that the NV30's and even NV20's vertex pipeline is capable of single cycle fp32. They both, natively, contained 32-bit fp hardware. What keeps the same from being true in the NV30 fragment pipeline? It just doesn't add up. The units are 32-bit and can be split to 16-bit. It is not as if there are only 4 16-bit floating point units in the pipelines, which would require another pass. To me, it seems higher latency/bandwith requirements increase NV30's time in-between computations, however, it does not mean the hardware cannot execute the 32-bit operation in one cycle.
 
Why would the NV30's and even NV20's vertex hardware be capable of single cycle fp32, while the fragment pipeline is not

Hypothetically: The fragment shaders FP32 units may be more complex than required by the vertex shaders

Or perhaps because there are less vertex shader units, they can be bigger (and hence actually operate in a single instruction cycle, or be pipelined to support a throughput of 1 per cycle), but the multitude of fragment shader FP units require that they be smaller, and hence need to rely on multiple cycles to complete their operations, rather than being extra wide single cycle units.

But that's all hypothetical.
 
Luminescent said:
Why would the NV30's and even NV20's vertex hardware be capable of single cycle fp32, while the fragment pipeline is not. It just doesn't add up. To me, it seems higher latency/bandwith requirements increases NV30's time inbetween computations, but it does not mean the hardware cannot execute the 32-bit operation in one cycle.

The hardware savings might be significant and if Nvidia doesn't think they'll be able to feed the pipeline enough data to process 32bit FP per clock they might as well not try.
 
The hardware savings might be significant and if Nvidia doesn't think they'll be able to feed the pipeline enough data to process 32bit FP per clock they might as well not try.

Exactly.

With the Radeon 9700 having twice the bandwidth per pixel per clock, it makes sense that the 9700 would be designed to handle the higher precision in one cycle.

On the other hand, it stands to reason that 9700 could have built in the hardware to handle 2 fp16 (or fp12?) pixel shader calcs, to increase performance at lower precision.

It's a matter of picking a trade-off. Both companies approaches appear valid given the bandwidth / pixel ratio of their respective architectures.
 
<speculation mode>Yes, I think the NV30 is capable (full FP/clock) but is crippled on purpose through drivers for performance reasons (vis-a-vis the R300).</spec mode>
 
Well, I just remembered that Nvidia stated the pixel shader alone was capable of 51 gflops (@400Mhz). Assuming this is hafl-float precision, it would mean that with full floats, the NV30 is capable of around 25.3 flops, or approximately 8 floats/clock per pipeline. Being that there are 8 virtual pipes with 4 fp units, it would indicate 4 fmads/clock. So, according to Nvidia's theoretical numbers, the NV30 is capable of a compnent fp32 calculation per clock.

I believe all the speculation about the 2 cycle fp32 execution came from Carmack, but he seems to indicate there are great levels of inifficiency in NV30's ARB2 path compiler.
 
Lower bandwidth isn't important at all when it comes to how fast you can execute shaders at fp32, since shaders should be more about math then simple texture combining.

Based on what I heard (though this was quite some time ago so it might have changed) GeForce FX should be able to do one fp32 instruction per cycle, but reality might be slightly off from that, so fp16 usually helps. Even so, shader (both vertex and pixel) performance on GeForce FX is probably VERY hard to predict.
 
From the Digit-life article
(http://www.digit-life.com/articles2/gffx/index.html):

The pixel processor of the GeForce FX can execute up to two integer and one floating-point command or two texture access operations per clock — i.e. it acts as a superscalar processor in case of integer operationsand reception of sampled texture values from texture units.
Contrary to a vertex processor which always works with the F32 data format, a pixel processor (like in R300 and in NV30) supports three formats- F32, F16 and integer I16 R300) / I12 (NV30). The latter two formats are not just useful for compatibility with old shaders 1.x but also provide speed gains in calculations.

Isn't the I12 format computed in the fragment program processor of the NV30 and not the register combiners (10-bit maximum). This leads me to believe that fp16 also aquires the superscalar performance benefits of the I12 format.
 
Lower bandwidth isn't important at all when it comes to how fast you can execute shaders at fp32, since shaders should be more about math then simple texture combining.

Well, that all depends on what you're doing with the shaders. In any of the tests we've seen (Doom3, 3D Mark shaders...), has it actually been about complex calcs? Or more like lots of "relativley" simple calcs, not much more complex than traditional texture combining.
 
How ridiculous is the idea that each pixel pipeline has fp16 capabilities, and one pipeline's fp16 processing capabilities stalls when the other is doing a fp32 op?

If feasible, it seems to me the advantage of this is that if the other pipeline isn't doing an fp16 op when this occurs, this stall could conceivably be hidden, whereas if the pixel pipeline stalled itself (taking 2 cycles to do fp32), there would be no opportunity for hiding such a delay.

You could still say that your pipelines were capable of fp32 in one cycle, especially if integer processing capability were not precluded.

Are there any flaws to this concept that I'm missing? Has this been discussed before?
 
How ridiculous is the idea that each pixel pipeline has fp16 capabilities, and one pipeline's fp16 processing capabilities stalls when the other is doing a fp32 op?
I do not believe the pipelines can function independently as you mention.
If a pipeline is working on cetain texture section (2*2) with fp32 precision and the texture contains multiple pixel blocks, each requiring the same level of precision, why would the pipelines vary precision level independently within the clock cycle? Given the lack of flow control in the fragment shaders, it seems more likely the work-load would be partitioned equally at a given precision level (either fp16 or fp32).
 
No benchmarks have indicated it can dispatch even 1 FP16 per pipe per clock, much less 2. If it could do 1 FP16 per clock, you'd expect the NV30 path in Doom 3 to be ~50% faster than the 9700 (based on clock speed difference) instead of "slightly" faster. Of course, I suppose Doom 3 could be running into bandwidth constraints on the NV30 path, but that would show just how unbalanced the chip is when you hit bandwidth constraints on what you would imagine would be a shader-execution-speed limited FP16 (or fixed-point 12) execution path.
 
Joe,

What Doom 3 probably does on ARB2 pipeline is either:
5 texture lookups (base, normal map, 2x normalization cube map, exponent lookup) and 5 math instructions.
Or:
2 texture lookups (base, normal map) and 13 math instructions.
Based on his comment about future hardware optimisations I'd say he's doing the second.

demalion,

Pixel pipelines still don't have branching support in GeForce FX, thus on every clock they execute same instruction, but with different inputs.
 
Antlers, what you say holds true for the ARB2 path (as of now), but how about in general? There is more evedince given which points to 1 fp32 op per color component, per cycle for the NV30. Whether we will ever effectively measure this in the real world or through the ARB2 path is a whole other story.

I guess I kind of mislead the thread by stating "in light of Carmack's comments", or at least his comments on the ARB2 path.
 
It would also be good idea to ask John Carmack if (and how much) he uses *x, *h or *r instructions in NV30 profile ;).
 
Back
Top