Do you believe NV30 fragment shading hardware is capable of

Discussion in 'Architecture and Products' started by Luminescent, Feb 6, 2003.

?

Is NV30 capable of single cycle FP32 in its FS?

  1. Yes, but with a comparatively higher latency overhead than fp16

    100.0%
  2. Yes, with no forseeable performance penalty

    0 vote(s)
    0.0%
  3. No, only 1 fp16 component per clock

    0 vote(s)
    0.0%
  4. At this moment, any option is viable

    0 vote(s)
    0.0%
  1. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    What do you guys think in light of Carmack's comments and Geforce FX' shadermark benchmark results (8500 beats FX in some tests; seems to show unoptimized drivers)? Are the NV30 fragment-color-shading fp execution units capable of producing an fp32 result/component in 1 clock cycle.

    Is it 1 fp32 component per clock and 2 fp16's or 1 fp16 component per clock and 1/2 an fp32. It seems to me that whether it is integer or floating point, the NV30 is seriously underperforming.
     
  2. Tagrineth

    Tagrineth SNAKES... ON A PLANE
    Veteran

    Joined:
    Feb 14, 2002
    Messages:
    2,512
    Likes Received:
    9
    Location:
    Sunny (boring) Florida
    Well, except in Quality Aniso it seems that R300 doesn't lose much from two-cycle Trilinear... so two-cycle FP32 probably won't affect NV30 much except in really long fragment programmes.
     
  3. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    What seems fishy to me is the fact that the NV30's and even NV20's vertex pipeline is capable of single cycle fp32. They both, natively, contained 32-bit fp hardware. What keeps the same from being true in the NV30 fragment pipeline? It just doesn't add up. The units are 32-bit and can be split to 16-bit. It is not as if there are only 4 16-bit floating point units in the pipelines, which would require another pass. To me, it seems higher latency/bandwith requirements increase NV30's time in-between computations, however, it does not mean the hardware cannot execute the 32-bit operation in one cycle.
     
  4. RussSchultz

    RussSchultz Professional Malcontent
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,855
    Likes Received:
    55
    Location:
    HTTP 404
    Hypothetically: The fragment shaders FP32 units may be more complex than required by the vertex shaders

    Or perhaps because there are less vertex shader units, they can be bigger (and hence actually operate in a single instruction cycle, or be pipelined to support a throughput of 1 per cycle), but the multitude of fragment shader FP units require that they be smaller, and hence need to rely on multiple cycles to complete their operations, rather than being extra wide single cycle units.

    But that's all hypothetical.
     
  5. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    The hardware savings might be significant and if Nvidia doesn't think they'll be able to feed the pipeline enough data to process 32bit FP per clock they might as well not try.
     
  6. Joe DeFuria

    Legend

    Joined:
    Feb 6, 2002
    Messages:
    5,994
    Likes Received:
    70
    Exactly.

    With the Radeon 9700 having twice the bandwidth per pixel per clock, it makes sense that the 9700 would be designed to handle the higher precision in one cycle.

    On the other hand, it stands to reason that 9700 could have built in the hardware to handle 2 fp16 (or fp12?) pixel shader calcs, to increase performance at lower precision.

    It's a matter of picking a trade-off. Both companies approaches appear valid given the bandwidth / pixel ratio of their respective architectures.
     
  7. Reverend

    Banned

    Joined:
    Jan 31, 2002
    Messages:
    3,266
    Likes Received:
    24
    <speculation mode>Yes, I think the NV30 is capable (full FP/clock) but is crippled on purpose through drivers for performance reasons (vis-a-vis the R300).</spec mode>
     
  8. Joe DeFuria

    Legend

    Joined:
    Feb 6, 2002
    Messages:
    5,994
    Likes Received:
    70
    ??

    I don't follow...what benefit would there be to crippling the performance on purpose?
     
  9. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    What do you mean, Reverend?
     
  10. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    At this moment, any option is viable

    Rev, what are you trying to suggest :?: Please throw me a soft ball...
     
  11. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    Well, I just remembered that Nvidia stated the pixel shader alone was capable of 51 gflops (@400Mhz). Assuming this is hafl-float precision, it would mean that with full floats, the NV30 is capable of around 25.3 flops, or approximately 8 floats/clock per pipeline. Being that there are 8 virtual pipes with 4 fp units, it would indicate 4 fmads/clock. So, according to Nvidia's theoretical numbers, the NV30 is capable of a compnent fp32 calculation per clock.

    I believe all the speculation about the 2 cycle fp32 execution came from Carmack, but he seems to indicate there are great levels of inifficiency in NV30's ARB2 path compiler.
     
  12. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    Lower bandwidth isn't important at all when it comes to how fast you can execute shaders at fp32, since shaders should be more about math then simple texture combining.

    Based on what I heard (though this was quite some time ago so it might have changed) GeForce FX should be able to do one fp32 instruction per cycle, but reality might be slightly off from that, so fp16 usually helps. Even so, shader (both vertex and pixel) performance on GeForce FX is probably VERY hard to predict.
     
  13. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    From the Digit-life article
    (http://www.digit-life.com/articles2/gffx/index.html):

    Isn't the I12 format computed in the fragment program processor of the NV30 and not the register combiners (10-bit maximum). This leads me to believe that fp16 also aquires the superscalar performance benefits of the I12 format.
     
  14. Joe DeFuria

    Legend

    Joined:
    Feb 6, 2002
    Messages:
    5,994
    Likes Received:
    70
    Well, that all depends on what you're doing with the shaders. In any of the tests we've seen (Doom3, 3D Mark shaders...), has it actually been about complex calcs? Or more like lots of "relativley" simple calcs, not much more complex than traditional texture combining.
     
  15. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    How ridiculous is the idea that each pixel pipeline has fp16 capabilities, and one pipeline's fp16 processing capabilities stalls when the other is doing a fp32 op?

    If feasible, it seems to me the advantage of this is that if the other pipeline isn't doing an fp16 op when this occurs, this stall could conceivably be hidden, whereas if the pixel pipeline stalled itself (taking 2 cycles to do fp32), there would be no opportunity for hiding such a delay.

    You could still say that your pipelines were capable of fp32 in one cycle, especially if integer processing capability were not precluded.

    Are there any flaws to this concept that I'm missing? Has this been discussed before?
     
  16. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    I do not believe the pipelines can function independently as you mention.
    If a pipeline is working on cetain texture section (2*2) with fp32 precision and the texture contains multiple pixel blocks, each requiring the same level of precision, why would the pipelines vary precision level independently within the clock cycle? Given the lack of flow control in the fragment shaders, it seems more likely the work-load would be partitioned equally at a given precision level (either fp16 or fp32).
     
  17. antlers

    Regular

    Joined:
    Aug 14, 2002
    Messages:
    457
    Likes Received:
    0
    No benchmarks have indicated it can dispatch even 1 FP16 per pipe per clock, much less 2. If it could do 1 FP16 per clock, you'd expect the NV30 path in Doom 3 to be ~50% faster than the 9700 (based on clock speed difference) instead of "slightly" faster. Of course, I suppose Doom 3 could be running into bandwidth constraints on the NV30 path, but that would show just how unbalanced the chip is when you hit bandwidth constraints on what you would imagine would be a shader-execution-speed limited FP16 (or fixed-point 12) execution path.
     
  18. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    Joe,

    What Doom 3 probably does on ARB2 pipeline is either:
    5 texture lookups (base, normal map, 2x normalization cube map, exponent lookup) and 5 math instructions.
    Or:
    2 texture lookups (base, normal map) and 13 math instructions.
    Based on his comment about future hardware optimisations I'd say he's doing the second.

    demalion,

    Pixel pipelines still don't have branching support in GeForce FX, thus on every clock they execute same instruction, but with different inputs.
     
  19. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    Antlers, what you say holds true for the ARB2 path (as of now), but how about in general? There is more evedince given which points to 1 fp32 op per color component, per cycle for the NV30. Whether we will ever effectively measure this in the real world or through the ARB2 path is a whole other story.

    I guess I kind of mislead the thread by stating "in light of Carmack's comments", or at least his comments on the ARB2 path.
     
  20. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    It would also be good idea to ask John Carmack if (and how much) he uses *x, *h or *r instructions in NV30 profile :wink:.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...