Do you believe NV30 fragment shading hardware is capable of

Is NV30 capable of single cycle FP32 in its FS?

  • Yes, with no forseeable performance penalty

    Votes: 0 0.0%
  • No, only 1 fp16 component per clock

    Votes: 0 0.0%
  • At this moment, any option is viable

    Votes: 0 0.0%

  • Total voters
    284
Luminescent said:
How ridiculous is the idea that each pixel pipeline has fp16 capabilities, and one pipeline's fp16 processing capabilities stalls when the other is doing a fp32 op?
I do not believe the pipelines can function independently as you mention.
If a pipeline is working on cetain texture section (2*2) with fp32 precision and the texture contains multiple pixel blocks, requiring the same level of precision, how could pipelines vary precision level independently within the clock cycle?

Hmm? Well, a mathematical operation has data, processes it, then outputs data. What I was suggesting was that two units capable of fp16 precision operation be used when operating on fp32 data (perhaps even only some operations, not all). I am suggesting this as opposed to having one unit capable of an operation on fp16 data in one cycle taking two cycles to perform an operation on fp32 data.

I was under the impression data was stored in a commonly accessible area (instruction slot storage space), and this struck me as one thing this might allow.

Given the lack of flow control in the fragment shaders, it seems more likely that the work-load is partitioned equally at the given precision (either fp16 or fp32).

I guess what I'm missing is an understanding of how each pipeline would always be performing all the same operations in each clock cycle. I thought there might be opportunities for idle operational units to be used for gain this way. Especially if units were operating on the same data (how unlikely is that?), or in cases of data dependency (how common could this be?) this seemed a simple opportunity for optimization.

The comments so far, to my understanding, seem to preclude the possibility of anything other than 1 fp32 op per cycle per pipeline, since this idea is just another method of implementing the alternative (AFAICS).
 
Luminescent said:
Antlers, what you say holds true for the ARB2 path (as of now), but how about in general? There is more evedince given which points to 1 fp32 op per color component, per cycle for the NV30. Whether we will ever effectively measure this in the real world or through the ARB2 path is a whole other story.

I guess I kind of mislead the thread by stating "in light of Carmack's comments", or at least his comments on the ARB2 path.

What evidence is there? Which of Carmack's comments make you think FP32 performance is going to get faster? AFAIK, Carmack said that NVidia had given him assurances about the ARB2 path, but he also sounded like he was sure there were intrinsic performance penalty (compared with the R300s FP24 implementation) with FP32 on the NV30.

Besides the benchmarks, I just don't think the NV30 has enough transistors to run FP32 vector ops in the shader at one per clock per pipe.
 
From http://www.megagames.com/news/html/hardware/nvidiagainsmarketshare-nv302xgf4.shtml :
At a presentation in Seoul, South Korea, called Nvidia Mania Day, to promote the NV30, which will be officially launched at the Comdex trade show during the week of Nov. 18 in Las Vegas, David Kirk, Nvidia's chief scientist claimed: that NV30 will have more than two times the performance of GeForce4, with 51 billion floating point operations per second (51 gigaflops) in the pixel shader alone. The chip has 125 million transistors, which is three times the number on the Pentium 4. Nvidia, through its manufacturing partner Taiwan Semiconductor Manufacturing Co., is using a 0.13-micron manufacturing process to reduce the physical size of the chip and achieve better efficiency. By comparison, the top-end Radeon chip has 110 million transistors, and the GeForce4 Ti 4600 has 63 million.
This seems to support my previous claim:
Assuming this is hafl-float precision (@400mhz), it would mean that with full floats, the NV30 is capable of around 25.3 flops, or approximately 8 floats/clock per pipeline. Being that there are 8 virtual pipes with 4 fp units, it would indicate 4 fmads/clock. So, according to Nvidia's theoretical numbers, the NV30 is capable of a compnent fp32 calculation per clock.
 
Funny you should say that

Reverend said:
<speculation mode>Yes, I think the NV30 is capable (full FP/clock) but is crippled on purpose through drivers for performance reasons (vis-a-vis the R300).</spec mode>

Well without going down the conspiracy theories I was thinking of exactly the same thing. It would be a great hand to play from Nvidia to suddenly unleash the full power of the GFX after the R350 release. I would not have given much credit to this idea but I honestly believe with the suspected limited R350s performance leads over the GFX it would be easy for the GFX to easily reclaim the performance lead. I only hope ATI are genuinly releasing the R350 for higher margins as opposed to capture the performance crown back. If not ATI might have spent quite a lot of cash and time in a futile attempt to win mind share and bragging rights, not to mention loosing focus on the R400/R500 projects?
I know some people think the GFX is simply a failure but simply looking at clock speed differences between GFX and R300 makes my spidey sense tingle. The gap between the two is simply too big and yet the GFX performance just so happens to be just slightly faster. What better way into luring ATI into releasing another chip? Ok, sorry I'm now making this sound like a conspiracy theory but if ATI does release the R350 and the GFX unleashes it's full clock speed potential and matches or surpasses the R350 where will that leave ATI if the margins on the R350 are not any better than the R300?

Then again, the GFX might just be the cold turkey ppl suggest ;)
 
Well,as Carmck just stated:

Apparently, the R300 architecture is a better target for a straightforward assembler / compiler, while the NV30 is twitchy enough to require more serious analysis and scheduling, which is why they expect significant improvements with later drivers.

Besides explaining their need for Cg, this quote together with nVidia dodging the question seem to suggest to me that some odd compromises over this whole FP16/32 thing might have been made, e.g. optimizations of caches, registers etc.
 
Purposefully crippling the drivers is the only scenario I can imagine that yields the performance we are seeing with the potential performance some here envision...

However, that kind of ploy just doesn't happen. The first impression of the card means too much.
 
You have perfect reason to be skeptical, antlers, but this thread only offers speculation (to shed some more light on the possibilities). None of us will really, truely know (blasted Nvidia fp numbers aside) until later on in the FX's life cycle, maybe even the next driver release. Personally I seem to sway more with the first 2 options, but, then again, option 4 seems to be the most reliable. We can never be 100% sure unless an Nvidia employee/developer answers the question or a credible benchmark indicates this.

I have a feeling Nvidia is preparing to make a strong showing in Futuremark 2003. This maybe where they unleash what the FX gravely wants.
 
I have a feeling Nvidia is preparing to make a strong showing in Futuremark 2003. This maybe where they unleash what the FX gravely wants.

One would think that should be the case, since 3D Mark 2003 should attempt to push the shading capabilities of these cards. (At least in some tests, the min spec is a DX7 card.) But then, we've been assuming all along that "DX9 shader speed" is where NV30 would have it's best opportunity to shine, and yet we haven't seen it materialize. Putting the NV30 in the best possible scenario so far, using NV specific extensions in GL, it's merely on par with the 9700.

The synthetic DirectX shading performance thus far has been dismal on the FX.
 
At this moment, any option is viable

Possible scenario 1:

Fragment shader calculations has a troughput of 1 calculations/cycle, but has a latency of 3 cycles.
It requires scheduling to move dependent instuctions far enough from each other.

Possible scenario 2:

Some complex calculations (rsq, exp, log) takes multiple cycles, but other calculation can be executed parallelly to them as long as the output is not reused.

Possible scenario 3:

Instructions a register fetch limited - if you use multiple "wide" register inputs it might take extra cycles to fetch the data.

There can be others...
 
No, only :!: 1 :!: fp16 component per clock 61% [ 22 ]

This forum reveals quite an anti-nvidia bias lately, with the majority publicly stating that nVs high end pixel pipeline "sucks" 4 cycles for a single 64 bit fp pixel... ;)

Sorry, had to bite...
 
What makes you call the situation anti-nvidia bias instead of reflecting reality bias?
 
BRiT said:
What makes you call the situation <b>anti-nvidia bias</b> instead of reflecting <b>reality</b> bias?
It wasn't really that hard to find the joke in PiNkY's post :D
 
Back
Top