Industrial Light and Magic support GFFX features!

OpenGL guy said:
So when/where was it stated that the GeForce FX could do 1 32-bit FLOP per cycle and 2 16-bit FLOPs per cycle?

I can't guarantee that's the case. It could be the opposite, 1 16-bit FLOPs per cycle or 1 32-bit FLOP every two cycle. But that seems unlikely.
But what I can guarantee is that the GFFX is two times faster ( beside memory bandwidth, of course ) when using FP16 than when using FP32.

And I can also guarantee the GFFX benchmarks we've recieved so far are run in 32-bit.

Here's a quote from Brian Burke, of nVidia PR:
The GeForce FX is 128-bit throughout the pipeline. The GeForce FX runs 128-bit, 64-bit and 32-bit natively. Going from 128 to 64, for example, will result in a performance doubling. This will not be the case for a GPU that does not run these modes natively. To run 64-bit on one of those GPUs, they will still incur the performance hit associated with 128-bit.

We selected the 32-bit benchmarks because that would give people the best frame of reference, as that has been the standard for some time.


Uttar
 
Uttar said:
And I can also guarantee the GFFX benchmarks we've recieved so far are run in 32-bit.
Brian Burke is talking about 32-bit color not 32-bit per component. Is that what you meant?
Here's a quote from Brian Burke, of nVidia PR:
The GeForce FX is 128-bit throughout the pipeline. The GeForce FX runs 128-bit, 64-bit and 32-bit natively. Going from 128 to 64, for example, will result in a performance doubling. This will not be the case for a GPU that does not run these modes natively. To run 64-bit on one of those GPUs, they will still incur the performance hit associated with 128-bit.

We selected the 32-bit benchmarks because that would give people the best frame of reference, as that has been the standard for some time.


Uttar
 
OpenGL guy said:
So when/where was it stated that the GeForce FX could do 1 32-bit FLOP per cycle and 2 16-bit FLOPs per cycle?

The impression stems from the interview here at beyond3d at the GeForce FX launch:

There was talk that FP16 (64-bit floating point rendering) could run twice the speed of FP32 (128-bit floating point rendering), is that the case?

Yes it is. Because we have native support in our hardware for FP16 and FP32. So, every pipeline is wide enough to accommodate the full 128-bit through the entire thing -- in the Vertex Shader, in the Pixel Shader and out to the frame buffer. Because we support 128-bit throughout the entire pipeline we added some extra control line and we can split those 128-bit channels into 64-bit channels. Now, that's only in the shading architecture, so we don't get twice as many pixels, but you get twice as many 64-bit in instructions. Also, if you want to use FP16 you'll have a smaller frame buffer so it has a lower footprint in memory as well.

That's why I and many others are still somewhat in the dark about what kind of performance boost they claim in both theory and real world performance with FP16 over ATI's 'native' FP24.
 
In the FX Digit-life article, 1 floating and 2 int per cycle performance is mentioned (assuming they are credible; their information seems to coincide with Nvidia's albeit the 3 vertex units). Being that all the pipelines are based on 32-bit units (Kirk mentioned 32 128-bit units in an Extremetech interview) 1 cycle execution is only natural. Why would the R300 color shading unit take more than 1 cycle at 24-bit when its fmads are naturally 24-bit wide and issue 24-bit operands? I could understand a latency overhead, but the actual computations should take no more than 1 cycle. This is aside from the fact that the NV30 can split float operands and execute 2 half-floats in the space of a full float. Lestoffer's Beyond 3D quote makes it clear that NV30 allows for twice the execution rate when working with FP16.

If we read the highlited material in Lestoffer's Beyond 3D quote closely (too lazy to quote it again), it is implied that using 32-bit floats, a certain amount of pixels can be interpolated, fetched, and shaded per clock (1, according to Digit-life, with the option of 2 texture adress commands per clock at the expense of a shader instruction op). In FP16 mode, the same amount of pixels are interpolated, fetched, etc. (they execute no differently running on FP 32/16 code), but twice as many shader instructions can be run in the same amount of time.

I say this with no authority, I am no professional like Open Gl guy, only an electrical engineering student and enthusiast which spends a good amount of time reading posts and articles on this material.
 
OpenGL guy said:
Brian Burke is talking about 32-bit color not 32-bit per component. Is that what you meant?

Sorry for not being sufficently clear. I meant 32-bit color indeed, and not 32-bit per component. So it's 8 bit per component.
So, yes, GFFX benchmarks ran at 32-bit precision compared to the Radeon 9700 Pro 96-bit precision. Anything else you want to know? :rolleyes:

The Radeon 9700 Pro *does* run at 96-bit precision by default, right?


Uttar
 
Uttar said:
Sorry for not being sufficently clear. I meant 32-bit color indeed, and not 32-bit per component. So it's 8 bit per component.
So, yes, GFFX benchmarks ran at 32-bit precision compared to the Radeon 9700 Pro 96-bit precision. Anything else you want to know? :rolleyes:

The Radeon 9700 Pro *does* run at 96-bit precision by default, right?

Uttar

But in what part of the pipeline? It is almost assured that the texture filtering is generally not done at 96-bit precision by default on the Radeon 9700. It makes more sense to state that blending and pixel shader instructions are done at the higher precision.

The GeForce FX probably does all shader ops, by default, at 64-bit precision, and definitely does the texture filtering at plain 32-bit precision.
 
Back
Top