On FP16 and FP32 performance...

OpenGL guy said:
But there's only one texture unit per pixel, so what's the point? Old games use textures, not shaders.

Well you could still execute an 8 color op shader in as little as 2 cycles. But the point would be, if I didn't need the increased precision, then even in a DX9 title, I could switch on 8-bit precision and increase my performance further.


When you write a C program, do you always use double's instead of floats, or 64-bit long words instead of 32-bit? No, you pick the precision you need, and no more, if you can get away with it and if it yields higher performance.


If I wrote a 100-op shader, but knew that I didn't need 16-bit FP and could deal with the error, and if running it in 8-bit mode allowed it to execute in 25 cycles, I would do it.


This is a case where NVidia optimized their design for one axis (shader execution speed) at the cost of other things (AA quality, etc) They made a tradeoff. and gave programmers more programmability, more control over the precision in the pipeline. ATI made different tradeoffs. The fact that 128-bit runs "slower" than 32-bit or 64-bit is a plus for Nvidia, and not a bonus for ATI, no matter which way you slice it. Because, as I said, the real way to look at it is that NVidia runs 128-bit as fast as ATI, but up to 2x faster in 64-bit. It's not a question of "slow down" but one of "I can choose to speed it up", the same way you can choose to reduce your screen resolution from 1280x1024 to 1024x768 or from 32-bit to 16-bit.
 
DemoCoder said:
Well you could still execute an 8 color op shader in as little as 2 cycles. But the point would be, if I didn't need the increased precision, then even in a DX9 title, I could switch on 8-bit precision and increase my performance further.


When you write a C program, do you always use double's instead of floats, or 64-bit long words instead of 32-bit? No, you pick the precision you need, and no more, if you can get away with it and if it yields higher performance.


If I wrote a 100-op shader, but knew that I didn't need 16-bit FP and could deal with the error, and if running it in 8-bit mode allowed it to execute in 25 cycles, I would do it.
And how would you tell the driver to do that? The precision modifier in DX9 can only be used to indicate that 16bit FP precision is enough AFAIK.
Does the NV30 Cg profile support an integer type at all?
 
It's all based on speculation due to the comments by NVidia that they have special hardware for 32-bit integer legacy.

Presumably, something like vector<int, 4> in DX9 HLSL would resolve to potentially executing at 8-bit precision in the pipeline. Or a new pragma could be added, or a new type. Who knows.

NVidia says they have special support for the old pipeline, instead of just extending 8-bit into 32-bit FP precision. If they wasted transistors on this, presumably it was to gain performance, and hence it needs to be exposed somehow to the developer.
 
I suppose it's possible as well that 32-bit shaders dispatch at the same per-cycle rate as the 9700s, FP16 shaders at half the rate, and FP32 shaders at 1/4 the rate.

If, on the other hand, the NV30 can dispatch shader instructions at 4 times the per-cycle rate as the R300, and it has a higher clock speed, the practical limit to shader length would be much, much higher on the NV30 (provided the shaders weren't texture limited).

We've got to get Humus' Mandelbrot demo running on a GeForceFX...
 
DemoCoder said:
The fact that 128-bit runs "slower" than 32-bit or 64-bit is a plus for Nvidia, and not a bonus for ATI, no matter which way you slice it. Because, as I said, the real way to look at it is that NVidia runs 128-bit as fast as ATI, but up to 2x faster in 64-bit. It's not a question of "slow down" but one of "I can choose to speed it up", the same way you can choose to reduce your screen resolution from 1280x1024 to 1024x768 or from 32-bit to 16-bit.

How are you making the assumption that NV30's pipes operate twice as fast in 64-bit mode as R300's in 96-bit? In Dave B.'s interview with Geoff Ballew, he stated that the 128-bit pipes could be split to perform two 64-bit instructions in parallel. However, ATI's R300 documentation states that its pixel shaders can perform up to three ops in parallel (a texture read, a texture address op, and a color op). Nvidia made no mention of this capability for NV30.

So depending on how the shader instructions are distributed, R300 may be faster in some situations with 64-bit ops, and NV30 may be faster in others (all clock-for-clock, of course). However, NV30's 128-bit ops should never be faster than R300's 96-bit.
 
I very much doubt they can execute 4 32bit ops at the speed of 1 128bit FP op, otherwise they would have touted it as such. They are only talking about doubling the speed of FP16, not 32bit integer.
 
Well, Nvidia never stated the format or form in which its pixel processor dispatches instructions. However, looking at the f32 gigaflop number of 12.75 per-clock per-pixel pipe, it seems, the NV30 executes more floating point operations per pipeline in a cycle than the R300 (12.75 vs. 10 when counting the floating-point address op in f32). This leads me to believe it can dispatch something along the lines of 1 vector and 2 scalar ops per cycle, or 2 vectors and 1 scalar along with a texture adress op (ex. 4 parallel fmads, 2 frcp or frcq, and 1 fmov) . Also, it seems the NV30 has two integer units in the form of register combiners (http://www.beyond3d.com/articles/nv30r300/index.php), thus the ability to execute two ints per cycle.
 
DemoCoder said:
No Bigus, you are reading it backwards.

No, just different ways of stating the same thing.

Bigus Dickus said:
If correct, then regardless of whether the NV30 can address two 64 bit floats in the same time it can one 128 bit float, it will appear that the performance (of the shaders?) reduces as the mode becomes higher precision. 32 bit integer might fly... but you lose speed from there.

There is plenty of room in my statement for the NV30 to be as many times faster as the R300 in 32 bit integer format as you wish for it to be.

Whether it "slows down" in higher precision, or "speeds up" in lower precision is a matter of semantics... the meaning is the same. It could "slow down" to the point where its F32 performance "only" matched the R300, or it could "speed up" to the point where its int 32 was much faster.

It's all a matter of what the starting point is, and the bottom line is that right now we just don't know. The only thing I've seen that gives any indication of this is the reporte NV30 "nature" scores, which I would assume uses int 32 format. The NV30 doesn't seem to be especially powerful there, so something else must be the limiting factor. Whether that is always the case... who knows.

For once, I am looking forward to seeing some synthetic benchmarks. ;)
 
Firstly I think that the integer ops thing is a bit of a red herring.
Looking at NVidia's papers, it looks like the NV30 has a complete set of NV2X style register combiners at the end of the pipeline. This was probably required as much to maintain compatability with older titles using NV open GL extensions as to increase performance. I'd be surprised if the integer support was much more than this.

That leaves the choice between 32 bit fp and 16 bit fp, the latter being twice the speed of the first. The question here is how fast is it clock for clock, and I have yet to see an answer to this.
 
DaveBaumann said:
I very much doubt they can execute 4 32bit ops at the speed of 1 128bit FP op, otherwise they would have touted it as such. They are only talking about doubling the speed of FP16, not 32bit integer.

One press guy asked about this at the launch and the answer was that a 32bit integer op only would need half of a cycle (so two 32bit ops per clock per pipeline), while FP16 needs 1.
 
ERP said:
That leaves the choice between 32 bit fp and 16 bit fp, the latter being twice the speed of the first. The question here is how fast is it clock for clock, and I have yet to see an answer to this.

I agree since I'm still a bit confused regarding nVidias claim. R300 can do three ops per clock (texture look-up, texture address ops + colour ops) in 24 bit fp whereas I understand that NV30 can do two (colour?) ops per clock in 16 bit fp and one in 32 fp bit.

Beyond that uncertainly the question also remain whether NV30 can do as many ops per clock as R300 if we look at texture look-ups, texture address ops and colour ops at the same time.
 
LeStoffer said:
Beyond that uncertainly the question also remain whether NV30 can do as many ops per clock as R300 if we look at texture look-ups, texture address ops and colour ops at the same time.

I'm not sure that this is a useful comparison either........
Complex real shaders are going to have an unbalanced collection of Texture look-ups, texture address ops and color ops, so they are not equivalent. And in fact the balance will likely change dramatically from shader to shader.
At this point it's nothing but speculation until we have more info. If your numbers are correct then we know that NV30 clock for clock executes the same number of Texture lookups per cycle, either the same or twice as many color ops as R300 (based on 16/32 bit fp), but we no nothing about Texture address ops.
 
Back
Top