MaxPC on R420/NV40 Precision

Geo

Mostly Harmless
Legend
January 2004 Issue

Article: "Hardcore Hardware 2004"

"Neither ATI nor nVidia is talking much about its next-generation parts because they don't want to cannibalize sales of existing products. We have learned, however, that both the NV40 and R420 will support higher precision than existing cards, and we've confirmed that the NV40 will include full 64-bit precision."

So, the whisper campaign for increased precision on R420 builds. . .first Orton's hint, now this.
 
... and we've confirmed that the NV40 will include full 64-bit precision...
Did they clarified this :

What exactly do they mean by "full 64-bit precision" ? I can think of instances where 64-bit precision is 64-bit precision-wise... and I can think of instances where 64-bit precision isn't 64-bit precision-wise (yes, as stupid as that sounds!).
 
*cough* FP32 = 128-bit *cough* FP16 = 64-bit *cough*
NVIDIA's marketing is getting worse by the minute...

---

I wish I still had my source's exact words, as a source once told me the EXACT, official description of the three precision formats in the NV40, as should be said in for-developer documents whose release date is unknown to me. I had expected COMDEX or earlier, but I guess they'll only be given at the NV40's launch, eh...

Anyway, basically, the NV40 supports FP32, FP16 and FX16. FX16 is simply FX12 + 4-bits of mantissa, while both FP32 and FP16 are unchanged.

This information is current as of early September, thus it should not have been made outdated by the 25M additionnal transistors added around that timeframe.


Uttar
 
Reverend said:
... and we've confirmed that the NV40 will include full 64-bit precision...
Did they clarified this :

What exactly do they mean by "full 64-bit precision" ? I can think of instances where 64-bit precision is 64-bit precision-wise... and I can think of instances where 64-bit precision isn't 64-bit precision-wise (yes, as stupid as that sounds!).

Not so much. :LOL: The sentence does end "allowing for even more impressive high-dynamic range effects in future games." That's it for detail on the subject.

Then the article goes on to talk about how GPUs are becoming more CPU-like. I had a moment of hoping that their "secret" ATI source isn't that same badly translated Orton interview that we all saw, as he was going on about the same two subjects there. Altho, generally, I have a good opinion of the reliability of MaxPC.
 
Uttar said:
*cough* FP32 = 128-bit *cough* FP16 = 64-bit *cough*
NVIDIA's marketing is getting worse by the minute...

---

I wish I still had my source's exact words, as a source once told me the EXACT, official description of the three precision formats in the NV40, as should be said in for-developer documents whose release date is unknown to me. I had expected COMDEX or earlier, but I guess they'll only be given at the NV40's launch, eh...

Anyway, basically, the NV40 supports FP32, FP16 and FX16. FX16 is simply FX12 + 4-bits of mantissa, while both FP32 and FP16 are unchanged.

This information is current as of early September, thus it should not have been made outdated by the 25M additionnal transistors added around that timeframe.


Uttar

"higher precision than existing cards". In context, clearly within each vendor's product line, so that they are saying that NV40 has higher precision than NV35, and R420 has higher precision than R360. Of course, that still doesn't make it impossible that they misunderstood what they were told by someone who wanted them to misunderstand.
 
trying for a scoop = gathering little information.

(Sounds like the register or inquirer)

Uttar: You should be taken out back and shot for losing those emails and such like you did...
;)

They would be very useful right now.
 
You know what would be cool.

If ATI made it so they could combine stuff to get higher precision, like fp24->fp48 then they could run in DX9 spec fp24 and still say hey we have higher precision than NV.
 
I believe the 64bit FP refers to the precision of the framebuffer. AFAIK, framebuffers (ones read by the RAMDAC) are still 32bit INT on R3xx/NV3x. To quote John Carmack...
The future is in floating point framebuffers. One of the most noticeable
thing this will get you without fundamental algorithm changes is the ability
to use a correct display gamma ramp without destroying the dark color
precision. Unfortunately, using a floating point framebuffer on the current
generation of cards is pretty difficult, because no blending operations are
supported, and the primary thing we need to do is add light contributions
together in the framebuffer. The workaround is to copy the part of the
framebuffer you are going to reference to a texture, and have your fragment
program explicitly add that texture, instead of having the separate blend unit
do it. This is intrusive enough that I probably won't hack up the current
codebase, instead playing around on a forked version.

Floating point framebuffers and complex fragment shaders will also allow much
better volumetric effects, like volumetric illumination of fogged areas with
shadows and additive/subtractive eddy currents.

John Carmack

Source : http://www.bluesnews.com/cgi-bin/finger.pl?id=1&time=20030129210315
 
For the NV3X architecture Nvidia tied two FP16 units with control logic creating a single FP32 so there is nothing stopping them putting four FP16 units and creating a single FP64 unit.

This is the reason FP32 is half the speed of FP16. I would imagine that FP64 would be 1/4 the speed of FP16 on the NV40, but they would be able to say they have 64 bit precision.
 
rwolf said:
For the NV3X architecture Nvidia tied two FP16 units with control logic creating a single FP32 so there is nothing stopping them putting four FP16 units and creating a single FP64 unit.

This is the reason FP32 is half the speed of FP16.
There is no indication that the NV3X architecture is anything like this. The engine is FP32, the reason FP16 is faster is because of the reduced register footprint, which is a critical issue on the NV3X.

-FUDie
 
Uttar said:
This information is current as of early September, thus it should not have been made outdated by the 25M additionnal transistors added around that timeframe.

do you know then, why the additional transistors are there?
 
Toasty said:
I believe the 64bit FP refers to the precision of the framebuffer. AFAIK, framebuffers (ones read by the RAMDAC) are still 32bit INT on R3xx/NV3x. To quote John Carmack...
The future is in floating point framebuffers. One of the most noticeable
thing this will get you without fundamental algorithm changes is the ability
to use a correct display gamma ramp without destroying the dark color
precision. Unfortunately, using a floating point framebuffer on the current
generation of cards is pretty difficult, because no blending operations are
supported, and the primary thing we need to do is add light contributions
together in the framebuffer. The workaround is to copy the part of the
framebuffer you are going to reference to a texture, and have your fragment
program explicitly add that texture, instead of having the separate blend unit
do it. This is intrusive enough that I probably won't hack up the current
codebase, instead playing around on a forked version.

Floating point framebuffers and complex fragment shaders will also allow much
better volumetric effects, like volumetric illumination of fogged areas with
shadows and additive/subtractive eddy currents.

John Carmack

Source : http://www.bluesnews.com/cgi-bin/finger.pl?id=1&time=20030129210315

mmmmm 8)
 
For the NV3X architecture Nvidia tied two FP16 units with control logic creating a single FP32 so there is nothing stopping them putting four FP16 units and creating a single FP64 unit.

This is the reason FP32 is half the speed of FP16. I would imagine that FP64 would be 1/4 the speed of FP16 on the NV40, but they would be able to say they have 64 bit precision.
If only things were that easy. First of all, 16 bit and 32 bit formats don't match at all. Secondly, I don't believe one second that you just combine two FP16 units to create a FP32 compatible one. It simply isn't possible in my opinion.
 
FUDie said:
rwolf said:
For the NV3X architecture Nvidia tied two FP16 units with control logic creating a single FP32 so there is nothing stopping them putting four FP16 units and creating a single FP64 unit.

This is the reason FP32 is half the speed of FP16.
There is no indication that the NV3X architecture is anything like this. The engine is FP32, the reason FP16 is faster is because of the reduced register footprint, which is a critical issue on the NV3X.

-FUDie

I recall reading this somewhere unfortunately I don't have a link. A fellow from Nvidia was saying that they took a fp32 unit and added some control logic so that it could support dual fp16 instructions. They even said it was twice as fast. I don't think all the NV3X problems are just registers. I am pretty sure that there are twice as many fp16 units as fp32.
 
That'd make for a hell of alot of wasted transistors. If there were separate floating point units for FP16 and twice as many as there are for FP32 (and don't forget the FX12 units in the NV30), just think how much better it would be to spend those registers on more FP32 units. It's only one precision mode, but it'd certainly be alot faster than it is now.

Waste or not, a separate set of units for FP16 should mean that operations that only use that precision could be executed in parallel with FP32 instructions. Going even further, one should be able to execute FX12 instructions on the NV30 along with those two floating point instructions. There's no indication of any of this. (EDIT: FX12 instructions can be executed in parallel with FP instructions on the NV30, but only one FP instruction regardless of precision per clock, per pipe.)

There's been a few good threads about NV3x architecture made by some people who've done a real heck of alot of analysis. From all of this one can immediately conclude that the only two sets of arithmetic units are the fixed point ones and the floating point ones -- and fixed point was apparently dropped in the NV35 in favour of "mini" floating point units capable of executing relatively simple instructions. The entirety of the GeforceFX's FP16 performance seems to come from register usage. Just check out the fillrate tester results. Pixel shader programs that use only a couple registers run at the same speed regardless of partial precision. Check out the last post in the following link:

Link!
 
http://www.beyond3d.com/previews/nvidia/nv30gfx/index.php?p=3

Large instruction counts need plenty of storage for intermediate results. GeForce FX will have 32 temporary pixel shader registers in FP32 mode, though this is doubled if all the instructions are in FP16.

http://www.beyond3d.com/previews/nvidia/nv30gfx/index.php?p=5

As for shader performance it remains to be seen if the full capabilities of GeForce FX can be taken advantage of within games at playable frame rates, though the ability to execute two FP16 instructions at the speed of on FP32 instruction will certainly be a boon.

http://www.beyond3d.com/previews/nvidia/nv30launch/index.php?p=2

There was talk that FP16 (64-bit floating point rendering) could run twice the speed of FP32 (128-bit floating point rendering), is that the case?

Yes it is. Because we have native support in our hardware for FP16 and FP32. So, every pipeline is wide enough to accommodate the full 128-bit through the entire thing -- in the Vertex Shader, in the Pixel Shader and out to the frame buffer. Because we support 128-bit throughout the entire pipeline we added some extra control line and we can split those 128-bit channels into 64-bit channels. Now, that's only in the shading architecture, so we don't get twice as many pixels, but you get twice as many 64-bit in instructions. Also, if you want to use FP16 you'll have a smaller frame buffer so it has a lower footprint in memory as well.

I might not be exactly right on the implementation, but I think I am close. Wish I could find the other links.....

It could be the pipeline that has the extra control logic and not the FP units.
 
You can't merge two ALUs together to get a ALU with double the precision. It's like saying if you merged 6 Mini Coopers together you could build a Mustang; it just doesn't work like that. One of the facts to keep in mind is that multipliers scale sizewise quadratically [ O(n^2) ] with growth in precision. In other words, a 2x gain in precision requires around a 4x increase in die area. For fixed point, you can sort of merge 4 "X"-bit ALUS to get a "2X"-bit ALU but even then you will run into problems with signal timings and the like. For floating point - forget it.

The reason NV3X runs better with FP16 than FP32 has to do with register thrashing issues rather than computational issues. FP16 runs on the same units as does FP32, at the same rate of throughput.
 
rwolf said:
http://www.beyond3d.com/previews/nvidia/nv30gfx/index.php?p=3
Large instruction counts need plenty of storage for intermediate results. GeForce FX will have 32 temporary pixel shader registers in FP32 mode, though this is doubled if all the instructions are in FP16.
As I said.
http://www.beyond3d.com/previews/nvidia/nv30gfx/index.php?p=5
As for shader performance it remains to be seen if the full capabilities of GeForce FX can be taken advantage of within games at playable frame rates, though the ability to execute two FP16 instructions at the speed of on FP32 instruction will certainly be a boon.
I don't buy it. Also, not one test has shown this to be the case.
http://www.beyond3d.com/previews/nvidia/nv30launch/index.php?p=2
There was talk that FP16 (64-bit floating point rendering) could run twice the speed of FP32 (128-bit floating point rendering), is that the case?

Yes it is. Because we have native support in our hardware for FP16 and FP32. So, every pipeline is wide enough to accommodate the full 128-bit through the entire thing -- in the Vertex Shader, in the Pixel Shader and out to the frame buffer. Because we support 128-bit throughout the entire pipeline we added some extra control line and we can split those 128-bit channels into 64-bit channels. Now, that's only in the shading architecture, so we don't get twice as many pixels, but you get twice as many 64-bit in instructions. Also, if you want to use FP16 you'll have a smaller frame buffer so it has a lower footprint in memory as well.
Sounds like marketing-speak to me. The extra control logic could just be referring to being able to write two 16-bit values to a 32-bit register.

-FUDie
 
Back
Top