FP16 and market support

Nick · Dec 25, 2003

OpenGL guy said:
Refrast does the same. Also, 32 steps is a lot, it takes some really extreme examples to make then visible.

Doesn't it just require a color difference of more than 32 and high magnification? Seems like lightmaps could suffer from it. Also, very high detail textures could suffer from aliasing. Like the article sais: "The most simple of all texture filters is bilinear. We'd expect it to be implemented without compromise".

Exactly, and these don't occur in games.

Not in current generation of games. They all still use textures with unmodified texture coordinates and are not really DirectX 9.

Where's this "great loss in precision" coming from? Not from the LOD fraction, certainly. Not from FP24 certainly.

24-bit is already on the edge of sufficient precision. As soon as you loose some more bits it is bound to become visible and the loss of precision is exponential with the number of operations.

Prove it.

Remind me of it after my last exam, that's on February 5, and I'll make an attempt.

DX9 does specify. It specifies a minimum of 24-bit precision in the pixel shader, or 16-bit precision when _pp is specified. Not very complicated.

Well, it seems to me that in some placed 24-bit is too low and sometimes it's too high even though no pp is used. So, yes, DirectX exposes some precision control, but not sufficient to be ideal on all hardware. Like I said before, what would have happened with ATI if DirectX did specify 32-bit is required for certain operations? Probably long discussions like this would have started...

Nick · Dec 25, 2003

[maven said:
]I think it would be in the DDK, couldn't find anything in the SDK...

Oh, but the DDK isn't publicly available, is it? So many developers would not be aware of limited precision on certain hardware. Imagine all the debugging and finding workarounds that fix it when 32-bit precision was assumed.

But they (limited fractional bits for texture-interpolators and dependent reads) are different sources of error...

...that contribute to the same artifacts. If you perform a dependent texture lookup that requires high fractional precision at high range, ATI won't be able to provide either. I still agree it's a rare case, but what to do if it does occur?

This is a notion I disagree with. To quote one of my Numerical Analysis lecturers, it's the same as proclaiming the patient dead while the operation hasn't even started yet.

Well no, but if you do start the operation with the wrong tools, I fear for your patient's life anyway. Most shaders used nowadays do operations on colors but seldom on texture coordinates. Once the real potential is explored, how can you be sure no problems are experienced? Who knows, maybe some developers already had to drop certain shaders because of precision issues without really realizing the cause.

There are certain types of operations (in conjunction with particular data) that can have catastrophic effects on accuracy, but you need to be aware of those (as a programmer) anyway; and they do not necessarily occur.

Again, the programmer can't be aware of if it isn't in the SDK. It seems to me the 24-bit limitation was added later on to suit ATI. Lucky for them, it doesn't show show many problems in current 'flat' games, but once operations are made that are sensitive for low mantissa precision, catastrophy can occur. In that case, ATI is pretty much stuck or needs measures that probably lead to performance decimation, so Nvidia just walks past it.

Note that I haven't addressed at all, whether FP16/24/32 is enough for anything or not...

Yeah sure, you just wanted to prove my ignorance...

I appreciate that though, as long as you can just agree with me if I do make a point.

Nick · Dec 25, 2003

Chalnoth said:
This is why I have always opposed people using in-game screenshots to test texture quality and AA quality. For these things, synthetic screenshots are vastly more telling. If you want to use games to test such things, the only way to do it properly would be to use video. That never happens, however, and I don't think it ever will.

Yeah, it's like the people who claim the reference rasterizer shows things the way they should look. But nobody has actually seen temporary effects because we can't wait all day for one second of animation.

Demirug · Dec 25, 2003

Hyp-X said:
This doesn't make much sense.

Say you have 128 slots (because of high register usage), a PS program of 1 TEX instruction and some arithmetic instructions, a 176 tex latency but a <=128 arithmetic latency.

<snip>

I don't say the FX works this way.
I'm just saying that your argument that the bypass have to be long because the quads order cannot change makes no sense.

I am not sure what are you trying to tell me but I sure we both read me post in diffrent ways. If this is true it is my fault.

They NV3X Pipe looks like this

Code:

   Primitivesetup
         |
--Gatekepper
|        |
|  Shadercore
|    |       |
|  TMU  Bypass
|    |       |
| Shaderbackend
|       |
| Combiners
|      |
---Loopback
       |
     ROP

For each instruction you need the same number of cycles to go from the start (gatekepper) to the end(Loopback). If you have a instruction with textureaccess the TMU is used. All other Instructions use the bypass. If the bypass use less slots/stages than the TMU it is possible that one quad can overtake an other. In this case you are not in sync anymore and you have two quads that reach the backend at the same time.

Demirug · Dec 25, 2003

Nick, I am not sure if it is in the DDK but during the DX9 Beta there was a document that contained this:

[from ps_2_0 section]
---Begin Paste---
Internal Precision
- All hardware that support PS2.0 needs to set D3DPTEXTURECAPS_TEXREPEATNOTSCALEDBYSIZE.
- MaxTextureRepeat is required to be at least (-128, +128).
- Implementations vary precision automatically based on precision of inputs to a given op for optimal performance.
- For ps_2_0 compliance, the minimum level of internal precision for temporary registers (r#) is s16e7** (this was incorrectly s10e5 in spec)
- The minimum internal precision level for constants (c#) is s10e5.
- The minimum internal precision level for input texture coordinates (t#) is s16e7.
- Diffuse and specular (v#) are only required to support [0-1] range, and high-precision is not required.
---End Paste ---

Somebody from nVidia (David Kirk?) say that they switch from fp32 to fp24 in the DX9 spec was done after they have done there desgin. But even if this is not correctly nVidia had no choice at this point. Everything less than fp32 would have been a step backwards. They allready use fp32 at NV2X.

Arun · Dec 25, 2003

Radar1200gs: Stop trying to look stupid, cause your primary argument is flawed. You say the 5200 is the most popular DX9 card, and that FP16 is a gimmick on it.
The only usable format on the 5200 is FX12. FP16 is nearly exactly three times as slow as FX12 on the 5200, and FX12 already isn't too fast. DX9 PS2.0. doesn't expose this functionality, but I believe NVIDIA forces FX12 on the 5200 anyway.
However, I do believe that developers should put "all program in FP16" hints for not-too-complex DX9 shaders, because not only the NV3x and S3 benefit from it, but also the NV4x(!)
Regarding the 5200/5600/5800, I say we should we let NVIDIA use FX12 everywhere on them. It would be a disservice to the poor users of these cards not to let them do that, as it's the only way for them to have playable framerates, although with obviously lower IQ.

---

Demirug: That makes sense to me, although I find it extremely stupid from an engineering point of view. But then again, so are the whole register usage penalties, so if it makes sense if you see what I mean

Regarding the NV40, what I meant is that we would move more towards an ILDP (Instruction Level Distributed Processor), and that the gatekeeper would drastically evolve. Problem is, perhaps I'm just being too ambitious and thinking too much about what NVIDIA will have to do in the NV50 if they want to be efficient, because they certainly don't HAVE to do that in the NV40...

Very basically speaking:
- The gatekeeper can now dispatch and receive several quads at the same time (fixed number though, of course).
- The gatekeeper can send the quads to a few different units(!same number of units as number of quads it can send/get!)
- All of these units have a loopback mechanism to one of the input paths of the gatekeeper.

In the most basic implementation, there's just one path for arithmetic and one for texturing. In the most complex implementation, A.K.A. a true ILDP, each unit has such a path, resulting in an optimal usage of all units at all times.

This is risk-free IMO. For example, let us say the arithmetic path takes 100 slots and texturing one 250 slots. Even if the gatekeeper can send multiple quads at a time, it can never send more than one to a specific path, or get more than one from the same path, in a single cycle.

The idea here is not to reduce the maximum number of slots used. It's to reduce the *average*.

Uttar

Demirug · Dec 25, 2003

Uttar said:
Demirug: That makes sense to me, although I find it extremely stupid from an engineering point of view. But then again, so are the whole register usage penalties, so if it makes sense if you see what I mean

We would not have this discussion now if nVidia had insert a larger register file. IMHO the file was larger in the design but it was not possible to build this design. Maybe nVidia want to use 1T-RAM but TSMC was not ready.

Uttar said:
Regarding the NV40, what I meant is that we would move more towards an ILDP (Instruction Level Distributed Processor), and that the gatekeeper would drastically evolve. Problem is, perhaps I'm just being too ambitious and thinking too much about what NVIDIA will have to do in the NV50 if they want to be efficient, because they certainly don't HAVE to do that in the NV40...

Very basically speaking:
- The gatekeeper can now dispatch and receive several quads at the same time (fixed number though, of course).
- The gatekeeper can send the quads to a few different units(!same number of units as number of quads it can send/get!)
- All of these units have a loopback mechanism to one of the input paths of the gatekeeper.

In the most basic implementation, there's just one path for arithmetic and one for texturing. In the most complex implementation, A.K.A. a true ILDP, each unit has such a path, resulting in an optimal usage of all units at all times.

This is risk-free IMO. For example, let us say the arithmetic path takes 100 slots and texturing one 250 slots. Even if the gatekeeper can send multiple quads at a time, it can never send more than one to a specific path, or get more than one from the same path, in a single cycle.

The idea here is not to reduce the maximum number of slots used. It's to reduce the *average*.Uttar

Sure ILDP can help. If they change the TMU from a static to a dynamic slotcount model it will help even more. But i think that the gatekepper ist not right unit for the managment of this.

IMHO this is the right design for the job:

Code:

                  |----&lt;--------------|
Primitive Setup   | |--&lt;------------| |
      |           | |               | |
  ********Gatekepper*****           | |
        |               |           | | 
      Instruction  Instruction      | |
      Decoder 1    Decoder 2        | |
        |               |           | |
  *********Scheduler*****           | | 
  |          |          |           | |
 To ROP     TMU      FPU/ALU        | | 
             |          |           | | 
             |          |--->-------| |  
             |                        |
             |-------------->---------|

But this is absolute off topic in this thread.

Arun · Dec 25, 2003

Demirug said:
We would not have this discussion now if nVidia had insert a larger register file. IMHO the file was larger in the design but it was not possible to build this design. Maybe nVidia want to use 1T-RAM but TSMC was not ready.

I'm not too familiar with 1T-RAM; is that on-chip RAM or something? Or is it extremely fast external RAM? Or something?

Sure ILDP can help. If they change the TMU from a static to a dynamic slotcount model it will help even more. But i think that the gatekepper ist not right unit for the managment of this.

True. The word gatekeeper isn't appropriate here. I was overextending the term, while it's much more logical to see it as a few added units. But on the general idea, was seem to agree though

I'm not so sure how easy it would be to make a dynamic slotcount TMU though. And I'd say that since ILDP helps for a ton more stuff than just reducing register usage penalties, it's a better area to concentrate on - although that's just my opinion.
But then again, as said oh some many times before, ILDP is an extremely vague term. By itself, it doesn't even really mean anything.

But this is absolute off topic in this thread.

Agreed, but moving thread into another direction most likely wouldn't make anyone cry

Uttar

Hyp-X · Dec 25, 2003

Demirug said:

They NV3X Pipe looks like this

Code:

   Primitivesetup
         |
--Gatekepper
|        |
|  Shadercore
|    |       |
|  TMU  Bypass
|    |       |
| Shaderbackend
|       |
| Combiners
|      |
---Loopback
       |
     ROP

Well you are right it's not possible with this design.
This design sucks. (Not that it's news...)

The better design would be to stall the quads in the loopback/gatekeeper instead of the bypass and have a 0 stage bypass.
This would allow the gatekeeper to insert instructions only when it wouldn't cause out-of-order problem (a simple thing to solve).

This would allow higher efficiency on a series of arithmetic instructions - which you would have when you use a lot of registers anyway. So the register usage cost would have been barely noticable.

And since the cost of stalling quads are almost only present in the register file it wouldn't have needed much more transistors (maybe not more transistors at all).

Arun · Dec 25, 2003

BTW, I think it couldn't hurt to get even a tad more offtopic...
How come NVIDIA doesn't seem to have any register problem in their Vertex Shader architecture? I know this never has been tested, but if it was significant, we'd have heard about it by now...

The big difference between PS and VS in the NV3x, beside a few obvious architectural differences due to the usage of vertices instead of quads, is the existence of Branching and, more interestingly, the lack of texturing.

So, could it be that with texturing, VS will also have register usage penalties? If so, the original NV30 designs called for VS texturing using the PS' lookup units...
Or is it a completely different architecture? I would most sincerly doubt that, as considering the rather identical feature sets, it sounds just plain stupid to me.
Another possibility is that the design is just plain smarter there, and that NVIDIA decided to operate that way because branching made it less transistor-frenzy to implement their wanted solution.

ExtremeTech's preview did hint at certain not-so-minor problems NVIDIA had with their Pixel shadercore...
http://www.extremetech.com/article2/0,3973,1153457,00.asp

And contrary to the rumors mill that the 0.13 micron manufacturing process delayed taping out the GeForceFX, Kirk blamed implementing these 32 processing units

I'd say that's marketing speech, but seeing just how problematic their pixel architecture already is even in the NV30/NV35, claiming they had unforseen problems during development and it, among other things, caused the delays makes sense IMO.
Having scrapped some important techniques and having changed them with the most easy-to-implement-but-slow ones is a possibility.

Or we could start another topic for this - for once I seriously feel compelled to talk about this GPU stuff that I'm supposed not to care about anymore

Uttar

akira888 · Dec 25, 2003

Uttar said:
I'm not too familiar with 1T-RAM; is that on-chip RAM or something? Or is it extremely fast external RAM? Or something?

Usually 1T-RAM refers to Mo-Sys's proprietary EDRAM design where SRAM caches are used to mask the higher latency usually associated with DRAM relative to SRAM. Used in the Art-X Flipper GameCube GPU for both a texture cache and a framebuffer cache.

akira888 · Dec 25, 2003

Demirug said:
They already use fp32 at NV2X.

Where? Not in the fragment combiner, which was FX9 in both the NV20 and NV25. The R2xx was FX16.

http://www.beyond3d.com/misc/comp.php

Dave Baumann · Dec 25, 2003

Probably the texture address processor - what was the texture address processor in NV2x was extended for the texture address / FP ALU in NV30. AFAIK, the texture address processor n R300 is FP32 as well.

YeuEmMaiMai · Dec 26, 2003

well I call it as I see it and based upon the NV30 and R300 ATi was definately smarter this time around....Nvidia continues to prove my point by the actions they continue to take. They hid the 5800 Ultra and the cheats they have to use in the drivers indicates that the hardware was poorly designed at best. Nvidia choose to try and dictate to the rest of the industry and they failed. I am quite sure the engineering team had a lot of input on the design of the NV3X right?

Nick said:
YeuEmMaiMai said:

...and I am quite sure that all of their engineers are smarter than the ones over at nVidia based upon the current generation's performance and adherance to the DX9 spec...

Click to expand...

I hope you realize that is a very serious insult for any Nvidia employee. It's not because they made a tiny misjudgement and had a bit less luck that they're all morons. After all, Nvidia cards are still high quality and beat the corresponding ATI cards at several other points than ps 2.0 performance. I'd like to see you make the design decisions for the next generation graphics cards and we'll see if they work out well on all aspects...

Ostsol · Dec 26, 2003

DaveBaumann said:
Probably the texture address processor - what was the texture address processor in NV2x was extended for the texture address / FP ALU in NV30. AFAIK, the texture address processor n R300 is FP32 as well.

It definitely is. I wrote a little test program using a 2048x2048 texture to compare dependant texture reads and normal texture addressing. Right away the quality is lower on the dependant texture read. It's not bad at all on the lower end of the texture coordinates, but approaching 10.0 it gets much more noticiably worse. Each of the texture's pixels remains distinct, though. The question I ask is: how much texture wrapping do you think one would need in a dependant texture read?

Demirug · Dec 26, 2003

DaveBaumann said:
Probably the texture address processor - what was the texture address processor in NV2x was extended for the texture address / FP ALU in NV30.

Yes, and the textureshader work with FP32, too.

NV2X:

Code:

   Primitivesetup 
         | 
|---->---| 
|        |
| Textureshader(FP32) 
|    |       | 
|  TMU  Bypass 
|    |       | 
|--&lt;--Loopback
        |
  Shaderbackend 
        |
-->-----|
|       | 
| Combiners (FX9)
|      | 
-&lt;-Loopback 
       | 
     ROP

KimB · Dec 26, 2003

radar1200gs said:
What would be a good compromise is a screenshot showing the overall scene, then a 10 to 30 frame animation using animgif format or similar (can .png do anim sequences?) of a small selected portion of the screen designed to show the effect in motion.

That would help. But there's still the other problem I forgot to mention: The amount of visible aliasing also depends heavily upon the scene selected. In fact, if two different video cards have a similar form of AA, the choice of scene may well favor one or the other video card unfairly.

And with texture filtering quality, the choice of scene may similarly unfairly favor one or the other (enter the endless arguments on the off-angle deficiencies of ATI's video cards).

So even a single video comparing the video cards' image quality is not going to be sufficient. I still say that synthetic tests are much better, at least for those who understand them. Let those of us who understand them observe the synthetic tests, look at the tradeoffs the two companies have made, and decide for themselves.

OpenGL guy · Dec 26, 2003

Demirug said:
DaveBaumann said:

Probably the texture address processor - what was the texture address processor in NV2x was extended for the texture address / FP ALU in NV30.

Click to expand...

Yes, and the textureshader work with FP32, too.

NV2X:

Code:

Primitivesetup | |---->---| | | | Textureshader(FP32) | | | | TMU Bypass | | | |--<--Loopback | Shaderbackend | -->-----| | | | Combiners (FX9) | | -<-Loopback | ROP

I don't buy it. There's absolutely no need for NV2x to have FP32 texture lookups.

Ostsol · Dec 26, 2003

OpenGL guy said:
Demirug said:

DaveBaumann said:

Probably the texture address processor - what was the texture address processor in NV2x was extended for the texture address / FP ALU in NV30.

Click to expand...

Yes, and the textureshader work with FP32, too.

Click to expand...

I don't buy it. There's absolutely no need for NV2x to have FP32 texture lookups.

Well, I can't test the texture shader, but the fixed function pipeline is easy enough to check. . .

Here's my little program:
http://members.shaw.ca/dwkjo/Programs/TextureAddressing.zip

Up zooms in some more
Down zooms out
Left moves the perspective left
Right moves the perspective right
1 - 4 toggles between different texture address ranges
F1 turns off dependant texture reads
F2 turns on dependant texture reads (if ARB_vertex_program and ARB_fragment_program are supported)

Here's normal rendering (FP32) on a Radeon 9700 Pro at the first texture address range:
http://members.shaw.ca/dwkjo/screenshots/normal.png

Here's dependant texture reads (FP24) on the same card at the same range:
http://members.shaw.ca/dwkjo/screenshots/dependant.png

OpenGL guy · Dec 26, 2003

Ostsol said:
OpenGL guy said:

Demirug said:

DaveBaumann said:

Probably the texture address processor - what was the texture address processor in NV2x was extended for the texture address / FP ALU in NV30.

Click to expand...

Yes, and the textureshader work with FP32, too.

Click to expand...

I don't buy it. There's absolutely no need for NV2x to have FP32 texture lookups.

Click to expand...

Well, I can't test the texture shader, but the fixed function pipeline is easy enough to check. . .

Here's my little program:
http://members.shaw.ca/dwkjo/Programs/TextureAddressing.zip

Up zooms in some more
Down zooms out
Left moves the perspective left
Right moves the perspective right
1 - 4 toggles between different texture address ranges
F1 turns off dependant texture reads
F2 turns on dependant texture reads (if ARB_vertex_program and ARB_fragment_program are supported)

Here's normal rendering (FP32) on a Radeon 9700 Pro at the first texture address range:
http://members.shaw.ca/dwkjo/screenshots/normal.png

Here's dependant texture reads (FP24) on the same card at the same range:
http://members.shaw.ca/dwkjo/screenshots/dependant.png

What's your point? Is anyone interested in staring at four texels stretched across the whole screen?

FP16 and market support

Nick

Nick

Nick

Demirug

Demirug

Arun

Unknown.

Demirug

Arun

Unknown.

Hyp-X

Irregular

Arun

Unknown.

akira888

akira888

Dave Baumann

Gamerscore Wh...

YeuEmMaiMai

Ostsol

Demirug

KimB

OpenGL guy

Ostsol

OpenGL guy

Similar threads