FP16 and market support

Althornin said:
translation:
If nVidia hadnt bothered to waste transistors on FP16/integer units, the NV3x would be much better.

I mean, whats your point? Why do you fail to ever lay blame on nvidia? Why must it always be someone elses fault - ATI, MS, game developers, etc....
You just don't pay attention to what I say, Althornin. You just hear what you like to hear: that I like nVidia, so everything nVidia does is good, that everything ATI does is bad. That's not what I think. I am responding to people who are overwhelmingly biased towards ATI, and have unwarranted opinions towards nVidia's hardware. I am not going to write huge posts filled with lots of conditions to try and clear things up. I typically try to make a specific point with each post. It's your fault you're misreading me.

I also find it funny that out of one side of your mouth you proclaim FP16 as useful, and out the other side, say that FP24 is not.

FP16 is a low precision worthless format, and should be on its way out.
This is the exact problem I'm talking about.

FP16's precision is more than enough for specific circumstances.

FP24's precision is more than enough for those same circumsances, plus more.

FP32's precision is more than enough for still more specific circumstances.

Each precision has cases where it will not work properly. Does that mean that none of these precisions should be used? No. It means that they should be used where they are useful. The way you're arguing, you might as well ask for FP64 all the time.

FP24 is on the way out because it appears that the pipelines are going to be unified. FP24 cannot be used alongside FP32 in the same pipeline to any reasonable advantage. FP16, however, can, as nVidia has shown already. It's not about the precision available with FP16 vs. FP24. It's about the fact that only one of these modes would be useful alongside FP32.
 
Tridam said:
What would you add in DX9.1 to help NVIDIA's performances ?
That's not my decision, they know their own design best so let them decide. I just find it hard to believe that all these extra transistors and memory bandwidth can't be put to good use.
FP16 support ? Already in DX9.
Is it? Where? All I see is that some instructions don't require full precision in the mantissa. That's something different than using another format throughout the whole pipeline. And switching between the two formats must cost something as well. So they probably don't gain as much from it as they could.
Better compiler ? No need for a new DX revision.
Well, I'm no expert at the design of the hardware's instruction sets, but I believe ATI is totally parallel with ps 2.0 while Nvidia has more RISC-like instructions? Then, either way, it's always faster to program in the assembly language directly instead of writing a complicated compiler that still has to make compromises.

I wrote an 'emulator' myself for x86 which translates SSE instructions into the equivalent FPU instructions. And as expected it was much slower but the main cause was actually that it requires several emms imstructions. But I even optimized it to use as little emms instructions as required. However, it's quite easy to write all the FPU code manually and get far superiour code with just one emms instruction per iteration or even per loop.

I'd really love to see ATI emulate Nvidia's instructions... :rolleyes:
 
Chalnoth said:
Um, that is what happens. Modern architectures have their designs pretty much set somewhere between 18-24 months prior to release. That's quite a bit before the API is designed.

Do you really think that ATI designed the R300 after DirectX 9, which wasn't released until nearly 6 months after the R300 was released?
Thanks for confirming that!

So, don't some people realize what a huge impact Microsoft's decisions have on hardware performance? You have two big companies battling each other with what are essentially hardware design differences. But to make optimal use of their specific design they have to convince Microsoft to adopt to their design. Obviously, Microsoft's decision depends on more than what could be considered the best compromise for the developers...

Please correct me if I'm wrong but... just the fact that Nvidia had to write an optimizing compiler (and it does optimize things), and ATI did not, kindof makes me believe they totally picked ATI's side this time. :?
 
chalnoth,

please take your head out of your rear long enough to listen. MS stated that they used the 8500 VPU as the reference for DX9. Nvidia and ATi had access to the same specs and featureset idea for DX 9 but that's right Nvidia acted like a little brat and stormed out becasue MS would not change the specs for them.


When are you going to realize that Nvidia has to be held accountable for Nvidia's actions? IT is not the fault of

MS
ATi
S3
Matrox
SiS
ST

or any other company involved that nVidia screwed up, it is Nvidia's fault.

Everyone here agrees (except for you) that NV3X would have been a much better product if they would have focused solely on FP32 performance than trying to make a excellent DX8 card with some DX9 features to get their foot in the door.

And DX9 was not 6 months after R300 it was 3 as it came out dec 2002 so once again you are wrong.

You constantly state ATi does things that they do not and try to make them look bad but I got news for you Ati does not make Nvidia look bad, Nvidia makes themselves look bad.

BTW, have you seen the preview of the new S3 card? Oh my looks like Nvidia will be knocked down to third place in the value segment of DX9 cards....I never thought S3 would be faster than NVidia but I guess I was wrong........ www.firingsquad.com and look for the S3 preview
 
Bouncing Zabaglione Bros. said:
"Embrace and extend" eh? Problem is that this is devisive at at time when developers want a consitent interface. Historically, extras are ignored until they make it into an API.
So what you're saying is that ATI's shader design is not an extra and Nvidia's design is, just because the former made it into the API without compromises and the latter was forced to write an optimizing compiler?
GFFX is a good example because Nvidia tried (and failed spectacularly) to force this "going beyond the API" because (a) performance sucks when you do, and (b) they neglected to supply things that are useful, basic parts of the API.
(a) New features can always be categorized as either new capabilities or optimizations of other techniques. So I don't think everything is inherently slower.
(b) But the API descisions were taken afterwards. So what you consider 'basic' features is what ATI considered optimal for their architecture. But I don't see they actually listened much to Nvidia's demands, did they?

And since OpenGL 2.0 isn't expected any time soon Nvidia is quite stuck with specifications that are sub-optimal for their architecture.
I was being fatuous, but PS 1.4 was made part of the API - and little supported either then or now because Nvidia refused to support it when they were the 900 lb gorilla. It's this sort of thing that shows why API decisions should not be up to one IHV.
Exactly. That's why I think just as much that ATI shouldn't be making all the decisions now. I could be wrong about all this so please correct me if necessary...
 
YeuEmMaiMai said:
BTW, have you seen the preview of the new S3 card? Oh my looks like Nvidia will be knocked down to third place in the value segment of DX9 cards....I never thought S3 would be faster than NVidia but I guess I was wrong........ www.firingsquad.com and look for the S3 preview

That my friend is laughable. We will see though, I personally don't beleive it at all since s3 is getting completely owned at the moment, and the XGI solution is terrible. S3 cannot do opengl even yet, I mean it is like a geforce3 running it.
 
Nick said:
Please correct me if I'm wrong but... just the fact that Nvidia had to write an optimizing compiler (and it does optimize things), and ATI did not, kindof makes me believe they totally picked ATI's side this time. :?
You're wrong. ATI has had an optimizing compiler since the R300 launch. It has gone through some major revisions as well and will continue to do so.
 
OpenGL guy said:
Nick said:
Please correct me if I'm wrong but... just the fact that Nvidia had to write an optimizing compiler (and it does optimize things), and ATI did not, kindof makes me believe they totally picked ATI's side this time. :?
You're wrong. ATI has had an optimizing compiler since the R300 launch. It has gone through some major revisions as well and will continue to do so.
i was about to say the same thing. I wonder where the myth that ATI doesnt have an optimizing compiler came from? Uninformed nVidia PR spewing reviews, mayhaps?
 
Chalnoth said:
You just don't pay attention to what I say, Althornin. You just hear what you like to hear: that I like nVidia, so everything nVidia does is good, that everything ATI does is bad. That's not what I think. I am responding to people who are overwhelmingly biased towards ATI, and have unwarranted opinions towards nVidia's hardware. I am not going to write huge posts filled with lots of conditions to try and clear things up. I typically try to make a specific point with each post. It's your fault you're misreading me.
I'm sorry, but thats tripe.
I listen to what you say - you just hardly ever say anything new.
You have yet to say that nVidia should take some blame for thier crappy performance - in ANY thread. But, you always manage to blame it on someone else, just like you did in this thread. Its not a matter of not listening to you, Chalnoth, its a matter of you not listening to yourself.

Its your fault you cant open your mouth without letting your enormous bias slip out.
 
What I don't understand is if your entire pipeline is running at a FP32 precision why would you ever want to run at anything lower? I built both fixed point and floating point adders and multipliers (in Altera simulator) and while you can, in fixed point, with some compromises do multiple lower precision ops simultaneously on one higher precision ALU you just can't on floating point; the architecture is just too complex to fathom doing something like that. FP16 would confer no advantages, in an architecture with a proper number of internal registers, over FP32. Intel and AMD's FPUs always execute and store internally at 80 bit precision, for example, no matter what precision the programmer/assembler loads into and stores from the FP stack.
 
Tridam said:
What would you add in DX9.1 to help NVIDIA's performances ?
Actually, the best thing that Microsoft could do is change HLSL. The intermediate assembly format is the biggest problem with HLSL. Essentially, the intermediate assembly means that information is lost when the assembly is generated. This puts more strain on driver developers to develop more optimal compilers for their respective architectures.

I see two possible ways of modifying HLSL:
1. Creating a new intermediate format that retains all of the information of the HLSL.
2. Going the GLSL route of just having IHV's doing all of the compiling.

You have to remember that the differences between the architectures are much more expansive than just the precision differences. We've seen examples posted previously on this message board about algorithms you'd want to be completely different on the different hardware.
 
akira888 said:
What I don't understand is if your entire pipeline is running at a FP32 precision why would you ever want to run at anything lower?
Modern x86 processors do all non-SSE FP operations in 80-bit precision. Does that mean that we should all be using long doubles?

Anyway, I can easily imagine ways to modify a FP32 pipeline to make also capable of acting like two FP16 pipelines with a few added transistors.
 
Althornin said:
You have yet to say that nVidia should take some blame for thier crappy performance - in ANY thread. But, you always manage to blame it on someone else, just like you did in this thread.
Because I don't like blame. I think of problems as things to fix. Basically, I think that nVidia's architecture is set, and has been set for a very long time. There is nothing to fix there. Microsoft, however, had a choice. They created a problem, and I think, for no good reason. This is what I don't like.

You may not see a difference between blame and what I stated above, but I see a huge difference.

Basically, nVidia can't do anything to fix the way their architecture has been designed with the NV3x. It is pointless to keep to keep laying out what has become a mistake, in part due to Microsoft's choices. Of course, they had a responsibility to fix the shortcomings of the NV3x architecture for the NV4x, and they definitely gave it a good start with the NV35+.
 
Chalnoth said:
Tridam said:
What would you add in DX9.1 to help NVIDIA's performances ?
Actually, the best thing that Microsoft could do is change HLSL. The intermediate assembly format is the biggest problem with HLSL. Essentially, the intermediate assembly means that information is lost when the assembly is generated. This puts more strain on driver developers to develop more optimal compilers for their respective architectures.

I see two possible ways of modifying HLSL:
1. Creating a new intermediate format that retains all of the information of the HLSL.
2. Going the GLSL route of just having IHV's doing all of the compiling.

You have to remember that the differences between the architectures are much more expansive than just the precision differences. We've seen examples posted previously on this message board about algorithms you'd want to be completely different on the different hardware.

I know well enough the differences between the architectures...

2. Impossible (before a major DX revision -> DX10)
1. Why not

Anyway, the GeFFX architectures have some limitations that won't disappear. Whatever you can change in DX you'll still have a data access limitation. Whatever you can change in DX you'll still lose efficiency because the small ALU can't do every ops. Whatever you can change in DX you'll still lose efficiency because the full ALU can't work when you're doing texturing.

Do you think that Microsoft has to bring a new DX revision to help NVIDIA dealing with these architecture limitations ? I don't think that Microsoft will ever do something like that... ATI can also ask Microsoft to bring a new DX revision reintroducing co-issue or reintroducing more modifiers.

AFAIK, ATI has to use a compiler to enable co-issuing and some additionnal hardware primarily used for some ps1.4 modifiers. ATI has also to ask developers to think about co-issuing when writing shaders.

NVIDIA has to use a compiler to reduce register number, to make the instruction order more geffx friendly. NVIDIA has to ask developers to think about register numbers and precision when writing shaders.

ATI and NVIDIA are equal. They both need a compiler. They both need the help pf developers. But because ATI has a shading performances advantage Microsoft has to help NVIDIA ? It's too easy to say that Microsoft is the guilty man who is responsible for the disappointing Geffx shading performances. Why have we the same situation with OpenGL?
 
Tridam said:
Anyway, the GeFFX architectures have some limitations that won't disappear. Whatever you can change in DX you'll still have a data access limitation. Whatever you can change in DX you'll still lose efficiency because the small ALU can't do every ops. Whatever you can change in DX you'll still lose efficiency because the full ALU can't work when you're doing texturing.

Do you think that Microsoft has to bring a new DX revision to help NVIDIA dealing with these architecture limitations ? I don't think that Microsoft will ever do something like that... ATI can also ask Microsoft to bring a new DX revision reintroducing co-issue or reintroducing more modifiers.

AFAIK, ATI has to use a compiler to enable co-issuing and some additionnal hardware primarily used for some ps1.4 modifiers. ATI has also to ask developers to think about co-issuing when writing shaders.

NVIDIA has to use a compiler to reduce register number, to make the instruction order more geffx friendly. NVIDIA has to ask developers to think about register numbers and precision when writing shaders.

ATI and NVIDIA are equal. They both need a compiler. They both need the help pf developers. But because ATI has a shading performances advantage Microsoft has to help NVIDIA ? It's too easy to say that Microsoft is the guilty man who is responsible for the disappointing Geffx shading performances. Why have we the same situation with OpenGL?

Very well said!

Btw, it was OpenGL that enabled us to know why the shading performance suck. And that has little to do with MS not exposing enough.
 
Chalnoth said:
Because I don't like blame. I think of problems as things to fix. Basically, I think that nVidia's architecture is set, and has been set for a very long time. There is nothing to fix there. Microsoft, however, had a choice. They created a problem, and I think, for no good reason. This is what I don't like.
"I dont like blame"
"i blame microsoft"
get real.
nVidia had a choice also. You always ignore information that contradicts your inane and hugely fanish assumptions - namely, information that ATI chose its precision AFTER MS set it, according to some of the ATI guys here.
So nVidia chose to do something different.
 
Anyone who read the NV10(!) register combiner specification, knows that nVidia's original plan was to convert their register combiners to floating point.

The strange thing is, that if you look at what the R300 can do in one clock it's almost the same as nVs register combiners. (MAD + MUL in single cycle, single cycle LRP, coissue, free i/o modifiers.)

One of the reasons ATI did this might be exactly that because they thought that is what nV was working on.

And that is what nV should have done! They should have converted their register combiners to FP.

Should we say anything about the precision of the combiner computations?

RESOLUTION: NO. The spec is written as if the computations are
done on floating point values ranging from -1.0 to 1.0 (clamping is
specified where this range is exceeded). The fact that NV10 does
the computations as 9-bit signed fixed point is not mentioned in
the spec. This permits a future design to support more precision
or use a floating pointing representation.
<...>

Should a dot product be computed in parralel with the sum of products?

RESOLUTION: NO.<...>

The rationale for this is that we want to minimize the number of
adders that are required to ease a transition to floating point.
 
Tridam said:
Anyway, the GeFFX architectures have some limitations that won't disappear.
Every architecture in existence has limitations. It just turns out that the GeForce FX appears to be a processor that while its peak performance can be quite high, its worst-case scenario is incredibly bad.

Do you think that Microsoft has to bring a new DX revision to help NVIDIA dealing with these architecture limitations ? I don't think that Microsoft will ever do something like that... ATI can also ask Microsoft to bring a new DX revision reintroducing co-issue or reintroducing more modifiers.
Microsoft already released a new compiler target to help performance with the NV3x. Beyond this, all that Microsoft can really do to help nVidia is official support of the FX12 format, which doens't seem like a bad thing to me, but it just won't happen at this juncture (And yes, this still upsets me, because it would be such a tiny change in the software: again, Microsoft can fix the problem easily, which is what upsets me).

The statement I made, however, was vendor-independent. It had nothing to do with which IHV is doing what, it has to do with a rather poor implementation of HLSL by Microsoft (which is, ironically, almost certainly due to nVidia's Cg model).

Yes, making the change I suggested would benefit nVidia more than ATI right now, because nVidia's architecture is in more need of a good compiler. This is simply because compiler differences seem to make a much larger performance difference on the NV3x than they do on the R3xx.
 
I think FP16 isn't so much the problem for this generation as not having FP24 support. You need FP24 for the minimum full precision standard, and any resources devoted to going beyond that are resources you can't apply anywhere else. So Nvidia is stuck with having to go to FP32 to meet the minimum FP24 standard, and that's an unacceptable performance hit for their current cards.
 
Chalnoth said:
akira888 said:
What I don't understand is if your entire pipeline is running at a FP32 precision why would you ever want to run at anything lower?
Modern x86 processors do all non-SSE FP operations in 80-bit precision. Does that mean that we should all be using long doubles?

Anyway, I can easily imagine ways to modify a FP32 pipeline to make also capable of acting like two FP16 pipelines with a few added transistors.

Chalnoth, the two situations are not in the least comparable. In my case we have a DSP which is constantly streaming FP32 data "for free" into the FP32 fragment blend ALUs. In your case you have a system which is always memory bandwidth-limited having to fetch data from external RAM into its register stack. Doing the operation internally on the x86 FPU at single precision would result in zero speedup. Likewise, down-converting FP32 data from trisetup into an FP16 format to execute on 32 bit precision ALUs will also result in zero speedup.
 
Back
Top