FP16 and market support

Nick · Dec 20, 2003

OpenGL guy said:
You're wrong. ATI has had an optimizing compiler since the R300 launch. It has gone through some major revisions as well and will continue to do so.

Thanks for correcting that, I really never heard of it...

Anyway, trying to be objective, could it be possible that ATI's internal instruction set is very close to ps 2.0, so the optimizing compiler just corrects 'stupidities' of the programmer? For example it could reorder instructions to break dependecies, or detect when less registers can be used, or when a copy is not required? In that case, I can see the need for it, but it would still be unfair compared to the advanced optimization compiler Nvidia had to write and which will never be truely optimal.

I'm a software rendering freak, so I'm not looking for any excuses, just explanations we could learn from.

Nick · Dec 20, 2003

Althornin said:
i was about to say the same thing. I wonder where the myth that ATI doesnt have an optimizing compiler came from? Uninformed nVidia PR spewing reviews, mayhaps?

Relax! I never read it wasn't so, I just never heard about, so don't use it as any kind of argument to make Nvidia look like the bad guys or ATI look like the good guys, ok? Now would you be so kind as to tell me where you heard about ATI's optimizing compiler, and if you also know how advanced it is? I'm just trying to find an explanation why ATI had it from the start and Nvidia needed a lot of months for it while it has such a great driver team. Or hasn't it?

Nick · Dec 20, 2003

Hyp-X said:
And that is what nV should have done! They should have converted their register combiners to FP.

Who sais they haven't? Could that be an explanation for the performance issues if shaders were actually implemented via register combiners? Ok that's probably total crap I'm saying now... it's getting late.

Like I said before I'm just trying to find a reasonable explanation. I hardly can accept that more transistors and more bandwidth yields less performance. It's possible if you have chimps in the hardware design department but that seems highly unlikable for any company of this caliber.

Tridam · Dec 20, 2003

Nick said:
Althornin said:

i was about to say the same thing. I wonder where the myth that ATI doesnt have an optimizing compiler came from? Uninformed nVidia PR spewing reviews, mayhaps?

Click to expand...

Relax! I never read it wasn't so, I just never heard about, so don't use it as any kind of argument to make Nvidia look like the bad guys or ATI look like the good guys, ok? Now would you be so kind as to tell me where you heard about ATI's optimizing compiler, and if you also know how advanced it is? I'm just trying to find an explanation why ATI had it from the start and Nvidia needed a lot of months for it while it has such a great driver team. Or hasn't it?

I think that ATI has worked on a general ASM optimizing compiler from the start. In the tests I've done in april I was already seeing that an ASM optimizing compiler was working even if what I saw was basic stuff. ATI need an optimizer because DX9 is not exposing some of ATI hardware capabilities : co-issue, some modifiers...

I think that NVIDIA hasn't worked on a general ASM optimizing compiler from the start. Instead they've worked on Cg. Maybe they thought that by optimizing Cg compiler they wouldn't need an ASM optimizer ? But even their Cg compiler doesn't really do a great job... Anyway I think that the main reason is that developping an optimizing compiler for GeFFX is very difficult. NVIDIA had to quickly show good performances. So their priority was manual shader replacement. When you're working on manual optimization of 3dmark you're not working on a compiler...

Bouncing Zabaglione Bros. · Dec 20, 2003

Nick said:
So what you're saying is that ATI's shader design is not an extra and Nvidia's design is, just because the former made it into the API without compromises and the latter was forced to write an optimizing compiler?

ATI has always had a general purpose shader compiler - it does things like reordering shaders to match their hardware. This makes sure the hardware is running the shader at the highest possible speed, whilst still mainitaining mathematical equivalence.

Design of the shader engines/optimisers are irrelevent to the API. The API provides a certain standard of performance the hardware must output. How it produces that output doesn't matter. If someone designs something to that standard that's fine. If someone designs something better that becomes part of the API later, that's great.

What Nvidia did was design something greater than the API standard (FP32) which they then couldn't run fast enough to be usable, and then something less than the API standard (FP16) which allowed their hardware to run faster, but is below the standard API spec.

When APIs are meant to drive the hardware towards better quality rendering (for the benefit of the consumer), it is no suprise that MS and the other IHV's who were part of the API discussion did not want to go backwards and allow the use of older FP16 (as we have been using for many years) to be part of that forward driving standard.

In fact it appears that even Nvidia, with their trumpeting of 32 bit rendering and "cinematic computing" at the launch of NV30 also did not see FP16 as a future technology, and only switched to marketing it as an "optimisation" when it became obvious that ATI was kicking Nvidia's butt in performance whilst still meeting the API standard spec that everyone is programming to.

Nick said:
(a) New features can always be categorized as either new capabilities or optimizations of other techniques. So I don't think everything is inherently slower.

It appears to be for the much used parts of the NV3x series when running DX9. Not good for a card strongly marketed as a DX9 part. Are there any benchmarks where Nvidia does not cheat that it even gets parity to the slower clocked ATI equivalent that's running on a larger process? The fact that Nvidia needs to cheat on all these tests as a matter of course should show you there are real performance issues that they cannot deal with using honest performance.

Nick said:
(b) But the API descisions were taken afterwards. So what you consider 'basic' features is what ATI considered optimal for their architecture. But I don't see they actually listened much to Nvidia's demands, did they?

Do you really think that Nvidia and ATI invested tens of millions in designing new, cutting edge DX9 parts without knowing far in advance what the major parts of the DX9 spec would be? Don't you think they had a 18 months heads up on the rest of us. Do you think either of these companies would spend huge sums of money on this kind of R&D with Microsoft telling them "we have a spec, but we won't tell you what it is for a couple of years?"

Nick said:
And since OpenGL 2.0 isn't expected any time soon Nvidia is quite stuck with specifications that are sub-optimal for their architecture.

That's because Nvidia was publicly and obviously trying to take control of the API and force developers into doing thing the Nvidia way. They failed to create a "NVGLide", and are now stuck with the results of this failed gamble - an architechture that does not fit the current APIs very well. Combined with gambles on programmable design and going to a new .13 process at the same time, Nvidia put all thier eggs in one basket. It was Nvidia's bad luck that their gambles failed at the same time as ATI brought out a groundbreaking part which Nvidia publicly said was impossible to make.

Nick said:
Exactly. That's why I think just as much that ATI shouldn't be making all the decisions now. I could be wrong about all this so please correct me if necessary...

You are wrong. Do you really think Microsoft allowed ATI to "make all the decisions"? Even if they did, don't you think a company the size, influence and history of Nvidia should have made sure they were working within those available standards?

It's like all the car and oil companies getting together to decide the future of the car/gas industry, and GM takes part and designs a good next generation car, while Ford blows off the conference thinking they are big enough to push the market where they want it to go. Ford then finds that no one wants to buy their substandard car a couple of years later. I'll let you guess which company is Nvidia and which is ATI.

Rolf N · Dec 20, 2003

Chalnoth said:
akira888 said:

What I don't understand is if your entire pipeline is running at a FP32 precision why would you ever want to run at anything lower?

Click to expand...

Modern x86 processors do all non-SSE FP operations in 80-bit precision. Does that mean that we should all be using long doubles?

That's only reflected in external storage (load from/store to system memory).
FPU registers are always 80 bits wide, as are internal results.
... the FPU can be configured to take some slack on precision but this doesn't affect the data formats at all; you'll always get 80 bit results. Some of these 80 bits may not be correct, but anyway, you can't store two low precision results in one high precision register. That's ... um ... my point

But then there's 3DNow. This does store two lower precision results in one high precision register and, by doing so, also doubles the amount of temporaries you can keep in the register file (16 vs 8 ). 3DNow also doubles througput ('real' SIMD).

None of this is really NV3xish because AFAIK NV3x doesn't double throughput on FP16 vs FP32.

Anyway, I can easily imagine ways to modify a FP32 pipeline to make also capable of acting like two FP16 pipelines with a few added transistors.

AMD can, so there must surely be ways to do it.

edit because there are rare instances when an eight followed by a closing bracket isn't the same as a pair of sunglasses 8)

jpaana · Dec 21, 2003

Chalnoth said:
The statement I made, however, was vendor-independent. It had nothing to do with which IHV is doing what, it has to do with a rather poor implementation of HLSL by Microsoft (which is, ironically, almost certainly due to nVidia's Cg model).

The DX HLSL compiler runs circles around the Cg compiler in the performance of the generated code regardless of the target as it just makes much better job in generic optimizations. The resulting code is pretty close to ideal if the hardware would run PS assembly as such with no register limitations etc. Different profiles help optimization to real hardware but obviously giving the high-level code to the driver to optimize is better. Looking at NVidia's track record on compilers I wouldn't expect miracles though.

OpenGL guy · Dec 21, 2003

Nick said:
OpenGL guy said:

You're wrong. ATI has had an optimizing compiler since the R300 launch. It has gone through some major revisions as well and will continue to do so.

Click to expand...

Thanks for correcting that, I really never heard of it...

Anyway, trying to be objective, could it be possible that ATI's internal instruction set is very close to ps 2.0, so the optimizing compiler just corrects 'stupidities' of the programmer? For example it could reorder instructions to break dependecies, or detect when less registers can be used, or when a copy is not required? In that case, I can see the need for it, but it would still be unfair compared to the advanced optimization compiler Nvidia had to write and which will never be truely optimal.

It does far more than that. As I said, we're working to improve it all the time. *poke Dio*

I'm a software rendering freak, so I'm not looking for any excuses, just explanations we could learn from.

Well I obviously can't go into details on how the compiler or HW work, can I?

ninelven · Dec 21, 2003

Please correct me if I'm wrong but... just the fact that Nvidia had to write an optimizing compiler (and it does optimize things), and ATI did not, kindof makes me believe they totally picked ATI's side this time.

You're wrong. ATI has had an optimizing compiler since the R300 launch. It has gone through some major revisions as well and will continue to do so.

Click to expand...

Still, I would like to see a performance comparison between the two measuring the increase in performance from no compiler to latest compiler. Somehow I have a feeling the GFFX cards would see a significantly higher % increase, but I could be wrong....

KimB · Dec 21, 2003

jpaana said:
The DX HLSL compiler runs circles around the Cg compiler in the performance of the generated code regardless of the target as it just makes much better job in generic optimizations.

Last I saw, the two compilers were pretty close.

But anyway, nVidia has publicly stated that they don't want to be in the compiler business. Of course, they've been forced into it. They've already shown that they're capable of building much better compilers than what Cg had with their new optimized compiler technology (that compiles the DX9 assembly to the hardware). I don't think Cg has been updated recently.

I do find it somewhat silly that nVidia would rather have an intermediate assembly language. Their hardware would be much better-off without one (i.e. with a more complete intermediate language, or with no intermediate language).

KimB · Dec 21, 2003

Nick said:
Anyway, trying to be objective, could it be possible that ATI's internal instruction set is very close to ps 2.0, so the optimizing compiler just corrects 'stupidities' of the programmer?

That's part of it, but not, by far, the whole thing.

For example, the R3xx architecture is capable of executing one texture one one FP op per clock (well, sometimes more, depending on the operations....like it can, apparently, execute a separate multiply and add in the same clock, not just a MAD). So, an optimal shader would be written with alternating math and texture operations.

This is not true with the NV3x, where an optimal shader would pair texture instructions. This is one difference that is independent of instruction set (there are others).

I think the main reason that the R3xx performed better form the start is that it seems to be designed more simply. That is, it seems to have much more leeway in the specific ordering and whatnot of instructions. The NV3x architecture is much less forgiving, and, I would say, is much better-left to compilers to try to optimize for.

Oh, I'd like to take this last line to interject a point: one person a while back noted that the NV3x architecture may use a RISC instruction set. It is much more likely that the instruction set is actually VLIW, with 128-bit instructions. I think that the Intel Itanium is a decent analogy to the NV3x architecture because while they are extremely different, both depend on a good compiler for optimal performance (The Intel Itanium also performed absolutely horribly in many early benchmarks).

Rolf N · Dec 21, 2003

I really don't think the Cg compiler is all that hot.
Trivial shader - Cg uses 6 instructions and 3 temps. Can be done with 4 instructions and 2 temps.

More complex shader - Cg uses 28 instructions and 5 temps. Can be done with 18 instructions and 2 temps. Constant count could also be reduced; from three to 1 - I just didn't bother.

I fail to recognize any quality when I see Cg compiled code

KimB · Dec 21, 2003

zeckensack said:
I really don't think the Cg compiler is all that hot.

I didn't say it was. They also seem to have stopped working on the compiler. The last release was in July, about five months ago. nVidia has since focused on the compilers in their drivers.

Anyway, if you can test it, it would be interesting to see just how optimal nVidia's new optimized compiler is. That is, if you ran both programs you referenced on an NV3x (particularly in Direct3D), would there be much difference?

PatrickL · Dec 21, 2003

WoW 5 pages on the weekly thread: How Nvidia is so nice to lower the standard to fit its interest and how some zealots should sell that to the masses ?

FP 16 market support is an imaginary thing. You have card makers that can produce acceptable speed at dx 9 standard and the others.

Do we need to have that thread every week until NV40 release ? (if nv 40 does not use FP16 or i guess we will have that until NV50 )

jpaana · Dec 21, 2003

Chalnoth said:
Last I saw, the two compilers were pretty close.

But anyway, nVidia has publicly stated that they don't want to be in the compiler business. Of course, they've been forced into it. They've already shown that they're capable of building much better compilers than what Cg had with their new optimized compiler technology (that compiles the DX9 assembly to the hardware). I don't think Cg has been updated recently.

Well the (still) current versions were compared in this thread with some benchmarks: http://www.beyond3d.com/forum/viewtopic.php?t=6864&highlight= and the results were not close by my standards. Would be interesting to see the benchmarks repeated with current drivers on both cards though.

radar1200gs · Dec 21, 2003

If Uttar is to be believed, FP16 will be part of NV40. And what makes you sure that NV50 won't also use FP16?

Just because the fanATics don't like it doesn't mean its not useful, or that it will go away...

PatrickL · Dec 21, 2003

And that is not because one company fail to deliver that evolution has to be stopped.

radar1200gs · Dec 21, 2003

Don't expect the speed differences you have seen between FP16 & FP32 in NV3x to be the norm.

IMO 90% of the problems the NV3x series has faced comes down to the silicon (for whatever reson) not being able to efficiently split the FP32 registers into FP16 registers as intended. You can rest assured nVidia will have worked hard on resolving that.

I don't wish to peddle rumors, but there is some speculation NV40 will feature FP64 support. If this is true, it is an indication nVidia well and truly has thier multi-precision registers working properly in NV40.

KimB · Dec 21, 2003

jpaana said:
Well the (still) current versions were compared in this thread with some benchmarks: http://www.beyond3d.com/forum/viewtopic.php?t=6864&highlight= and the results were not close by my standards. Would be interesting to see the benchmarks repeated with current drivers on both cards though.

It looked pretty close to me:

Tridam said:
It could be interesting to look at performances of these codes HLSL code uses 6 registers and the Cg code only 4. Your HLSL code has 17% less instructions.

...

GeForce FX 5600 HLSL : 11.2 MPix/s
GeForce FX 5600 Cg : 12.4 MPix/s

GeForce FX 5600 HLSL_pp : 14.8 MPix/s
GeForce FX 5600 Cg_pp : 13.8 MPix/s

GeForce FX 5600 HLSL AA/AF : 7.0 MPix/s
GeForce FX 5600 Cg AA/AF : 6.1 MPix/s

Anyway, that was one possibility for a pixel shader. There are others.

Regardless, I don't think Cg is going anywhere. I just hope Microsoft realizes (as they have done many times in the past) that OpenGL is doing things in a much better way.

KimB · Dec 21, 2003

radar1200gs said:
I don't wish to peddle rumors, but there is some speculation NV40 will feature FP64 support. If this is true, it is an indication nVidia well and truly has thier multi-precision registers working properly in NV40.

Regarding that other thread, I think the consensus is that this likely isn't talking about pixel shader precision, but rather framebuffer precision (i.e. 64-bits per pixel, or 16-bit FP in a full-featured framebuffer), which would probably be more useful for the time being.

FP16 and market support

Nick

Nick

Nick

Tridam

Bouncing Zabaglione Bros.

Rolf N

Recurring Membmare

jpaana

OpenGL guy

ninelven

PM

KimB

KimB

Rolf N

Recurring Membmare

KimB

PatrickL

jpaana

radar1200gs

PatrickL

radar1200gs

KimB

KimB

Similar threads