Explain To Me The Benefits of SM 3.0 For Nvidia

Recall

Newcomer
Hi, could someone explain to me the benefits of SM 3.0 for Nvidia to me? I want to know if apart from a few fps here and there, can Nvidia benefit in any other way? Could you tell me exactly what the GF6800 can do that the X800 cannot? Im here to learn so as detailed as possible, I will be asking lots of questions. Hope you dont mind :oops: I have seen this link http://www.elitebastards.com/page.php?pageid=4136&head=1&comments=1 , but alot of people tell me that the X800 can do pretty much all of that anyway, even the 9800XT can do it. So what is the point of SM 3.0, and if the X800 can do almost everything in SM 3.0, then why was it not made fully SM 3.0? I know the X800 can only do FP24, and I also know that FP32 in hardware ( whether it does use it or not) is required for SM 3.0 ( whether it uses pp_hints to lower the precision is fine as long as its capable of doing FP32). So please fill in the holes for me, I will be extremely grateful. :D

Oh on a sidenote 3DC, how is that better than DTX5?
 
Im here to learn so as detailed as possible
Good spirit!

To keep it short...

Well, to begin with, there's that thing called a Turing machine (see google.com for details), which basically means that given enough ressources (disk space, time...) any computer (or graphic cards, for that purpose that's the same thing) can emulate (= run the same programs as) any other computer (/ graphic card).

So yes, you can program anything to do anything, although it's simpler to do it in the (best resources)->(least resources) direction than in the opposite (which almost always see the computer with the least resources loosing time to emulate the more complex functions of the other computer because of nasty little additional technical work forcing it to depend on slower devices (system ram; disks..) to virtualize its resources).

For gamers, I suspect that the FP16 blending might end being the most visible feature of the GF6, because it translates directly in "better/faster HDR", something that you one could put on a sticker on the box. (Hum, I'm don't even know if that's in SM3, but that's a very cool feature IMHO).

Enough said...

Hopes this helped at least a bit... or two, because, you know, when you've got one bit, and given enough resources, you can emulate two. :D

Edit: Oh, and about the benefit for nVidia: gamers with more eye-candy in the eye are happier gamers, therefore happier customer, and that's how we make happily billed people and earn bucks for the employees and share holders in this world. Happy? ;)
 
Remi said:
Im here to learn so as detailed as possible
Good spirit!

To keep it short...

Well, to begin with, there's that thing called a Turing machine (see google.com for details), which basically means that given enough ressources (disk space, time...) any computer (or graphic cards, for that purpose that's the same thing) can emulate (= run the same programs as) any other computer (/ graphic card).

So yes, you can program anything to do anything, although it's simpler to do it in the (best resources)->(least resources) direction than in the opposite (which almost always see the computer with the least resources loosing time to emulate the more complex functions of the other computer because of nasty little additional technical work forcing it to depend on slower devices (system ram; disks..) to virtualize its resources).

For gamers, I suspect that the FP16 blending might end being the most visible feature of the GF6, because it translates directly in "better/faster HDR", something that you one could put on a sticker on the box. (Hum, I'm don't even know if that's in SM3, but that's a very cool feature IMHO).

Enough said...

Hopes this helped at least a bit... or two, because, you know, when you've got one bit, and given enough resources, you can emulate two. :D

Thanks :D Ok so what your saying is that X800 for example could emulate pretty much all those features, but it would not neccesarily be as efficient? Also explain to me HDR, bit of a grey area for me. Alot of people tell me that FP16 is a bad thing, and that ATI ability to FP24 is much better? Any ideas on how much a hit that ATI could take with FP32? Sorry for all the questions :oops:
 
Recall said:
Thanks :D Ok so what your saying is that X800 for example could emulate pretty much all those features, but it would not neccesarily be as efficient? Also explain to me HDR, bit of a grey area for me. Alot of people tell me that FP16 is a bad thing, and that ATI ability to FP24 is much better? Any ideas on how much a hit that ATI could take with FP32? Sorry for all the questions :oops:
FP16 is only bad in instances where it is forced against the developers wishes or -- at the very least -- in instances where it will cause visual inaccuracies. In the early months of the NV3x series, NVidia forced FP16 globally. That's probably where most of the "anti-FP16" sentiment comes from.

EDIT: I guess it's a bit off topic, though. . . having nothing to do with SM3.0, directly. *hopes no one replies with more precision arguments*
 
so what your saying is that X800 for example could emulate pretty much all those features, but it would not neccesarily be as efficient?
Absolutely.

Also explain to me HDR, bit of a grey area for me.
HDR is this. It makes bright lights and bright reflections look much more realistic.

Alot of people tell me that FP16 is a bad thing, and that ATI ability to FP24 is much better?
Well, I won't enter in that debate, there's well too much passion and IMHO it generally seriously lacks reasonning, even (or maybe particularily) here. FP24 is a neat idea, I would love it for a console, but for the PC I prefere the FP16/FP32 combo. Let's say it's a question of taste and let's remain all good friends!

beer.gif
Cheers!
 
Remi said:
so what your saying is that X800 for example could emulate pretty much all those features, but it would not neccesarily be as efficient?
Absolutely.

Also explain to me HDR, bit of a grey area for me.
HDR is this. It makes bright lights and bright reflections look much more realistic.

Alot of people tell me that FP16 is a bad thing, and that ATI ability to FP24 is much better?
Well, I won't enter in that debate, there's well too much passion and IMHO it generally seriously lacks reasonning, even (or maybe particularily) here. FP24 is a neat idea, I would love it for a console, but for the PC I prefere the FP16/FP32 combo. Let's say it's a question of taste and let's remain all good friends!

beer.gif
Cheers!

Remi your a star. :D So what exaclty makes SM 3.0 more efficient? :oops:
 
Hey, thanks, but ya know you're not gooing to get anything cheaper from me like that! :D (just kidding!)

But my download's almost finished, I'll have to go and leave the connection...

So what exaclty makes SM 3.0 more efficient?
Pfeeww! Right in time! What? Still 5 minutes to complete? Oh well...

I'm more into OpenGL those days so I would have some hard time to answer that precisely. Better read the specs before to put it back in mind. I can try to answer for the GF6's hardware possibilities however, hoping they're all exposed in SM3.0 in a way or another.

The most exciting stuff for a developper (edit: = what they'll be able to use for greater efficiency) is probably on the vertex shader part. My top picks would be frequency division (= one 3d model in memory, but many unique copies on screen) and true branching with MIMD (which means that those vertices won't have to wait for each other before they can go lit the pixels, as it can sometimes happen). Of course, there's the maximum program length which is just waaaayyy better than with SM2 - although it hasn't been too much of a limit (yet) for games.

And of course, there's FP blending... Yum... There are lots of way to use that, but in short it also ease to program things other than pure graphic rendering on the graphic card (Edit: like physics for instance).

Hope this helped! :)
 
Recall said:
Oh on a sidenote 3DC, how is that better than DTX5?
3Dc has approximately 3 more bits of precision per pixel than DXT5 for one component of 2-component textures (i.e. the same precision for both components rather than one high precision and one low).
 
Recall said:
Remi your a star. :D So what exaclty makes SM 3.0 more efficient? :oops:
The basic reason is flexibility. A programmer programming with SM3 has more options than with SM2. This allows the programmer to perform more different algorithms efficiently.

I'll look at two examples:
1. High Dynamic Range rendering:
Before SM3:
Since you cannot blend with the framebuffer, it becomes infeasible to read in a previously-calculated floating point value to the pixel shader. This means that you need to do all HDR calculations in one pass. This limits you to either doing all lighting in one pass (which limits the types of lighting algorithms you can do, particularly on SM2 hardware), or sacrifice the quality of your HDR calculations. You also cannot have any transparent objects in the scene.

With SM3:
With framebuffer blending, you can now do multipass lighting of the scene, and can actually render transparent objects.

2. Dynamic branching:
Before SM3:
If you want the pixel shader output to depend upon an "if" statement in the pixel shader, you need to do one of a few things:
a) Execute all possible paths the pixel shader might take, and select the one you want at the end. The drawback to this is that you have to execute more instructions.
b) Bake the conditional into a previous pass, and conditionally render each conditional statement. The drawback is that you need to do multiple passes, one for each possible branch, and you may need to re-calculate things each pass.

With SM3:
You can do the above, or you can just use dynamic branching. Here the hardware can throw out instructions it doesn't need to calculate, but there is a performance hit associated with branching.

So, essentially, SM3 gives you more options. Depending upon the algorithm, one way or another might be quicker. Because SM3 gives more options, it will either end up faster or the same speed (as the same hardware running SM2).
 
Okay, here's my take on SM2.0 vs SM3.0. I am a lowbie 3d graphics coder, working on mainly wireframes but I do dabble in some of these shaders.

To the developer, this is where the SM2.0 vs SM3.0 differences will matter more. The reasoning is simple. SM2.0 gives a very base set of instructions to work with. It's complete enough to possibly produce any effect that we can imagine for now. It might not be efficient but if you are a genius, you can create any effect you like, through intelligent coding. SM3.0 on the other hand, introduces more instructions to make coding simpler. That is, any effects that we want to generate, becomes easier to code for. So, developers should have an easier time trying to write the shader codes to achieve what they want.

To the consumer, does SM2.0 or SM3.0 affect you as much? Maybe not. The real effect to the consumers will be, how good are the developers of the software you are buying? If they are really good, anything that SM3.0 can achieve, can be done through fallbacks or rewritten SM2.0 shader codes. If the developers aren't good, then perhaps you only get the best experience with an SM3.0 card and miss out on some features on non-SM3.0 capable cards.
 
Recall said:
Oh on a sidenote 3DC, how is that better than DTX5?

In two ways. I offers better quality at the same storage space, and it can fit right into the same shader as if a regular texture had been used (no swizzles needed).

3DcDXT5compare.jpg
 
Humus, Isn't that a bit disingenous? 3Dc requires adding additional shader instructions to compute the Z coordinate so it can not "fit right into the same shader" any easier than the DXT5 method. Thus, to support either DXT5 or 3Dc, you need to add extra instructions, and writing ".xyzw" vs ".wzyx" on one of the instructions is equally laborious.

It is true that there is a slight IQ improvement, but the comparison you should be making is low-res uncompressed normal maps vs DXT5/3DC, and in that comparison (hi-res DXT5 vs uncompressed low-res), DXT5 is clearly a large improvement.

ATI seems to be trying to sell the idea that games should only support non-3Dc low-res maps and hi-res 3Dc maps, in an attempt to kill off one of their own ideas: DXT5 for normal map compression. But it doesn't wash, because the labor to support DXT5 is minimal, while the improvement is large. Both 3Dc and DXT5 should be supported.
 
I've personally found it to be a lot better to get everything working on your baseline as efficiently as possible first and then just collapse passes or whatever it is that the higher end stuff allows to improve efficiency.

In the current discussion (with what I'm currently working on), that'd mean doing everything on the left side of the tone-mapping pass with multiple PS2 passes, rendered to FP render targets (with blending done via pixel shader/ping-pong method rather than fixed function blending), and everything after the tone mapping pass done with PS2 and integer render targets using fixed function blending. Sure, it's a pain in the ass at first, but once you get everything done, it's done for good; and taking advantage of the more advanced hardware becomes completely obvious, requiring only minutes to hours rather than weeks to months (you can mostly just copy and paste the multipassed ps2 shaders into one big ps2.x/ps3 shader).

When I get around to it, I can effortlessly compact several passes into one ps2.a/ps2.b/ps3 pass, and on the 6800 I can just use fixed function blending throughout (rather than just in the integer RT passes). Right and easy.


But I know a lot of people that do things the opposite way.. where they write their first-draft code on the most advanced system available to them, and then try and break it up into many small passes and/or just drop functionality on the lower end systems. But, in my experience, this leads to less optimal code (and, usually, reduced visual quality) on both the low *and* high end systems. There's just so many optimizations you miss out on when you have little to no constraints on the code you write, and you often get so set in the mindset of doing things one way on the advanced hardware that you just outright think things can't possibly be done on less advanced hardware and disable that functionality.



But this mostly a matter of personal preference.. generally speaking, the tighter the constraints, the tighter the code I write, YMMV.
 
DemoCoder said:
ATI seems to be trying to sell the idea that games should only support non-3Dc low-res maps and hi-res 3Dc maps, in an attempt to kill off one of their own ideas: DXT5 for normal map compression. But it doesn't wash, because the labor to support DXT5 is minimal, while the improvement is large. Both 3Dc and DXT5 should be supported.

I have yet to see ATI recommend anything of the sort.. their devrel certainly openly suggests DXT5 on any cards that don't support 3Dc (as that includes their own, previous generation cards).

Right now they're just championing on that people support 3Dc.. some (developers too) mistake this to mean "at the expense of DXT5", which I don't think is their message at all.
 
DemoCoder said:
Humus, Isn't that a bit disingenous? 3Dc requires adding additional shader instructions to compute the Z coordinate so it can not "fit right into the same shader" any easier than the DXT5 method. Thus, to support either DXT5 or 3Dc, you need to add extra instructions, and writing ".xyzw" vs ".wzyx" on one of the instructions is equally laborious.

It is true that there is a slight IQ improvement, but the comparison you should be making is low-res uncompressed normal maps vs DXT5/3DC, and in that comparison (hi-res DXT5 vs uncompressed low-res), DXT5 is clearly a large improvement.

ATI seems to be trying to sell the idea that games should only support non-3Dc low-res maps and hi-res 3Dc maps, in an attempt to kill off one of their own ideas: DXT5 for normal map compression. But it doesn't wash, because the labor to support DXT5 is minimal, while the improvement is large. Both 3Dc and DXT5 should be supported.

I'm not saying that DXT5 shouldn't be supported. Just like I don't say that for instance ps1.1 or ps1.4 shouldn't be supported. You do it as good as you can given the hardware available. Doesn't mean that 3Dc doesn't have benefits over DXT5, nor does it mean that DXT5 isn't preferable over uncompressed bumpmaps in most cases.

When I say 3Dc fit right into the same shader I mean that you can use the same shader to use a regular RGB texture and 3Dc and it will "just work". 3Dc expands to (x, y, 1, 1). So you have two options, when you compress to 3Dc format you can scale the components such that z = 1, and store x and y. This way it works with regular post-filtering normalization without any change to shader code. Or you can simply store it the same way as regular normal maps, and just compute the third component in the shader, rather than using post-filtering normalization. This is what I did in my 3Dc demo. So the same shader could be used for both 3Dc and RGB, while DXT5 had to use it's own shader. Computing the third component and normalizing the normal is equally costly in terms of shader instructions. Only difference is that the first DP3 instruction of normalization is changed to DP2ADD.
 
Chalnoth said:
2. Dynamic branching:
Before SM3:
If you want the pixel shader output to depend upon an "if" statement in the pixel shader, you need to do one of a few things:
a) Execute all possible paths the pixel shader might take, and select the one you want at the end. The drawback to this is that you have to execute more instructions.
b) Bake the conditional into a previous pass, and conditionally render each conditional statement. The drawback is that you need to do multiple passes, one for each possible branch, and you may need to re-calculate things each pass.

With SM3:
You can do the above, or you can just use dynamic branching. Here the hardware can throw out instructions it doesn't need to calculate, but there is a performance hit associated with branching.

This is another good example of what I was just talking about (where more flexibility can lead to less optimal code) - Humus's little stencil-based dynamic branching (once corrected for NV cards) was significantly faster than both the ps2.x (execute all paths, multiplying all you don't want by 0) and ps3 (dynamic branching in the pixel shader) paths. A quick, top of the pipe stencil test is (and probably will be for some time) far more efficient than using the dynamic branching instruction set.

Now, there's likely a point where the number of additional required passes might make ps3 branching faster than stencil-based dynamic branching, but I expect for the vast majority of cases in any games coming out in the next 2-3 years, that the stencil-based approach (where possible) would be the performance-preferred method for both PS2 and PS3 cards.

But, yeah, it's a bitch and a half to have to do things that way, so I don't really expect people to do it and instead just do 'evaluate all paths' on ps2.x cards, and dynamic branching on ps3. Effectively a loss for both IHVs (though one more than the other) and consumers. And that's exactly why I force myself to get everything working as optimally as possible on the LCD first :)
 
Ilfirin said:
This is another good example of what I was just talking about (where more flexibility can lead to less optimal code) - Humus's little stencil-based dynamic branching (once corrected for NV cards) was significantly faster than both the ps2.x (execute all paths, multiplying all you don't want by 0) and ps3 (dynamic branching in the pixel shader) paths. A quick, top of the pipe stencil test is (and probably will be for some time) far more efficient than using the dynamic branching instruction set.
But this isn't necessarily going to be the case in a real game. Beyond the multipass performance hit you mentioned, there may be additional performance improvements nVidia can get out of future driver releases when the shaders are longer. It seemed to me that Humus' shader in that demo was relatively short.
 
Back
Top