SM 3.0, yet again.

Frank

Certified not a majority
Veteran
Time for a new thread about this, with the ongoing confusion about it. So, is there a use for it yet that can't be done in SM 2.0b and that is actually used?
 
Are we discussing the MS SM3 feature set, or nV's superset with FP blending and PCF?
 
There's vertex texturing, but it doesn't look like anyone is too keen on using it in the near term. Not sure why, 'cuz there are plently of neat things you can do with it.
 
Mintmaster said:
There's vertex texturing, but it doesn't look like anyone is too keen on using it in the near term. Not sure why, 'cuz there are plently of neat things you can do with it.
Because NVIDIA NV4x implementation of it, is slower than a dead tortoise? As there no other card that supports its currently availible, there is little incentive to use vertex texturing.

Of course with new hardware coming along, expect to see it used more and more.
 
DiGuru said:
Time for a new thread about this, with the ongoing confusion about it. So, is there a use for it yet that can't be done in SM 2.0b and that is actually used?
I don't think it's a matter of the effects that can be created, but how one goes about rendering those effects. SM3.0 allows for certain things to be done much easier and in some cases potentially faster. HDR and other effects requiring the interaction of the results of multiple floating point render passes are good examples. Then there's the most basic example: handling a variable number of lights in a single render pass. They can all be done in SM2.0(b), but the equivalent implementation in SM3.0 is much more elegant.
 
DiGuru said:
Time for a new thread about this, with the ongoing confusion about it. So, is there a use for it yet that can't be done in SM 2.0b and that is actually used?
So far, FarCry makes use of the extra interpolated registers to support one more light source per pass in SM3 than it does with SM 2.0b. The performance difference of this will clearly vary depending on where the limits are in the scene.

FarCry also makes use of floating point blending for high dynamic range rendering, which is GF6-only, but not strictly PS/VS 3.0.

The Chronicles of Riddick uses nVidia's GF6 OpenGL extensions for soft shadowing, but I'm not sure exactly what they're doing.
 
complex shader which will run insanely slow on no-branching HW, such as relief mapping and heavily blured shadow mapping.
 
DeanoC said:
Mintmaster said:
There's vertex texturing, but it doesn't look like anyone is too keen on using it in the near term. Not sure why, 'cuz there are plently of neat things you can do with it.
Because NVIDIA NV4x implementation of it, is slower than a dead tortoise? As there no other card that supports its currently availible, there is little incentive to use vertex texturing.

Of course with new hardware coming along, expect to see it used more and more.

What would you speculate could be the real problem behind it? Doesn't vertex texturing come with a sizeable amount of latency anyway?
 
Ailuros said:
Doesn't vertex texturing come with a sizeable amount of latency anyway?
Probably they 'slapped' vertex texturing support in their vertex shader engines without significantly addressing the latency problem. Vertex shaders without vertex textures don't have to hide huge latencies.
 
I think the problem on the NV40 is that you have to think like a RISC compiler when writing your shader and schedule your "loads" because unlike the pixel pipeline, the chip doesn't do it for you. If you think you're going to do a bunch of sampling, you need overlap those with lots of raw shader ops.
 
Probably they 'slapped' vertex texturing support in their vertex shader engines without significantly addressing the latency problem. Vertex shaders without vertex textures don't have to hide huge latencies.

I was under the impression that for vertex texturing in order to not stall the entire rendering process, one could take advantage of said latency and get a very high amount of instructions (not related to the texture fetch) nearly for free; and that entirely irrelevant to architecture or approach.

Take PowerVR's Cloth demo; it might be limited to a 64*64 grid, but the amount of instructions used is extremely high. NV40 runs said cloth simulation between 85-100fps (depending on occassion).

If you think you're going to do a bunch of sampling, you need overlap those with lots of raw shader ops.

I thought that would be a presupposion for VS texturing anyway. Anyone care to shed a bit more light over it, because I'm obviously confused?
 
Ailuros said:
I was under the impression that for vertex texturing in order to not stall the entire rendering process, one could take advantage of said latency and get a very high amount of instructions (not related to the texture fetch) nearly for free; and that entirely irrelevant to architecture or approach.
Well..this can be seen as a positive side effect, but if you have a relatevitely (to texture sampling latency) short shader with vertex texturing you're still going very slow.
That's why nvidia is not advocating vertex texturing for stuff like skinning (storing bones matrices in a texture..)
 
nAo said:
Ailuros said:
I was under the impression that for vertex texturing in order to not stall the entire rendering process, one could take advantage of said latency and get a very high amount of instructions (not related to the texture fetch) nearly for free; and that entirely irrelevant to architecture or approach.
Well..this can be seen as a positive side effect, but if you have a relatevitely (to texture sampling latency) short shader with vertex texturing you're still going very slow.
That's why nvidia is not advocating vertex texturing for stuff like skinning (storing bones matrices in a texture..)

Assuming a simplistic VS (even w/o vertex texturing) and a very complex PS, wouldn't a stall be already possible in order for the PS to complete?

Sweeney on UE:

Our vertex shaders are quite simple nowadays, and just perform skeletal blending and linear interpolant setup on behalf of the pixel shaders. All of the heavy lifting is now on the pixel shader side -- all lighting is per-pixel, all shadowing is per-pixel, and all material effects are per-pixel.

Once you have the hardware power to do everything per-pixel, it becomes undesirable to implement rendering or lighting effects at the vertex level; such effects are tessellation-dependent and difficult to integrate seamlessly with pixel effects.

http://www.beyond3d.com/interviews/sweeneyue3/

Would additional logic actually help in the end or will we see better results with future hardware and future APIs?
 
991060 said:
complex shader which will run insanely slow on no-branching HW, such as relief mapping and heavily blured shadow mapping.

Would you qualify 6x00 as branching HW, even if it would be slower overall using those branches than using the SM 2.0 solutions, like unrolling?
 
Ailuros said:
Assuming a simplistic VS (even w/o vertex texturing) and a very complex PS, wouldn't a stall be already possible in order for the PS to complete?
A stall mean VS is doing nothing, and that is bad. Moreover VS and PS are decoupled, so VS don't stall in order for the PS to complete, unless the buffers that sit between VS and PS are full of to be rasterized primitives.
Would additional logic actually help in the end or will we see better results with future hardware and future APIs?
The problem with actual VT implementation can be solved (as it already 'solved' or alleviated in PS!) spending more transistors ;)
 
DiGuru said:
991060 said:
complex shader which will run insanely slow on no-branching HW, such as relief mapping and heavily blured shadow mapping.

Would you qualify 6x00 as branching HW, even if it would be slower overall using those branches than using the SM 2.0 solutions, like unrolling?

Depending on the situation and how good a shader programmer you are. ;)

NV40 is the first generation of HW with true branching ability, which comes at a cost admitted by all people. You'll need to be very careful on when and how to use the dynamic branching instructions. nVIDIA has a demo showing how to do that correctly : http://download.developer.nvidia.co..._Samples/DEMOS/OpenGL/simple_soft_shadows.zip

If you have a NV4x, you'll see using SM3.0 path will give you a 200% performance boost, comparing with the SM2.0 path.

BTW, there's no true equivalance of dynamic branching in SM2.0( maybe you can call humus's early stencil out trick as an exception). Unrolling is just a way to avoid loop, it has nothing to do with runtime instruction jump.
 
DiGuru said:
Would you qualify 6x00 as branching HW, even if it would be slower overall using those branches than using the SM 2.0 solutions, like unrolling?
Unrolling can only be used for static branching. It's certainly not absolutely slower than SM 2.0 solutions, either.

With SM 2.0 you have two options:
1. Execute all branches and use compare.
2. Multipass.

With the first situation, the NV4x will be faster for anything like conditional loops, or anything where you end up skipping a large number of instructions. For the second situation, the NV4x will be faster whenever you are geometry-limited (and you are more likely to become geometry-limited when doing multipass rendering, as the pixel shaders become shorter).
 
Chalnoth said:
DiGuru said:
Would you qualify 6x00 as branching HW, even if it would be slower overall using those branches than using the SM 2.0 solutions, like unrolling?
Unrolling can only be used for static branching. It's certainly not absolutely slower than SM 2.0 solutions, either.

With SM 2.0 you have two options:
1. Execute all branches and use compare.
2. Multipass.

With the first situation, the NV4x will be faster for anything like conditional loops, or anything where you end up skipping a large number of instructions. For the second situation, the NV4x will be faster whenever you are geometry-limited (and you are more likely to become geometry-limited when doing multipass rendering, as the pixel shaders become shorter).

Well, the last time we discussed this, we came to the conclusion that the dynamic branching of the NV6x00 is essentially your first method, with batches of about a 1000 pixels each. Only if all pixels in that area take the same path and this can be determined by the driver are the instructions skipped. Which doesn't sound very dynamic to me.
 
DiGuru said:
Well, the last time we discussed this, we came to the conclusion that the dynamic branching of the NV6x00 is essentially your first method, with batches of about a 1000 pixels each. Only if all pixels in that area take the same path and this can be determined by the driver are the instructions skipped. Which doesn't sound very dynamic to me.
The driver couldn't do such a fine-grained decision in the shader pipeline. The hardware decides whether to run one or both branches.
 
Back
Top