Pixel Shader 1.4 advantage (or not)?

LeStoffer

Veteran
Okay, I’m going to stick my neck out here, so don’t chop my head right off, okay?

I reheard the whole Carmack/Doom III interview and he was talking about ATI’s 8500 Pixel Shader 1.4 ability to apply a number of textures in one pass as contrary to GF3 (6 vs 4 texture inputs max). So it got me wondering: What kind of performance increase could you see in real life from this advantage?

One thing is to have the ability – another is to have a pipe/cache design that will make the best of it... Since memory bandwidth is still a major concern I guess that such a chip would need a larger texture cache to really take advantage of this feature... (so it doesn’t have to fetch texture data/layer number 5 or 6 from the main DDR-memory and slow things down).

ATI makes this statement:
DirectX® 8.1 pixel shaders allow up to six textures to be sampled and blended in a single rendering pass. This means effects that required multiple rendering passes in earlier versions of DirectX® can now be processed in fewer passes, and effects that were previously too slow to be useful can become more practical to implement.

It all sounds awesome (doh!), but as we know, one has to look out for the weakest link in the pipeline, so the question is whether the number of textures that can be blended in one pass really is the main bottleneck in a modern graphics card?

(And no: I was not paid by nVidia to ask this question!)

Regards, LeStoffer
 
I'm not a coder but ATI has a pretty indepth explanation of and differences 1.4 brings.

DirectX® 8.0 Pixel Shaders

Max. Texture Inputs - 4
Max. Program Length - 12 instructions (up to 4 texture sampling, 8 color blending)
Instruction Set - 13 address operations 8 color operations
Texture Addressing Modes - 40

SMARTSHADER Pixel Shaders (DirectX® 8.1)

Max. Texture Inputs - 6
Max. Program Length - 22 instructions (up to 6 texture sampling, 8 texture addressing, 8 color blending)
Instruction Set - 12 address / color operations
Texture Addressing Modes - Virtually unlimited

http://www.ati.com/na/pages/technology/hardware/smartshader/smartshader_white_paper.html

<font size=-1>[ This Message was edited by: Doomtrooper on 2002-02-10 00:40 ]</font>
 
On 2002-02-10 00:39, Doomtrooper wrote:
I'm not a coder but ATI has a pretty indepth explanation of and differences 1.4 brings.

Thanks, but, ahmmm, I've been there, read that...

So I stand by my question: Does anybody really know whether ATI's PS1.4 would make a big diff over NV PS1.1 (or PS1.3) if both were optimized by, say John Carmack?

Regards, LeStoffer
 
So I stand by my question: Does anybody really know whether ATI's PS1.4 would make a big diff over NV PS1.1 (or PS1.3) if both were optimized by, say John Carmack?

Not for certain, except Carmack himself. We can guess, though. Let's do some approximate math. We can assume 32-bit RGBA, 32-bit Z (compressed to 3 bytes per pixel), 32-bit textures, and a half-decent texture cache that needs to read 1.5 texels from memory per texture per pixel. To give PS1.4 maximum benefit of doubt, we can assume 6 textures total per pixel. We then get the following memory accesses for PS 1.1:
  • 6 textures = 6*1.5*4 = 36 bytes
  • First pass: 1 RGBA write + 1 Z read + 1 Z write = 10 bytes
  • Second pass: 1 RGBA read + 1 Z read + 1 RGBA write = 11 bytes

For PS1.4, we obviously avoid the framebuffer accesses for the 2nd pass. This gives us a total of 57 bytes accessed per pixel for PS1.1 and 46 bytes for PS1.4. This would give PS1.4 a ~24% speed advantage, assuming that memory bandwidth is the bottleneck. Note that for both PS1.1 and PS1.4, the memory bandwidth usage is dominated by texture reads.

Adding further capability to do multiple textures per pass gives dimininishing returns as far as performance is concerned, while requiring additional texture cache to be effective, so the optimal number of textures per pass supported is finite.
 
While collapsing passes can certainly improve performance there is also another possibility that may be overlooked in discussions like this, and that's that you can simply do stuff in other way using the flexibility of the shaders, and in Radeon 8500's case the arbitrary dependent texture reads can certainly be used to cut off some heavy calculations and do it with a texture as a lookup table. Say going from a 5 texture pass to maybe 2 textures with a dependent texture read (just an example though based on nothing).
 
One simple example I can think of where higher numbers of textures per pass may pay off better than just collapsing passes would be this: Performing DOT3 with one normal map and multiple light-maps. Let Ln be a light source vector for light source n that is interpolated across the polygon as if it were a gouraud color, Mn be the result of lightmap or shadow map lookup and N be the normal map. Simple per-pixel lighting with n light sources may then be performed as follows:

Color = ( (L0 dot3 N)*M0 + (L1 dot3 N)*M1 + (L2 dot3 N)*M2 + ....) * BaseColor

Now, to do this calculation, we obviously need to be able to access at least 2 textures per pass, N and Mn. It may look from the formula as if we need to load N over and over again, but in a reasonably well-written pixel shader program we can just load N once, store it to a pixel shader register and then keep re-using that register until we run out of texture units, thus avoiding some texture reads.

Also, note that if we are forced into doing multiple passes because of lack of texture units, the final BaseColor multiplication must be given one full pass all of its own, adding to the proliferation of passes.

Also, a comment to my previous post: If e.g. multisampling or compressed textures are being used, the numbers (bytes per pixel) given will change, mostly giving PS1.4 some additional benefit.
 
If you are using lightmaps, why would you ever need more than one? If the lightmap is calculated by using say, radiosity, and takes into account all the lights in the scene, what would the other light maps be used for and how would they be calculated? Also, if the lightmaps are monochromatic, you can store 2-3 per texture (alpha, and depending on swizzling/masking, blue channel, plus another)


Probably the more useful thing is applying more complicated models like BRDF, or going wild on masking (gloss map, spec map, environment map, dirt map, glow map, detail texture, etc)
 
Thanks for the input arjan de lumens and others...

This is exactly what makes beyond3d great. I also want to extent my thanks to those brainy folks of you that stay here despite all the random rambling people like me “polluteâ€￾ the forum with. ;)

Regards, LeStoffer
 
IIRC (PcChen can explain much better) the aproximated equation for a single pass is:

BPP = ( TSP x TLP x TCE ) + 2 x FB + ZB

Where:
BPP - Bytes per pixel per pass
TSP - Texture Samples per Pixel (variable with the different filters, bilinear = 4)
TLP - Texture Levels per Pixel
TCE - Texture Cache Eficiency
FB - Frame Buffer (usually 4 bytes)
ZB - Z Buffer (usually 4)

The total bandwith will be:

bandwith = (BPP x passes x resolution x overdraw x fps )/ MSE

Where MSE is the Memory Subsystem Eficiency and depends on the kind of memory, latency, statistics of page open and close, etc...

Lets say that we are playing Doom3 at 1024x768x32, bilinear, overdraw 3, using a GF2 Pro with 50% of texture cache eficiency and 60% of MSE, then a single pass:

BP = 4x2x0.5 + 2x4 + 4 = 16 bytes

bandwith = (16 x 5 x 1024x768 x 3 x fps ) / 0.6 = 6.4 GB/s =>

fps= 20 (very slow) :eek:

Now lets do the same with GF3 Ti 200:

BP = 4x4x0.5 + 2x4 + 4 = 20 bytes

bandwith = ( 20 x 2 x 1024x768 x 3 x fps ) / 0.6 = 6.4 GB/s =>

fps= 40 (good, twice the fps of the GF2 Pro) :smile:

Lets do the same for a Radeon 8500 but with a 200MHz DDR memory:

BP = 4x6x0.5 + 2x4 + 4 = 24 bytes

bandwith = ( 24 x 1 x 1024x768 x 3 x fps ) / 0.6 = 6.4 GB/s =>

fps = 67 (much better now)

Ok, but there are some non-controlled variables:
- The texture cache eficiency
- The memory subsystem eficiency
- The filter implementation
- Overdraw

Also, I am not sure if GFs will require a new ZB check in the second pass and beyond.

Well, this is all theory :smile:

<font size=-1>[ This Message was edited by: pascal on 2002-02-10 19:47 ]</font>
 
I think an efficient cache wouldn't load a texel twice within a single poligon.
So I suggest starting to calculate the texel/pixel ratio instead TSP*TCE.

With bilinear and mipmapping the highest TPR should be 1, because the GPU would switch to the next mipmap when it goes above that. At the switching point it becomes 0.25 because it is halved with each mipmap. The average should be 0.5. (This makes the same average 0.625 for trilinear).

When a surface is viewed at lower angles the TPR should drop because the mipmap selection choses "too" low resolution mipmaps. With anisotropic filtering on the other hand TPR can go as high the maximum ratio of the filter. So with 8x aniso you would peak at TPR=8 which should occur at 7,18° (assuming no distortion in the mapping). Below that angle TPR would oscillate between 8 and 2.

Pascal seem to ignore texture depth :smile:
It is a tough call with Doom3 as it should (like everyone) use DXT1 for textures which is 0.5 bytes/texel while it will probably need 4 bytes/texel for the normal map. It makes all the difference!

The first pass does not need to read the framebuffer while following passes does not need to write the Z value. Anyway it should be 2xZB at least for the first pass.

Also why do assume that all the texture stages will be used in all passes especially on the GF3?



<font size=-1>[ This Message was edited by: Hyp-X on 2002-02-10 22:30 ]</font>
 
BPP = ( TSP x TLP x TCE ) + 2 x FB + ZB
This formula is missing a term for the number of bytes per texture sample. Also, you may want to replace the TCE term with (1-TCE), otherwise you get the weird result that 0% texture cache efficiency = 0 memory bandwidth used on textures. Texture cache efficiency would normally be closer to about 80-84% (for trilinear interpolation or low-resolution lightmaps). The number of FB and ZB accesses per pass will vary from one pass to another - you can normally avoid an FB read in 1st pass and a ZB write in each pass other than the 1st one.
 
Yeah, you are right :smile:

Rewriting:

BPP = ( TSP x TLP x TD x (1 - TCE) ) + 2 x FB + ZB

where TD = Texture depth

Now lets recalculate GF3 Ti 200 with higher 75% TCE and 4 bytes TD:

pass 1 BP = 4x4x4x0.25 + 4 + 4 = 24 bytes
pass 2 BP = 4x4x4x0.25 + 2x4 = 24 bytes

bandwith = ( 48 x 1024x768 x 3 x fps ) / 0.6 = 6.4 GB/s =>

fps= 34 (not too bad) :smile:

Of course it is much more complex ;)
 
My thought on this:

When using volumetric shadows on everything, it is quite hard to collapse two lights into one pass since each light has its own shadow mask. So what really important is the formula of one light.

I think PS 1.1 can handle most cases of simple and bumped per pixel lighting. What PS 1.4 helps is few cases where more complex lighting requires two or more passes on PS 1.1, thus PS 1.4 can help reducing the required rendering passes.

Of course, there are some effects that can be done only by PS 1.4. An example is Z corrected bump mapping (PS 1.3 can also do it). With Z corrected bump mapping, the shadow edge on bumped walls (and characters) will be jagged and more convincing, instead of a straight line. Unfortunately it can be too expensive to be practical.
 
Back
Top