Pixel Shader 1.4 advantage (or not)?

Discussion in 'Architecture and Products' started by LeStoffer, Feb 9, 2002.

  1. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    Okay, I’m going to stick my neck out here, so don’t chop my head right off, okay?

    I reheard the whole Carmack/Doom III interview and he was talking about ATI’s 8500 Pixel Shader 1.4 ability to apply a number of textures in one pass as contrary to GF3 (6 vs 4 texture inputs max). So it got me wondering: What kind of performance increase could you see in real life from this advantage?

    One thing is to have the ability – another is to have a pipe/cache design that will make the best of it... Since memory bandwidth is still a major concern I guess that such a chip would need a larger texture cache to really take advantage of this feature... (so it doesn’t have to fetch texture data/layer number 5 or 6 from the main DDR-memory and slow things down).

    ATI makes this statement:
    It all sounds awesome (doh!), but as we know, one has to look out for the weakest link in the pipeline, so the question is whether the number of textures that can be blended in one pass really is the main bottleneck in a modern graphics card?

    (And no: I was not paid by nVidia to ask this question!)

    Regards, LeStoffer
     
  2. Mulciber

    Regular

    Joined:
    Feb 7, 2002
    Messages:
    413
    Likes Received:
    0
    Location:
    Houston
    I've wondered the same question for a long time
     
  3. Doomtrooper

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,328
    Likes Received:
    0
    Location:
    Ontario, Canada
    I'm not a coder but ATI has a pretty indepth explanation of and differences 1.4 brings.

    DirectX® 8.0 Pixel Shaders

    Max. Texture Inputs - 4
    Max. Program Length - 12 instructions (up to 4 texture sampling, 8 color blending)
    Instruction Set - 13 address operations 8 color operations
    Texture Addressing Modes - 40

    SMARTSHADER Pixel Shaders (DirectX® 8.1)

    Max. Texture Inputs - 6
    Max. Program Length - 22 instructions (up to 6 texture sampling, 8 texture addressing, 8 color blending)
    Instruction Set - 12 address / color operations
    Texture Addressing Modes - Virtually unlimited

    http://www.ati.com/na/pages/technology/hardware/smartshader/smartshader_white_paper.html

    <font size=-1>[ This Message was edited by: Doomtrooper on 2002-02-10 00:40 ]</font>
     
  4. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    Thanks, but, ahmmm, I've been there, read that...

    So I stand by my question: Does anybody really know whether ATI's PS1.4 would make a big diff over NV PS1.1 (or PS1.3) if both were optimized by, say John Carmack?

    Regards, LeStoffer
     
  5. Rodéric

    Rodéric a.k.a. Ingenu
    Moderator Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,980
    Likes Received:
    844
    Location:
    Planet Earth.
    wait until doom3 you'll see.
     
  6. Mulciber

    Regular

    Joined:
    Feb 7, 2002
    Messages:
    413
    Likes Received:
    0
    Location:
    Houston
    thats exactly what people said about Unreal2
    but now we are seeing otherwise
     
  7. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    Not for certain, except Carmack himself. We can guess, though. Let's do some approximate math. We can assume 32-bit RGBA, 32-bit Z (compressed to 3 bytes per pixel), 32-bit textures, and a half-decent texture cache that needs to read 1.5 texels from memory per texture per pixel. To give PS1.4 maximum benefit of doubt, we can assume 6 textures total per pixel. We then get the following memory accesses for PS 1.1:
    • 6 textures = 6*1.5*4 = 36 bytes
    • First pass: 1 RGBA write + 1 Z read + 1 Z write = 10 bytes
    • Second pass: 1 RGBA read + 1 Z read + 1 RGBA write = 11 bytes

    For PS1.4, we obviously avoid the framebuffer accesses for the 2nd pass. This gives us a total of 57 bytes accessed per pixel for PS1.1 and 46 bytes for PS1.4. This would give PS1.4 a ~24% speed advantage, assuming that memory bandwidth is the bottleneck. Note that for both PS1.1 and PS1.4, the memory bandwidth usage is dominated by texture reads.

    Adding further capability to do multiple textures per pass gives dimininishing returns as far as performance is concerned, while requiring additional texture cache to be effective, so the optimal number of textures per pass supported is finite.
     
  8. Doomtrooper

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,328
    Likes Received:
    0
    Location:
    Ontario, Canada
    Thx Arjan, thats pretty indepth :smile:
     
  9. Humus

    Humus Crazy coder
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,217
    Likes Received:
    77
    Location:
    Stockholm, Sweden
    While collapsing passes can certainly improve performance there is also another possibility that may be overlooked in discussions like this, and that's that you can simply do stuff in other way using the flexibility of the shaders, and in Radeon 8500's case the arbitrary dependent texture reads can certainly be used to cut off some heavy calculations and do it with a texture as a lookup table. Say going from a 5 texture pass to maybe 2 textures with a dependent texture read (just an example though based on nothing).
     
  10. Galilee

    Newcomer

    Joined:
    Feb 7, 2002
    Messages:
    241
    Likes Received:
    0
    Location:
    Trondheim, Norway
    I can't wait for some games to actually use this feature. I wonder if I have a GF5 or r400 then :smile:
     
  11. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    One simple example I can think of where higher numbers of textures per pass may pay off better than just collapsing passes would be this: Performing DOT3 with one normal map and multiple light-maps. Let Ln be a light source vector for light source n that is interpolated across the polygon as if it were a gouraud color, Mn be the result of lightmap or shadow map lookup and N be the normal map. Simple per-pixel lighting with n light sources may then be performed as follows:

    Color = ( (L0 dot3 N)*M0 + (L1 dot3 N)*M1 + (L2 dot3 N)*M2 + ....) * BaseColor

    Now, to do this calculation, we obviously need to be able to access at least 2 textures per pass, N and Mn. It may look from the formula as if we need to load N over and over again, but in a reasonably well-written pixel shader program we can just load N once, store it to a pixel shader register and then keep re-using that register until we run out of texture units, thus avoiding some texture reads.

    Also, note that if we are forced into doing multiple passes because of lack of texture units, the final BaseColor multiplication must be given one full pass all of its own, adding to the proliferation of passes.

    Also, a comment to my previous post: If e.g. multisampling or compressed textures are being used, the numbers (bytes per pixel) given will change, mostly giving PS1.4 some additional benefit.
     
  12. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    If you are using lightmaps, why would you ever need more than one? If the lightmap is calculated by using say, radiosity, and takes into account all the lights in the scene, what would the other light maps be used for and how would they be calculated? Also, if the lightmaps are monochromatic, you can store 2-3 per texture (alpha, and depending on swizzling/masking, blue channel, plus another)


    Probably the more useful thing is applying more complicated models like BRDF, or going wild on masking (gloss map, spec map, environment map, dirt map, glow map, detail texture, etc)
     
  13. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    Thanks for the input arjan de lumens and others...

    This is exactly what makes beyond3d great. I also want to extent my thanks to those brainy folks of you that stay here despite all the random rambling people like me “polluteâ€￾ the forum with. :wink:

    Regards, LeStoffer
     
  14. pascal

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    1,830
    Likes Received:
    49
    Location:
    Brasil
    IIRC (PcChen can explain much better) the aproximated equation for a single pass is:

    BPP = ( TSP x TLP x TCE ) + 2 x FB + ZB

    Where:
    BPP - Bytes per pixel per pass
    TSP - Texture Samples per Pixel (variable with the different filters, bilinear = 4)
    TLP - Texture Levels per Pixel
    TCE - Texture Cache Eficiency
    FB - Frame Buffer (usually 4 bytes)
    ZB - Z Buffer (usually 4)

    The total bandwith will be:

    bandwith = (BPP x passes x resolution x overdraw x fps )/ MSE

    Where MSE is the Memory Subsystem Eficiency and depends on the kind of memory, latency, statistics of page open and close, etc...

    Lets say that we are playing Doom3 at 1024x768x32, bilinear, overdraw 3, using a GF2 Pro with 50% of texture cache eficiency and 60% of MSE, then a single pass:

    BP = 4x2x0.5 + 2x4 + 4 = 16 bytes

    bandwith = (16 x 5 x 1024x768 x 3 x fps ) / 0.6 = 6.4 GB/s =>

    fps= 20 (very slow) :eek:

    Now lets do the same with GF3 Ti 200:

    BP = 4x4x0.5 + 2x4 + 4 = 20 bytes

    bandwith = ( 20 x 2 x 1024x768 x 3 x fps ) / 0.6 = 6.4 GB/s =>

    fps= 40 (good, twice the fps of the GF2 Pro) :smile:

    Lets do the same for a Radeon 8500 but with a 200MHz DDR memory:

    BP = 4x6x0.5 + 2x4 + 4 = 24 bytes

    bandwith = ( 24 x 1 x 1024x768 x 3 x fps ) / 0.6 = 6.4 GB/s =>

    fps = 67 (much better now)

    Ok, but there are some non-controlled variables:
    - The texture cache eficiency
    - The memory subsystem eficiency
    - The filter implementation
    - Overdraw

    Also, I am not sure if GFs will require a new ZB check in the second pass and beyond.

    Well, this is all theory :smile:

    <font size=-1>[ This Message was edited by: pascal on 2002-02-10 19:47 ]</font>
     
  15. Hyp-X

    Hyp-X Irregular
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,170
    Likes Received:
    5
    I think an efficient cache wouldn't load a texel twice within a single poligon.
    So I suggest starting to calculate the texel/pixel ratio instead TSP*TCE.

    With bilinear and mipmapping the highest TPR should be 1, because the GPU would switch to the next mipmap when it goes above that. At the switching point it becomes 0.25 because it is halved with each mipmap. The average should be 0.5. (This makes the same average 0.625 for trilinear).

    When a surface is viewed at lower angles the TPR should drop because the mipmap selection choses "too" low resolution mipmaps. With anisotropic filtering on the other hand TPR can go as high the maximum ratio of the filter. So with 8x aniso you would peak at TPR=8 which should occur at 7,18° (assuming no distortion in the mapping). Below that angle TPR would oscillate between 8 and 2.

    Pascal seem to ignore texture depth :smile:
    It is a tough call with Doom3 as it should (like everyone) use DXT1 for textures which is 0.5 bytes/texel while it will probably need 4 bytes/texel for the normal map. It makes all the difference!

    The first pass does not need to read the framebuffer while following passes does not need to write the Z value. Anyway it should be 2xZB at least for the first pass.

    Also why do assume that all the texture stages will be used in all passes especially on the GF3?



    <font size=-1>[ This Message was edited by: Hyp-X on 2002-02-10 22:30 ]</font>
     
  16. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    This formula is missing a term for the number of bytes per texture sample. Also, you may want to replace the TCE term with (1-TCE), otherwise you get the weird result that 0% texture cache efficiency = 0 memory bandwidth used on textures. Texture cache efficiency would normally be closer to about 80-84% (for trilinear interpolation or low-resolution lightmaps). The number of FB and ZB accesses per pass will vary from one pass to another - you can normally avoid an FB read in 1st pass and a ZB write in each pass other than the 1st one.
     
  17. pascal

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    1,830
    Likes Received:
    49
    Location:
    Brasil
    Yeah, you are right :smile:

    Rewriting:

    BPP = ( TSP x TLP x TD x (1 - TCE) ) + 2 x FB + ZB

    where TD = Texture depth

    Now lets recalculate GF3 Ti 200 with higher 75% TCE and 4 bytes TD:

    pass 1 BP = 4x4x4x0.25 + 4 + 4 = 24 bytes
    pass 2 BP = 4x4x4x0.25 + 2x4 = 24 bytes

    bandwith = ( 48 x 1024x768 x 3 x fps ) / 0.6 = 6.4 GB/s =>

    fps= 34 (not too bad) :smile:

    Of course it is much more complex :wink:
     
  18. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,741
    Likes Received:
    105
    Location:
    Taiwan
    My thought on this:

    When using volumetric shadows on everything, it is quite hard to collapse two lights into one pass since each light has its own shadow mask. So what really important is the formula of one light.

    I think PS 1.1 can handle most cases of simple and bumped per pixel lighting. What PS 1.4 helps is few cases where more complex lighting requires two or more passes on PS 1.1, thus PS 1.4 can help reducing the required rendering passes.

    Of course, there are some effects that can be done only by PS 1.4. An example is Z corrected bump mapping (PS 1.3 can also do it). With Z corrected bump mapping, the shadow edge on bumped walls (and characters) will be jagged and more convincing, instead of a straight line. Unfortunately it can be too expensive to be practical.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...