NV30 fragment processor test results

Discussion in 'Architecture and Products' started by thepkrl, Apr 1, 2003.

  1. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    Dawn's skin shader:

    Code:
    !!FP1.0
    # NV_fragment_program generated by NVIDIA Cg compiler
    # cgc version 1.5.0001, build date Oct 15 2002  01:04:08
    # command line args: -profile fp30
    #vendor NVIDIA Corporation
    #version 1.0.1
    #profile fp30
    #program main
    #semantic main.skinColor_frontSpec
    #semantic main.skin_norm_sideSpec
    #semantic main.g_specular_colorShift
    #semantic main.g_blood_texture
    #semantic main.g_transmission_terms
    #semantic main.g_diffuse_Cube
    #semantic main.g_specular_Cube
    #semantic main.g_nrmalize_Cube
    #semantic main.g_dappleProjection
    #semantic main.g_hilight_Cube
    #semantic main.g_oiliness
    #var float4 v2f.skinColor_frontSpec : $vin.TEXCOORD0 : TEXCOORD0 : 0 : 1
    #var float3 v2f.worldEyeDir	   : $vin.TEXCOORD2 : TEXCOORD2 : 0 : 1
    #var float3 v2f.worldTanMatrixX 	   : $vin.TEXCOORD5 : TEXCOORD5 : 0 : 1
    #var float3 v2f.worldTanMatrixY 	   : $vin.TEXCOORD6 : TEXCOORD6 : 0 : 1
    #var float3 v2f.worldTanMatrixZ 	   : $vin.TEXCOORD7 : TEXCOORD7 : 0 : 1
    #var float4 v2f.SkinSilouetteVec    : $vin.TEXCOORD3 : TEXCOORD3 : 0 : 1
    #var samplerRECT skinColor_frontSpec :  : texunit 0 : 1 : 1
    #var samplerRECT skin_norm_sideSpec  :  : texunit 1 : 2 : 1
    #var samplerRECT g_specular_colorShift :  : texunit 2 : 3 : 1
    #var samplerRECT g_blood_texture 	     :  : texunit 3 : 4 : 1
    #var samplerRECT g_transmission_terms  :  : texunit 4 : 5 : 1
    #var samplerCUBE g_diffuse_Cube 	     :  : texunit 5 : 6 : 1
    #var samplerCUBE g_specular_Cube       :  : texunit 6 : 7 : 1
    #var samplerCUBE g_nrmalize_Cube       :  : texunit 7 : 8 : 1
    #var samplerRECT g_dappleProjection    :  : texunit 8 : 9 : 1
    #var samplerCUBE g_hilight_Cube        :  : texunit 9 : 10 : 1
    #var float2 g_oiliness :  :  : 11 : 1
    #var half4 COL : $vout.COLOR : COLOR : -1 : 1
    
    DEFINE LUMINANCE = {0.299, 0.587, 0.114, 0.0};
    DECLARE g_oiliness;
    
    ############################################################################################
    #  These two blocks slow code down 10%                                                     #
    ############################################################################################
    TEX H0, f[TEX0], TEX1, 2D;                  # store range-compressed normal (and side spec) in H1
    TEX H1, f[TEX0], TEX0, 2D;                  # store skin color in H3
    
    MOVH H2.x, g_oiliness.y;
    MULX H0.xy, H0, H2.x;
    TEX H0, H0, TEX7, CUBE;
    
    MOVH H3, f[TEX5];
    DP3X H2.x, H0, H3;
    MOVH H3, f[TEX6];
    DP3X H2.y, H0, H3;
    MOVH H3, f[TEX7];
    DP3X H2.z, H0, H3;                          # H2 now contains worldNormal - extinguish
    
    MOVH H0.xyz, f[TEX2];                       # store v2f.world_V in H1
    DP3X H2.w, H0, H2;                          # H2.w = dot(Normal, View)
    MULX H2.xyz,-H2, -2;                        # twice the normal
    MADX H3.xyz, -H2, H2.w, H0;                 # H4 = -2*dot(H0, H2)*H2 + H0  = reflection vector.  Normal doesn't need to be fixed, because it's uniformly scaled
    
    TEX H0.xyz, H2, TEX5, CUBE;                 # diffuse lighting
    TEX H2, H2, TEX6, CUBE;                     # side specular
    MULX H0.xyz, H0, H1;                        # skin diffuse*diffuse color
    #MULX H2, H2, H0.w;                         # side_spec*side_spec term (H0.w is the side_spec term)
    MULX H2, H2, H1.w;                          # side_spec*side_spec term (H0.w is the side_spec term)
    
    MOVH H1.xyz, f[TEX3];                       # copy ndotv (H1.w is the spec map)
    MULX H0, H0, H1.x;
    MADX H0, H2, H1.y, H0;                      # diffuse + side_spec (H2 avail)
    
    TEX H2, f[TEX2], TEX9, CUBE;                # fetch hilight
    TEX H3, H3, TEX6, CUBE;                     # fetch direct specular
    MULX H2.xyz, H2, H1.y;                      # hilight by facing ratio
    MULX H2.xyz, H2, H1.x;
    MULX H3, H3, H1.w;                          # spec*spec_map
    MULX H3, H3, 0.02;                          # scale specular
    MOVH H3.w, g_oiliness.x;
    MULH H3, H3, H3.w;		            # scale the oiliness
    MADX H2.xyz, H2, 0.7, H3;                   # 0.7*hilight + direct spec
    
    ADDX H0, H2, H0;                            # diffuse specs and hilight
    #MADX o[COLH], H0, H1.x, H2.w;              # add haze
    ADDX o[COLH], H0, H2.w;                     # add haze
    END
    
    
    # End of program
    Also, note how only H0->H3 is used. Sounds like nVidia knows about the register performance :D

    Okay, so in that program, there are:
    ADDX: 2
    MULX: 9
    MADX: 3
    DP3X: 4
    TEX: 7
    MULH: 1
    MOVH: 7

    -> 5a, 13m, 3S, 4T, 1M and 7MOVs
    All MOVs seem to do FP->FP

    According to thepkrl data, the float/texture part would take 7 cycles ( 1.5+4+1 = 6.5 -> 7 )
    I didn't check whether "m"s are dependent.
    If they aren't, then it's possible to do the INT part in 7 cycles too, in parallel to all of the FP/Tex stuff:
    12/3 + 6/2 = 4 + 3 = 7

    Does that make sense?

    BTW, I'd like to know how much rounds SM, SA, TM and TA takes. If they only take one, then the FP/Tex part would only take 6 cycles...


    Uttar
     
  2. thepkrl

    Newcomer

    Joined:
    Mar 14, 2003
    Messages:
    12
    Likes Received:
    0
    More numbers below. Looking at them more closely, it seems there is more structure to the slowdown than just every two registers slowing things. I've grouped the numbers a bit to show this.

    Code:
       32.12 rounds   8.48 cycle/fragm:  1 regs, 32 instr
       32.05 rounds   8.46 cycle/fragm:  2 regs, 32 instr
    
       34.29 rounds   9.05 cycle/fragm:  3 regs, 32 instr
       34.27 rounds   9.05 cycle/fragm:  4 regs, 32 instr
    
       43.62 rounds  11.52 cycle/fragm:  5 regs, 32 instr
       43.64 rounds  11.52 cycle/fragm:  6 regs, 32 instr
    
       59.60 rounds  15.73 cycle/fragm:  7 regs, 32 instr
       59.61 rounds  15.74 cycle/fragm:  8 regs, 32 instr
    
       93.42 rounds  24.66 cycle/fragm:  9 regs, 32 instr
       93.36 rounds  24.65 cycle/fragm: 10 regs, 32 instr
       95.97 rounds  25.33 cycle/fragm: 11 regs, 32 instr
       95.97 rounds  25.33 cycle/fragm: 12 regs, 32 instr
    
      129.35 rounds  34.15 cycle/fragm: 13 regs, 32 instr
      129.33 rounds  34.14 cycle/fragm: 14 regs, 32 instr
      132.74 rounds  35.04 cycle/fragm: 15 regs, 32 instr
      132.80 rounds  35.06 cycle/fragm: 16 regs, 32 instr
    
      207.44 rounds  54.76 cycle/fragm: 17 regs, 32 instr
      207.42 rounds  54.76 cycle/fragm: 18 regs, 32 instr
      212.65 rounds  56.14 cycle/fragm: 19 regs, 32 instr
      212.64 rounds  56.14 cycle/fragm: 20 regs, 32 instr
      217.83 rounds  57.50 cycle/fragm: 21 regs, 32 instr
      217.85 rounds  57.51 cycle/fragm: 22 regs, 32 instr
      223.03 rounds  58.88 cycle/fragm: 23 regs, 32 instr
      223.05 rounds  58.88 cycle/fragm: 24 regs, 32 instr
    
      299.50 rounds  79.07 cycle/fragm: 25 regs, 32 instr
      299.49 rounds  79.06 cycle/fragm: 26 regs, 32 instr
      306.32 rounds  80.87 cycle/fragm: 27 regs, 32 instr
      306.32 rounds  80.87 cycle/fragm: 28 regs, 32 instr
      313.12 rounds  82.66 cycle/fragm: 29 regs, 32 instr
      313.15 rounds  82.67 cycle/fragm: 30 regs, 32 instr
      319.91 rounds  84.45 cycle/fragm: 31 regs, 32 instr
      319.95 rounds  84.46 cycle/fragm: 32 regs, 32 instr
    One guess to explain this is that the more registers is used, the fewer fragments are active in the pipeline. This would lower efficiency but would increase memory available to a single fragment. This might be a sensible strategy, as using many registers means long programs, which might have internal parallelism that could be exploited. This doesn't seem to happen with current drivers though (programs with many registers do not perform paired FP-ops any faster).

    I tried more tests with complicated programs using 8 registers. It seems the numbers above are the best case, and when registers are used in more complicated patterns, things get a bit slower (25% in one case).

    Good observation.

    Looking at the Dawn skin shader you showed, there is an interesting code fragment with extra MOV instructions:

    Code:
    MOVH H3, f[TEX5];
    DP3X H2.x, H0, H3;
    MOVH H3, f[TEX6];
    DP3X H2.y, H0, H3;
    ...
    I tried a similar code segment to see if moving a texture coordinate f[TEX] to a temp reg first is faster than using it directly. And it is! As mentioned earlier, using a f[TEX]-register in arithmetic ops seems to cost one extra cycle (for both FP16 and FP32). However, the extra move in the FP16 case makes the operation complete in one cycle, including the MOV which becomes free.

    This might be related to the NV program specification, which says instructions with f[...] are always performed with FP32 accuracy. So using it directly forces the operation into FP32 which is always 2 cycles in this case, but adding a dummy move that changes the type makes it legal for the driver to ignore the rule.

    The whole Dawn shader takes 17 rounds/pixel (4.25 cycles/pixel) in my testbench. There is dependency between instructions which lowers efficiency quite a bit from the theoretical maximums you calculated. I tried to manually divide the shader to rounds, and got the same number of stages as the tests showed, so this might be close:

    Code:
    TEX H0, f[TEX0], TEX1, 2D;
    TEX H1, f[TEX0], TEX0, 2D;
    --
    MOVH H2.x, g_oiliness.y;
    MULX H0.xy, H0, H2.x;
    --
    TEX H0, H0, TEX7, CUBE;
    --
    MOVH H3, f[TEX5];
    DP3X H2.x, H0, H3;
    --
    MOVH H3, f[TEX6];
    DP3X H2.y, H0, H3;
    --
    MOVH H3, f[TEX7];
    DP3X H2.z, H0, H3;
    --
    MOVH H0.xyz, f[TEX2];
    DP3X H2.w, H0, H2;
    --
    MULX H2.xyz,-H2, -2;
    MADX H3.xyz, -H2, H2.w, H0;
    --
    TEX H0.xyz, H2, TEX5, CUBE;
    --
    TEX H2, H2, TEX6, CUBE;
    --
    MULX H0.xyz, H0, H1;
    MULX H2, H2, H1.w;
    --
    MOVH H1.xyz, f[TEX3];
    MULX H0, H0, H1.x;
    MADX H0, H2, H1.y, H0;
    --
    TEX H2, f[TEX2], TEX9, CUBE;
     MULX H2.xyz, H2, H1.y;
     MULX H2.xyz, H2, H1.x;
    --
    TEX H3, H3, TEX6, CUBE;
     MULX H3, H3, H1.w;
     MULX H3, H3, 0.02;
    --
    MOVH H3.w, g_oiliness.x;
    MULH H3, H3, H3.w;
    --
    MADX H2.xyz, H2, 0.7, H3;
    ADDX H0, H2, H0;
    --
    ADDX o[COLH], H0, H2.w;
    
    Texture does seem to take the whole round and the Add or Mul goes to the next round.

    Code:
      1.00 rounds, prog: T
      2.02 rounds, prog: TA
      3.01 rounds, prog: TAA
      1.00 rounds, prog: S
      2.01 rounds, prog: SA
      3.01 rounds, prog: SAA
      1.01 rounds, prog: T
      2.00 rounds, prog: TM
      3.03 rounds, prog: TMM
      1.02 rounds, prog: S
      2.03 rounds, prog: SM
      3.04 rounds, prog: SMM
     
  3. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,921
    Likes Received:
    221
    Location:
    Seattle, WA
    Oh, I'd like to post a quick update. I finally found the quote I was looking for.

    With 4 FP ops and 8 Int ops per clock from these tests, the total number of functional units just does not add up.

    From Extremetech:
    (This was a quote by Davis Kirk)

    Now, if the GeForce FX indeed does have 32 functional units, what are they doing? We only see 12 in use with these benches.
     
  4. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    Well remember, each fp shader pipeline contains 4 fp subunits (perhaps mads/general fp processors) which compute the color for each individual color component (RGBA). So there are 4 32-bit units per shader pipeline. If there are 4 pipes, this means 16 units. This is only for the fragment shader. What about the 12 integer units?

    Edit:
    I do agree that there is some mystery behind the 125 million transistor count of the NV30; something is not performing. With the NV35 rumored to sport only 5 million more, something smells fishy. Does anyone think NV35 will be able to exectute texture ops and fragment shaders concurrently? Why isn't NV30 not able to do this. According to thepkrl, the NV30 uses the fragment shader for only dependent texturing. Therefore, why is it not performing alongside the texture units for independent texturing?
     
  5. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,921
    Likes Received:
    221
    Location:
    Seattle, WA
    But if you're going to calculate things like that, you'd get 48 functional units, which also doesn't add up.

    My first guess would be that there's a second FP unit per pipe that isn't made available in current drivers. For what reason, I'm not certain (though parallelism concerns seem to be the most likely issue...possibly with register sharing problems). These 16 functional units may be counted as 32 if each is an mad unit (multiply+add).
     
  6. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    :?:

    OK, MAD can be viewed as 2 units for your purposes?

    If you'll look back here, you'll see the argument I made for 8 FP units and subsequent discussion about it (if you look at the other thread, I think you'll see why I consider that important for the NV35 when not performing texture ops for long sequences of instructions).

    From that and the other detailed discussion thepkrl provided, 8 * (FP MUL + FP ADD) + 4 * (2 MUL + 2 ADD) = 32.

    OK, starting from this, and pointing out the flaws in it and my past discussion, what is your point Chalnoth? I'm not trying to flame, I just don't see how your discussion of a 32 count calls the earlier discussion into question.
     
  7. antlers

    Regular

    Joined:
    Aug 14, 2002
    Messages:
    457
    Likes Received:
    0
    I don't think there are any "missing" calc units there. David Kirk was just in total marketing mode. I mean, he said that they can never write more than 8 pixels per cycle, when the truth of the matter is they can never write more than four.
     
  8. Saem

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    1,532
    Likes Received:
    6
    Argh, there is a significant difference between architecture and design. I'm guessing you mean mistakes in the design. The architecture is likely quite good, just this implementation (Read: design) sucks when compared to the R300.
     
  9. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,921
    Likes Received:
    221
    Location:
    Seattle, WA
    I don't think it sucks compared to the R300 (speaking solely about the shader implementation).

    The only problem is, you just need to make use of integer calculations whenever possible for the NV30 architecture to have high performance. If your calculations can use integer precision enough (2/3 of all calculations), then the NV30 can, with the benchmarks in this thread, have performance as much as 50% above the R300 on a per-clock basis.

    Now, this certainly makes the NV30 a little bit harder to program for, but this is probably why nVidia made Cg. With optimized compiling, the vast majority of the quirkiness of the NV30 architecture need not be made visible to the programmer. Still, Direct3D remains a problem for the FX.

    Now, if you really want to say that the NV30's shader design is fundamentally inferior to the R300's, you will need to have more information about specific shaders that would require higher than integer precision all throughout the pipeline. And remember, even if a piece of software has some shaders that require floating-point precision all throughout the shader, the NV30 may still end up with higher performance if enough shaders need mostly integer precision.
     
  10. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    Are you including that one has 8 pipes and another has 4 for that "50%" higher per clock figure?
     
  11. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,921
    Likes Received:
    221
    Location:
    Seattle, WA
    Well, isn't the Radeon 9700 capable of FP shader op per clock per pixel pipeline? That's 8 fp ops per clock.

    The GeForce FX is, according to the benches in this thread, capable of 4 fp ops and 8 int ops per clock. That's 50% more pixel shading power, if enough ops use integer precision.

    Of course, if lots of scalar ops are issued, the R300 can do even better...but I'm just trying to say that the FX shader architecture is not fundamentally inferior to the R300 shader architecture (Of course, this depends on how many calculations can run at integer precision...but current information points to the FX as generally being better if as many ops as possible use integer precision).
     
  12. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    No matter how you twist it: ps_2_x and ARB_fragment_program DO NOT expose anything to support integer calculations. Only NV_fragment_program exposes such functionality and only there Cg can help GeForce FX. Generally you ARE NOT ALLOWED to run below fp16 in ps_2_x and ARB_fragment_program, meaning that NVIDIA drivers will have to hack around trying to figure out if shader actually needs (full) float precision for that instruction or not.
    I don't know what has NVIDIA actually tried to do with such approach? They are definitely not making their driver writer lives any easier. They also don't seem to save any transistors. So what the hell are they saving?
     
  13. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    Yes, this very high dependence on int12 ins is obviously key in the NV30 shader performance. I would also agree that this in part is why they created Cg, but it seems increasing clear to me, that the hardware architecture guys where too much influenced by the those within nVidia who primary wanted a strong NV30GL (Quadro FX) for the professional GL-apps.

    That market probably loves that the good old register combiner stayed in there and they should like the fact that they still can still work with the NV extensions they have been used to. A more capable FP16/32 shader glue on top is just the way a relatively conservative market segment likes it.

    This line of thinking is IMHO why nVidia never considered the R300 approach.

    Yes, and I see this as partly evidence to my line of thinking.

    Good point ... and how the hell do you figure that out? I had the impression that the register combiners only can receive a fragment from the fragment program but can't send it back for futher FP ins. Is this right?
     
  14. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    You can freely mix int, fp16 and fp32 instructions in NV_fragment_program. If you look up to Dawn 's skin shader you'll notice two kinds of instructions *X and *H. *X are integer (fiXed point), *H are half float and there are also *R instructions which are full float.
     
  15. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    Chalnoth: At the link you've given, in a later page, David Kirk actually suggests that you can have 32 pixels in flight...


    If you actually had to have dedicated registers for 32 units, it would explain the transistor count for sure.
    And if you didn't have all of the 32 FP32 registers in each 32 unit, it would explain the register slowdown.

    Another possibility is that the NV30 architecture is fundamentally different from what we think ( and that's what I'd bet on )
    32 units, which can do either:
    1FP in 8 cycles
    1TEX in 4 cycles ( or 8 cycles, if dependent )
    1FX ADD in 4 cycles
    1FX MUL in 2 cycles

    ( note that those numbers are speculative, and assume 32 functional units. If some units were disabled for thermal issues or because they were buggy, those numbers might have to be divided by 2 )

    You'd then have a register pool, which would be bigger than the 32 FP32 limit. And you'd have an instruction pool, which would be shared among all pipelines.

    So, this is my guess for the whole NV30 disaster: There was so darn much headroom and parallelism they had to disable some registers, some units, and stuff just to be able to survive the thermal problems, and it requires Flow FX and a ridiculous amount of power... At the same time, GDDR-2 also produces more heat than expected.

    All nVidia would have to do with the NV35 is reduce headroom, increase per-clock efficiency by reactivating the units which were disabled for thermal reasons, and finally use DDR-I or cooler GDDR-II.
    Sounds good, doesn't it?


    Uttar
     
  16. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    Mix, yes, but can you within the same shader do a bit of FP (in the fragment processor), then some int (in the register combiners) and then again some FP (back in the fragment processor) going back and forth freely?

    I'm asking because the NV30 OpenGL specs only explain how you should output from the fragment processor to the register combiners (using registers TEX0 - TEX3) but not the other way back. I don't know whether this is an issue at all, but it would imply a limitation to me if you can't 'go back' to FP ins after using a int ins.
     
  17. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    You don't even have to jump out of NV_fragment_program at all to use ints.

    Edit: But of course you can't jump from NV_fragment_program to register combiners and back.

    Edit 2:
    Code:
    !!FP1.0
    MOVH H0, f[TEX3];
    DP4H H0.x, f[TEX2], H0;
    MULH H0, f[TEX2], H0.x;
    MOVX H1, f[COL1];
    DP4X H1.x, f[COL0], H1;
    MULH H1, f[TEX2], H1.x;
    MOVR R0, f[TEX1];
    DP4R R0.x, f[TEX0], R0;
    MADR R0, f[TEX2], R0.x, H0;
    ADDR o[COLR], R0, H1;
    This is a completly valid NV_fragment_program shader that uses int (*X), half float (*H) and full float (*R) instructions. No register combiners.
     
  18. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    Hmm...I got confused by your implying shader functionality parity (aside from integer processing), and didn't realize you were excluding texture ops and complex operations.

    That's ps 1.3 functionality with no dependent texture reads or complex instructions occuring, but you're right that's a 50% per clock advantage in that circumstance. For what length of a shader is that a realistic set of limitations? Where is the data for these operations coming from?

    When running with less functionality than R300 shaders, using integer precision, using multiply operations, not using any texture ops, and using 4 component vectors alone. Did I forget instructions that allow the nv30 to exhibit similar execution per clock characteristics?

    I would characterize the nv30 design as being either fundamentally inferior in features offered or fundamentally inferior in speed with regards to shaders. I concur with your 50% clock speed advantage special case, but don't see how it refutes this description.

    EDIT: error.
     
  19. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    The NV30 is superior to the R300 if the following is true:
    1. Both INT & FP are used in the same program
    2. Few registers are used
    3. There's little scalar

    It's not THAT hard, now, is it?
    2 is Cg's job.
    3 is true in many cases.

    1 is the big problem, because DX9 doesn't support INT. Not even through extensions. I'm sure nVidia would gladly pay them $25M "under the table", or maybe even more, to get full integer support in DX9.1 extensions and the right to use FP16 registers for most operations...


    Uttar
     
  20. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    But why is Nvidia so stuck on int and not fp!!
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...