Asking Tim Sweeney about NVIDIA and more

Discussion in 'Beyond3D News' started by Reverend, Sep 29, 2003.

  1. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,560
    Likes Received:
    157
    Location:
    In the Island of Sodor, where the steam trains lie
    "Curiouser and Curiouser". One wonders, then, what the point was of a certain spread-sheet that listed maximum instruction slots <shrug>.
    I'll have to leave that to others. Recently, I've only really looked in-depth at the pre-rasteriser stages. ... (well... apart from Texture Comp of course).
     
  2. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,062
    Likes Received:
    1,024
    Actually, for a graphics ASIC, perhaps it should be.
    This topic was discussed at length in here:
    http://www.beyond3d.com/forum/viewtopic.php?t=6223&highlight=precision

    Basically, the integer alternative is marginal in precision, and lack dynamic range, fp16 has sufficient dynamic range but still offer marginal precision, and fp24 offers both precision and dynamic range to spare. fp32 is better still, but taking the limitations of the target output (displays) into account, as well as the intended use of the ASIC, the extra precision on offer will go unused except for research type applications, and then primarily research into precision effects in rendering. Games care not.

    You could make a rather convincing argument in favour of fp16 being enough of an incremental step up from the 24bit integer (8bit integer actually if we want consistent nomenclature) that we used up to DX9, since fp16 offered both improved precision and the desired flexibility in terms of dynamic range. Especially since the DX9 API is primarily targeted towards gaming. The step up to fp24 in order to alleviate the remaining precision anxiety was quite ambitious.

    That is not to say that you can't construct cases where fp24 does not suffice in terms of precision. Of course you can. But are these examples relevant for the target application? If you manage to actually find such an example, can't you reformulate your algorithm in a numerically more stable form? If this isn't possible, couldn't the algorithm be replaced with something functionally equivalent, but less demanding in terms of numerical precision? And if all these questions are given their absolute worst case answers, is the trade-off of 32bit fp in either performance (less parallellism, possibly lower clock) or cost (larger dies, plus a higher ratio of defective dies) actually worth it in the greater scheme of things?

    I think you'll find fp24 to be a quite natural choice.
     
  3. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,079
    Likes Received:
    648
    Location:
    O Canada!
    I'm still a little annoyed that I haven't got the MS presentation from Shader Day - they had an excellent slide explaining why FP24 is a great choice. Basically the range that FP24 can cover more or less matches the range that the rods and cones of the human eye perceves.
     
  4. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,062
    Likes Received:
    1,024
    If only our output devices were anywhere close to that.....
     
  5. Dio

    Dio
    Veteran

    Joined:
    Jul 1, 2002
    Messages:
    1,758
    Likes Received:
    8
    Location:
    UK
    Are you arguing that if R3xx did not have 24-bit precision, the standard for DX9 would be 32-bit?
     
  6. vb

    vb
    Regular

    Joined:
    Jun 5, 2003
    Messages:
    367
    Likes Received:
    2
    and let's not forget that 8 bit integer was actually 10 bit, 12 bit and even 16 bit integer in former current hardware so fp16 doesn't sound as such an improvement (wrt precision at least)
     
  7. jimbob0i0

    Newcomer

    Joined:
    Jul 26, 2003
    Messages:
    115
    Likes Received:
    0
    Location:
    London, UK
    hmm I think that if NV had been honest with performance figures 2 years ago 16bit would be DX9 minimum standard...

    if NV behaved as they have done and ATi managed to fit a 32bit pipeline into their .15 transistor budget instead of the 24bit (i know... highly unlikely but work with me here...) then the minimum would have been 32bit - which would have <really> screwed NV

    if ATi had gone with a 16bit setup then again the DX-minimum would have been 16bit - all IMHO of course.
     
  8. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,062
    Likes Received:
    1,024
    Well, fp16 isn't much of an improvement wrt precision. Note that I'm assuming here that just as IHVs used higher internal precision for their calculations in order to reduce error propagation problems, that nVidia does the same for fp16. It would really surprise me if they didn't.
    DX9 was designed with fp24 as a minimum requirement though, which made some things possible which wouldn't have been with fp16 as a minimum requirement, and giving comfortable margins in other places.

    When discussing calculational precision, practicality always rules. Striving for ever higher precision for idealistic reasons is silly and pointless. The list of successive questions I put above are used without qualms in the field of scientific codes. I don't see why games should be held to higher standards.
     
  9. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,560
    Likes Received:
    157
    Location:
    In the Island of Sodor, where the steam trains lie
    Damn it! I demand support for the GNU MP library in DX10 HW! :D
     
  10. Sxotty

    Veteran

    Joined:
    Dec 11, 2002
    Messages:
    4,895
    Likes Received:
    344
    Location:
    PA USA
    Well they are completely full of crap then, but that is not surprising look at the reason Carmack originally asked for HW vedors to include higher precision it is because when math is done rounding error leads to visual artifacts, we do not need fp24 to get the range the human eye can see. We need whatever percision, will after all the calculations are done, give the correct visual range, and this varies based on calculations the best you could say is that in 90% or some arbitrary number of calculations used in games fp24 gives the dynamic range the eye can percieve.

    But I guarantee that no one has gone thru and checked every shader in every game to come up with this

    Other sources claim 16.7 million, I suppose there is a multiple of gray shades by color hues, this is possible to represent on a tnt2. In any case I am not trying to be a troll, just saying that the reasoning on why we have the fp24 format is just that it is good enough most of the time and not some omnipotent standard that is for some intrinsic reason "good".
     
  11. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,062
    Likes Received:
    1,024
    :D
    Ok, ok, I'm talking from the perspective of science. Us lowly grunts have no true appreciation of mathematics and the purity of numbers. Seriously, if the highest level of fast precision isn't enough, in my field we go back to the drawing board and find another approach to the problem. If compromises are required, well, that's what will be used. It works as long as people know what they are doing. I don't think I've ever used higher than fp72 (on old Cybers that had a 36 bit standard word length). There always seem to be ways around problems with precision.
     
  12. DeanoC

    DeanoC Trust me, I'm a renderer person!
    Veteran Subscriber

    Joined:
    Feb 6, 2003
    Messages:
    1,469
    Likes Received:
    185
    Location:
    Viking lands
    Rubbish, that statement is simple untrue. And I actually checked rather than trusted yours or anyones elses word.

    The HLSL compiler (I'm haven't got the SDK update on this machine, so it may have changed) does do as you say and expand the sin intrinsic to the approximation when compiling. See Example 1 below.

    The assembler on the other hand produces the sincos macro in the object code when asked. This can be processed however the driver choses, it is a single token exactly the same as any other instruction. See Example 2

    Notice the 'object' code in the the comments at the end of both listings, this is the EXACT thing that the drivers gets, unprocessed unchanged. Clearly shorter in the second case, but maybe you still don't believe me. So lets create a 13 simple instruction shader (same length as Example 2) and see how long the object code is. See Example 3.

    Longer in Example 3, WHY? because SINCOS is passed to the drivers as a single instruction that can take UPTO 8 instruction slots.

    Case Closed.

    ------------------------------------------------------------------
    Example 1:
    Save the following code to test.hlsl and compile this with
    fxc.exe /Tps_2_0 /Fccode.txt test.hlsl

    Code:
    float4 main(float2 stuff : TEXCOORD0) : COLOR0
    {
      return sin( stuff.x );
    }
    
    code.txt will have the following body
    Code:
    //
    // Generated by Microsoft (R) D3DX9 Shader Compiler
    //
    //  Source: test.psh
    //  Flags: /E:main /T:ps_2_0 
    //
    
        ps_2_0
        def c0, -0.5, 0, 0, 1
        def c1, 0.159155, 6.28319, -3.14159, 0.25
        def c2, -2.52399e-007, -0.00138884, 0.0416666, 2.47609e-005
        dcl t0.x
        mad r7.w, t0.x, c1.x, c1.w
        frc r2.w, r7.w
        mad r4.w, r2.w, c1.y, c1.z
        mul r11.w, r4.w, r4.w
        mad r1.w, r11.w, c2.x, c2.w
        mad r3.w, r11.w, r1.w, c2.y
        mad r5.w, r11.w, r3.w, c2.z
        mad r7.w, r11.w, r5.w, c0.x
        mad r9, r11.w, r7.w, c0.w
        mov oC0, r9
    
    // approximately 10 instruction slots used
    
    
    // 0000:  ffff0200  000cfffe  42415443  00000014  _......_CTAB.___
    // 0010:  00000014  ffff0200  00000000  00000000  .____...________
    // 0020:  58443344  68532039  72656461  6d6f4320  D3DX9 Shader Com
    // 0030:  656c6970  abab0072  05000051  a00f0000  piler_..Q__.__..
    // 0040:  bf000000  00000000  00000000  3f800000  ___.__________.?
    // 0050:  05000051  a00f0001  3e22f983  40c90fdb  Q__.._....">...@
    // 0060:  c0490fdb  3e800000  05000051  a00f0002  ..I.__.>Q__.._..
    // 0070:  b4878163  bab609ba  3d2aaaa4  37cfb5a1  c.........*=...7
    // 0080:  0200001f  80000000  b0010000  04000004  .__.___.__...__.
    // 0090:  80080007  b0000000  a0000001  a0ff0001  ._..___..__.._..
    // 00a0:  02000013  80080002  80ff0007  04000004  .__.._..._...__.
    // 00b0:  80080004  80ff0002  a0550001  a0aa0001  ._..._..._U.._..
    // 00c0:  03000005  8008000b  80ff0004  80ff0004  .__.._..._..._..
    // 00d0:  04000004  80080001  80ff000b  a0000002  .__.._..._...__.
    // 00e0:  a0ff0002  04000004  80080003  80ff000b  ._...__.._..._..
    // 00f0:  80ff0001  a0550002  04000004  80080005  ._..._U..__.._..
    // 0100:  80ff000b  80ff0003  a0aa0002  04000004  ._..._..._...__.
    // 0110:  80080007  80ff000b  80ff0005  a0000000  ._..._..._..___.
    // 0120:  04000004  800f0009  80ff000b  80ff0007  .__.._..._..._..
    // 0130:  a0ff0000  02000001  800f0800  80e40009  __...__._...._..
    // 0140:  0000ffff                                ..__
    
    Example 2:
    Save the following code to test.psh and assemble this with
    psa.exe /Fccode.txt test.psh
    Code:
    ;Note the other instructions are purely to get it to assemble, sincos is a fussy instruction and I didn't play for long
    ps_2_0
    
    dcl t0.x
    mov r1.xyzw, t0.xxxx
    mov r0.xyzw, r1
    sincos r0.x, r1.x, c0, c1
    mov r1.xyzw, t0.xxxx
    mov r1.x, r0.x
    mov oC0, r1
    
    code.txt then has the following
    Code:
    
        ps_2_0
        dcl t0.x
        mov r1, t0.x
        mov r0, r1
        sincos r0.x, r1.x, c0, c1
        mov r1, t0.x
        mov r1.x, r0.x
        mov oC0, r1
    
    // approximately 13 instruction slots used
    
    
    // 0000:  ffff0200  0200001f  80000000  b0010000  _....__.___.__..
    // 0010:  02000001  800f0001  b0000000  02000001  .__.._..___..__.
    // 0020:  800f0000  80e40001  04000025  80010000  __..._..%__.__..
    // 0030:  80000001  a0e40000  a0e40001  02000001  .__.__..._...__.
    // 0040:  800f0001  b0000000  02000001  80010001  ._..___..__.._..
    // 0050:  80000000  02000001  800f0800  80e40001  ___..__._...._..
    // 0060:  0000ffff                                .._
    
    Example 3:
    Exactly the same as Example 2 but replace the sincos with 8 mov r1,t0.x

    Code:
        ps_2_0
        dcl t0.x
        mov r1, t0.x
        mov r0, r1
        mov r1, t0.x
        mov r1, t0.x
        mov r1, t0.x
        mov r1, t0.x
        mov r1, t0.x
        mov r1, t0.x
        mov r1, t0.x
        mov r1, t0.x
        mov r1, t0.x
        mov r1.x, r0.x
        mov oC0, r1
    
    // approximately 13 instruction slots used
    
    
    // 0000:  ffff0200  0200001f  80000000  b0010000  _....__.___.__..
    // 0010:  02000001  800f0001  b0000000  02000001  .__.._..___..__.
    // 0020:  800f0000  80e40001  02000001  800f0001  __..._...__.._..
    // 0030:  b0000000  02000001  800f0001  b0000000  ___..__.._..___.
    // 0040:  02000001  800f0001  b0000000  02000001  .__.._..___..__.
    // 0050:  800f0001  b0000000  02000001  800f0001  ._..___..__.._..
    // 0060:  b0000000  02000001  800f0001  b0000000  ___..__.._..___.
    // 0070:  02000001  800f0001  b0000000  02000001  .__.._..___..__.
    // 0080:  800f0001  b0000000  02000001  800f0001  ._..___..__.._..
    // 0090:  b0000000  02000001  80010001  80000000  ___..__.._..___.
    // 00a0:  02000001  800f0800  80e40001  0000ffff  .__._...._....__
    
     
  13. DeanoC

    DeanoC Trust me, I'm a renderer person!
    Veteran Subscriber

    Joined:
    Feb 6, 2003
    Messages:
    1,469
    Likes Received:
    185
    Location:
    Viking lands
    Oops Double Post
     
  14. RussSchultz

    RussSchultz Professional Malcontent
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,855
    Likes Received:
    55
    Location:
    HTTP 404
    So you're saying the HLSL compiler doesn't use the sincos ASM token, it always expands it, but if you're programming in PS2.0 ASM, you're ok and the token will be re-interpreted by the driver if necessary.

    Sounds to me like the HLSL compiler needs a kick in the pants.
     
  15. vb

    vb
    Regular

    Joined:
    Jun 5, 2003
    Messages:
    367
    Likes Received:
    2
    Well, the operation may be done at fp32, but in case of nv3x, the register storing the result will always be fp16 (due to well known limitations) which rather cancels any advantage.
     
  16. DeanoC

    DeanoC Trust me, I'm a renderer person!
    Veteran Subscriber

    Joined:
    Feb 6, 2003
    Messages:
    1,469
    Likes Received:
    185
    Location:
    Viking lands
    Agreed, its very odd, looks like the HLSL compiler doesn't know about sincos. But note I'm not using the latest compiler and it has been upgraded significantly, I'll install it later.

    I going to have a better play later to see a) what the new SDK update does in both ps_2_0 and ps_2_0a (GFFX specific(ish)) and b) see if caused by not forcing the input in range before hand (the HLSL compiler has been known to do similar silly 'optimisation' in similar cases).

    If I can't get a better results with some fiddling, I'll see what MS have to say about the matter.
     
  17. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    Your method of "flipping" has no logical relation to my questions and scenario...as long as you aren't proposing that you've answered them by some principle of logical convers with your series of questions, that's perfectly fine. I would be interested in those answers at some point, though. :?:

    This seems rhetorical, as apparently it does seem strange to you. To answer: no, it doesn't seem strange to me. Do you have a particular reason that it seems strange to you that we could discuss?

    You've gotten some mathematical/scientific answers relating to the use of it for DX 9. Why are they not enough?

    Are you making a silent assumption that ATI picked fp24, and then dictated fp24 would be used to Microsoft, and thus, this was "strange"?
    It seems obvious that if fp24 was determined to be enough in discussion, and then ATI decided to implement it, that this would not be strange, right? What alternative do you propose to this, and why...or do you really think this latter explanation is "strange"?

    Was it "strange" that 24-bit floating point used for this (pdf) new mobile phone 3D accelerator? How about this signal processing chip design from 1995? How about this (pdf) discussion from 1997?

    My question remains: does fp24 make sense for a minimum requirement for hardware processing precision? Practical, scientific, and mathematical discussion seem to support that it does, even when called "strange". Do you have more thoughts on the matter besides that label?

    No, it is not a "purely" a coincidence when a standard and an IHV both do something that makes sense. Their making sense would be the relationship that prevents that. This is not to say that another relationship, like "picking on nVidia", can be arbitrarily inserted after this "not being a coincidence" is answered, nor that "not being a coincidence" makes it "strange".

    The answer seems to be that the rest of the spec and discussions revolved around having higher precision than fp16 for full precision...you realize that the 'clarification' in response to nVidia's assumption did not move down from fp32->fp24, right?

    This answer seems to make sense given the limitations of fp16, what fp24 allows, and that fp16 is still offered with the partial (aka, not "full") precision hint...right? :shock:
    Or are you proposing that HDR implementations shouldn't have been allowed to be expressed in a straightforward fashion in the spec as the result of no programmer being able to take more than fp16 full precision for granted? This would seem "strange" to me.

    "Which IHV thought they could lock down the graphics industry within their own standard that they had the final word on?"

    Both of these are implications. Holding a conversation based solely on them seems a silly way to go...because then we could just go back and forth, and "prove" that MS or nVidia killed Kennedy. :-? (Alright, I'll concede that MS could probably afford a time machine, but still...)
    If you have some logical relationship related to your implication that answers the other logical commentary presented in this discussion, please provide it.

    You see, logically, this supports my implication concerning nVidia's problems, not yours. The only thing that ties it to yours is a "silent" assumption of the nature that "MS based their decisions solely on ensuring that nVidia was at a disadvantage".

    A major problem with such an assumption (when you bring logic into it) is that nVidia would still seem to have been put at a disadvantage if the spec specified FP16 for full precision (there were drivers like this...remember how the NV3x fared with them?), and that, aside from allowing the NV3x to use more registers before choking, it would only have resulted in impeding things like HDR.

    How is it logical to conclude that it makes more sense that minimum full precision spec is fp24 to prevent nVidia from using more registers, even when it wasn't the mechanism for achieving that, rather than the minimum full precision spec is fp24 to allow things like facilitating HDR?

    Is your entire premise: "It isn't a coincidence, so it must be a conspiracy to pick on nVidia"?

    EDIT: quotation typo
     
  18. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,560
    Likes Received:
    157
    Location:
    In the Island of Sodor, where the steam trains lie
    Thanks Dean,
    I thought that was the case but I wasn't certain. Odd that, for some reason HLSL seems to auto-expand several macros (although, AFAIK, it doesn't do that universally) <shrug>.
     
  19. Anonymous

    Veteran

    Joined:
    May 12, 1978
    Messages:
    3,263
    Likes Received:
    0
    What might Anthony be implying? Is he saying that Tim is there with him right now? Does this mean their relationship is more intimate than anyone realized? Or could it be they are working together on a plot to topple FutureMark? Does this confirm what we've been speculating all along, Anand really is the love child of Tom Pabst and Asia Carrera?

    As the 3D world turns...
     
  20. digitalwanderer

    digitalwanderer Dangerously Mirthful
    Legend

    Joined:
    Feb 19, 2002
    Messages:
    17,276
    Likes Received:
    1,788
    Location:
    Winfield, IN USA
    Kyle? Is this why you haven't replied yet in that other thread? :|
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...