Can someone tell me why ATI's PS 3.0 is better than Nvidia's?

Discussion in 'Architecture and Products' started by Redeemer, Oct 11, 2005.

  1. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    Apart from the branching, does ATI have another advantage? I'm really curious, how well does R520 handle texldd, texldl, gradients, dependent reads, lots of live registers? Are arbitrary swizzles free now, or just resolved by the compiler? G70 has the higher raw arithmetic throughput, especially with MUL, DP and sin/cos. ATI really has to have high efficiency to make up for that with shaders that have no use for branching.
     
  2. Redeemer

    Newcomer

    Joined:
    Feb 11, 2005
    Messages:
    34
    Likes Received:
    1
  3. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    IIRC the difference in GROMACS performance between G70 and R520 is attributed to the register space that R520 has.
     
  4. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    It would be interesting to see how many live registers you can have before performance degrades because not enough threads can be kept in flight.
     
  5. stepz

    Newcomer

    Joined:
    Dec 11, 2003
    Messages:
    66
    Likes Received:
    3
    According to GPUBench dependent texturing is handled better than in G70.
     
  6. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    How do you figure from those results?
     
  7. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    As best I can fathom one of Erics messages there is enough space for 32 register per pixel, per thread. When you exceed that the threads start dropping, but that still doesn't give us much indication as to when performance starts dropping as well.
     
  8. Tridam

    Regular Subscriber

    Joined:
    Apr 14, 2003
    Messages:
    541
    Likes Received:
    47
    Location:
    Louvain-la-Neuve, Belgium
    I think there are 2, 3 or max 4 physical registers available when 512 threads are used.
     
  9. sireric

    Regular

    Joined:
    Jul 26, 2002
    Messages:
    348
    Likes Received:
    22
    Location:
    Santa Clara, CA
    Run GPUBench. You will see that the R520 performance is perfectly linear with the number of instructions, regardless of the number of GPRs used, up to 32. There is no fall off, since fatter threads cover more latency, and so fewer are required. As long as the product of the 'thread cycle count' times 'the number of threads' is larger than the number of cycles of latency you are hiding, all is well and GPRs are free.
     
    Geo and Jawed like this.
  10. FX5900

    Newcomer

    Joined:
    Jan 19, 2005
    Messages:
    57
    Likes Received:
    0
    weird how i keep reading stuff online how the older cards will not be able to handle newer games....bla, bla.... my fx5900 flashed to fx5950 ultra can run any games today (probably the upcoming games) smoothly.....until my fx5900 runs some of my favorite games(Max details) less than 45 FPS, I'm not upgrading yet...
     
  11. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    This board is packed with developers and professionals; can't anyone attempt to lay out the whole truth about the branching stuff?

    The layman here will try, so you may count the bullets in my feet afterwards...*ahem*.

    1. There's no single doubt that dynamic branching performance is excellent on R520 and lacklustering on NV4x/G7x.

    2. As long as dynamic branching is NOT a requirement, does the driver itself or does it not decide whether to use dynamic or static branching in the end? (unroll the loop if the HW supports it).

    3. Is dynamic branching really under all occassions (exept the cases where it turns into a necessity) and absolute eulogy and never ever a panacea for SIMD architectures? (take eulogy/panacea in a relative sense).

    4. Can anyone predict how often dynamic branching will be a necessity and how often those cases will make it into games?

    5. Does a dynamic branching performance advantage really compensate for higher ALU throughput in an as objective as possible average? How many instructions per shader are we really talking about and what size of render targets anyway? Are there any dynamic branches in tech-demos for the entire screen or just a fraction of it?
     
  12. Humus

    Humus Crazy coder
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,217
    Likes Received:
    77
    Location:
    Stockholm, Sweden
    Gradients should be single-cycle AFAIK. Dependent texture reads I would assume are faster than previous generation due to the improved cache. I haven't analysed that myself though. Arbitrary swizzles are free.
     
  13. Humus

    Humus Crazy coder
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,217
    Likes Received:
    77
    Location:
    Stockholm, Sweden
    It can do so if it decides to do so. For really short branches predication will be faster on ATI too, though a short branch around an expensive texture lookup can still be beneficial. But if you're branching over say two ALU instructions, the driver will most likely replace it with predication. If there's a loop and the loop count is known at compile time, and the unrolled loop fits within the instruction slot limit, then the driver will most likely choose to unroll it.
     
  14. zgemboandislic

    Newcomer

    Joined:
    Sep 15, 2005
    Messages:
    135
    Likes Received:
    0
    Have you tried the F.E.A.R. demo?
     
  15. stepz

    Newcomer

    Joined:
    Dec 11, 2003
    Messages:
    66
    Likes Received:
    3
    R520 has 1.5x G70's DEP-RAND bandwidth. As the dependent access is completely random caches probably don't get used. You think that the cache behaviour would change the results or that the pagefaults kill the bandwidth enough that some other bottleneck doesn't reveal itself?
     
  16. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    Don't bother with the green goblin with his über nv35 hardware, it can still run UT2003 perfectly...
     
  17. Tridam

    Regular Subscriber

    Joined:
    Apr 14, 2003
    Messages:
    541
    Likes Received:
    47
    Location:
    Louvain-la-Neuve, Belgium
    A quick example I just tried : mandelbrot set algorithm rendered on moving mid size triangles. G7x architecture really likes it (lot of mads, scalar and vec2 instructions). I compute 129 iterations (-> ~400 instructions) :

    7800GTX : 35.6 Mpix/s
    X1800XL (my XT is back to ATI) : 13.1 MPix/s

    Now I use a loop with a break under condition to early out when more iterations are not needed :

    7800GTX : 17.7 MPix/s
    X1800XL : 29.4 MPix/s


    So yes the dynamic branching advantage can compensate for the ALU throughput but only in specific cases. There is no objective average here.
     
    #57 Tridam, Oct 12, 2005
    Last edited by a moderator: Oct 12, 2005
    Jawed likes this.
  18. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    It just occurred to me that I don't particularly understand why dynamic branching is now getting so much attention. I mean, branching and looping are fundamental programming concepts - why wasnt this built into the earliest shader models - back in the GF3 days?

    Or was it just a matter of transistor budget?
     
  19. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Shader Model 3 was the first time Dynamic Branching was mandated as a requirement in the pixel pipeline.

    Previous shader models were very limited in their programming constructs and capabilities - DX8 could only operate 8 instructions, whih doesn't really have much room for looping and branching.
     
  20. Frank

    Frank Certified not a majority
    Veteran

    Joined:
    Sep 21, 2003
    Messages:
    3,187
    Likes Received:
    59
    Location:
    Sittard, the Netherlands
    If dynamic branching works well, it opens up the possibility for more general algorithms, which are closer to how developers think it ought to work. And it allows them to implement things the way CGI does it, without having to discover the wheel all over again. And it offers the possibility of using real libraries. So, it's easier, faster to develop and offers more possibilities.

    As for the speed: it depends entirely on what you do and how you do it. It's just an additional tool, not a speed-up by itself.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...