R520 info thread #(9!)^9

Discussion in 'Pre-release GPU Speculation' started by Geeforcer, Sep 30, 2005.

  1. Hubert

    Newcomer

    Joined:
    Sep 16, 2003
    Messages:
    151
    Likes Received:
    0
    Location:
    Transsylvania
    The performance loss in 2048x1536 (no AA or AF) is very interesting.

    http://www.anandtech.com/video/showdoc.aspx?i=2552&p=10
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    But when are batches going to be coherent in exact multiples of 1024 pixels (in G70) or 4096 pixels (in NV4x)?

    The problem in those architectures is the batch size. See the tree shadow efficiency example:

    http://common.ziffdavisinternet.com/util_get_image/10/0,1425,sz=1&i=108132,00.jpg

    from: http://www.extremetech.com/article2/0,1697,1867119,00.asp

    (better presentation of this than B3D's)

    Apparently the batch size is 16 in R520. 16 is just so small I'm having a really hard time believing it. Never in my wildest dreams about small batches did I think they'd get that small.

    Jawed
     
  3. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Static branching performance means nothing, because static branches are unrolled by the shader compiler and the hardware never sees a "static" branch.
     
  4. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Well, in principle it should be easy with a multithreaded architecture to have batch sizes of one pixel. But it's not feasible in reality because you want a quad architecture, and I think that in ATI's case, they want to run the same instruction on all four quads.

    The one thing you have to keep in mind, however, is that the small batch size cannot be maintained over a large number of batches without stalling. You need some big batches to keep the pipelines full.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Explain your reasoning for that.

    All batches are the same size so I don't know where you get the idea that batch sizes will vary.

    Jawed
     
  6. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Chalnoth, like R420, the quads are still working on separate tiled regions - they are independant of one another (which is why they go down to one quad in the low end, but still have the same batch sizes (when they are only one shader pixel!)). More likely the reason they work on 4x4 pixels (in fact, it should probably be represented as 4x(2x2)) is due to the shader instruction latency - each batch is probably executed over 4 cycles to cope with, and hide, the inherant ALU operation latency.
     
  7. Rys

    Rys Graphics @ AMD
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,182
    Likes Received:
    1,579
    Location:
    Beyond3D HQ
    I think it's (2x2) x 4 x 4 (queue depth) for 64 pixel batches, not 16. Could be wrong though.
     
  8. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    Another reason is to reduce the amount of required thread state. 1 PC every 16 fragments = relatively cheap for 2048 fragments. 1 PC every fragment = expensive if you have 2048 fragments. The size of a thread window is also limited in the case they implement, as it would seem, out-of-order execution of those threads.
     
  9. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Its 16 pixels per batch, not 16 quads per batch. Each batch will operate on a single quad over 4 cycles.
     
  10. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    There're all the same minimum size. The hardware won't have the local storage required to hide texture read latencies if not enough pixels are in-flight, and if there are too many small batches, those batches will require too much storage individually to have enough pixels in-flight for texture latency hiding.
     
  11. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Ah, okay, that makes some sense. But wouldn't that also mean that the pipelines are empty for four cycles at each thread switch? That's not a big price to pay for "normal" operation, of course, but would be another reason that lots of small batches would be bad.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Aha! So each batch requires four phases.

    So four distinct batches are executing simultaneously in R520.

    So, do you know the texture pipe organisation of batches? Is one batch executing in parallel across all 16 texture pipes, or does a 16-pixel batch get textured in four phases by 4 texture pipes?

    Jawed
     
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    It's the latter according to Dave:

     
    Jawed likes this.
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Thanks.

    It seems the scheduler in R520 is smarter than I was expecting.

    I wish it was like Xenos's scheduler. Maybe it is, and I've just badly, badly, misinterpreted Xenos.
    Jawed
     
  15. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    No. If the ALU latency is 4 cycles then by the time its finished the first instruction on the batch the first quad of that batch is ready for the second instruction - alternatively if the latency is higher then other threads are interleaved in order to always keep the ALU's active whilst also taking into account the known ALU latencies (this is what Xenos does).
     
  16. Tridam

    Regular Subscriber

    Joined:
    Apr 14, 2003
    Messages:
    541
    Likes Received:
    47
    Location:
    Louvain-la-Neuve, Belgium
    AlphaWolf and Jawed like this.
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Now that's the kind of data I like to see.

    That alone is what sets R520 apart from G70, in my view.

    Now, all we need is someone to analyse FEAR's shaders to see if it is actually using SM3 dynamic branching, or if the performance advantage for R520 is coming from somewhere else...

    Jawed
     
  18. Moloch

    Moloch God of Wicked Games
    Veteran

    Joined:
    Jun 20, 2002
    Messages:
    2,981
    Likes Received:
    72
    QFT :cool:
     
  19. Maintank

    Regular

    Joined:
    Apr 13, 2004
    Messages:
    463
    Likes Received:
    2

    I believe Anands review mentioned something about FEAR not being optimized yet for Nvidias arch. I would probably wait until the final product before worrying about where the crazy performance advantage came from for ATI.
     
  20. ChronoReverse

    Newcomer

    Joined:
    Apr 14, 2004
    Messages:
    245
    Likes Received:
    1
    Considering that FEAR doesn't look particularly better than Doom3, I'd be willing to wager that either there's something funny about the way it renders or that it's simply not well optimized. Of course, this could also mean that ATI's hardware is better at taking unoptimized input and making something out of it.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...