AMD: R7xx Speculation

Discussion in 'Architecture and Products' started by Unknown Soldier, May 18, 2007.

Thread Status:
Not open for further replies.
  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Maybe I misinterpreted what design you are proposing.

    You are saying that design B has 64x1D +16x1D in a SIMD.
    How many individual elements are processed per clock?
    Design A has 16 elements being processed in a given clock cycle, hence why the 80 units per SIMD are divided up into 5 ALU processor groups.
    What has design B changed, exactly, other than distributing the 16 over the terms in the parenthesis?
    By your doing so, I interpret it as meaning that all 64 elements have one component evaluated per clock.
    If not, why did you change the 16-unit division of ALUs?
    I don't see how it's related to the thread-switching scheme that follows.

    Design A has to have enough registers to handle two 64-thread batches.
    Design B needs enough to handle eight.
    Either that, or each clause is 1/4 the size of those found in A, and clause setup overhead is quadruple that of A. The absolute amount of overhead is not something I'm aware of.

    Each thread gets a sequencer in the SIMD's control logic.
    Design A has two.
    Why wouldn't B have eight?
    The live set of registers is also 8 times as large, over most of the 32 clocks of execution.
    The most design A will have to worry about is 2 clauses' worth.
     
  2. ECH

    ECH
    Regular

    Joined:
    May 24, 2007
    Messages:
    682
    Likes Received:
    7
    I thought this was the intent of the CPU2 test.
     
  3. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    Its not Nvidia's duty to disable it but wouldnt the scores be termed useless if they dont meet the testing policy?
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Branches are never resolved without switching hardware thread (context) - that's because the Sequencer evaluates the branch (i.e. decides where to jump next).

    So ALU bubbles can only occur when the ALU SIMD runs out of threads, e.g. when all threads are waiting for TEX results.

    This is never an issue because the pipeline has a "previous" register that holds a copy of the last register result for each of the five ALU lanes. This register (seems to be 2 in fact, vec4 + scalar) can be sampled in any successive instruction (its lifetime is until it's overwritten).

    There are register file bandwidth constraints but it's easier to point you at the R600 ISA document which spends pages on the subject.

    Jawed
     
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,426
    Likes Received:
    423
    Location:
    New York
    Well technically they'd be even more useless...but yeah :)

    Looking at some of the numbers these cards are putting up nowadays maybe it would be a good thing to grab some of those cycles for physics acceleration. As always we don't have the software to evaluate that approach so we're just jerking off into the wind as usual.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    It seems ATI wanted to build a vec4-structured register file. Though with the irony that any 32-bit aligned word can be fetched.

    It seems to me that after a certain point register file layout/porting/bandwidth/read-ordering trumps a lot of other things when you're building ALUs.

    Also, you can't ignore the requirement to build ALUs that do more than just MAD/MUL/ADD. Transcendentals need to be proportionate (i.e. 1/4 MAD rate) and then the myriad of SM4's new instructions need to be distributed across the widths of the ALU types, without adding too much specialisation.

    Once you add these instruction types you then get into the "co-issue" problem and serial dependency issues. ATI decided to go with an entirely static, compiled, solution. Doing that they presumably then decided that for now 4 MADs + 1 T was the right way to go, instead of 1 MAD + 1 T that runs at 1/4 speed, or whatever...

    Jawed
     
  7. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Okay, here's some more proof:
    http://www.anandtech.com/video/showdoc.aspx?i=3275&p=4

    The 9800 GTX has 10% higher core clock, 13% higher shader clock, and over 120% higher bilinear texturing rate than the 8800 Ultra. However, in Crysis they perform the same, because the Ultra has 47% more BW/ROPs.
     
  8. Dio

    Dio
    Veteran

    Joined:
    Jul 1, 2002
    Messages:
    1,758
    Likes Received:
    8
    Location:
    UK
    Wise words.
     
  9. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    :twisted:
     
  10. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    http://www.anandtech.com/video/showdoc.aspx?i=3338&p=6

    more recent benchmarks, the 9800 gtx has conciderably less bandwidth as the 8800 gtx, but it out performs it by 10%+

    concidering the gtx 260 is also in those charts its got a huge bandwidth advantage but we don't see any of that.
     
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Somehow I've never run across the pdf. My google-fu is weak. :sad:
    I think I've found it now, so I'll be going through it.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    :lol: I installed the CAL SDK to get at it and other stuff.

    Jawed
     
  13. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    This is where the SOA and AOS mentality comes in. The register file doesn't need to access all 4 channels of all 64 elements in design B. It's organized in groups of 64 by channel, instead of 16 pixels.

    Assuming two register operands allowed to be read per instruction, A's SIMD reads 10 groups of 16 FP32/s per clock. In B, the SIMD reads 2 blocks of 64 FP32's every clock and 2 more every 4 clocks for the transcendental units. "Live registers" is the same.

    This actually makes B's register file design simpler, because you don't need as much granularity.

    Yes.
    Otherwise I can't make the same claims as before. I guess it could be 4 groups of 16, but isn't that the same thing?

    Not sure what you're talking about. Register files are the same size. FP32's in flight in the ALUs is the same as well.

    A's sequencer can be used in B for the most part. It can handle 2 per 8 cycles in A, so 8 per 32 cycles in B isn't a problem. It's still updating instruction pointers and loading instruction clauses and dealing with branches at the same rate. Yeah, insanely long instruction sequences that don't fit in any cache could thrash worse in this design, but it's a corner case.

    For clarity, how big is a clause in the way you use the term? Are you talking about a 5x1D instruction packet, or something variable and longer? (Well, not always 5x1D, as tex or branch instruction are possible too, but they don't go into the SIMD's, obviously)
     
    #4573 Mintmaster, Jun 23, 2008
    Last edited by a moderator: Jun 23, 2008
  14. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    That's just because the 8800 GTX's other deficits are too big now. The 9800 GTX has 25% higher shader clock, 17% higher core, 135% higher bilinear, etc.

    My point is that is that all else being equal, BW helps in Crysis even without AA. You said it doesn't.

    Take a 4850 and bump clocks (mem and core) by 20%, and you should get a 20% boost if CPU/PCI-e isn't a factor at the testing resolution. Now increase just the mem clock another 50% to match the 4870, and you'll get a futher boost. Not another 50%, of course, but a boost nonetheless.
     
  15. bridgman

    Newcomer Subscriber

    Joined:
    Dec 1, 2007
    Messages:
    58
    Likes Received:
    102
    Location:
    Toronto-ish
    #4575 bridgman, Jun 23, 2008
    Last edited by a moderator: Jun 23, 2008
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
  17. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    how much of deficits, the 8800 gt is also in that graph and it has ~35% less bandwidth compared to the 8800 gtx and its performance is only 5% less at the highest resolution tested. To expect maricles with the 4870 just because of bandwidth in Crysis (edit: without AA active), I just can't seem to see where that is coming from.
     
    #4577 Razor1, Jun 23, 2008
    Last edited by a moderator: Jun 23, 2008
  18. randomhack

    Newcomer

    Joined:
    Apr 4, 2008
    Messages:
    41
    Likes Received:
    0
    Umm .. CAL is available on Linux too and thats how I got r600 ISA docs :)
     
  19. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I totally agree, but as I mentioned above, this modification simplifies register layout a bit. Porting remains the same.

    I included this in more detail after the post you replied to. Co-issue is a problem that exists with NVidia's architecture, too, but if look again, my solution simply modifies an instruction packet (i.e. what the SIMD can do in a clock) to allow dependent MADs. Transcendentals are scheduled in the same way as it is now.

    My solution is the same in that sense. Same number of MADs, same number of T's. It just operates on elements in a different order, thus allowing dependencies and easier register file access, but has a few costs that IMHO are small.
     
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I would caution you to get your head round sections 4.6 and 4.7 of R600 ISA :twisted:

    My eyes glazed over half way through 4.7.4 :lol:

    Jawed
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...