G80 programmable power

Discussion in 'Architecture and Products' started by Pigman BABY!!!, Jan 25, 2007.

  1. Pigman BABY!!!

    Newcomer

    Joined:
    Jan 7, 2007
    Messages:
    24
    Likes Received:
    1
    How many shader operations can the G80 do per unified shader ALU?
    I know the Xenos can do 10 x 48ALU's x 500MHz = 240GFlop/s

    Taking in account the G80 does 10 aswell that gives it:
    10 x 128 ALU's x 1350MHz = 1728GFlop/s (1,728TFlop/s)

    This would mean it's a very big leap up from the G7x and R580 something which isn't as much shown by games.
    Why is that?
     
  2. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    Your math is completely off due to the frustration of mixed marketing messages. Plus you're asking about shader operations, but quoting FLOPS. Xenos' ALUs do more than G80's ALUs. Maybe someone can quickly supply a link to previous discussions about this, but you might want to search for old threads.
     
  3. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    A G80 ALU does 2 FLOPS (arguably sometimes 3, if you count that good ole Missing MUL...) per "shader ALU". Keep in mind a fully scalar design will be more efficient per-flops than a Vec4+Scalar one such as in Xenos, however. And please note that this response is a MASSIVE oversimplification - indeed, please just use that good ole Search Button instead if possible! :)


    Uttar
     
  4. rwolf

    rwolf Rock Star
    Regular

    Joined:
    Oct 25, 2002
    Messages:
    968
    Likes Received:
    53
    Location:
    Canada
    Proof?
     
  5. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,560
    Likes Received:
    157
    Location:
    In the Island of Sodor, where the steam trains lie
    It's a simple bin-packing issue. You end up with occasional "holes" in the fpu utilisation.
     
    Acert93 likes this.
  6. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    You can construct cases where Vec4 is just as efficient as scalar, but Uttar is correct, in real code scalar will be more efficient. The question is how much more efficient vs. any die area costs. This is something consumers won't be able to tell due to too many variables.
     
    Razor1 likes this.
  7. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    It's a least common denominator sort of thing. You need to deal with nice even numbers and x/1 is a lot simpler than x/4 when you're dealing with numbers between 1 and 4.

    Gets a bit more complicated when you start looking at how many processors fit in a given area.
     
  8. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    There are other issues with vector designs as well such as data rotation. Even if one has a perfectly 4-vector series of operations, the data may need a "transposition" due to stuff like matrix layout, texture reads (Fetch4 complicates this even more). Scalar processors will thus generally guarantee fewer idle ALUs and thus more efficient use of the hardware. However I don't know anything about the manufacturing difficulty of each design... if they were the same size/cost, of course 128 vec4 ALUs would be better than 128 scalar ones ;)
     
  9. rwolf

    rwolf Rock Star
    Regular

    Joined:
    Oct 25, 2002
    Messages:
    968
    Likes Received:
    53
    Location:
    Canada
    Agreed, but you have to duplicate a great deal of logic to get that efficency allowing you to have less scalar units.

    But if the scalar ALU takes 4 clocks to execute an instruction and the vec4+scalar takes 1 clock the scalar is not as efficent (or at least this implementation). Now in the case of G80 this would only be rarely used unstructions and we don't have any info on R600 yet.
     
  10. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    I think it pretty much is a similar idea to using AoS (.xyzw) or SoA (.xxxx, .yyyy, etc....) organization for your data on vector units such as PlayStation 2's VU's: in the end, even though they unit was designed with AoS usage in mind and had support for horizontal math and broadcast operations, developers wanting to get better utilization of the two vector units re-arranged their data and processed vertices/data vectors in parallel (4 vertices in parallel) and it worked out pretty well ;).
     
  11. icecold1983

    Banned

    Joined:
    Aug 4, 2006
    Messages:
    649
    Likes Received:
    4
    rwolf do u think a vec 4 approach is overall better than scalar? a more direct question, do you think g80 would be a better product had nvidia designed it with vec 4 in mind?
     
  12. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    Hasn't the g80 shown the effeciency, lets compare a vec 3+scalar 48 alu r580. It could theoretically do 192 scalar operations at 650 mhz.

    The g80 does 128 scalar operations at 1350 mhz.

    Lets say we even out the mhz to get a comparative scalar operations thats 265ish scalar operations for a g80 at 650 mhz.

    Still the g80 is much more effecient over all when it comes down to end performance its more then 25% faster in most occassions well all occasions I can think of, so effeciency is more at least the way the vec ALU's are set up in the r580.
     
    #12 Razor1, Jan 25, 2007
    Last edited by a moderator: Jan 26, 2007
  13. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    If you can make a vec4+scalar that executes in 1 clock then your scalar shouldn't take 4 clocks. Either way it doesn't matter if they are pipelined. Just work on other pixels while you wait for the result to be available.

    Hmm. It seems you're talking about the special function unit with the 4 clocks comment. That's a separate issue from the vector vs. scalar debate.
     
  14. Rangers

    Legend

    Joined:
    Aug 4, 2006
    Messages:
    12,322
    Likes Received:
    1,120
    Meh, that's not impressive the exact way you lay it out. G80 probably doesn't improve much on it's theoretical increase.

    Plus, R580 was probably quite a bit texture limited. What if you did the theoretical comparison with a G71? It would probably look even worse for G80.

    And, too mention that G80's die is much bigger per ALU/Shader op..so you really are in trouble when trying to say it's an efficiency upgrade..

    That said, I dont see why Nvidia would have done it if they didn't see benefit, plus there are so many other issues here (like DX10, CUDA capability..). It seems ALU's as a portion of the die are declining right now (64 rumored in R600 isn't impressive by past gen standards either).
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I like to use this awkward snippet of code to show how pipeline utilisation can fall-off:

    [​IMG]


    This is how I think it executes on R580 (I've corrected an error that was on prior postings of this):​

    [​IMG]


    And this is how I think it executes on G80:​

    [​IMG]


    Jawed​
     
  16. rwolf

    rwolf Rock Star
    Regular

    Joined:
    Oct 25, 2002
    Messages:
    968
    Likes Received:
    53
    Location:
    Canada
    That's a good point but R580 is half the transistors and half the speed with substantially less bandwidth.
     
  17. PeterAce

    Regular

    Joined:
    Sep 15, 2003
    Messages:
    489
    Likes Received:
    6
    Location:
    UK, Bedfordshire
    Bold mine, does this mean that when the MADD Vec3 and the ADD Vec3 are only processing a scalar each, the several other potential FLOPs are wasted in that ALU?

    Seems scalar ALUs are more optimal.

    *Edit: Ah, Jawed diagrams were not there when I posted this, they seem to concure with what I wrote wrt wasted FLOPs.
     
    #17 PeterAce, Jan 26, 2007
    Last edited by a moderator: Jan 26, 2007
  18. rwolf

    rwolf Rock Star
    Regular

    Joined:
    Oct 25, 2002
    Messages:
    968
    Likes Received:
    53
    Location:
    Canada
    Does your diagram take into account optimizations by the shader compiler? Also comparing R580 to G80 is not comparing vec4 to scalar. R580 is bottlenecked by the fact that three ALUS and three half ALUs service a single pixel. R600 is suppose to be implemented more like Xenos and Xenos is much more efficient isn't it.
     
    #18 rwolf, Jan 26, 2007
    Last edited by a moderator: Jan 26, 2007
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I suppose I might as well post how I think G71 executes this:

    [​IMG]

    ooh and Xenos, too:

    [​IMG]


    Jawed
     
  20. rwolf

    rwolf Rock Star
    Regular

    Joined:
    Oct 25, 2002
    Messages:
    968
    Likes Received:
    53
    Location:
    Canada
    Yes and no, I would have to answer.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...