G80 Shader Core @ 1.35GHz, How'd They Do That?

Discussion in 'Architecture and Products' started by ^eMpTy^, Jan 15, 2007.

  1. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    May I suggest kindly suggest you peruse the following pamflet?
     
  2. CouldntResist

    Regular

    Joined:
    Aug 16, 2004
    Messages:
    264
    Likes Received:
    7
    Thanks, but I have sentiments for this one.

    P4's ALU simply don't deserve to be called "ALU operating at double frequency", because if it did, you'd exect it to: a) process 2 independent instructions just as fast as 2 dependent ones, and b) expose results of these 2 instructions at the higher clock rate. P4's ALU does neither.

    It would more acurately describe computational power of the P4 ALU, if it was said that in some circumstances, the ALU can take 2 dependent instructions, fuse them, and execute as a monolithic 3-operand instruction. The tricks with raising clock edge, splitting data in 16-bit halfs, etc. are just implementation details. Of course, the "double frequency" looked better on paper, and in the peak of Megahertz Wars it looked like pure pwnage.
     
    Gubbi likes this.
  3. rwolf

    rwolf Rock Star
    Regular

    Joined:
    Oct 25, 2002
    Messages:
    968
    Likes Received:
    54
    Location:
    Canada
    Well said Chalnoth. I think that the design cycle has something to do with it as well. GPU designers have less time to spend optimizing individual circuits because they are looking for the fastest execution time of as many threads as possible while CPU designers spend a great deal of time optimizing single thread performance. It looks like G80 had about a four year development cycle (I believe) which might be a bit longer then other designs they have worked on.
     
  4. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    Wow, you're right. That's a fascinating paper. Thanks for sharing this and apologies for the sarcastic tone of my earlier reply.
     
  5. Acert93

    Acert93 Artist formerly known as Acert93
    Legend

    Joined:
    Dec 9, 2004
    Messages:
    7,782
    Likes Received:
    162
    Location:
    Seattle
    Yeah. I think it was a smart design choice as well. The silicon obviously can bare higher frequencies than we have seen on GPUs, so the question was could they design part of the pipeline to hit higher frequencies while being an overall win. I think the fact G80 has 128 smaller ALUs opened the door for getting their hands dirty and hand tweaking the layout. The work spent on getting 1 ALU design to have a high frequency benefits the others.

    I am curious how long it takes other parts of the GPU to get this sort of TLC.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Historically it seems there was little point in increasing the clock of the ALUs, since the TMUs, being inline, couldn't keep up without investing a lot of effort in them, too. TMUs seem to consume at least as much die, maybe doubly so, as ALUs.

    Additionally, every time you increase the pipeline clock, you have to take account of the fact that texture fetch latency (in terms of absolute time) remains ~ constant. So if you clock the pipeline faster, you need to introduce more FIFO stages. Which costs in area. It seems to me that NV4x-G7x were designed with enough FIFO stages to cater for a range of GPUs from craptastic to enthusiast, where the percentage of texture fetch latency, expressed in terms of pipeline stages, varies by X. I dunno what X is though, or how close G71 comes to its latency-tolerance ceiling for bilinear texturing.

    So, it seems to me that NVidia had to wait until the TMUs were decoupled before it could play with the ALU clocks. Then, for trilinear/anisotropic filtering, the relative throughput of the TMUs is such that it's advantageous to double-up the filtering section. So, in a sense, the TMUs are "double-clocked", except of course this is achieved by doubling their area.

    Jawed
     
    Acert93 likes this.
  7. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Well, I didn't say it was a smart design choice. I just said that it will be the biggest reason, I think, for different performance between the parts (whatever that different performance may be). Whether or not it was the better design choice remains to be seen (and I don't think it'll be evident immediately upon the release of the R600, either).
     
  8. JoshMST

    Regular

    Joined:
    Sep 2, 2002
    Messages:
    466
    Likes Received:
    22
    When I had chatted with NVIDIA about G80, I was told directly that the shader units were full custom (or as close to full custom as anyone has done in the graphics industry). Now, as to what exact level of "custom" they were referring to, I couldn't say.
     
  9. nonamer

    Banned

    Joined:
    May 25, 2002
    Messages:
    564
    Likes Received:
    7
    I've heard that they used their own libraries or whatever.
     
  10. stepz

    Newcomer

    Joined:
    Dec 11, 2003
    Messages:
    66
    Likes Received:
    3
    Those 128 smaller ALUs are marketing bullcrap. The G80 is still doing the same 16way SIMD, just organized a bit differently.
     
  11. Geo

    Geo Mostly Harmless
    Legend

    Joined:
    Apr 22, 2002
    Messages:
    9,116
    Likes Received:
    213
    Location:
    Uffda-land
    The "same 16 way SIMD" as what? :lol:
     
  12. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    It's not bullcrap, it's AoS vs SoA. And the ALUs are 8way now fyi, not 16way.


    Uttar
     
  13. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,429
    Likes Received:
    181
    Location:
    Chania
    You don't get cookies for asking a stupid question over an idiotic statement :p
     
  14. stepz

    Newcomer

    Joined:
    Dec 11, 2003
    Messages:
    66
    Likes Received:
    3
    Aside from eliminating the need for horizontal ops, such as dot product, I don't see how AoS vs. SoA makes any difference for the ALUs. You're still executing a single instruction in SIMD fashion. Register file access is simplified a bit though by eliminating the need for a layer of muxes to support swizzles.

    And I still believe the G80 ALU's are 16way SIMD. You have 128 execution units total, arranged into 8 processors, each capable of executing a single instruction over 16 pixels. AFAIK there is no way to issue different instructions in smaller than 16pixel granularity. If you're talking about the previous generation, the you're correct, it was (something like) 8way SIMD thanks to the 1+3/2+2 issuing schemes.
     
  15. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,326
    Likes Received:
    107
    Location:
    San Francisco
    Think about 8 ALUs working on 16 pixels over 2 clock cycles.
     
    #35 nAo, Jan 31, 2007
    Last edited: Jan 31, 2007
  16. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,537
    Likes Received:
    589
    Location:
    New York
    A bit differently huh? So based on the "old" marketing how many ALU's are in G80 exactly?
     
  17. _xxx_

    Banned

    Joined:
    Aug 3, 2004
    Messages:
    5,008
    Likes Received:
    86
    Location:
    Stuttgart, Germany
    Stupidly compared, I'd say it would be about like 128/4 = 32 G70 ALU's. Though since some stuff is beefed up, more like 48 or so.
     
  18. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,537
    Likes Received:
    589
    Location:
    New York
    Well I was really trying to ask stepz for his definition of an ALU. For example, why are we dividing by 4? Is it just because older architecture's were vec4? It can't be based on the SIMD configuration cause there was a time that the entire shader pipeline was a single SIMD array.

    He claims that 128 ALU's is just marketing speak - I'm trying to understand why each ALU should not be counted separately.
     
  19. stepz

    Newcomer

    Joined:
    Dec 11, 2003
    Messages:
    66
    Likes Received:
    3
    Re-reading the B3D G80 review (which is excellent btw.) it seems so, though I can't see any reason why it should be like this. So it looks like 8 processors, with two 8way SIMD units that are clocked double the control logic, being effectively 16way SIMD. Can the two 8way SIMD's be executing different instructions. If not I can't see any difference with 16way SIMD. If yes, it would be interesting to know what parts are shared. Constant registers, shader code, any thing else?

    trinibwoy: I'd define ALUs like they do in CPU land, by control unit outputs. If two would be ALU's can't be executing different instructions, then they're not two different ALUs, but a single SIMD ALU. Although I must say that the terminology for talking about SIMD ALU's and it constituent units is pretty fuzzy. But saying that a subunit of SIMD ALU is a scalar unit is in my opinion seriously misleading. Normally, if you hear 128 scalar ALUs you'd think that the processor can be executing 128 different instructions at once, i.e. 128 way superscalar. (which would of course require ridiculous amounts of control logic)
    Oh and for the NVidia previous generation, I'd say 6 processors of 2way superscalar 8way(ish) SIMD. I'm not sure how decoupled the 6 processors are though, we really need some decent overviews of different GPU microarchitectures without all the marketing buzzwords, but using established microprocessor design terminology where possible. The G80 B3D review is a good step in that direction.
     
    #39 stepz, Jan 31, 2007
    Last edited by a moderator: Jan 31, 2007
  20. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,326
    Likes Received:
    107
    Location:
    San Francisco
    Yes..
    It's likely that at least 2 different groups of threads can be executed, different code.. some constants (or better..some constants cache :) )
    Nothing wrong with your definition, but can't see why you require to execute different code on each ALU.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...