G70 fragment pipeline break down?

Discussion in 'Architecture and Products' started by j^aws, Jul 15, 2005.

  1. j^aws

    Veteran

    Joined:
    Jun 1, 2004
    Messages:
    1,909
    Likes Received:
    8
    Each G70 fragment pipeline can issue 5 instructions/cycle, can anyone give a breakdown? Here's my guess,

    5 maths
    4 maths + 1 texture
    4 maths + 1 norm
    3 maths + 1 texture + 1 norm

    Would that be correct? IIRC, the vec4 ALUs also work on texture data with the texture ALUs?

    Thanks...
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I might as well post what I posted in the other thread:

    Sadly, this breaks down extremely quickly. One thing I learnt from the efficiency thread is that NV40/G70 has "combinations" of instructions that it can execute in one cycle.

    There might be 50 or 100 of these different combinations (just a guess). Some combinations will amount to only 2 instructions per clock. Others might amount to 6 or 8 or more instructions.

    It's hideously complicated.

    Simple texture operations appear not to block ALU1 - I don't know what defines this. But I don't think it's possible to say that a texture operation on ALU1 always blocks that ALU from other instructions.

    Jawed
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    This is the code snippet I posted in that thread:

    Code:
    # clock 1
    texld r0, t0, s0; # tex fetch
    madr r0, r0, c1.r, c1.g # _bx2 in tex
    nrm_pp r1.rgb, t4 # nrm in shader 0
    dp3 r1.r, r1, r0 # 3D dot product in shader 1
    mul r0.a, r0, r0 # dual issue in shader 1
    
    # clock 2
    mul r1.a, r0.a, c2.a # dual issue in shader 0
    mul r0.rgb, r1.r, r0 # dual issue in shader 0
    add r0.a, r1.r, r1.r # fx2 in shader 0
    mad r0.rg, r0.a, c1, c1.a # mad w/2 const in shader 1
    mul r1.ba, r1.a, r0.a, c2 # dual issue in shader 1
    
    # clock 3
    rcp r0.a, r0.a # reciprocal in shader 0
    mul r0.rg r0, r0.a # div instruction in shader 0
    mul r0.a, r0.a, r1.a # dual issue in shader 0
    texld r2, r0, s1 # texture fetch
    mad r2.rgb, r0.a, r2, c5 # mad in shader 1
    abs r0.a, r0.a # abs in shader 1
    log r0.a, r0.a # log in shader 1
    
    # clock 4
    rcp r0.a, t1.a # reciprocal in shader 0
    mul r0.rg, t1, r0.a # div instruction in shader 0
    mul r0.a, r0.a, c2.g # dual issue in shader 0
    texld r1, r0, s3 # tex fetch
    mad r1.rgb, r1, c4, -r2 # mad in shader 1
    exp r0.a, r0.a # dual issue in shader 1
    
    # clock 5
    texld r0, r1.bar, s2 # texture coordinates swizzle
    mad r0.rgb, r0, v0, r1 # color calculation in shader 1
    mul r0.a, r1, v0 # dual issue in shader 1
    
    # clock 6
    mul r1.rgb, r0.a, c5.a # mul in shader 0
    mad r0.rgb, r1, r0.a, r0 # mad in shader 1
    mov r0.a, c3.a # move in shader 1
    mov oC0, r0 # move in shader 1
    Anything from 3 to 7 instructions per clock.

    Jawed
     
  4. Tridam

    Regular Subscriber

    Joined:
    Apr 14, 2003
    Messages:
    541
    Likes Received:
    47
    Location:
    Louvain-la-Neuve, Belgium
    Looks like "the not-exactly-true thing" that Nvidia showed at the 6800 launch ;)
     
  5. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    I am sorry but this shader needs 9 clocks

     
  6. Tridam

    Regular Subscriber

    Joined:
    Apr 14, 2003
    Messages:
    541
    Likes Received:
    47
    Location:
    Louvain-la-Neuve, Belgium
    That shader iss a piece of code that Nvidia shows to the press to highlight the 6800 pipeline. I don't expect it to run as it is said by Nvidia even with a 7800 GTX.

    And you'll notice some stupid things in it. Nvidia wanted to have as many instructions as possible so it looks like they added useless things. For example ABS before LOG. Nvidia said me one year ago that there were errors in it because the guy who wrote it didn't know D3D.
     
  7. j^aws

    Veteran

    Joined:
    Jun 1, 2004
    Messages:
    1,909
    Likes Received:
    8
    Okay, I'm even more confused by this varying issueing rate!

    So a single G70 fragment pipeline can vary between 3-9 instructions/cycle then???

    I'm just guessing here but could this be local to a 'single' pipeline but overall for a 'quad' pipeline,it would be constant at 20 instructions/cycle?

    i.e. a quad can be considered as an independent SIMD processor that can issue 20 inst./cycle but a single pipe within the quad can have a variance from 5 inst./cycle but the quad remains consistent at 20 inst./cycle?
     
  8. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,298
    Likes Received:
    137
    Location:
    On the path to wisdom
    No, the same instructions are executed for every pixel in the quad.

    Where did you get 9 instructions/clock from?
     
  9. j^aws

    Veteran

    Joined:
    Jun 1, 2004
    Messages:
    1,909
    Likes Received:
    8
    From the code snippet from Jawed, it shows upto 7 instructions/cycle and upto 9 from the other snippet or am I missing something here? This is what's confusing because it's quoted at 5 inst./cycle?
     
  10. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,298
    Likes Received:
    137
    Location:
    On the path to wisdom
    Who stated 5 instructions/cycle?

    The code snippet from Jawed shows 7 inst/cycle, yes, but you have to be careful on what you count as instructions, and what could be optimized away.
    For example, ABS before LOG is unnecessary, because LOG implies ABS. Also, ABS is just the simple operation of clearing the sign bit, and certainly implemented as an input modifier. So you can have as many ABS per clock as there are input registers used. Does it make sense to count MOVs? Because, depending on the architecture, MOV could be just input register selection. Likewise, the final MOV oC0, r0 probably does nothing at all.

    The snippet posted by Demirug only shows 3 inst/cycle.
     
  11. j^aws

    Veteran

    Joined:
    Jun 1, 2004
    Messages:
    1,909
    Likes Received:
    8
    [​IMG]

    http://www.hardocp.com/article.html?art=Nzg0LDM=

    This diagram shows 5 inst./cycle? It sounds like they're including the 16bit normalise instruction with this number?

    I'm trying to get to an idea of what should be counted, it's like the IBM G5's, IIRC, they say 8 or 5 instruction/cycle depending on whether you include branch/load/stores etc...so this seems to be similarly confusing too, depending on what you count?
     
  12. ERP

    ERP Moderator
    Moderator Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    It's actually even more complex than that.

    There are limits on input and output register counts, how the ops use the execution units and how they are paired.

    Without very specific details on the pipleline which NVidia are unlikely to make avaialable, numbers like 3 or 7 are just simplifications.
     
  13. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,298
    Likes Received:
    137
    Location:
    On the path to wisdom
    It seems to me they count 4 madd/dp (dual- and co-issue) plus FP16 normalize, and the special functions "replacing" one of the co-issued instructions each (which is not entirely true, as you can still perform a madd on the SF result).
     
  14. overclocked

    Veteran

    Joined:
    Oct 25, 2002
    Messages:
    1,317
    Likes Received:
    6
    Location:
    Sweden
    What exactly can the "fog-alu" do and whats the pros/cons with having one of the ALU´s tied to textureinst vs not as ATI?

    I know its a matter of tradeoffs in transistorbudgets and the like but should not a g70 with 2 twins ALU´s + a third identical with texture be more transistor effecient than adding 2 quads as current g70 does?
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    It seems not all texture operations block ALU1 in NV40/G70 and even when they do, the block is only for a limited time (1 cycle?).

    Will have to wait until one of the real experts comes along...

    Jawed
     
  16. overclocked

    Veteran

    Joined:
    Oct 25, 2002
    Messages:
    1,317
    Likes Received:
    6
    Location:
    Sweden
    Yeah Jawed also reg the Fog-ALU im hope someone can answer.

    It just seems better from a trans-budget to add one ALU to each pipe instead of adding quads.
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
  18. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    little OT: so I assume they started to release some in depth info about RSX..:)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...