The NEXT LAST R600 Rumours & Speculation Thread

Discussion in 'Pre-release GPU Speculation' started by Geo, Mar 1, 2007.

Thread Status:
Not open for further replies.
  1. Geeforcer

    Geeforcer Harmlessly Evil
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,320
    Likes Received:
    525
    I am pretty sure that's the number they gave for "DX10" cards, which presumably included 8400/8600 as well.
     
  2. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,245
    Likes Received:
    4,465
    Location:
    Finland
    Well 8400/8600 have been available only less than a month if I'm not mistaken, so they couldn't have sold too many of 'em yet?
     
  3. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    Dell, HP and others have pretty big deals with those cards:wink:
     
  4. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,559
    Likes Received:
    34
    OEM deals could be the answer... :wink:
     
  5. IbaneZ

    Regular

    Joined:
    Apr 15, 2003
    Messages:
    743
    Likes Received:
    17
  6. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    Err..
    Yup, you didnt mention of anything related performance. And the bet is already invalid?

    :lol:
     
  7. ants

    Newcomer

    Joined:
    Feb 10, 2006
    Messages:
    44
    Likes Received:
    3
    Didn't see this posted yet, sorry if a repost.

    http://www.pcadvisor.co.uk/reviews/index.cfm?reviewid=834

    PC Advisor semi review, no scores but a very odd conclusion...

    Something is very broken it seems...

    EDIT: I'm also seeing the R600 pop up in Canada but for about $550...
     
    #5187 ants, May 11, 2007
    Last edited by a moderator: May 11, 2007
  8. Geeforcer

    Geeforcer Harmlessly Evil
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,320
    Likes Received:
    525
    The price comment (cheaper than GTS 640) seems strange, but then again, I don't know how much GTS 640 costs in UK.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    There's no data movement, as such, though. The GPU merely needs to track the status of each thread (meaning batch). Flags and predicate registers are persistently located in batch-specific areas, regardless of whether a thread is in-ALU or waiting its turn. A copy will be brought into the ALU pipeline, but then everything is copied in to an ALU pipeline in order for it to function.

    R600 may be different of course, because I'm talking mostly in terms of how I understand R5xx.

    For example D3D10 requires a GPU to support 4096 temporary registers per pixel. It's an insane number which to me implies that at some point that degree of swapping requires registers to be shunted out to video memory. This is something that R5xx doesn't do - the hard limit on registers per pixel (128) is traded-off against the number of batches and that's the end of it.

    I think you'll find batch-swapping is as fine-grained as the code and population of batches requires. e.g. if you calculate a texture address, look-up that texture and then use that texture result to calculate another texture address, before looking up the second texture, there's no way to avoid those 4 fine-grained swap events. If you don't swap you're facing a monster amount of texturing latency.

    SM2 allows 32 texture fetches and 64 ALU instructions. It's 2.0a/b and SM3 that went dramatically beyond this. (Though there's the f-buffer in R3xx and R4xx GPUs which meant you could run SM2 programs one after the other, whilst keeping the register values from one program into the next - erm, that's the way I understand it, anyway.)

    The swap is much less arduous than you imagine!

    I tend to refer to these as "clauses", indicating portions of code that can execute without interruption. If you play with ATI's GPU Shader Analyzer you can see how code gets broken up into sections of ALU and TEX operations:

    Code:
    Shader stats:
         RS Instructions:         8
         TEX Instructions:        7
         ALU Instructions:       52
         ALU Instruction slots:  52
         CF Instructions:         0
         Pix Size:               12
         Highest Const:          31
         Start Addr:              0
         End Addr:               58
     
     RS Instructions:
     
       rs 00:                            r00.rg-- = txc00
       rs 01:                            r01.rg-- = txc01
       rs 02:                            r02.rgba = txc02 adjusted
       rs 03:                            r03.rgb- = txc03
       rs 04:                            r04.rgb- = txc04
       rs 05:                            r05.rgb- = txc05
       rs 06:                            r06.rgb- = txc06
       rs 07:                            r07.rgb- = txc07 adjusted
     
     US Program:
     
      0 tex 00    :  r01.rgb_ = lookup(r01.rgrr, tex01) ign_unc
      1 tex 01    :  r00.rgba = lookup(r00.rgrr, tex00) ign_unc
      2 tex 02    :  r07.rgb_ = lookup(r07.rgbr, tex03) sem_wait sem_grab ign_unc
      3 alu 00 rgb:             r08.--b = dp3(r04.rgb, r04.rgb)
             alpha:             r01.a   = clamped mad(r02.a, 1.0, 0.0)  
      4 alu 01 rgb:             r08.--b = dp3(r03.rgb, r03.rgb)
             alpha:             r02.a   = rsq(abs(r08.b))  
      5 alu 02 rgb:             r08.r-- = dp3(r05.rgb, r05.rgb)
             alpha:             r03.a   = rsq(abs(r08.b))  
      6 alu 03 rgb:             r08.-g- = dp3(r06.rgb, r06.rgb)
             alpha:             r04.a   = rsq(abs(r08.r))  
      7 alu 04 rgb:             r08.rgb = mad(r01.rgb, 1.0, c11.rrr)*2 sem_wait
             alpha:             r05.a   = rsq(abs(r08.g))  
      8 alu 05 pre:  srcp.rgb = 1.0-2.0*r01.rgb
      8 alu 05 rgb:             r01.rgb = cmp(neg(srcp.rgb), r08.rgb, nab(c10.rrr))
             alpha:             r06.a   = rcp(r02.a)  
      9 alu 06 rgb:             r08.rg- = mad(r06.a0.0r, c01.g0.00.0, (+2.5000000E-01).0.0ar)
             alpha:             r06.a   = rcp(r03.a)  
      10 alu 07 rgb:             r09.rg- = mad(r06.a0.0r, c01.r0.00.0, (+2.5000000E-01).0.0ar)
             alpha:             r06.a   = rcp(r04.a)  
      11 alu 08 rgb:             r10.rg- = mad(r06.a0.0r, c01.b0.00.0, (+2.5000000E-01).0.0ar)
             alpha:             r06.a   = rcp(r05.a)  
      12 alu 09 rgb:             r11.rg- = mad(r06.a0.0r, c01.a0.0r, (+2.5000000E-01).0.0ar)
             alpha:             r06.a   = mad(r01.b, r01.b, 0.0)  
      13 tex 03    :  r08.r___ = lookup(r08.rgrr, tex02) alu_wait ign_unc
      14 tex 04    :  r09.r___ = lookup(r09.rgrr, tex02) ign_unc
      15 tex 05    :  r10.r___ = lookup(r10.rgrr, tex02) ign_unc
      16 tex 06    :  r11.rrr_ = lookup(r11.rgrr, tex02) sem_wait sem_grab ign_unc
      17 alu 10 rgb:             r12.r-- = d2a(r01.rg0.0, r01.rg0.0, r06.rra)
             alpha:             r06.a   = mad(r02.b, r02.b, 0.0)  
      18 alu 11 rgb:             r04.rgb = mad(r04.rgb, r02.aaa, 0.0)
             alpha:             r02.a   = rsq(r12.r)  
      19 alu 12 rgb:             r01.rgb = mad(r01.rgb, r02.aaa, 0.0)
             alpha:             r02.a   = mad(r02.g, r02.g, r06.a)  
      20 alu 13 rgb:             r08.-g- = dp3(r04.rgb, r01.rgb)*2
             alpha:             r02.a   = mad(r02.r, r02.r, r02.a)  
      21 alu 14 rgb:             r03.rgb = mad(r03.rgb, r03.aaa, 0.0)
             alpha:             r02.a   = rsq(r02.a)  
      22 alu 15 rgb:             r04.rgb = mad(r01.rgb, r08.ggg, neg(r04.rgb))
             alpha:             r03.a   = clamped mad(r08.g, 1.0, 0.0)/2 
      23 alu 16 rgb:             r02.rgb = mad(r02.rgb, r02.aaa, 0.0) sem_wait
             alpha:             r02.a   = mad(r08.r, r03.a, 0.0)*2 
      24 alu 17 rgb:             r08.-g- = dp3(r01.rgb, r03.rgb)*2
             alpha:                       mad(0.0, 0.0, 0.0)  
      25 alu 18 rgb:             r05.rgb = mad(r05.rgb, r04.aaa, 0.0)
             alpha:             r03.a   = clamped mad(r08.g, 1.0, 0.0)/2 
      26 alu 19 rgb:             r04.r-- = clamped dp3(r04.rgb, r02.rgb)
             alpha:                       mad(0.0, 0.0, 0.0)  
      27 alu 20 rgb:             r03.rgb = mad(r01.rgb, r08.ggg, neg(r03.rgb))
             alpha:                       mad(0.0, 0.0, 0.0)  
      28 alu 21 rgb:             r08.--b = dp3(r01.rgb, r05.rgb)*2
             alpha:             r06.a   = ln2(r04.r)  
      29 alu 22 rgb:             r04.rgb = mad(r06.rgb, r05.aaa, 0.0)
             alpha:             r04.a   = mad(r06.a, c14.r, 0.0)  
      30 alu 23 rgb:             r03.--b = clamped dp3(r02.rgb, r03.rgb)
             alpha:             r04.a   = ex2(r04.a)*2 
      31 alu 24 rgb:             r05.rgb = mad(r01.rgb, r08.bbb, neg(r05.rgb))
             alpha:             r04.a   = mad(r08.r, r04.a, 0.0)  
      32 alu 25 rgb:             r03.r-- = dp3(r01.rgb, r04.rgb)*2
             alpha:             r05.a   = ln2(r03.b)  
      33 alu 26 rgb:             r03.--b = clamped dp3(r02.rgb, r05.rgb)
             alpha:             r07.a   = mad(r05.a, c14.r, 0.0)  
      34 alu 27 rgb:             r01.rgb = mad(r01.rgb, r03.rrr, neg(r04.rgb))
             alpha:             r11.a   = ln2(r03.b)  
      35 alu 28 rgb:             r03.--b = clamped dp3(r02.rgb, r01.rgb)
             alpha:             r07.a   = ex2(r07.a)  
      36 alu 29 rgb:             r01.rgb = mad(r00.rgb, c03.rgb, 0.0)
             alpha:             r11.a   = mad(r11.a, c14.r, 0.0)  
      37 alu 30 rgb:             r02.rgb = mad(r00.rgb, r03.aaa, 0.0)
             alpha:             r03.a   = clamped mad(r08.b, 1.0, 0.0)/2 
      38 alu 31 rgb:             r01.rgb = mad(r02.aaa, r01.rgb, 0.0)
             alpha:             r02.a   = ln2(r03.b)  
      39 alu 32 rgb:             r02.rgb = mad(r02.rgb, c02.rgb, 0.0)*2
             alpha:             r09.a   = mad(r04.a, r00.a, 0.0)  
      40 alu 33 rgb:             r04.rgb = mad(r00.rgb, r03.aaa, 0.0)
             alpha:             r04.a   = mad(r09.r, r07.a, 0.0)  
      41 alu 34 rgb:             r05.rgb = mad(c07.rgb, r09.aaa, 0.0)
             alpha:             r05.a   = ex2(r11.a)  
      42 alu 35 rgb:             r06.rgb = mad(c06.rgb, r04.aaa, 0.0)*2
             alpha:             r02.a   = mad(r02.a, c14.r, 0.0)  
      43 alu 36 rgb:             r01.rgb = mad(r09.rrr, r02.rgb, r01.rgb)
             alpha:             r04.a   = ex2(r02.a)  
      44 alu 37 rgb:             r04.rgb = mad(r04.rgb, c04.rgb, 0.0)*2
             alpha:             r06.a   = clamped mad(r03.r, 1.0, 0.0)/2 
      45 alu 38 rgb:             r02.rgb = mad(r00.rgb, r06.aaa, 0.0)
             alpha:             r02.a   = mad(r10.r, r05.a, 0.0)  
      46 alu 39 rgb:             r03.rgb = mad(r00.aaa, r06.rgb, r05.rgb)
             alpha:                       mad(0.0, 0.0, 0.0)  
      47 alu 40 rgb:             r05.rgb = mad(c08.rgb, r02.aaa, 0.0)*2
             alpha:                       mad(0.0, 0.0, 0.0)  
      48 alu 41 rgb:             r06.rgb = mad(r04.aaa, c09.rgb, 0.0)
             alpha:                       mad(0.0, 0.0, 0.0)  
      49 alu 42 rgb:             r07.rgb = mad(r11.rgb, r07.rgb, 0.0)
             alpha:                       mad(0.0, 0.0, 0.0)  
      50 alu 43 rgb:             r02.rgb = mad(r02.rgb, c05.rgb, 0.0)*2
             alpha:                       mad(0.0, 0.0, 0.0)  
      51 alu 44 rgb:             r01.rgb = mad(r10.rrr, r04.rgb, r01.rgb)
             alpha:                       mad(0.0, 0.0, 0.0)  
      52 alu 45 rgb:             r03.rgb = mad(r00.aaa, r05.rgb, r03.rgb)
             alpha:                       mad(0.0, 0.0, 0.0)  
      53 alu 46 rgb:             r04.rgb = mad(r07.rgb, r06.rgb, 0.0)*2
             alpha:                       mad(0.0, 0.0, 0.0)  
      54 alu 47 rgb:             r01.rgb = mad(r02.rgb, r07.rgb, r01.rgb)
             alpha:                       mad(0.0, 0.0, 0.0)  
      55 alu 48 rgb:             r02.rgb = mad(r00.aaa, r04.rgb, r03.rgb)
             alpha:                       mad(0.0, 0.0, 0.0)  
      56 alu 49 rgb:             r01.rgb = mad(r00.rgb, c00.rgb, r01.rgb)
             alpha:                       mad(0.0, 0.0, 0.0)  
      57 alu 50 rgb:             r01.rgb = clamped mad(r02.rgb, 1.0, r01.rgb)
             alpha:                       mad(0.0, 0.0, 0.0)  
       alu 50 post-NOP
      58 alu 51 pre:  srcp.rgb = r01.rgb-c31.rgb
      58 alu 51 rgb:  out0.rgb =           mad(srcp.rgb, r01.aaa, c31.rgb)
             alpha:  out0.a   =           mad(c00.a, 1.0, 0.0)  last
    The RS instructions (rasteriser) are where vertex attributes are interpolated per pixel. So that's custom code that's executed to generate those 8 texture coordinates listed there. This is run in the shader pipe interpolator.

    The US program (user specified, I guess that means) is the meat of it. You can see there's two periods of texture-fetching, one right at the beginning before the ALU instructions. Though, looking at the code it appears that TU and ALU can both start working concurrently because it's not until ALU instruction 04 that a texture result is required.

    I expect R600 will use predication, just like R5xx does - see the CTM document, it quickly gets hairy.

    G80 seems to use predication for very short clauses of code (6/7 instructions seems to be the limit if I remember right from the CUDA documentation). Thereafter, it splits the batch up into separate instruction streams. So if half the batch wants to loop 12 times and the other half only once, then that'll cause two instruction streams to be forked. The warp size remains unchanged, so half of the pixels in each will be empty (logically speaking: predicated-out).

    I don't understand the reason for G80 to do this. I'd thought about starting a thread on the topic. It seems to be very much about this problem of instruction-caching/paging and the associated work for constants (remembering that constants in a loop might be indexed by the loop counter).

    Don't understand the question.

    That code snippet above has some interesting "meta" stuff in it. See the "*_wait" and "sem_*" instructions, for example. Those are instructions to the sequencer announcing clause boundaries, effectively. But also capable of announcing what instructions (in both the ALU and TU) can overlap.

    This rambling is fun.

    Jawed
     
  10. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    There shouldn't be much advantage to doing scalar only shaders, the compilier should be able to take care of that, but point you and Kaotik make is very valid.
     
  11. Frank

    Frank Certified not a majority
    Veteran

    Joined:
    Sep 21, 2003
    Messages:
    3,187
    Likes Received:
    59
    Location:
    Sittard, the Netherlands
    Yes, I'm busy reading them myselves, but I didn't get to the interesting stuff yet.

    Thanks. So, pixel targets are still bound to execution units.
     
  12. Unknown Soldier

    Veteran

    Joined:
    Jul 28, 2002
    Messages:
    4,047
    Likes Received:
    1,670
  13. IbaneZ

    Regular

    Joined:
    Apr 15, 2003
    Messages:
    743
    Likes Received:
    17
    Where's the chart? :smile:
     
  14. Frank

    Frank Certified not a majority
    Veteran

    Joined:
    Sep 21, 2003
    Messages:
    3,187
    Likes Received:
    59
    Location:
    Sittard, the Netherlands
    In what way?

    I'm trying, in between stuff.

    In R3x0-R5x0, all pixels are divided into quads/areas, in which each pixel is always calculated by the exact same execution unit (single fragment SIMD pipe). Or, in other words, all the shader pipe outputs are always written to the same locations in the framebuffer, with only a single offset for the whole block.

    So, pixel (0,0) on the screen is always calculated by fragment pipe 0.

    It sure is! ;)
     
  15. XMAN26

    Banned

    Joined:
    Feb 17, 2003
    Messages:
    702
    Likes Received:
    1
    Monday is going to be such a nice day so we can see real game benches and not leaked crapmark scores. IMHO, crapmark has been useless for gaming gauge performance for about 3 versions now.
     
  16. Unknown Soldier

    Veteran

    Joined:
    Jul 28, 2002
    Messages:
    4,047
    Likes Received:
    1,670
  17. Evildeus

    Veteran

    Joined:
    May 24, 2002
    Messages:
    2,657
    Likes Received:
    2
    It's a magazine no? So you should find them in the magazine.
     
  18. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,245
    Likes Received:
    4,465
    Location:
    Finland
  19. vertex_shader

    Banned

    Joined:
    Sep 8, 2006
    Messages:
    961
    Likes Received:
    14
    Location:
    Far far away
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Every clock the ALU pipe has to get the operands and constants as sources for its operations. Every clock the result has to be written to a register.

    So, when a batch swap is done, it's transparent.

    Now, it might be that R600 uses a "localised" memory system.

    In fact, thinking about this has reminded me of the rather tortuous patent application:

    Method and apparatus for managing tasks in a multiprocessor system

    I can think of too many different interpretations of this to really come to any decent conclusions. But note that "blocks of code" and "blocks of operands" seem to be the order of the day.

    So, batches may actually carry more context into the ALU (and TU) pipes than I've been surmising. The document certainly describes a degree of context that amounts to a portion of code. You have to be careful reading this because the task control processors, themselves, are programmable (what I've tended to refer to as sequencers), having their own registers, instruction cache and flags :!:

    See if you can make out anything specific.

    Gotcha.

    Jawed
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...