AMD: R7xx Speculation

Discussion in 'Architecture and Products' started by Unknown Soldier, May 18, 2007.

Thread Status:
Not open for further replies.
  1. Sunday

    Newcomer

    Joined:
    Feb 6, 2002
    Messages:
    194
    Likes Received:
    6
    Location:
    GMT+1
    I only can find excuse in total ignorance for people who’re using Crysis default flyby benchmark for evaluating GPU performance in this game!

    This is pure texture streaming demo, and nowhere near experience in actual game
     
  2. ChronoReverse

    Newcomer

    Joined:
    Apr 14, 2004
    Messages:
    245
    Likes Received:
    1
    Which is what makes the numbers even more interesting IF REAL. It's already known that the ATI solutions are more texture limited than the Nvidia ones.
     
  3. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    True, but it does make for even more architecture specific code.

    And branch coherence goes *poof* :smile:
     
  4. A.L.M.

    Newcomer

    Joined:
    Jun 2, 2008
    Messages:
    144
    Likes Received:
    0
    Location:
    Looking for a place to call home
    I don't think that R600/RV670 performances matter, in this case...
    Well, if RV770 is going to be a great part, I would rather say that R600 was simply a poor implementation of an extremely good architecture, cause it's easier to create a good gpu, rather than a scalable architecture, I suppose.
    The huge problem of R600 was the timing, actually. Think about it, NVidia managed to get good reviews for G200, even if maybe within 10 days it will be outperformed by 2 upper mid-end cards that cost almost the half of one GTX280... Imagine if G200 launched in July. No one would have even cared about it, maybe. :wink:
     
  5. A.L.M.

    Newcomer

    Joined:
    Jun 2, 2008
    Messages:
    144
    Likes Received:
    0
    Location:
    Looking for a place to call home
    But makes two cards comparable, in a way which maybe is nearer to real gaming than any other synthethic bench.
    No one said that Crysis will be playable if a card (or 2) score 23fps in the bench. :wink:
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    At a high level, it would ignore ILP considerations entirely. Just write a program for one element, then try to throw as many of them as you can into a bucket and have at it, no visible SIMD or ILP at all.

    The one branch unit per 5 ALUs would be a problem in addition to other already outlined conditions, though.

    It was claimed in an earlier ATI interview that large batch sizes didn't hurt performance all that much. Maybe we should put that claim to the test. ;)
     
  7. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    I feel that the 280 will drop in price by $50-100 right before the launch of R700. Until then, they can milk it even though it might have meant getting a rap in the initial reviews.
     
  8. jimmyjames123

    Regular

    Joined:
    Apr 14, 2004
    Messages:
    810
    Likes Received:
    3
    Again you are missing the point. If I can get up to 2x the performance with GTX 280 vs HD 4850, and only 10 or 20% higher idle power consumption (or whatever it is), that means that the GTX 280 is actually more efficient in this sense. :)
     
    #3808 jimmyjames123, Jun 17, 2008
    Last edited by a moderator: Jun 17, 2008
  9. Sunday

    Newcomer

    Joined:
    Feb 6, 2002
    Messages:
    194
    Likes Received:
    6
    Location:
    GMT+1
    Yeah, but with this approach reviewers media are spreading skewed picture about Crysis playability on particular card.
    So far I only noticed that TR has adopted “in-game” measurement of performance, and these scores from GTX 280 review are extremely interesting if you compare GTX260 vs. HD3870 X2!
    [​IMG]
    With all the info about RV770 that we know so far, HD4850 could very well be faster than HD3870 X2! This means that 200 USD HD4850 could be better choice for Crysis then 400 USD GTX260! Wouldn’t that be a TWI-NOT-MTBP :D
     
  10. leoneazzurro

    Regular

    Joined:
    Nov 3, 2005
    Messages:
    518
    Likes Received:
    25
    Location:
    Rome, Italy
    Then we have 80 TU total on both sides (80 in the GTX280@602MHz, 40+40 in the two 4850@625 MHz), aggregated bandwidth is higher on GT280, but the difference is on the 4850's side more than the slight clock difference should tell, and there's surely not a 100% scaling in CF operation. If this is a texture straming bench, then, it seems ATI got their texturing capabilities really right this time.
     
  11. jimmyjames123

    Regular

    Joined:
    Apr 14, 2004
    Messages:
    810
    Likes Received:
    3
    If you don't understand it, then you go away, thanks. This isn't rocket science here buddy :)
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    As Brook+ gets better I presume this kind of packing will take place without the developer having to futz so much. It's pretty hard though, maybe it won't work out like that...

    It's about operand bandwidth I think. Basically only 4 different scalars can be read from the register file (r0.w, r2.x, r3.z, r3.w, say) per clock (and 3 clocks are spent fetching operands). The remaining operands must be either literals, constants, cache-constants or "previous" resultants that have been retained within the ALU pipe from a prior instruction.

    There's more restrictions once you take account of all the operands that can be fetched over 3 clocks to feed the ALUs - but those rules are way too hairy - the R600 ISA document spends pages on it, and appears to come to the conclusion that "most of the time you won't notice a problem".

    The operands for the transcendental lane, if they're from the register file, must be in the population fetched for the other lanes, or the transcendental instruction can't be issued.

    Otherwise, swizzling, per se, appears to be completely free of constraints.

    I'm not really sure what you're saying here. When R6xx issues a batch, at minimum it allocates two vec4 fp32 registers (256 bits) of register file. So, ahem, if you choose not to use all 8 of those scalars for your kernel then you're wasting register file.

    Beyond that it's really a matter of not using so many vec4 registers per batch that you no longer have enough batches in flight to hide memory latency.

    But yeah, the code does get a bit hairy and long if you have a kernel that calculates 10s of elements per invocation.

    This is the optimised double precision matrix multiply CAL source:

    Code:
     
    il_ps_2_0
    dcl_cb cb0[2]
    dcl_input_position_interp(linear_noperspective)_centered_center vWinCoord0.xy__
    dcl_resource_id(0)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
    dcl_resource_id(1)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
    dcl_resource_id(2)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
    dcl_resource_id(3)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
    dcl_resource_id(4)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
    dcl_resource_id(5)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
    dcl_resource_id(6)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
    dcl_resource_id(7)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
    dcl_resource_id(8)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
    dcl_resource_id(9)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
    dcl_literal l0, 0x00000000, 0x00000000, 0x00000000, 0x00000000
    mov r3, l0
    mov r4, l0
    mov r5, l0
    mov r6, l0
    mov r7, l0
    mov r8, l0
    mov r9, l0
    mov r10, l0
    mul r0.xyz_, vWinCoord0.xyyx, cb0[0].zwxz
    mov r1.__zw, r0.zzzx
    mov r2._yz_, r1.zzwz
    mov r2.___w, r0.y
    mov r11.x___, l0
    whileloop
        itof r11._y__, r11.x
        ge r11._y__, r11.y, cb0[1].x
        break_logicalnz r11.y
        mov r0.___w, r2.w
        add r1._y__, r2.w, cb0[0].y
        add r1.x___, r1.y, cb0[0].y
        add r2.x___, r1.x, cb0[0].y
        add r2.___w, r2.x, cb0[0].y
        sample_resource(8)_sampler(8) r21, r0.xwww
        sample_resource(9)_sampler(9) r22, r0.xwww
        sample_resource(8)_sampler(8) r23, r1.wyww
        sample_resource(9)_sampler(9) r24, r1.wyww
        sample_resource(8)_sampler(8) r25, r1.wxww
        sample_resource(9)_sampler(9) r26, r1.wxww
        sample_resource(8)_sampler(8) r27, r2.zxzz
        sample_resource(9)_sampler(9) r28, r2.zxzz
        sample_resource(0)_sampler(0) r13, r0.wzww
        sample_resource(1)_sampler(1) r14, r0.wzww
        sample_resource(2)_sampler(2) r15, r0.wzww
        sample_resource(3)_sampler(3) r16, r0.wzww
        sample_resource(4)_sampler(4) r17, r0.wzww
        sample_resource(5)_sampler(5) r18, r0.wzww
        sample_resource(6)_sampler(6) r19, r0.wzww
        sample_resource(7)_sampler(7) r20, r0.wzww
        dmad r12.xy, r13.zw, r22.xy, r3.xy
        dmad r12.zw, r13.zw, r22.zw, r3.zw
        dmad r3.zw, r13.xy, r21.zw, r12.zw
        dmad r3.xy, r13.xy, r21.xy, r12.xy
        dmad r12.xy, r14.zw, r22.xy, r4.xy
        dmad r12.zw, r14.zw, r22.zw, r4.zw
        dmad r4.zw, r14.xy, r21.zw, r12.zw
        dmad r4.xy, r14.xy, r21.xy, r12.xy
        dmad r12.xy, r15.zw, r22.xy, r5.xy
        dmad r12.zw, r15.zw, r22.zw, r5.zw
        dmad r5.zw, r15.xy, r21.zw, r12.zw
        dmad r5.xy, r15.xy, r21.xy, r12.xy
        dmad r12.xy, r16.zw, r22.xy, r6.xy
        dmad r12.zw, r16.zw, r22.zw, r6.zw
        dmad r6.zw, r16.xy, r21.zw, r12.zw
        dmad r6.xy, r16.xy, r21.xy, r12.xy
        dmad r12.xy, r17.zw, r22.xy, r7.xy
        dmad r12.zw, r17.zw, r22.zw, r7.zw
        dmad r7.zw, r17.xy, r21.zw, r12.zw
        dmad r7.xy, r17.xy, r21.xy, r12.xy
        dmad r12.xy, r18.zw, r22.xy, r8.xy
        dmad r12.zw, r18.zw, r22.zw, r8.zw
        dmad r8.zw, r18.xy, r21.zw, r12.zw
        dmad r8.xy, r18.xy, r21.xy, r12.xy
        dmad r12.xy, r19.zw, r22.xy, r9.xy
        dmad r12.zw, r19.zw, r22.zw, r9.zw
        dmad r9.zw, r19.xy, r21.zw, r12.zw
        dmad r9.xy, r19.xy, r21.xy, r12.xy
        dmad r12.xy, r20.zw, r22.xy, r10.xy
        dmad r12.zw, r20.zw, r22.zw, r10.zw
        dmad r10.zw, r20.xy, r21.zw, r12.zw
        dmad r10.xy, r20.xy, r21.xy, r12.xy
        sample_resource(0)_sampler(0) r13, r1.yzyy
        sample_resource(1)_sampler(1) r14, r1.yzyy
        sample_resource(2)_sampler(2) r15, r1.yzyy
        sample_resource(3)_sampler(3) r16, r1.yzyy
        sample_resource(4)_sampler(4) r17, r1.yzyy
        sample_resource(5)_sampler(5) r18, r1.yzyy
        sample_resource(6)_sampler(6) r19, r1.yzyy
        sample_resource(7)_sampler(7) r20, r1.yzyy
        dmad r12.xy, r13.zw, r24.xy, r3.xy
        dmad r12.zw, r13.zw, r24.zw, r3.zw
        dmad r3.zw, r13.xy, r23.zw, r12.zw
        dmad r3.xy, r13.xy, r23.xy, r12.xy
        dmad r12.xy, r14.zw, r24.xy, r4.xy
        dmad r12.zw, r14.zw, r24.zw, r4.zw
        dmad r4.zw, r14.xy, r23.zw, r12.zw
        dmad r4.xy, r14.xy, r23.xy, r12.xy
        dmad r12.xy, r15.zw, r24.xy, r5.xy
        dmad r12.zw, r15.zw, r24.zw, r5.zw
        dmad r5.zw, r15.xy, r23.zw, r12.zw
        dmad r5.xy, r15.xy, r23.xy, r12.xy
        dmad r12.xy, r16.zw, r24.xy, r6.xy
        dmad r12.zw, r16.zw, r24.zw, r6.zw
        dmad r6.zw, r16.xy, r23.zw, r12.zw
        dmad r6.xy, r16.xy, r23.xy, r12.xy
        dmad r12.xy, r17.zw, r24.xy, r7.xy
        dmad r12.zw, r17.zw, r24.zw, r7.zw
        dmad r7.zw, r17.xy, r23.zw, r12.zw
        dmad r7.xy, r17.xy, r23.xy, r12.xy
        dmad r12.xy, r18.zw, r24.xy, r8.xy
        dmad r12.zw, r18.zw, r24.zw, r8.zw
        dmad r8.zw, r18.xy, r23.zw, r12.zw
        dmad r8.xy, r18.xy, r23.xy, r12.xy
        dmad r12.xy, r19.zw, r24.xy, r9.xy
        dmad r12.zw, r19.zw, r24.zw, r9.zw
        dmad r9.zw, r19.xy, r23.zw, r12.zw
        dmad r9.xy, r19.xy, r23.xy, r12.xy
        dmad r12.xy, r20.zw, r24.xy, r10.xy
        dmad r12.zw, r20.zw, r24.zw, r10.zw
        dmad r10.zw, r20.xy, r23.zw, r12.zw
        dmad r10.xy, r20.xy, r23.xy, r12.xy
        sample_resource(0)_sampler(0) r13, r1.xzxx
        sample_resource(1)_sampler(1) r14, r1.xzxx
        sample_resource(2)_sampler(2) r15, r1.xzxx
        sample_resource(3)_sampler(3) r16, r1.xzxx
        sample_resource(4)_sampler(4) r17, r1.xzxx
        sample_resource(5)_sampler(5) r18, r1.xzxx
        sample_resource(6)_sampler(6) r19, r1.xzxx
        sample_resource(7)_sampler(7) r20, r1.xzxx
        dmad r12.xy, r13.zw, r26.xy, r3.xy
        dmad r12.zw, r13.zw, r26.zw, r3.zw
        dmad r3.zw, r13.xy, r25.zw, r12.zw
        dmad r3.xy, r13.xy, r25.xy, r12.xy
        dmad r12.xy, r14.zw, r26.xy, r4.xy
        dmad r12.zw, r14.zw, r26.zw, r4.zw
        dmad r4.zw, r14.xy, r25.zw, r12.zw
        dmad r4.xy, r14.xy, r25.xy, r12.xy
        dmad r12.xy, r15.zw, r26.xy, r5.xy
        dmad r12.zw, r15.zw, r26.zw, r5.zw
        dmad r5.zw, r15.xy, r25.zw, r12.zw
        dmad r5.xy, r15.xy, r25.xy, r12.xy
        dmad r12.xy, r16.zw, r26.xy, r6.xy
        dmad r12.zw, r16.zw, r26.zw, r6.zw
        dmad r6.zw, r16.xy, r25.zw, r12.zw
        dmad r6.xy, r16.xy, r25.xy, r12.xy
        dmad r12.xy, r17.zw, r26.xy, r7.xy
        dmad r12.zw, r17.zw, r26.zw, r7.zw
        dmad r7.zw, r17.xy, r25.zw, r12.zw
        dmad r7.xy, r17.xy, r25.xy, r12.xy
        dmad r12.xy, r18.zw, r26.xy, r8.xy
        dmad r12.zw, r18.zw, r26.zw, r8.zw
        dmad r8.zw, r18.xy, r25.zw, r12.zw
        dmad r8.xy, r18.xy, r25.xy, r12.xy
        dmad r12.xy, r19.zw, r26.xy, r9.xy
        dmad r12.zw, r19.zw, r26.zw, r9.zw
        dmad r9.zw, r19.xy, r25.zw, r12.zw
        dmad r9.xy, r19.xy, r25.xy, r12.xy
        dmad r12.xy, r20.zw, r26.xy, r10.xy
        dmad r12.zw, r20.zw, r26.zw, r10.zw
        dmad r10.zw, r20.xy, r25.zw, r12.zw
        dmad r10.xy, r20.xy, r25.xy, r12.xy
        sample_resource(0)_sampler(0) r13, r2.xyxx
        sample_resource(1)_sampler(1) r14, r2.xyxx
        sample_resource(2)_sampler(2) r15, r2.xyxx
        sample_resource(3)_sampler(3) r16, r2.xyxx
        sample_resource(4)_sampler(4) r17, r2.xyxx
        sample_resource(5)_sampler(5) r18, r2.xyxx
        sample_resource(6)_sampler(6) r19, r2.xyxx
        sample_resource(7)_sampler(7) r20, r2.xyxx
        dmad r12.xy, r13.zw, r28.xy, r3.xy
        dmad r12.zw, r13.zw, r28.zw, r3.zw
        dmad r3.zw, r13.xy, r27.zw, r12.zw
        dmad r3.xy, r13.xy, r27.xy, r12.xy
        dmad r12.xy, r14.zw, r28.xy, r4.xy
        dmad r12.zw, r14.zw, r28.zw, r4.zw
        dmad r4.zw, r14.xy, r27.zw, r12.zw
        dmad r4.xy, r14.xy, r27.xy, r12.xy
        dmad r12.xy, r15.zw, r28.xy, r5.xy
        dmad r12.zw, r15.zw, r28.zw, r5.zw
        dmad r5.zw, r15.xy, r27.zw, r12.zw
        dmad r5.xy, r15.xy, r27.xy, r12.xy
        dmad r12.xy, r16.zw, r28.xy, r6.xy
        dmad r12.zw, r16.zw, r28.zw, r6.zw
        dmad r6.zw, r16.xy, r27.zw, r12.zw
        dmad r6.xy, r16.xy, r27.xy, r12.xy
        dmad r12.xy, r17.zw, r28.xy, r7.xy
        dmad r12.zw, r17.zw, r28.zw, r7.zw
        dmad r7.zw, r17.xy, r27.zw, r12.zw
        dmad r7.xy, r17.xy, r27.xy, r12.xy
        dmad r12.xy, r18.zw, r28.xy, r8.xy
        dmad r12.zw, r18.zw, r28.zw, r8.zw
        dmad r8.zw, r18.xy, r27.zw, r12.zw
        dmad r8.xy, r18.xy, r27.xy, r12.xy
        dmad r12.xy, r19.zw, r28.xy, r9.xy
        dmad r12.zw, r19.zw, r28.zw, r9.zw
        dmad r9.zw, r19.xy, r27.zw, r12.zw
        dmad r9.xy, r19.xy, r27.xy, r12.xy
        dmad r12.xy, r20.zw, r28.xy, r10.xy
        dmad r12.zw, r20.zw, r28.zw, r10.zw
        dmad r10.zw, r20.xy, r27.zw, r12.zw
        dmad r10.xy, r20.xy, r27.xy, r12.xy
        dcl_literal l1, 0x00000001, 0x00000001, 0x00000001, 0x00000001
        iadd r11.x___, r11.x, l1
    endloop
    dcl_output_generic o0
    dcl_output_generic o1
    dcl_output_generic o2
    dcl_output_generic o3
    dcl_output_generic o4
    dcl_output_generic o5
    dcl_output_generic o6
    dcl_output_generic o7
    mov o0, r3
    mov o1, r4
    mov o2, r5
    mov o3, r6
    mov o4, r7
    mov o5, r8
    mov o6, r9
    mov o7, r10
    ret_dyn
    end;
    
    it calculates 16 doubles per invocation and uses 30 vec4 registers (and one temporary vec4 register). That's 480 bytes per "element", or each batch has a state of 30720 bytes, which I think means there are 8 batches in flight at any one time (256KB of register file per SIMD). In R6xx assembly the ALU:TEX ratio of the compute loop is 3.3:1 (173 instruction groups: 133 ALU + 40 TEX).

    Jawed
     
  13. kyniskos

    Newcomer

    Joined:
    Jan 8, 2006
    Messages:
    55
    Likes Received:
    2
    Can it be that the yields are so good for AMD that they are in fact releasing the very same core for HD4870 and HD4850, with only a core volt bump and mHz increase on specially picked cores? That way they can match somewhat rare cores to the apparently rare GDDR5, have HD4850 in a very good performancebracket (with respect to price). I think we need to wait for the data of the HD4870 before we know what AMD is really up to.
     
  14. kyetech

    Regular

    Joined:
    Sep 10, 2004
    Messages:
    532
    Likes Received:
    0
    Quote of the day. +1. Did you make that up. thats good.

    Its official: Nvidias new initiative... TWI-NOT-MTBP
     
  15. digitalwanderer

    digitalwanderer Dangerously Mirthful
    Legend

    Joined:
    Feb 19, 2002
    Messages:
    18,992
    Likes Received:
    3,532
    Location:
    Winfield, IN USA
    What makes you think the 280 will double the 4850s performance would be my question? :|
     
  16. ZerazaX

    Regular

    Joined:
    Oct 29, 2007
    Messages:
    280
    Likes Received:
    0
    I'm pretty sure ATI doesn't do deactivating of parts for the RV770 like Nvidia does w/ their cards unless they've switched redundancy methods.

    From the EE TImes Article
    Sounds like they aren't keen on disabling parts of the GPU
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I misworded my sentence.
    By scalar values I meant scalar threads.
    I was running out of words for what to call things once I nested elements within a batch and nested items within the groups of 5.

    By your description of the operand restrictions, it would be more likely that it would be 4 threads inside each batch item in a 64 element batch, hence emulating a 256-length vector machine.
     
  18. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Why not? That's exactly what they did with 3850/3870.
     
  19. Love_In_Rio

    Veteran

    Joined:
    Apr 21, 2004
    Messages:
    1,627
    Likes Received:
    226
    They havent done it since ATI 9500, so i also supposse they are not doing it again now.
     
  20. CJ

    CJ
    Regular

    Joined:
    Apr 28, 2004
    Messages:
    816
    Likes Received:
    40
    Location:
    MSI Europe HQ
    So what are X800GT, X800Pro, X1800GTO, X1900GT, HD2900GT and HD3690/HD3830?
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...