AMD: R7xx Speculation

Status
Not open for further replies.
Crysis Very High DX10 @ 1920x1200 4xAA ~ 23 FPS on the 4850 CF
on GTX 280 at same settings (from computerbase.de) ~ 16 FPS

Interesting..
I only can find excuse in total ignorance for people who’re using Crysis default flyby benchmark for evaluating GPU performance in this game!

This is pure texture streaming demo, and nowhere near experience in actual game
 
Which is what makes the numbers even more interesting IF REAL. It's already known that the ATI solutions are more texture limited than the Nvidia ones.
 
What's to stop a CTM implementation where computational entities are packed 5 to a primitive?

True, but it does make for even more architecture specific code.

Well, besides likely blowing out the register file, code expansion, and assuming you only want the subset of ops the slim ALUs currently offer?
That and the insanely expanded effective batch size.
64 items per batch, with each item being a packet of 5 scalar values is 320 elements.

And branch coherence goes *poof* :smile:
 
Scaling is always easier when you have the luxury of a lower starting point.
Not that I don't find it interesting how great the improvements are, but they are somewhat magnified by the lower bar set by their predecessors.

I don't think that R600/RV670 performances matter, in this case...
Well, if RV770 is going to be a great part, I would rather say that R600 was simply a poor implementation of an extremely good architecture, cause it's easier to create a good gpu, rather than a scalable architecture, I suppose.
The huge problem of R600 was the timing, actually. Think about it, NVidia managed to get good reviews for G200, even if maybe within 10 days it will be outperformed by 2 upper mid-end cards that cost almost the half of one GTX280... Imagine if G200 launched in July. No one would have even cared about it, maybe. ;)
 
I only can find excuse in total ignorance for people who’re using Crysis default flyby benchmark for evaluating GPU performance in this game!

This is pure texture streaming demo, and nowhere near experience in actual game

But makes two cards comparable, in a way which maybe is nearer to real gaming than any other synthethic bench.
No one said that Crysis will be playable if a card (or 2) score 23fps in the bench. ;)
 
True, but it does make for even more architecture specific code.
At a high level, it would ignore ILP considerations entirely. Just write a program for one element, then try to throw as many of them as you can into a bucket and have at it, no visible SIMD or ILP at all.

The one branch unit per 5 ALUs would be a problem in addition to other already outlined conditions, though.

And branch coherence goes *poof* :smile:
It was claimed in an earlier ATI interview that large batch sizes didn't hurt performance all that much. Maybe we should put that claim to the test. ;)
 
Imagine if G200 launched in July. No one would have even cared about it, maybe. ;)
I feel that the 280 will drop in price by $50-100 right before the launch of R700. Until then, they can milk it even though it might have meant getting a rap in the initial reviews.
 
Well the performance of a graphics card while idle is zero, but I bet you the 4850 uses less power than the 280 while it's idle also, so it'll still be more efficient there. ;)

Again you are missing the point. If I can get up to 2x the performance with GTX 280 vs HD 4850, and only 10 or 20% higher idle power consumption (or whatever it is), that means that the GTX 280 is actually more efficient in this sense. :)
 
Last edited by a moderator:
But makes two cards comparable, in a way which maybe is nearer to real gaming than any other synthethic bench.
No one said that Crysis will be playable if a card (or 2) score 23fps in the bench. ;)
Yeah, but with this approach reviewers media are spreading skewed picture about Crysis playability on particular card.
So far I only noticed that TR has adopted “in-game” measurement of performance, and these scores from GTX 280 review are extremely interesting if you compare GTX260 vs. HD3870 X2!
crysis-high-1920.gif

With all the info about RV770 that we know so far, HD4850 could very well be faster than HD3870 X2! This means that 200 USD HD4850 could be better choice for Crysis then 400 USD GTX260! Wouldn’t that be a TWI-NOT-MTBP :D
 
I only can find excuse in total ignorance for people who’re using Crysis default flyby benchmark for evaluating GPU performance in this game!

This is pure texture streaming demo, and nowhere near experience in actual game

Then we have 80 TU total on both sides (80 in the GTX280@602MHz, 40+40 in the two 4850@625 MHz), aggregated bandwidth is higher on GT280, but the difference is on the 4850's side more than the slight clock difference should tell, and there's surely not a 100% scaling in CF operation. If this is a texture straming bench, then, it seems ATI got their texturing capabilities really right this time.
 
What's to stop a CTM implementation where computational entities are packed 5 to a primitive?
As Brook+ gets better I presume this kind of packing will take place without the developer having to futz so much. It's pretty hard though, maybe it won't work out like that...

It's not like the VLIW should care, though it would waste any form of swizzling across lanes. (I can't remember where I read it, but isn't it also the case that the slim ALU lanes can interchange results with each other easily while the fat ALU is off on its own)
It's about operand bandwidth I think. Basically only 4 different scalars can be read from the register file (r0.w, r2.x, r3.z, r3.w, say) per clock (and 3 clocks are spent fetching operands). The remaining operands must be either literals, constants, cache-constants or "previous" resultants that have been retained within the ALU pipe from a prior instruction.

There's more restrictions once you take account of all the operands that can be fetched over 3 clocks to feed the ALUs - but those rules are way too hairy - the R600 ISA document spends pages on it, and appears to come to the conclusion that "most of the time you won't notice a problem".

The operands for the transcendental lane, if they're from the register file, must be in the population fetched for the other lanes, or the transcendental instruction can't be issued.

Otherwise, swizzling, per se, appears to be completely free of constraints.

Well, besides likely blowing out the register file, code expansion, and assuming you only want the subset of ops the slim ALUs currently offer?
That and the insanely expanded effective batch size.
64 items per batch, with each item being a packet of 5 scalar values is 320 elements.
I'm not really sure what you're saying here. When R6xx issues a batch, at minimum it allocates two vec4 fp32 registers (256 bits) of register file. So, ahem, if you choose not to use all 8 of those scalars for your kernel then you're wasting register file.

Beyond that it's really a matter of not using so many vec4 registers per batch that you no longer have enough batches in flight to hide memory latency.

But yeah, the code does get a bit hairy and long if you have a kernel that calculates 10s of elements per invocation.

This is the optimised double precision matrix multiply CAL source:

Code:
il_ps_2_0
dcl_cb cb0[2]
dcl_input_position_interp(linear_noperspective)_centered_center vWinCoord0.xy__
dcl_resource_id(0)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(1)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(2)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(3)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(4)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(5)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(6)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(7)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(8)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(9)_type(2d)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_literal l0, 0x00000000, 0x00000000, 0x00000000, 0x00000000
mov r3, l0
mov r4, l0
mov r5, l0
mov r6, l0
mov r7, l0
mov r8, l0
mov r9, l0
mov r10, l0
mul r0.xyz_, vWinCoord0.xyyx, cb0[0].zwxz
mov r1.__zw, r0.zzzx
mov r2._yz_, r1.zzwz
mov r2.___w, r0.y
mov r11.x___, l0
whileloop
    itof r11._y__, r11.x
    ge r11._y__, r11.y, cb0[1].x
    break_logicalnz r11.y
    mov r0.___w, r2.w
    add r1._y__, r2.w, cb0[0].y
    add r1.x___, r1.y, cb0[0].y
    add r2.x___, r1.x, cb0[0].y
    add r2.___w, r2.x, cb0[0].y
    sample_resource(8)_sampler(8) r21, r0.xwww
    sample_resource(9)_sampler(9) r22, r0.xwww
    sample_resource(8)_sampler(8) r23, r1.wyww
    sample_resource(9)_sampler(9) r24, r1.wyww
    sample_resource(8)_sampler(8) r25, r1.wxww
    sample_resource(9)_sampler(9) r26, r1.wxww
    sample_resource(8)_sampler(8) r27, r2.zxzz
    sample_resource(9)_sampler(9) r28, r2.zxzz
    sample_resource(0)_sampler(0) r13, r0.wzww
    sample_resource(1)_sampler(1) r14, r0.wzww
    sample_resource(2)_sampler(2) r15, r0.wzww
    sample_resource(3)_sampler(3) r16, r0.wzww
    sample_resource(4)_sampler(4) r17, r0.wzww
    sample_resource(5)_sampler(5) r18, r0.wzww
    sample_resource(6)_sampler(6) r19, r0.wzww
    sample_resource(7)_sampler(7) r20, r0.wzww
    dmad r12.xy, r13.zw, r22.xy, r3.xy
    dmad r12.zw, r13.zw, r22.zw, r3.zw
    dmad r3.zw, r13.xy, r21.zw, r12.zw
    dmad r3.xy, r13.xy, r21.xy, r12.xy
    dmad r12.xy, r14.zw, r22.xy, r4.xy
    dmad r12.zw, r14.zw, r22.zw, r4.zw
    dmad r4.zw, r14.xy, r21.zw, r12.zw
    dmad r4.xy, r14.xy, r21.xy, r12.xy
    dmad r12.xy, r15.zw, r22.xy, r5.xy
    dmad r12.zw, r15.zw, r22.zw, r5.zw
    dmad r5.zw, r15.xy, r21.zw, r12.zw
    dmad r5.xy, r15.xy, r21.xy, r12.xy
    dmad r12.xy, r16.zw, r22.xy, r6.xy
    dmad r12.zw, r16.zw, r22.zw, r6.zw
    dmad r6.zw, r16.xy, r21.zw, r12.zw
    dmad r6.xy, r16.xy, r21.xy, r12.xy
    dmad r12.xy, r17.zw, r22.xy, r7.xy
    dmad r12.zw, r17.zw, r22.zw, r7.zw
    dmad r7.zw, r17.xy, r21.zw, r12.zw
    dmad r7.xy, r17.xy, r21.xy, r12.xy
    dmad r12.xy, r18.zw, r22.xy, r8.xy
    dmad r12.zw, r18.zw, r22.zw, r8.zw
    dmad r8.zw, r18.xy, r21.zw, r12.zw
    dmad r8.xy, r18.xy, r21.xy, r12.xy
    dmad r12.xy, r19.zw, r22.xy, r9.xy
    dmad r12.zw, r19.zw, r22.zw, r9.zw
    dmad r9.zw, r19.xy, r21.zw, r12.zw
    dmad r9.xy, r19.xy, r21.xy, r12.xy
    dmad r12.xy, r20.zw, r22.xy, r10.xy
    dmad r12.zw, r20.zw, r22.zw, r10.zw
    dmad r10.zw, r20.xy, r21.zw, r12.zw
    dmad r10.xy, r20.xy, r21.xy, r12.xy
    sample_resource(0)_sampler(0) r13, r1.yzyy
    sample_resource(1)_sampler(1) r14, r1.yzyy
    sample_resource(2)_sampler(2) r15, r1.yzyy
    sample_resource(3)_sampler(3) r16, r1.yzyy
    sample_resource(4)_sampler(4) r17, r1.yzyy
    sample_resource(5)_sampler(5) r18, r1.yzyy
    sample_resource(6)_sampler(6) r19, r1.yzyy
    sample_resource(7)_sampler(7) r20, r1.yzyy
    dmad r12.xy, r13.zw, r24.xy, r3.xy
    dmad r12.zw, r13.zw, r24.zw, r3.zw
    dmad r3.zw, r13.xy, r23.zw, r12.zw
    dmad r3.xy, r13.xy, r23.xy, r12.xy
    dmad r12.xy, r14.zw, r24.xy, r4.xy
    dmad r12.zw, r14.zw, r24.zw, r4.zw
    dmad r4.zw, r14.xy, r23.zw, r12.zw
    dmad r4.xy, r14.xy, r23.xy, r12.xy
    dmad r12.xy, r15.zw, r24.xy, r5.xy
    dmad r12.zw, r15.zw, r24.zw, r5.zw
    dmad r5.zw, r15.xy, r23.zw, r12.zw
    dmad r5.xy, r15.xy, r23.xy, r12.xy
    dmad r12.xy, r16.zw, r24.xy, r6.xy
    dmad r12.zw, r16.zw, r24.zw, r6.zw
    dmad r6.zw, r16.xy, r23.zw, r12.zw
    dmad r6.xy, r16.xy, r23.xy, r12.xy
    dmad r12.xy, r17.zw, r24.xy, r7.xy
    dmad r12.zw, r17.zw, r24.zw, r7.zw
    dmad r7.zw, r17.xy, r23.zw, r12.zw
    dmad r7.xy, r17.xy, r23.xy, r12.xy
    dmad r12.xy, r18.zw, r24.xy, r8.xy
    dmad r12.zw, r18.zw, r24.zw, r8.zw
    dmad r8.zw, r18.xy, r23.zw, r12.zw
    dmad r8.xy, r18.xy, r23.xy, r12.xy
    dmad r12.xy, r19.zw, r24.xy, r9.xy
    dmad r12.zw, r19.zw, r24.zw, r9.zw
    dmad r9.zw, r19.xy, r23.zw, r12.zw
    dmad r9.xy, r19.xy, r23.xy, r12.xy
    dmad r12.xy, r20.zw, r24.xy, r10.xy
    dmad r12.zw, r20.zw, r24.zw, r10.zw
    dmad r10.zw, r20.xy, r23.zw, r12.zw
    dmad r10.xy, r20.xy, r23.xy, r12.xy
    sample_resource(0)_sampler(0) r13, r1.xzxx
    sample_resource(1)_sampler(1) r14, r1.xzxx
    sample_resource(2)_sampler(2) r15, r1.xzxx
    sample_resource(3)_sampler(3) r16, r1.xzxx
    sample_resource(4)_sampler(4) r17, r1.xzxx
    sample_resource(5)_sampler(5) r18, r1.xzxx
    sample_resource(6)_sampler(6) r19, r1.xzxx
    sample_resource(7)_sampler(7) r20, r1.xzxx
    dmad r12.xy, r13.zw, r26.xy, r3.xy
    dmad r12.zw, r13.zw, r26.zw, r3.zw
    dmad r3.zw, r13.xy, r25.zw, r12.zw
    dmad r3.xy, r13.xy, r25.xy, r12.xy
    dmad r12.xy, r14.zw, r26.xy, r4.xy
    dmad r12.zw, r14.zw, r26.zw, r4.zw
    dmad r4.zw, r14.xy, r25.zw, r12.zw
    dmad r4.xy, r14.xy, r25.xy, r12.xy
    dmad r12.xy, r15.zw, r26.xy, r5.xy
    dmad r12.zw, r15.zw, r26.zw, r5.zw
    dmad r5.zw, r15.xy, r25.zw, r12.zw
    dmad r5.xy, r15.xy, r25.xy, r12.xy
    dmad r12.xy, r16.zw, r26.xy, r6.xy
    dmad r12.zw, r16.zw, r26.zw, r6.zw
    dmad r6.zw, r16.xy, r25.zw, r12.zw
    dmad r6.xy, r16.xy, r25.xy, r12.xy
    dmad r12.xy, r17.zw, r26.xy, r7.xy
    dmad r12.zw, r17.zw, r26.zw, r7.zw
    dmad r7.zw, r17.xy, r25.zw, r12.zw
    dmad r7.xy, r17.xy, r25.xy, r12.xy
    dmad r12.xy, r18.zw, r26.xy, r8.xy
    dmad r12.zw, r18.zw, r26.zw, r8.zw
    dmad r8.zw, r18.xy, r25.zw, r12.zw
    dmad r8.xy, r18.xy, r25.xy, r12.xy
    dmad r12.xy, r19.zw, r26.xy, r9.xy
    dmad r12.zw, r19.zw, r26.zw, r9.zw
    dmad r9.zw, r19.xy, r25.zw, r12.zw
    dmad r9.xy, r19.xy, r25.xy, r12.xy
    dmad r12.xy, r20.zw, r26.xy, r10.xy
    dmad r12.zw, r20.zw, r26.zw, r10.zw
    dmad r10.zw, r20.xy, r25.zw, r12.zw
    dmad r10.xy, r20.xy, r25.xy, r12.xy
    sample_resource(0)_sampler(0) r13, r2.xyxx
    sample_resource(1)_sampler(1) r14, r2.xyxx
    sample_resource(2)_sampler(2) r15, r2.xyxx
    sample_resource(3)_sampler(3) r16, r2.xyxx
    sample_resource(4)_sampler(4) r17, r2.xyxx
    sample_resource(5)_sampler(5) r18, r2.xyxx
    sample_resource(6)_sampler(6) r19, r2.xyxx
    sample_resource(7)_sampler(7) r20, r2.xyxx
    dmad r12.xy, r13.zw, r28.xy, r3.xy
    dmad r12.zw, r13.zw, r28.zw, r3.zw
    dmad r3.zw, r13.xy, r27.zw, r12.zw
    dmad r3.xy, r13.xy, r27.xy, r12.xy
    dmad r12.xy, r14.zw, r28.xy, r4.xy
    dmad r12.zw, r14.zw, r28.zw, r4.zw
    dmad r4.zw, r14.xy, r27.zw, r12.zw
    dmad r4.xy, r14.xy, r27.xy, r12.xy
    dmad r12.xy, r15.zw, r28.xy, r5.xy
    dmad r12.zw, r15.zw, r28.zw, r5.zw
    dmad r5.zw, r15.xy, r27.zw, r12.zw
    dmad r5.xy, r15.xy, r27.xy, r12.xy
    dmad r12.xy, r16.zw, r28.xy, r6.xy
    dmad r12.zw, r16.zw, r28.zw, r6.zw
    dmad r6.zw, r16.xy, r27.zw, r12.zw
    dmad r6.xy, r16.xy, r27.xy, r12.xy
    dmad r12.xy, r17.zw, r28.xy, r7.xy
    dmad r12.zw, r17.zw, r28.zw, r7.zw
    dmad r7.zw, r17.xy, r27.zw, r12.zw
    dmad r7.xy, r17.xy, r27.xy, r12.xy
    dmad r12.xy, r18.zw, r28.xy, r8.xy
    dmad r12.zw, r18.zw, r28.zw, r8.zw
    dmad r8.zw, r18.xy, r27.zw, r12.zw
    dmad r8.xy, r18.xy, r27.xy, r12.xy
    dmad r12.xy, r19.zw, r28.xy, r9.xy
    dmad r12.zw, r19.zw, r28.zw, r9.zw
    dmad r9.zw, r19.xy, r27.zw, r12.zw
    dmad r9.xy, r19.xy, r27.xy, r12.xy
    dmad r12.xy, r20.zw, r28.xy, r10.xy
    dmad r12.zw, r20.zw, r28.zw, r10.zw
    dmad r10.zw, r20.xy, r27.zw, r12.zw
    dmad r10.xy, r20.xy, r27.xy, r12.xy
    dcl_literal l1, 0x00000001, 0x00000001, 0x00000001, 0x00000001
    iadd r11.x___, r11.x, l1
endloop
dcl_output_generic o0
dcl_output_generic o1
dcl_output_generic o2
dcl_output_generic o3
dcl_output_generic o4
dcl_output_generic o5
dcl_output_generic o6
dcl_output_generic o7
mov o0, r3
mov o1, r4
mov o2, r5
mov o3, r6
mov o4, r7
mov o5, r8
mov o6, r9
mov o7, r10
ret_dyn
end;

it calculates 16 doubles per invocation and uses 30 vec4 registers (and one temporary vec4 register). That's 480 bytes per "element", or each batch has a state of 30720 bytes, which I think means there are 8 batches in flight at any one time (256KB of register file per SIMD). In R6xx assembly the ALU:TEX ratio of the compute loop is 3.3:1 (173 instruction groups: 133 ALU + 40 TEX).

Jawed
 
Can it be that the yields are so good for AMD that they are in fact releasing the very same core for HD4870 and HD4850, with only a core volt bump and mHz increase on specially picked cores? That way they can match somewhat rare cores to the apparently rare GDDR5, have HD4850 in a very good performancebracket (with respect to price). I think we need to wait for the data of the HD4870 before we know what AMD is really up to.
 
Again you are missing the point. If I can get up to 2x the performance with GTX 280 vs HD 4850, and only 10 or 20% higher idle power consumption (or whatever it is), that means that the GTX 280 is actually more efficient in this sense. :)
What makes you think the 280 will double the 4850s performance would be my question? :|
 
I'm pretty sure ATI doesn't do deactivating of parts for the RV770 like Nvidia does w/ their cards unless they've switched redundancy methods.

From the EE TImes Article
"We didn't want to come out with one monolithic GPU and then disable parts of it for different markets," said an AMD spokesman prior to a full disclosure of the part in a briefing in San Francisco on June 16.

Sounds like they aren't keen on disabling parts of the GPU
 
I'm not really sure what you're saying here. When R6xx issues a batch, at minimum it allocates two vec4 fp32 registers (256 bits) of register file. So, ahem, if you choose not to use all 8 of those scalars for your kernel then you're wasting register file.
I misworded my sentence.
By scalar values I meant scalar threads.
I was running out of words for what to call things once I nested elements within a batch and nested items within the groups of 5.

By your description of the operand restrictions, it would be more likely that it would be 4 threads inside each batch item in a 64 element batch, hence emulating a 256-length vector machine.
 
Can it be that the yields are so good for AMD that they are in fact releasing the very same core for HD4870 and HD4850, with only a core volt bump and mHz increase on specially picked cores? That way they can match somewhat rare cores to the apparently rare GDDR5, have HD4850 in a very good performancebracket (with respect to price). I think we need to wait for the data of the HD4870 before we know what AMD is really up to.

Why not? That's exactly what they did with 3850/3870.
 
Status
Not open for further replies.
Back
Top