AMD: R9xx Speculation

I think the performance of the cards would need to be seen 1st.
Indeed, but if AMD hasnt screwed up, its possible to extrapolate rough estimates. I would be mighty surprised if with such stats 6970 wont be faster than GTX580. While 6990 as someone joked would provide "useless" amount of firepower ;)
 
I guess I meant 'no fullspeed 32bit INT ops'.
The slide says:
4* 24bit MUL, ADD or MAD
2* 32bit ADD
1* 32bit MUL

I was hoping they would be doing fullrate 32bit rather than ganging the SPs.
If 32bit INT isn't used all that much this should be OK though.
Don't forget even with only one 32bit int mul per clock, the absolute number is still (somewhat) higher than what a GTX580 can do (which has half-rate 32bit int rate). For 32bit int adds that's more than twice as fast as GTX 580 (unless, of course, that's scalar, in which case it'll drop to 1 32bit int add as usual). So that still looks plenty fast to me.
 
It's not going to be an R300 because GTX580 is already out, and its not NV30.
And it can't be a new R300 (imho) anyway, since this was such a big leap in all areas - not only performance but also feature wise. Cayman is probably a nice improvement in performance (and it could be a decent improvement in perf/w too which gets more important), but it doesn't really bring anything new to the table feature wise me thinks.
 
4* 24bit MUL, ADD or MAD
2* 32bit ADD
1* 32bit MUL

I was hoping they would be doing fullrate 32bit rather than ganging the SPs.
If 32bit INT isn't used all that much this should be OK though.
That's a mistake. It can do four 32bit integer adds per cycle. An add can be done in each VLIW slot, same as in Cypress (and everything since R600).

Edit:
ISA looks like that (no difference between different generations):
Code:
x: ADD_INT     R0.x,  R1.x,  R2.x
y: ADD_INT     R0.y,  R1.y,  R2.y
z: ADD_INT     R0.z,  R1.z,  R2.z
w: ADD_INT     R0.w,  R1.w,  R2.w
 
the absolute number is still (somewhat) higher than what a GTX580 can do (which has half-rate 32bit int rate).
Hmm, somehow I'd gotten into my mind that NV is doing fullspeed 32bit INTs :eek:
 
That's a mistake. It can do four 32bit integer adds per cycle. An add can be done in each VLIW slot, same as in Cypress (and everything since R600).

Edit:
ISA looks like that (no difference between different generations):
Code:
x: ADD_INT     R0.x,  R1.x,  R2.x
y: ADD_INT     R0.y,  R1.y,  R2.y
z: ADD_INT     R0.z,  R1.z,  R2.z
w: ADD_INT     R0.w,  R1.w,  R2.w

Well, I hadn't even noticed that one! If it can do 4 FMAs/cycle, there's really no reason it shouldn't be able to do 4 ADDs, anyway.

By the way, does anyone know when the NDA actually expires? I mean the NDA for this presentation, not benchmarks.
 
Indeed, but if AMD hasnt screwed up, its possible to extrapolate rough estimates. I would be mighty surprised if with such stats 6970 wont be faster than GTX580. While 6990 as someone joked would provide "useless" amount of firepower ;)

That's my thinking as well. It appears that the 6950 should be on par (win some/loss some) with the competing current gen card. But that remains to be seen. Another contention is the improved IQ with MLAA along with EQAA and what kind of performance one will get with those cards.
 
That's a very interesting claim, 'cos people like me don't upgrade ev'ry generation. So, this "power" will be helpful for the upcoming game releases.

Future proofing doesn't exist. GF 1x0 may be fast with tessellation but in upcoming games another part of the hardware may be bottleknecked in thos egames resulting in the game performing as badly as the cypress platform might. Or cypress leading in other areas of performance might pull away.

Anyway there is a thread on tessellation
 
That's what they said, tests have shown something else iirc.
Oh you're right. The whitepaper said only DP can't be dual issued but two int instructions can. Either that's just not true or it could be artificially limited for consumer parts?
 
That's a mistake. It can do four 32bit integer adds per cycle.
How and why would they make such a mistake?
They use the Mantissa part of the FP unit only can do 24bit INT unless they have 48bit FP capability.
 
I know in compute, 32bit int is used often for indices into data structures.

And in many cases for the entire kernel, only half rate 32 bits add is unacceptable, fortunally it's not the case.

32 bit mul at quarter rate is ok but not good, at half rate would be good, since the hardware is capable of 52 bits multiplies at quarter rate couldn't it be a little modified to allow 32 bit mul at half? :smile:
 
How and why would they make such a mistake?
They use the Mantissa part of the FP unit only can do 24bit INT unless they have 48bit FP capability.

With FMA the need 48 bit adders for correct results.

Anyway, 32 bits adders are way to cheap to not include...
 
Don't forget even with only one 32bit int mul per clock, the absolute number is still (somewhat) higher than what a GTX580 can do (which has half-rate 32bit int rate). For 32bit int adds that's more than twice as fast as GTX 580 (unless, of course, that's scalar, in which case it'll drop to 1 32bit int add as usual). So that still looks plenty fast to me.

Actually, Fermi has full rate 32-bit int add operations. I just wrote a CUDA kernel to test it out on my GTX 480, and got 644 Giga integer adds/second. The full-rate peak would be 1.4 GHz * 480 SMs= 672 Giga integer adds/second.

Trying the same kernel out with 32-bit int mul operations gave 331 Giga integer muls/second, which does appear to be half rate.
 
Back
Top