AMD: R9xx Speculation

:D 10% area saving for the SIMDs with the VLIW-4 architecture. So it wasn't just for utilisation.

Interestingly double-precision throughput is improved (I'm a little sceptical, though) with 2 MULs per clock, same as ADD. MAD/FMA are still 1 per clock.

Fetch directly from LDS :devilish: This means it's not a feature of Cypress, as I was expecting, waiting for the compiler to improve. Fingers-crossed it's good...

This slide:

http://www.dyn-wp.frazpc.pl/wp-content/uploads/2010/11/Fot_018.jpg

appears to show that each SIMD engine has an "octo" fetch unit. EDIT: er, no, they're just 4 "tall" blocks.
 
Not a bad move to VLIW-4 if they gott 10% die space saved AND similar performance
I guess it is quite hard for a compiler to fill all 5 VLIW lanes: such cases are quite limited and naturally the performance drop in VLIW-4 case is equally rare.
May suck for a hand-optimized cases though. Yet another reason to avoid these and rely on compiler.
 
User controllable power containment sounds nice, and the slide also suggests clockspeeds are quite a bit up due no need to limit them just to fit couple apps within TDP
 
nterestingly double-precision throughput is improved (I'm a little sceptical, though) with 2 MULs per clock, same as ADD.
If you look in some Cypress presentations, they had the exact same error in there, too ;)
This slide:

http://www.dyn-wp.frazpc.pl/wp-content/uploads/2010/11/Fot_018.jpg

appears to show that each SIMD engine has an "octo" fetch unit.
That may reuse the texture L1 (probably still 8kB) and may mimic GF100's doubled throughput with 4offset_gather4.
 
User controllable power containment sounds nice, and the slide also suggests clockspeeds are quite a bit up due no need to limit them just to fit couple apps within TDP

Indeed.

Aside from providing smoother motion and tearing elimination, a first level of power containment is to keep vsync on.

Still, not all cards have the same power consumption, even if you try to play a light game like Torchlight, vsynced at 60Hz. If i understand it correctly, this is where AMD's user selected power containment could come into play.

I imagine it as a beast that can eat Crysis for dinner, which also can play light games with a fraction of the max tdp. That, coupled with great idle power consumption, would result in an awesome overall card.

I have also upload the slides here. Credits to original reporters have been dully noted.
 
Last edited by a moderator:
If you look in some Cypress presentations, they had the exact same error in there, too ;)
That may reuse the texture L1 (probably still 8kB) and may mimic GF100's doubled throughput with 4offset_gather4.
Sorry, I realised I made a mistake there and added a note.

The diagram does show 8KB L1s though.
 
Still no 32bit INT ops
:?:
Of course there are 32bit integer operations. Even several years old GPUs can do it. Only thing that is a bit slow are 32bit integer multiplications (only done in t unit before and now by combination of all 4 slots). Everything else should be quite fast.
 
Its likely these slides were from the October Barts' presentation

The slides say October at the bottom, and we know that Cayman's features were briefly presented in October according to numerous reviews

So that 6990 November leak may be true... AMD was due to reveal more about Cayman late November, and thats when the actual specs may have been filled in instead of the TBD
 
Not a bad move to VLIW-4 if they gott 10% die space saved AND similar performance
"similar performance" is likely ~95% performance, which would mean that the 4D->5D transition made the unified core 5% more area-efficient... I think the main reason was DP, where the per-ALU performance stays exactly the same with 10% area saving.
Marketing talk... it only happens at tess level 9. Otoh, there was the off die buffer mentioned, to improve high tess level performance.
This is hardly a marketing talk. Using this logic would place Barts at the same postiion as Cypress, while in reality Barts is 23% faster in HAWX2 than Cypress (hardware.fr) despite 36% arithmetic and texturing advantage of Cypress. It's plausible, that Cayman will perform at least 2.5 times better than Cypress in this game...
 
2 Graphics Engines ? same as a GPC :p ?

I noticed an upgrade in ROPs processing speed for Int16 and FP32 , I hope that translates into faster AA .
 
From a Polish site, but the slides do speak for themselves:
So 2 simd groups only? Probably 2x15 simds then?
Overall this looks quite similar to Evergreen:
- simds are now vliw-4.
- tesselator, vertex assembly, geometry assembly have moved to the "graphics engine", hence theoretically twice the polygon/tesselation throughput.
- rops got ability for full-speed 1-channel / half-speed 2-channel fp32 (just like nvidia - honestly I don't know why they didn't do this with earlier chips should be dead easy to implement) plus some faster int16 handling.
- There is nothing like the rumored decoupled tmus or anything like that.
 
That's low :LOL:
I was over-doing it a bit (matching may have been better) but the 6950 should easily match the 5870 in raw power (it would only take 1280 ALUs) and by the slide the arch has gone through plenty of enhancements.

My bet is that the difference between 6950 and the GT580 may not be significant (between 0-5% +/- is not significant for me when so many games already runs at crazy high fps, even 10% and more in some cases/games is completely transparent to the end user).

I'm confident that the 6970 will beat the GT580 ( I'd put the odds 80%). Antilles should offer a "useless" amount of power :devilish:.
 
Back
Top