AMD: R9xx Speculation

ZerazaX · Nov 22, 2010

If you look at the graph... it never dips below > 1.5x Cypress and is pretty consistently at 2x or higher for the most part

http://www.dyn-wp.frazpc.pl/wp-content/uploads/2010/11/Fot_014.jpg

Jawed · Nov 22, 2010

10% area saving for the SIMDs with the VLIW-4 architecture. So it wasn't just for utilisation.

Interestingly double-precision throughput is improved (I'm a little sceptical, though) with 2 MULs per clock, same as ADD. MAD/FMA are still 1 per clock.

Fetch directly from LDS

This means it's not a feature of Cypress, as I was expecting, waiting for the compiler to improve. Fingers-crossed it's good...

This slide:

http://www.dyn-wp.frazpc.pl/wp-content/uploads/2010/11/Fot_018.jpg

appears to show that each SIMD engine has an "octo" fetch unit. EDIT: er, no, they're just 4 "tall" blocks.

ZerazaX · Nov 22, 2010

Not a bad move to VLIW-4 if they gott 10% die space saved AND similar performance

hoom · Nov 22, 2010

Still no 32bit INT ops

Lux_ · Nov 22, 2010

ZerazaX said:
Not a bad move to VLIW-4 if they gott 10% die space saved AND similar performance

I guess it is quite hard for a compiler to fill all 5 VLIW lanes: such cases are quite limited and naturally the performance drop in VLIW-4 case is equally rare.
May suck for a hand-optimized cases though. Yet another reason to avoid these and rely on compiler.

Kaotik · Nov 22, 2010

User controllable power containment sounds nice, and the slide also suggests clockspeeds are quite a bit up due no need to limit them just to fit couple apps within TDP

DarthShader · Nov 22, 2010

And article is pulled!

no-X said:
Up-to 3-times faster compared to Cypress... says the slide.

Marketing talk... it only happens at tess level 9. Otoh, there was the off die buffer mentioned, to improve high tess level performance.

kresek · Nov 22, 2010

DarthShader said:
And article is pulled!

Grab the slides:

http://www.dyn-wp.frazpc.pl/wp-content/uploads/2010/11/Fot_001.jpg
... to ...
http://www.dyn-wp.frazpc.pl/wp-content/uploads/2010/11/Fot_024.jpg

Kef · Nov 22, 2010

kresek said:
Grab the slides:

http://www.dyn-wp.frazpc.pl/wp-content/uploads/2010/11/Fot_001.jpg
... to ...
http://www.dyn-wp.frazpc.pl/wp-content/uploads/2010/11/Fot_024.jpg

And here: http://www.computerbase.de/forum/showpost.php?p=8860219&postcount=92

Gipsel · Nov 22, 2010

Jawed said:
nterestingly double-precision throughput is improved (I'm a little sceptical, though) with 2 MULs per clock, same as ADD.

If you look in some Cypress presentations, they had the exact same error in there, too

Jawed said:
This slide:

http://www.dyn-wp.frazpc.pl/wp-content/uploads/2010/11/Fot_018.jpg

appears to show that each SIMD engine has an "octo" fetch unit.

That may reuse the texture L1 (probably still 8kB) and may mimic GF100's doubled throughput with 4offset_gather4.

psolord · Nov 22, 2010

Kaotik said:
User controllable power containment sounds nice, and the slide also suggests clockspeeds are quite a bit up due no need to limit them just to fit couple apps within TDP

Indeed.

Aside from providing smoother motion and tearing elimination, a first level of power containment is to keep vsync on.

Still, not all cards have the same power consumption, even if you try to play a light game like Torchlight, vsynced at 60Hz. If i understand it correctly, this is where AMD's user selected power containment could come into play.

I imagine it as a beast that can eat Crysis for dinner, which also can play light games with a fraction of the max tdp. That, coupled with great idle power consumption, would result in an awesome overall card.

I have also upload the slides here. Credits to original reporters have been dully noted.

Jawed · Nov 22, 2010

Gipsel said:
If you look in some Cypress presentations, they had the exact same error in there, too
That may reuse the texture L1 (probably still 8kB) and may mimic GF100's doubled throughput with 4offset_gather4.

Sorry, I realised I made a mistake there and added a note.

The diagram does show 8KB L1s though.

Gipsel · Nov 22, 2010

hoom said:
Still no 32bit INT ops

Of course there are 32bit integer operations. Even several years old GPUs can do it. Only thing that is a bit slow are 32bit integer multiplications (only done in t unit before and now by combination of all 4 slots). Everything else should be quite fast.

ZerazaX · Nov 22, 2010

Its likely these slides were from the October Barts' presentation

The slides say October at the bottom, and we know that Cayman's features were briefly presented in October according to numerous reviews

So that 6990 November leak may be true... AMD was due to reveal more about Cayman late November, and thats when the actual specs may have been filled in instead of the TBD

no-X · Nov 22, 2010

ZerazaX said:
Not a bad move to VLIW-4 if they gott 10% die space saved AND similar performance

"similar performance" is likely ~95% performance, which would mean that the 4D->5D transition made the unified core 5% more area-efficient... I think the main reason was DP, where the per-ALU performance stays exactly the same with 10% area saving.

DarthShader said:
Marketing talk... it only happens at tess level 9. Otoh, there was the off die buffer mentioned, to improve high tess level performance.

This is hardly a marketing talk. Using this logic would place Barts at the same postiion as Cypress, while in reality Barts is 23% faster in HAWX2 than Cypress (hardware.fr) despite 36% arithmetic and texturing advantage of Cypress. It's plausible, that Cayman will perform at least 2.5 times better than Cypress in this game...

liolio · Nov 22, 2010

What is the likelihood of the HD6950 actually beating the GT580?

chavvdarrr · Nov 22, 2010

liolio said:
What is the likelihood of the HD6950 actually beating the GT580?

~ 0

DavidGraham · Nov 22, 2010

2 Graphics Engines ? same as a GPC

?

I noticed an upgrade in ROPs processing speed for Int16 and FP32 , I hope that translates into faster AA .

mczak · Nov 22, 2010

kresek said:
From a Polish site, but the slides do speak for themselves:

So 2 simd groups only? Probably 2x15 simds then?
Overall this looks quite similar to Evergreen:
- simds are now vliw-4.
- tesselator, vertex assembly, geometry assembly have moved to the "graphics engine", hence theoretically twice the polygon/tesselation throughput.
- rops got ability for full-speed 1-channel / half-speed 2-channel fp32 (just like nvidia - honestly I don't know why they didn't do this with earlier chips should be dead easy to implement) plus some faster int16 handling.
- There is nothing like the rumored decoupled tmus or anything like that.

liolio · Nov 22, 2010

chavvdarrr said:
~ 0

That's low

I was over-doing it a bit (matching may have been better) but the 6950 should easily match the 5870 in raw power (it would only take 1280 ALUs) and by the slide the arch has gone through plenty of enhancements.

My bet is that the difference between 6950 and the GT580 may not be significant (between 0-5% +/- is not significant for me when so many games already runs at crazy high fps, even 10% and more in some cases/games is completely transparent to the end user).

I'm confident that the 6970 will beat the GT580 ( I'd put the odds 80%). Antilles should offer a "useless" amount of power

.

AMD: R9xx Speculation

ZerazaX

Jawed

ZerazaX

hoom

Lux_

Kaotik

Drunk Member

DarthShader

kresek

Kef

Gipsel

psolord

Jawed

Gipsel

ZerazaX

no-X

liolio

Aquoiboniste

chavvdarrr

DavidGraham

mczak

liolio

Aquoiboniste

Similar threads