AMD: R9xx Speculation

Mianca · Nov 26, 2010

UniversalTruth said:
And that reason is... simply... best case scenario... or something like up to X1680.

HD 6970 being 50% faster than HD 6870 isn't the best case scenario ... especially not in 3DMark 11

Squilliam · Nov 26, 2010

Ok, we're all talking about best case scenarios here. What kind of performance is best case scenario? Oh and I look at the clock and see it is almost December. December means presents and fast graphics cards to play with? Even with the delay it is still kind of hard to get concrete information out. We don't even have a die size, do we?

Lastly are we looking at something which has a significant performance per mm^2 improvement or are we just looking at an increase in functional units mainly proportional to the increase in the number of transistors and therefore performance through size?

OgrEGT · Nov 26, 2010

Speculations are around

370-400mm2
>2.5 billion Transistors (more like 2.7?)
850-900MHz
appr. 2.6TFlops
appr. 230W Power consumption (games)
Between
2x15 or 3x10 SIMDs a 16 x 4D VLIWs / 120 TMUs
and
2x12 or 3x8 SIMDs a 16 x 4D VLIWs / 96 TMUs

30-50% more performance than HD6870

Edit: Sounds still nice for me

no-X · Nov 26, 2010

according the leaked slides the SIMDs should be in two groups, so from 2x12 to 2x...:smile:

LordEC911 · Nov 26, 2010

~3.2-3.5TFlops.

Best case would technically be ~1.8-2x the performance of 6870, depending on exact specs, but as we know that won't translate to realworld performance increase so 1.4-1.6x would be more reasonable in the majority of cases.

Mianca · Nov 26, 2010

@OgrEGT:

30-50% faster than HD 6870 seems rather conservative.

Given that it's a new and improved architecture, you'd expect perf/mm² to stay at least at the same level as Barts.

I really don't see a next-gen chip that's supposed to be ~50% bigger than Barts achieving less than 50% increase in overall performance ...

HD 6970 will most likely be somewhere between 50-60% faster (in general) than HD 6870 - and thus end up ~5-10% faster (in general) than GTX580.

DX11 performance could/should see an even bigger jump in relative performance. 3DMark11 will surely yield some very interesting results in that respect.

no-X · Nov 26, 2010

Mianca said:
Given that it's a new and improved architecture, you'd expect perf/mm² to stay at least at the same level as Barts

I don't think so. Barts doesn't support DP, Cayman does. That are additional transistors, which won't be utilized in rendering. Another thing is the dual-geometry engine - it also consumes transistors, but it won't impact performance in many games (because majority of games isn't limited by geometry performance). The 4D thing seems to be also targeted to HPC (better DP:SP ratio), functionality transfered from T-unit to X/Y/Z/W is also very HPC oriented... it costs transistors, which won't be utilized in 3D. I'd be very surprised, if Cayman brings better performance/transistors than Barts (at the same clock, of course), because it appears to me, that this GPU is oriented to achieve best HPC performance per transistor - not the best 3D performance per transistor (that was Barts job). And the difference in this aspect seems to be significantly higher than between Cypress and Juniper.

OgrEGT · Nov 27, 2010

LordEC911 said:
~3.2-3.5TFlops.

Best case would technically be ~1.8-2x the performance of 6870, depending on exact specs, but as we know that won't translate to realworld performance increase so 1.4-1.6x would be more reasonable in the majority of cases.

3.3TFlops if 1920VLIWs at 100% Utilization at 850MHz.

Edit: Earlier, Gipsel suggested some 3.2 of 4 so 80% of that.

chavvdarrr · Nov 27, 2010

Mianca said:
@OgrEGT:

30-50% faster than HD 6870 seems rather conservative.

Given that it's a new and improved architecture, you'd expect perf/mm² to stay at least at the same level as Barts.

What about drivers? These won't be as well optimised for the new architecture, so we can expect some 10-20% better performance during first 6 months compared to launch.
Of course I bet the "popular" benchmarks will heavily optimised at launch but still ...

Jawed · Nov 27, 2010

One of the slides that was leaked says:

Upgraded Render Back-Ends

Coalescing of write ops [thought that was already in there ]
16-bit integer (snorm/unorm) ops are 2x faster
32-bit FP (single/double component) ops are 2x-4x faster

Here:

http://www.hardware.fr/articles/806-4/dossier-nvidia-geforce-gtx-580-sli.html

we can see that single-component fp32 fillrate is half speed on HD5870. So I guess that'll become full speed.

Blending might be 4x faster? Is there much need for blending of fp32 single-/dual-channel pixels though?

Or perhaps dual-component fp32 fillrate will be 4x faster than it currently is (however fast that is).

The EQAA modes with the extra coverage samples would appear to be partly dependent upon blending speeds, so perhaps blending speeds are boosted in the relevant places to make performance adequate here.

keritto · Nov 27, 2010

Kaotik said:
There was one theory posted in the HD5 AF broken -thread, aka AMD/ATI using more detailed LOD values by default, by "softening" the LOD by +0.65, the shimmering disappears on Radeons - incidently, then "sharpening" the LOD by -0.65, the shimmering appears on GeForces

Thanks. You could already say that

Which tool do you use to set this "fractional values" to LOD? How this translate to DX/OGL tweaks in ATT (LOD range: -10>-<10)

Gipsel said:
And you can forget about 1.5 GHz GDDR5 with just 6 GBit/s chips. It will be a bit lower (I guess 1.4 GHz maximum).

I'd say you pull up wrong conclusion. 6Gbps usually should mean exactly that, 750MHz(x8), and similarly to 5Gbps chips used on HD5700/5800 series easily could be raised up to 5.6Gbps, this 6Gbps puppies should have similar 66.6-75MHz(x8) overhead. And as official AMD card specs go they could declare them as 1450M(x4) parts (or as i'd correctedly put it 725M(x8)) just lowering clock 25MHz

And as spec(ulations) goes i'd agree Shtal is wrong

and here's my compiled wet dreams

6990 (XTX) 775MHz 3840SPs 6.0GFlops (310W)
6970 (XT) 1025M 1920SPs 4.0GFlops (232W)
6950 (Pro) 875M 1536SPs 2.7GFlops (188W)

cenit · Nov 27, 2010

keritto said:
6990 (XTX) 775MHz 3840SPs 6.0GFlops (310W)
6970 (XT) 1025M 1920SPs 4.0GFlops (232W)
6950 (Pro) 875M 1536SPs 2.7GFlops (188W)

I hope you're wet dreaming TFlops, not GFlops...

wishiknew · Nov 27, 2010

Those EQAA modes, is that basically AMD's version of CSAA?

And no slides of increased performance per watt or area except for that VLIW4 slide. No more architectural magic left in this round?

mczak · Nov 27, 2010

Jawed said:
Blending might be 4x faster? Is there much need for blending of fp32 single-/dual-channel pixels though?

I don't know if there's really much need for that but imho it makes a lot of sense. Currently dual-channel fp32 and single-channel fp32 blending is performed at the same speed as quad-channel fp32 (well outside of memory bandwidth requirements), at 1/4 the rate of 8bit int blending. Clearly, faster quad-channel fp32 blending wouldn't be helpful (there's not enough memory bandwidth even at quarter rate already...), but this means that for 1-channel fp32 blending the hw currently apparently uses only 1 of the 4 blend units of a ROP, the rest are just idling. So by using all of them (just need to feed 4 consecutive pixels to the 4 rgba blend units) single-channel fp32 blending performance should increase by a factor of 4 (well not quite it will hit memory bandwidth limits) and dual-channel fp32 blending by a factor of 2, with minimal hardware changes.(nvidia is already doing this for a while now.)

PSU-failure · Nov 27, 2010

Even considering bandwidth constraint, it could be quite a good improvement in some pathological cases.

Mianca · Nov 27, 2010

neliz said:
It might launch (really) close to the 570?

With GTX570 rumored to launch on December 7th, it sure would be a nice move to launch HD 69** just one day before that

no-X · Nov 27, 2010

mczak said:
(nvidia is already doing this for a while now.)

Since Fermi...? I think GT200 didn't support it.

Thalb · Nov 27, 2010

no-X said:
I don't think so. Barts doesn't support DP, Cayman does. That are additional transistors, which won't be utilized in rendering. Another thing is the dual-geometry engine - it also consumes transistors, but it won't impact performance in many games (because majority of games isn't limited by geometry performance). The 4D thing seems to be also targeted to HPC (better DP:SP ratio), functionality transfered from T-unit to X/Y/Z/W is also very HPC oriented... it costs transistors, which won't be utilized in 3D. I'd be very surprised, if Cayman brings better performance/transistors than Barts (at the same clock, of course), because it appears to me, that this GPU is oriented to achieve best HPC performance per transistor - not the best 3D performance per transistor (that was Barts job). And the difference in this aspect seems to be significantly higher than between Cypress and Juniper.

This last sentence is trivial, as Juniper was just a Cypress cut in half, and with some GPGPU functionality removed to save space. So if Caiman is anything else than Barts X2 (and I bet it is!), it has to be more fundamentally different from Barts than Capress was from Juniper.

In rest, I agree with your post. But don't forget that Caiman has a different target market than Barts:
- Caiman is intended for the enthusiast (hence the x9xx designation), who does care little about price, power consumption, etc. All that matters is to perform better than the competition
- Barts is intended for gamers, who prefer to spend no more than 200$ on a graphics card, but do not care about GPGPU functionality, single-GPU performance crown etc, as long as they can play the newest games without stuttering. The point about the x850 is to get 80% of the performance of the top GPU at 50% the price.

mczak · Nov 28, 2010

no-X said:
Since Fermi...? I think GT200 didn't support it.

GT200 worked the same.
http://www.hardware.fr/articles/787-7/dossier-nvidia-geforce-gtx-480-470.html
Note though that nvidia doesn't have 1/4 4-channel fp32 blend. It actually looks like 1/4 per channel, so with 4-channel fp32 blend you get 1/16 of the int8 blend rate - same for gt200 and gf100 (but for gf100 it appears as more than 1/16 because int8 blend is limited by the 64bit per clock export limit of the SMs).
Obviously, that would give Cayman a huge advantage over GF110 there. But I don't know if fp32 blending is really used anywhere - probably not...

Ethatron · Nov 28, 2010

mczak said:
But I don't know if fp32 blending is really used anywhere - probably not...

OpenCL image extension?

AMD: R9xx Speculation

Mianca

Squilliam

Beyond3d isn't defined yet

OgrEGT

no-X

LordEC911

Mianca

no-X

OgrEGT

chavvdarrr

Jawed

keritto

cenit

wishiknew

mczak

PSU-failure

Mianca

no-X

Thalb

mczak

Ethatron

Similar threads