AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Silent_Buddha · Dec 23, 2011

Love_In_Rio said:
Looking at the prior generation I see Ati packed the HD6970 2,6 billion transistors in 378mm2 and Nvidia 3 billions in 529mm2. Is this difference in density due to Nvidia hot clock?.

First, AMD and Nvidia count transistors differently. So you can't just directly compare transistor counts.

And yes, AMD packs more AMD transistors per MM^2 than Nvidia does with Nvidia transistors.

And different things can be packed more densely than other things. So just differences in architecture can mean one IHV can pack more of a certain "thing" per mm^2 than the other.

As one example, I believe AMD packs their ALU's much more densely than Nvidia does. But they are also less capable per ALU I believe.

So really, transistors are largely meaningless when comparing the two. Die size is more meaningful but that isn't even terribly meaningful. Part of why GF110 is so much larger than Cayman is that it devoted a lot more die area to compute capabilities and other things that are relatively meaningless for graphics workloads.

Regards,
SB

sebbbi · Dec 23, 2011

Love_In_Rio said:
Looking at the prior generation I see Ati packed the HD6970 2,6 billion transistors in 378mm2 and Nvidia 3 billions in 529mm2. Is this difference in density due to Nvidia hot clock?.

Basically if you want a higher clock, you have to increase your pipeline length, because you have less time to process each pipeline stage. Longer pipeline of course requires more transistors.

A really simple example:
Building a car takes 16 hours to complete. If you have a "pipeline" with single stage, you can clock the pipeline at one clock per 16 hours. If you now split the car manufacturing for example to 5 stages (structure = 4 hours, tires/drivedrain/suspension = 3 hours, internals = 4 hours, windows/lights = 3 hours, painting = 2 hours) you can clock the pipeline at one clock per 4 hours (time required to complete the stage with the longest time). So in this case the longer pipeline gives you 4x thoughput (4x more cars ready in the same time). Increased pipeline length causes additional latency. In the example with the 5 stage pipeline one car takes 4*5 = 20 hours to complete (from start to beginning). Moving work from pipeline stage to the next of course also takes some time (so pipeline stages have slightly less time than a full clock to do real work). More stages = more wasted time (and wasted transistors) on doing things that do not contribute to the result.

Pipeline length is only one factor of transistor budget. NVidia's cards have considerably higher 64 bit floating point calculation thoughput than last generation AMD products, and have more sophisticated caches targeted towards high performance GPU computing. Things like these take transistors as well.

Ailuros · Dec 23, 2011

Dave Baumann said:
It is already 40%-50% fast than the predecessor, 35% in the title in question, that no small gap by any means and again, no reason to push things out. And the example given is not even a "driver" issue as this is something that is implemented in the game.

No surprise there given the 50% bandwidth increase. It's only natural that some of us expected more with a new architecture.

flopper · Dec 23, 2011

Ailuros said:
No surprise there given the 50% bandwidth increase. It's only natural that some of us expected more with a new architecture.

I would assume if computing wasnt needed they could have built a faster card but then also face down the line a lack of features for the professional arena.
a ton of money is made there and if the new gcn series adress that field with good tech and features, the series is a winner no matter what we gamers think

Alexko · Dec 23, 2011

neliz said:
I really hope it's not empty boxes we're shipping.

As do I, this would be quite embarrassing!

But seriously, I mostly hope it's more than a few boxes. Of course, at $550, demand might not be all that high.

fellix · Dec 23, 2011

Eric Demers and GCN, Part II @ Rage3D

Entropy · Dec 23, 2011

sebbbi said:
Basically if you want a higher clock, you have to increase your pipeline length, because you have less time to process each pipeline stage. Longer pipeline of course requires more transistors.

A really simple example:
Building a car takes 16 hours to complete. If you have a "pipeline" with single stage, you can clock the pipeline at one clock per 16 hours. If you now split the car manufacturing for example to 5 stages (structure = 4 hours, tires/drivedrain/suspension = 3 hours, internals = 4 hours, windows/lights = 3 hours, painting = 2 hours) you can clock the pipeline at one clock per 4 hours (time required to complete the stage with the longest time). So in this case the longer pipeline gives you 4x thoughput (4x more cars ready in the same time). Increased pipeline length causes additional latency. In the example with the 5 stage pipeline one car takes 4*5 = 20 hours to complete (from start to beginning). Moving work from pipeline stage to the next of course also takes some time (so pipeline stages have slightly less time than a full clock to do real work). More stages = more wasted time (and wasted transistors) on doing things that do not contribute to the result.

Pipeline length is only one factor of transistor budget. NVidia's cards have considerably higher 64 bit floating point calculation thoughput than last generation AMD products, and have more sophisticated caches targeted towards high performance GPU computing. Things like these take transistors as well.

Small addendum: Optimum pipeline length will depend heavily on the demands of the software you're running, and the amount of resources you dedicate to alleviating the effects of pipeline stalls/flushes. Not only that, but as Sebbbi implied above, silicon process is also a major factor. Still, x86 CPUs we've seen from the Pentium Pro in 1995 and onwards have fluctuated back and forth between 10 and up to 25 pipeline stages. Which, all things considered, isn't much given that for instance the number of transistors on a CPU has grown with almost a factor 1000.

Specific application areas, such as graphics used to be, could have drastically different optimum pipeline length due to the limited scope of the target code, allowing the processor to be tailored to its task.

Entropy · Dec 23, 2011

flopper said:
I would assume if computing wasnt needed they could have built a faster card but then also face down the line a lack of features for the professional arena.
a ton of money is made there and if the new gcn series adress that field with good tech and features, the series is a winner no matter what we gamers think

Uhm, no, not really. nVidia as opposed to AMD does get some nice revenue from their Quadro/Tesla, to the tune of $200million per quarter. However, that would seem to be strongly dominated by the Quadro (graphics) products rather than Tesla (computation). Compared to the overall computer market, or even graphics market, computation revenue is very, very small. There just isn't that much gold in those hills. Pretty safe market, but volumes are pitiful, and there is no way that computation has been even close to paying for its engineering and software costs for the graphics vendors. It's an investment. (Paid by gamers, ironically.)

DavidGraham · Dec 23, 2011

fellix said:
Eric Demers and GCN, Part II @ Rage3D

Eric: Y'know, we didn't even put in [the presentation] all the changes we put in, we put in a lot of stuff but - I'll be honest with you, we never even saw the problems on Cayman until people started looking at it. Then we went back and we have about three or four guys that worked on this pretty consistently, a whole bunch of guys and the whole texture team. We found some problems and we fixed those and we found other things that we're going to fix over time as well. The one promise we can say is 'better'. For sure, better. I don't know if you guys could see it from there?

I presume the problem is not entirely fixed , not yet any way .

fellix · Dec 23, 2011

If AMD can't reproduce the issue in a proper manner, then they really wouldn't have a clue what and how to fix, no matter the amount of user complaints.

Dave Baumann · Dec 23, 2011

Ailuros said:
No surprise there given the 50% bandwidth increase. It's only natural that some of us expected more with a new architecture.

[How many years have we been doing this...?] in the absence of other changes 50% improvements in bandwidth or core clocks rarely translate into linear scaling, in fact it's generally in the range of half, or a little more, of the improvement. Given that there is a 33% improvement in execution units to be seeing this performance differential is showing that the architecture is benefitting games very well already. Add to this cases where things are genuinely doubled or more, such as tessellation and various compute cases, it's clear that the architecture is doing what it says on the tin; the fact that there is more to come through better undersanding is just gravy.

Entropy · Dec 23, 2011

Dave Baumann said:
[How many years have we been doing this...?] in the absence of other changes 50% improvements in bandwidth or core clocks rarely translate into linear scaling, in fact it's generally in the range of half, or a little more, of the improvement.

Rephrasing a bit, what you are saying is that given modern benchmarking codes, the cards are not wholly limited by either memory bandwidth nor ALU capabilities, meaning that improvements in either of those areas typically do not achieve linear scaling. And that this further implies that the cards are at least reasonably balanced - in order to yield a linear scaling, they would have had to be totally limited by that feature, which in turn would have indicated a design that was quite unbalanced in relation to the benchmarking load.
Makes sense.

However, if the improvement is less than either of the improvements in base capabilities would suggest, this would imply that there is either other architectural factors coming into play that a very simplistic analysis cannot take into account, or that the software layer still isn't quite as mature as for the previous architecture. To be honest, I haven't seen much of that, but those would be the interesting cases to dig deeper into. Or for that matter, on a positive note, where the improvement is greater than either!

trinibwoy · Dec 23, 2011

We've already seen examples of massive improvements in compute workloads that far exceed the theoretical numbers. I don't share Dave's view that Tahiti's current scaling in games is that great though. A 35-40% improvement from a 40-50% bump in theoreticals on a brand new architecture isn't something to brag about. Geometry got an even bigger boost. GCN could mature to 50-60% faster than Cayman on avg.

swaaye · Dec 23, 2011

fellix said:
Eric Demers and GCN, Part II @ Rage3D

OMG. Demers feeds the conspiracy theory of Intel having alien process technology!

fellix · Dec 23, 2011

trinibwoy said:
Geometry got an even bigger boost. GCN could mature to 50-60% faster than Cayman on avg.

I have strong suspition that all those Heaven benchamrks are somewhat tampered by the driver's tessellation profiling. I saw one test with Nvidia's Island demo, and the numbers were much less optimistic. Does Catalyst already ships with TS factor profiles, anyway?

Broken Hope · Dec 23, 2011

With AMD's current GPU pricing strategy I dread to think what the 7990 is going to cost in the UK, I'm already expecting the 7970 to be around £450-500, so that will put the 7990 at around £800-1000, stupid pricing.

I guess the only hope is that Nvidia drop the price on the GTX 580 and force AMD to do the same.

Dave Baumann · Dec 23, 2011

Trini - from a "big ticket" point of view only bandwidth increases 50%. CU's is 33% more, ROPs are the same number, vertex engines are the same number and clock delta is small.

Sent from my SGH-i917R using Board Express

trinibwoy · Dec 23, 2011

I understand but I factored in the clock increase too.

CarstenS · Dec 23, 2011

fellix said:
I have strong suspition that all those Heaven benchamrks are somewhat tampered by the driver's tessellation profiling. I saw one test with Nvidia's Island demo, and the numbers were much less optimistic. Does Catalyst already ships with TS factor profiles, anyway?

When I last ran Heaven (2.5), Tahiti was faster than GF110 - and we had the tessellation switch moved away from AMD optimized (I actually think, "application decides" and 64x max are about the same - you get what the app requests).

Tridam · Dec 23, 2011

jaredpace said:
For the overclockers out there, this was stock volts to the core. Looking forward to the next afterburner controlling that chil regulator. At 1.2ghz it's just under a 590/6990.

Mmm with my sample I could also go to quite high clocks.... but it would be stable only in very high power consumption cases such as 3DMark and Furmark... which means reduced clock through PowerTune (even in set to +20%). To get stability in games where the power consumption was not hitting the limit I had to go back to 1075 MHz which is still nice.

AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Silent_Buddha

sebbbi

Ailuros

Epsilon plus three

flopper

Alexko

fellix

Entropy

Entropy

DavidGraham

fellix

Dave Baumann

Gamerscore Wh...

Entropy

trinibwoy

Meh

swaaye

Entirely Suboptimal

fellix

Broken Hope

Dave Baumann

Gamerscore Wh...

trinibwoy

Meh

CarstenS

Moderator

Tridam

Similar threads