AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

It's like a desgin with modularization now.

Exactly

Yeah I'm thinking the GCN high-end part's gaming performance will be like 6970-GTX580 level. At most it'll be the Cypress>Cayman increase. Sounds lame, but the 20% Cayman has over Cypress is lame. 3870 to 4870 increase would be AWESOME... but keep dreamin. They already said GCN isn't performance-centric & heads are rolling there....

I think they should take VLIW5 Cypress + reworked scheduler, double or triple the alu count and put it on a Cayman sized 28nm die. Like AMD could have done with a 32nm 1100T or 32nm 965BE instead of BD.
 
Yeah I'm thinking the GCN high-end part's gaming performance will be like 6970-GTX580 level. At most it'll be the Cypress>Cayman increase. Sounds lame, but the 20% Cayman has over Cypress is lame. 3870 to 4870 increase would be AWESOME... but keep dreamin. They already said GCN isn't performance-centric & heads are rolling there....

I think they should take VLIW5 Cypress + reworked scheduler, double or triple the alu count and put it on a Cayman sized 28nm die. Like AMD could have done with a 32nm 1100T or 32nm 965BE instead of BD.

:unsure::unsure:

Why are you bringing a backported GPU into the discussion of a shrink and a new architecture?
 
@jaredpace:

I would actually bet that the high end Tahiti is a solid (at least) 50% faster than Cayman on average. Otherwise they would have screwed up quite badly.
And that 50% is more a question of how well the new front end works and if the TMUs/ROPs/memory controller can keep up with the ALUs. From all we know about GCN, the real world shader performance should be easily quite a bit above Cayman for all intents and purposes. And nVidia shows today that one can trump the VLIW architectures with just about half of the theoretical peak.
 
@jaredpace:

I would actually bet that the high end Tahiti is a solid (at least) 50% faster than Cayman on average. Otherwise they would have screwed up quite badly.
And that 50% is more a question of how well the new front end works and if the TMUs/ROPs/memory controller can keep up with the ALUs. From all we know about GCN, the real world shader performance should be easily quite a bit above Cayman for all intents and purposes. And nVidia shows today that one can trump the VLIW architectures with just about half of the theoretical peak.

I don't know, I mean Cypress was "only" 40~50% faster than RV790, and that's in spite of a pretty significant increase in die size on a full node shrink. I think 40% would be quite good; unless Tahiti is 400mm², of course.
 
I don't know, I mean Cypress was "only" 40~50% faster than RV790
Make it 50-60%. ;)
And that's the average. The extremes were of course above that. See this for instance:

3d_mark_vantage_extre69u5d.jpg


And this was achieved with basically the same memory interface and the same L2 bandwidth (i.e. bandwidth per CU was halved).
 
Last edited by a moderator:
I think the main problem of current architecture is scaling. Gaming performance scales quite badly with the number of SIMDs. Even Juniper with just 8 SIMDs enabled performs almost identically as fully active model (at the same clocks, of course). It seems, that Barts with two blocks of 7 SIMDs was the sweet spot (maybe even two blocks of 8 SIMDs would be fine). But any configuration exceeding this limit offers significantly worse performance/transistor ratio.

I believe this is the main reason why AMD decided to use GCN for the high-end parts in particular. Even if individual CU blocks of GCN architecture aren't as effective as current SIMDs, proper scaling should provide transistor/performance ratio (at least) comparable to current high-end. It is also possible, that current VLIW-5/VLIW-4 architecture will be more efficient for low-end (or lower mainstream) parts than GCN.
 
I think the main problem of current architecture is scaling. Gaming performance scales quite badly with the number of SIMDs. Even Juniper with just 8 SIMDs enabled performs almost identically as fully active model (at the same clocks, of course). It seems, that Barts with two blocks of 7 SIMDs was the sweet spot (maybe even two blocks of 8 SIMDs would be fine). But any configuration exceeding this limit offers significantly worse performance/transistor ratio.

I believe this is the main reason why AMD decided to use GCN for the high-end parts in particular. Even if individual CU blocks of GCN architecture aren't as effective as current SIMDs, proper scaling should provide transistor/performance ratio (at least) comparable to current high-end. It is also possible, that current VLIW-5/VLIW-4 architecture will be more efficient for low-end (or lower mainstream) parts than GCN.

A lot of that is dependent on workload. With so very little compute workloads in mainstream use VLIW5 is going to be close to VLIW4 in performance terms at the same 'sizing'. When the workload changes then the more compute oriented arch's will show more benefit.

I don't think we've hit the inflection point for 'vector'/GPGPU compute, the workloads are lagging the hardware capbilities quite a bit for mainstream and even off-the-shelf niche use cases.
 
Not generally, not in real games at the time of release, IIRC.
You have to look for non CPU limited scenarios of course (does not make much sense otherwise, right?), for instance higher resolutions. Then also the linked hardware.fr comparison shows an average advantage north of 50% for the HD5870 compared to the HD4890.
 
What about the fact that we continue to get D3D9 games that are ever more complex. D3D10/11 have various features to improve GPU utilization. I imagine GCN will be stuck running its share of D3D9 games.
 
You have to look for non CPU limited scenarios of course (does not make much sense otherwise, right?), for instance higher resolutions. Then also the linked hardware.fr comparison shows an average advantage north of 50% for the HD5870 compared to the HD4890.

As long as you benchmark real world games, you'll end up with a portion of your test being system limited almost anytime. And if driver or front-end are inefficient so that you can only make your architecture shine at ultra-high resolutions, then that's one aspect that should be part of an evaluation as long as few people only have access to such displays.

Nice example: 8x AA - is that primarily upping demand on the graphics processor and video ram in high resolution? Or is there another system limitation starting to creep in, namely lack of VRAM?

IMHO you need to look at the whole picture to not end up picking cherries.
 
As long as you benchmark real world games, you'll end up with a portion of your test being system limited almost anytime. And if driver or front-end are inefficient so that you can only make your architecture shine at ultra-high resolutions, then that's one aspect that should be part of an evaluation as long as few people only have access to such displays.

Nice example: 8x AA - is that primarily upping demand on the graphics processor and video ram in high resolution? Or is there another system limitation starting to creep in, namely lack of VRAM?

IMHO you need to look at the whole picture to not end up picking cherries.
That's all true. But I think if you want to see the scaling of a certain architecture, you should minimize the effect of the CPU (by using a very fast one) and the VRAM size for instance. The hardware.fr test (and most other articles) compare the HD4890 and HD5870 both with 1 GB. Of course, if you run then 2560x1600 with 8xAA, the VRAM size will limit in a lot of cases. But that tells you nothing about the scaling of the architecture. If you want to look at that, you have to find the right scenarios (neither CPU limited not VRAM size limited, but other limits of the architecture like frontend or bandwidth are completely okay, that are the ones you are actually looking for). And that is not cherry picking, it's chosing the right tool for the job. ;)
 
Actually, graphics cards themselves are tools for a job. Mostly, running games, with all their various limitations and different emphasis on a variety of aspects of an architecture. If your architecture of choice as a whole cannot scale well with a high number of SIMDs but is able to hide that the more emphasis you put on pixel pushing, then both of these aspects are worth considering when evaluating the architecture. By limiting your scope to only one of those aspects, you most certainly will fail to see (the) other(s).

So, while I think that you of course should omit an outright CPU bound case, you should not discard (bad) scaling for other reasons as some reviews (and readers) do by focusing solely on very high resolutions, limiting themselves to only part of an architecture.
 
So, while I think that you of course should omit an outright CPU bound case, you should not discard (bad) scaling for other reasons as some reviews (and readers) do by focusing solely on very high resolutions, limiting themselves to only part of an architecture.
I guess that means we agree, as I said that GPU architecture related bottlenecks are the ones one should look for and of course one has to evaluate it with somewhat modern real world workloads. :D
The higher resolution argument for the hardware.fr test was just one aspect to lower the CPU limitation, it was not meant to represent the complete picture of course. At the lower resolutions you also find quite a few examples for the different limits one may encounter.
 
Back
Top