AMD: Volcanic Islands R1100/1200 (8***/9*** series) Speculation/ Rumour Thread

Bah, they won't dare to go 512-bit interface (again)... unless there's a strategy to overtake NV in the HPC/Workstation markets. 64xROPs is plausible, if the setup pipes are to double up (16 fragments x 4 pipes). AMD has been more consistent in keeping the setup : pixel ratio equal for the last few generations.
 
I wouldn't be surprised either way, but Dave B's statement that Pitcairn's memory controllers are twice as area efficient as Tahiti's seems like a massive hint towards 512-bit.

Bah, they won't dare to go 512-bit interface (again)

You make it sound like R600's problem was a 512-bit memory controller. There's nothing inherently wrong with a large memory controller though.
 
to guess what they do, it would be really important to know what they're aiming for. if it's about gaming, then they probably won't pay for the 512bit if 'it' are not needed -> their competition could currently keep up with just 256bit. I'd think there are places that are more suited to be extended to utilize better what they already have. you can still extend the raw power by squeezing some more MHz out of the chips. I could even imagin they'll scarify some HPC parts (cache sizes reduction, lowering the DP ration etc.).

I bet on DX11 polishing -> tessellation/rasterization, geometry shader and probably some compute improvements (e.g. faster atomics).
 
I guess I should have asked here, since asking on another forum got one of the few B3Ders to respond.

In Anand's review of Tahiti they mentioned specifically that Cayman's ROPs were bandwidth starved and that for Tahiti they didn't really need to do anything but increase bandwidth to get the performance they wanted out of them.

Which leads me to believe, if they didn't oversupply Tahiti with bandwidth, that they will only go to 40ROPs + 7Gbps if they don't move to 512bit.

Why does it seem that AMD's ROP config is so much more dependent on bandwidth than Nvidia, specifically GTX680 vs 7970? Roughly equal spec-wise but 7970 has +40% bandwidth but they score roughly the same on 3dmark Vantage Pixel Fill. Is that synthetic just not accurate? Is there some inherent difference between their ROPs?
 
Why does it seem that AMD's ROP config is so much more dependent on bandwidth than Nvidia, specifically GTX680 vs 7970? Roughly equal spec-wise but 7970 has +40% bandwidth but they score roughly the same on 3dmark Vantage Pixel Fill. Is that synthetic just not accurate? Is there some inherent difference between their ROPs?
This particular sub-test is using alpha blending and writes the result in a R16G16B16A16 render target. There could be several limiting key factors, depending on the architecture, though ALU throughput is not an issue here.
 
Well I guess there's advantages and disadvantages for a low-clocked 512bit vs. a high-clocked 384bit interface. 384bit 1.7Ghz (i.e. 7gbps chips slightly underclocked) would have just about the same bandwidth as 512bit 1.25Ghz (5gbps), and the needed area for the memory controller might be similar too maybe (no good idea there, it seems safe to assume part of why the MC is so much bigger in Tahiti compared to Pitcairn is the higher supported clock, but it's quite possible it could be a bit more optimized and still support higher clocks).
- power consumption:
I suspect 512bit version should be way better there. At least pro cards could use 1.35V chips, whereas afaik 7gbps chips still are only sold as factory overclocked versions (that is, 1.55V or 1.6V instead of the nominal 1.5V). Obviously there's more chips with a 512bit interface but the lower voltage should make more than up for this.
- memory size:
I guess 512bit would have some sort of a "problem" since even the lowest-end configuration cannot be less than 4GB (with 2gb chips). That is unless there exist versions with disabled memory channels.
OTOH Pro version could easily go up to 16GB (instead of 12) (4gb chips, clamshell mode), and more memory is more than just "nice to have" for some HPC applications.
- cost:
512bit obviously needs more pins, and at least at first sight a more complex pcb but OTOH timing is less critical so might not really make much of a difference in the end.
 
Bah, they won't dare to go 512-bit interface (again)... unless there's a strategy to overtake NV in the HPC/Workstation markets. 64xROPs is plausible, if the setup pipes are to double up (16 fragments x 4 pipes). AMD has been more consistent in keeping the setup : pixel ratio equal for the last few generations.
What about Bonaire? Dual setup/rasterizer like Pitcairn and Tahiti but just 16 ROPs as CapeVerde.
This particular sub-test is using alpha blending and writes the result in a R16G16B16A16 render target. There could be several limiting key factors, depending on the architecture, though ALU throughput is not an issue here.
Kepler somehow manages to exceed the fillrate allowed by the external bandwidth for blending and 4xfp16 targets. Must be a caching phenomenon.
 
In Anand's review of Tahiti they mentioned specifically that Cayman's ROPs were bandwidth starved and that for Tahiti they didn't really need to do anything but increase bandwidth to get the performance they wanted out of them.
they actually say the opposite. in Cayman the peak ROP throughput could not be reached, increasing their count would not be that useful, the solution was to increase the efficiency. AMD made that by decoupling the ROPs from the memory controller. So they kept the ROP count, but increased the efficiency (and the bandwidth).
"With Tahiti AMD would need to improve their ROP throughput one way or another to keep pace with future games, but because of the low efficiency of their existing ROPs they didn’t need to add any more ROP hardware, they merely needed to improve the efficiency of what they already had."

Why does it seem that AMD's ROP config is so much more dependent on bandwidth than Nvidia, specifically GTX680 vs 7970? Roughly equal spec-wise but 7970 has +40% bandwidth but they score roughly the same on 3dmark Vantage Pixel Fill. Is that synthetic just not accurate? Is there some inherent difference between their ROPs?
so, you're saying, AMD's ROPs are bandwidth limited, yet 40% more bandwidth is not resulting in 40% better fillrate -> sounds like you cancel out your own argument ;)

you can't really compare ROPS 1:1, but I'd think it some artificial benchmark that goes for the simplest and fastest rendering, they are quite the same and then those 32ROPs from 7970@1GHz match pretty much those of the GTX680@1GHz (7970 is slightly faster and I'd blame that on bandwidth, indeed).
 
they actually say the opposite. in Cayman the peak ROP throughput could not be reached, increasing their count would not be that useful, the solution was to increase the efficiency. AMD made that by decoupling the ROPs from the memory controller. So they kept the ROP count, but increased the efficiency (and the bandwidth).
"With Tahiti AMD would need to improve their ROP throughput one way or another to keep pace with future games, but because of the low efficiency of their existing ROPs they didn’t need to add any more ROP hardware, they merely needed to improve the efficiency of what they already had."
That's what LordEC911 said. ;)
 
That's what LordEC911 said. ;)
he said they didn't need to do anything but increase the bandwidth, while the article says they didn't need to do anything than increase the efficiency of the bandwidth usage. (I know that sounds picky, but I think in regards why the GTX680 vs 7970 performs the same, it might be important to note, that both might be purely ROP limited).


seems NVidia did something similar in GTX680 vs GTX580, same bandwidth, slightly lower theoretical ROP throughput, yet about 30% more efficient (at least those are the results).

well, something that might proof him right and me wrong is that the 7870 has also 32 ROPs@1GHz and performance 30% worse in the fill rate benchmark. (but that might also be due to some tinier caches or something).

someone would need to run a benchmark and down- and overclock their 7970 memory by +-30%
 
Per 128b chunk the Pitcairn PHY is about half the size and Tahiti has 3 of them.

I recall this came up with Barts and Cayman, where stepping back from the bleeding edge brought some very significant area efficiencies. (1050 vs 1375)

Unlike Barts and Cayman, Pitcairn and Tahiti at least initially did had much less return for the area investment per 128b chunk. (1200 vs 1375 for 2x the area)
It wasn't until the GHz edition that Tahiti's memory clock went up to 1500 that more of a payoff was realized, although still not as much as Barts vs Cayman.

Was there some expectation that the Tahiti controller was going to plug into an even higher GDDR5 speed? It seems some other chips don't need that much of an area cost to get the same data rate.

In Anand's review of Tahiti they mentioned specifically that Cayman's ROPs were bandwidth starved and that for Tahiti they didn't really need to do anything but increase bandwidth to get the performance they wanted out of them.
There was a bandwidth increase, but also a change in how many memory controllers the ROPs could send data to, which made them much more flexible in utilizing the bandwidth the chip had.
 
seems NVidia did something similar in GTX680 vs GTX580, same bandwidth, slightly lower theoretical ROP throughput, yet about 30% more efficient (at least those are the results).
FP10 and FP16 RT pixel writes are full-rate on Kepler's ROPs, though with blending enabled the rate falls to Fermi levels (half).

Texture fill-rates, L1 and L2 bandwidth are doubled all over the previous generation.
Also, the global atomics op's are an order of magnitude faster.
 
Or the test metric is probably botched anyway.
An alternative to a really botched test (hardware.fr data shows this too, btw.) would be that the setup/raster stuff works differently on nV GPUs. AFAIK fillrate tests basically draw a lot of screen filling quads on top of each other. All what would be needed is that Kepler keeps 4 triangles in flight (i.e. two of the quads) while ensuring by some means that the ROPs carry out the write to the render target in the right order. That way, one would have some reuse of the data in the cache. One could test for that by using some more triangles than just two to fill the screen.

edit:
FP10 and FP16 RT pixel writes are full-rate on Kepler's ROPs, though with blending enabled the rate falls to Fermi levels (half).

Texture fill-rates, L1 and L2 bandwidth are doubled all over the previous generation.
Also, the global atomics op's are an order of magnitude faster.
But just have a look here:
http://www.hardware.fr/medias/photos_news/00/35/IMG0035598.gif

The GTX680 manages 33.3 GPixel/s with RGBA8 blending and 16 GPixel/s with 4xFP16 blending. This requires a bandwidth of 266 GB/s or 256 GB/s, which the GTX680 clearly doesn't have (192 GB/s).
 
Last edited by a moderator:
But just have a look here:
http://www.hardware.fr/medias/photos_news/00/35/IMG0035598.gif

The GTX680 manages 33.3 GPixel/s with RGBA8 blending and 16 GPixel/s with 4xFP16 blending. This requires a bandwidth of 266 GB/s or 256 GB/s, which the GTX680 clearly doesn't have (192 GB/s).
I think this is a proprietary benchmark they are testing with and it could be using optimized data set to align/fit the screen tiles nicely into the cache? :???:
 
he said they didn't need to do anything but increase the bandwidth, while the article says they didn't need to do anything than increase the efficiency of the bandwidth usage. (I know that sounds picky, but I think in regards why the GTX680 vs 7970 performs the same, it might be important to note, that both might be purely ROP limited).

By increasing the efficiency of the ROPs by increasing bandwidth, not increasing the efficiency of the bandwidth usage.

Anandtech said:
As it turns out, there’s a very good reason that AMD went this route. ROP operations are extremely bandwidth intensive, so much so that even when pairing up ROPs with memory controllers, the ROPs are often still starved of memory bandwidth. With Cayman AMD was not able to reach their peak theoretical ROP throughput even in synthetic tests, never mind in real-world usage. With Tahiti AMD would need to improve their ROP throughput one way or another to keep pace with future games, but because of the low efficiency of their existing ROPs they didn’t need to add any more ROP hardware, they merely needed to improve the efficiency of what they already had.(ROPs)

The solution to that was rather counter-intuitive: decouple the ROPs from the memory controllers. By servicing the ROPs through a crossbar AMD can hold the number of ROPs constant at 32 while increasing the width of the memory bus by 50%. The end result is that the same number of ROPs perform better by having access to the additional bandwidth they need.
http://www.anandtech.com/show/5261/amd-radeon-hd-7970-review/4
 
Back
Top