AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

hardware.fr has new presentation available:
http://www.hardware.fr/news/11656/afds-retour-futur-gpu-amd.html
It looks like it sets to impress :)

mFa do you takle this as a hint that AMD got the deal for tnext xbox:
IMG0032715_1.jpg


"compliant with next APIs"
 
GZ007: nVidia will probably introduce a new marketing "theonlyimportantfeature" with their new product, so tessellation will move to the "dry cow" list, directly under HDR and PhysX. No one will care about it at that time...
 
Any indication what the cache latency and/or the minimum number of workgroups needed to cover it is?
Probably less than 10 wavefronts needed (this number per SIMD is mentioned several times, is probably a number on the safe side).

I just wonder what the execution latency on the vector ALUs itself is. In the Evergreen and Cayman architecture manuals some dependent operations within an VLIW appear, indicating the pure execution latency is <=4 cycles. One of the slides comparing Caymans VLIW with the new CU architecture mentions "interleaved wavefronts required" for VLIW (resulting in the known 8 cycles we know for the VLIWs) on one hand, while it reads "vector back-to-back wavefront instruction issue" as an advantage of the new architecture. Does that mean the cleaned up register files (operand collection is much easier) enabled it to reduce the latency to 4 cycles so it can issue dependent vector ops every for cycles on one SIMD?
That would really simplify the whole scheduling (okay, the VLIW stuff does almost nothing in the SIMD-engines itself, that's why it can be inefficient as changing clauses costs quite some time), as you don't have to track those dependencies (as Fermi needs to do with its 18 to about 40 cycles latency and instruction issue every 2 cycles). And it would also fit perfectly with the description of the instruction arbitration part. I would like it. But I don't know if it is very feasible when looking at the latencies of nvidia GPUs. But when comparing with CPUs on the other side, Fermi has a factor 10 higher latency for floating point instructions (DP may be worse, integer is just ridiculous) and Cayman is not that much better. Reducing this distance a bit may be possible even when considering the die size and power budget of an ALU in a GPU. The CPU guys at AMD should have plenty of experience how to design fast register files, operand and result networks and such stuff, isn't it? Because as said, the cayman ALUs have probably just 4 cycles latency (maybe except for FMA and double precision stuff).
 
With multiple primitive pipes and R/W L2 the new radeon will probably catch up fermi in tesselation too.:p

If it ends up better at it than Fermi, I'll bet NVIDIA will claim that tessellation is an unimportant gimmick… :D
 
Looks like the TS stage will still be tied to the setup pipeline like the current architecture and not distributed among the SIMD multiprocessors like Fermi. But with (at least) four of those primitive pipes and the coherent L2, I think AMD can catch with NV in heavy tessellation performance.
 
Don't you think the L1 bandwidth is a bit underwhelming? Fermi can fetch twice than that, per SM.

By the way, Eric slipped an aggregate BW estimation of 1,5TB for the cache. With conservative estimation for the chip clock-rate (~850MHz), that would yield between 26 and 30 CUs for the flagship SKU.
 
Don't you think the L1 bandwidth is a bit underwhelming? Fermi can fetch twice than that, per SM.

By the way, Eric slipped an aggregate BW estimation of 1,5TB for the cache. With conservative estimation for the chip clock-rate (~850MHz), that would yield between 26 and 30 CUs for the flagship SKU.
Not per clock. A GF100 can also only fetch 64 Bytes/clock in the best case, isn't it?
Edit:
And AMDs new CU architecture has a separate access to the LDS with 128 Bytes/clock bandwidth. Fermi has to share the 64 Bytes/clock for cache/local memory, afaik.

Lets say 800 MHz (would fit to the 4 cycle vector pipeline latency proposed above ;)) and it means there are 32 CUs. At least if that are 1.5 TiB/s. Otherwise 28 CUs at 850 MHz is also quite close. The number of CUs needs to be divisibly by 4.
 
Last edited by a moderator:
Sure, but designing your hw assuming devs will use litle LDS when spec exposes 32K is a poor design choice. Although something that will work.
The poor choice is designing hardware assuming the maximum amount of memory is always needed.
 
Not per clock. A GF100 can also only fetch 64 Bytes/clock in the best case, isn't it?
Edit:
And AMDs new CU architecture has a separate access to the LDS with 128 Bytes/clock bandwidth. Fermi has to share the 64 Bytes/clock for cache/local memory, afaik.

True, but a CU looks like it has about 4x the GF100 SM compute resources. So it better have higher bandwidth to memory or it will fall down on kernels with low arithmetic to mem op ratios relative to GF100.
 
True, but a CU looks like it has about 4x the GF100 SM compute resources.
No, just twice per clock peak (64 fma vs. 32 fma per clock) vs. GF100/110 and only 33% more (64 vs. 48) vs. the SMs of the GF104 type.
 
Last edited by a moderator:
Ah sorry. For some reason I was thinking a GF100 SM did 16 fmas per clock - should have double checked. Still (GF104 excepted) it's not quite as big a difference as it looks like at first.
 
True, but a CU looks like it has about 4x the GF100 SM compute resources. So it better have higher bandwidth to memory or it will fall down on kernels with low arithmetic to mem op ratios relative to GF100.

(Twice as has already been pointed out). And it should also be running at ~40% lower clocks.

2×0.6 = 1.2, or 20% more compute power per CU versus Fermi's SM.
 
Fermi is also 64 FMA's per core clock. ALUs run at 2x core. A Fermi SM and a CU have the same throughput per core clock. Register file bandwidth per SIMD per clock is equivalent as well.

With respect to nVidia's response, there's no new API to shout about so they'll have to try harder.
 
Are you refering to me? Because that's not what I was saying. I wonder about how the well the chip could handle software rendering model.
If not... well just ignore my post ;)

Unless you're finnish, then no ;)
 
Yeah, it shares a lot with Fermi but takes it another step further. A CU resembles an SM and even has similiar throughput (64 scalar ALUs @ core clock vs 32 scalar ALUs @ 2x core clock). Should be much easier to compare the two architectures going forward.

In what sense does it take it another step further?
 
Back
Top