Probably less than 10 wavefronts needed (this number per SIMD is mentioned several times, is probably a number on the safe side).Any indication what the cache latency and/or the minimum number of workgroups needed to cover it is?
With multiple primitive pipes and R/W L2 the new radeon will probably catch up fermi in tesselation too.
hardware.fr has new presentation available:
http://www.hardware.fr/news/11656/afds-retour-futur-gpu-amd.html
It looks like it sets to impress
mFa do you takle this as a hint that AMD got the deal for tnext xbox:
"compliant with next APIs"
Not per clock. A GF100 can also only fetch 64 Bytes/clock in the best case, isn't it?Don't you think the L1 bandwidth is a bit underwhelming? Fermi can fetch twice than that, per SM.
By the way, Eric slipped an aggregate BW estimation of 1,5TB for the cache. With conservative estimation for the chip clock-rate (~850MHz), that would yield between 26 and 30 CUs for the flagship SKU.
The poor choice is designing hardware assuming the maximum amount of memory is always needed.Sure, but designing your hw assuming devs will use litle LDS when spec exposes 32K is a poor design choice. Although something that will work.
Not per clock. A GF100 can also only fetch 64 Bytes/clock in the best case, isn't it?
Edit:
And AMDs new CU architecture has a separate access to the LDS with 128 Bytes/clock bandwidth. Fermi has to share the 64 Bytes/clock for cache/local memory, afaik.
No, just twice per clock peak (64 fma vs. 32 fma per clock) vs. GF100/110 and only 33% more (64 vs. 48) vs. the SMs of the GF104 type.True, but a CU looks like it has about 4x the GF100 SM compute resources.
True, but a CU looks like it has about 4x the GF100 SM compute resources. So it better have higher bandwidth to memory or it will fall down on kernels with low arithmetic to mem op ratios relative to GF100.
Are you refering to me? Because that's not what I was saying. I wonder about how the well the chip could handle software rendering model.
If not... well just ignore my post
Not per clock. A GF100 can also only fetch 64 Bytes/clock in the best case, isn't it?
Yeah, it shares a lot with Fermi but takes it another step further. A CU resembles an SM and even has similiar throughput (64 scalar ALUs @ core clock vs 32 scalar ALUs @ 2x core clock). Should be much easier to compare the two architectures going forward.