AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Fermi is also 64 FMA's per core clock. ALUs run at 2x core. A Fermi SM and a CU have the same throughput per core clock. Register file bandwidth per SIMD per clock is equivalent as well.
A CU is expected to much more denser than an SM.

A big feature that is being overlooked is coherency. It seems AMD will be offering full coherency while NV only offers semi coherent L1.
With respect to nVidia's response, there's no new API to shout about so they'll have to try harder.
There will be a DX11.1 or 12.
 
A CU is expected to much more denser than an SM.

Not sure how that relates to what I stated.

A big feature that is being overlooked is coherency. It seems AMD will be offering full coherency while NV only offers semi coherent L1.

Yep, and it looks like they have an option to disable full coherency too. The GLC bit.

There will be a DX11.1 or 12.

Oh, I haven't heard a peep about either.
 
So far:

Modularity - graphics and compute hardware are decoupled.
Parallelism - concurrent kernels from multiple application contexts.

I think the compute core and the ff hw was decoupled earlier as well. What has changed in this context?

Cayman can do 2 kernels simultaneously.
 
A CU can now do 4 threads from 4 different apps as well. I may have not grasped the idea fully, or correctly, but does any one else see a point in bunching different threads in to a CU? IMHO, MIMD over CU's would have been better than MIMD within a CU.

For a start, you can kiss your L1 goodbye, as each will get just 4K on average?
 
A CU can now do 4 threads from 4 different apps as well. I may have not grasped the idea fully, or correctly, but does any one else see a point in bunching different threads in to a CU? IMHO, MIMD over CU's would have been better than MIMD within a CU.

For a start, you can kiss your L1 goodbye, as each will get just 4K on average?
The cost of giving up VLIW ... and such a nasty cost it is.
 
From the B3D's Fermi article:
That is what I meant.
We (well, at least me) referred to a SM with a peak rate of 32 fma/clock. That clock is then of course the hot clock.

Fermi-SM: 64 Bytes/hot clock aggregate bandwidth, 32 fma/ hot clock (or 48 fma/hot clock)
AMDs CU: 64 Byte/clock from cache/memory, 128 Byte/clock from LDS, 64 fma/clock
 
A CU can now do 4 threads from 4 different apps as well. I may have not grasped the idea fully, or correctly, but does any one else see a point in bunching different threads in to a CU? IMHO, MIMD over CU's would have been better than MIMD within a CU.

For a start, you can kiss your L1 goodbye, as each will get just 4K on average?
Just because it can be done, doesn't mean it needs to be or even will be done at all as default behaviour.
 
That is what
Fermi-SM: 64 Bytes/hot clock aggregate bandwidth, 32 fma/ hot clock (or 48 fma/hot clock)
AMDs CU: 64 Byte/clock from cache/memory, 128 Byte/clock from LDS, 64 fma/clock
The L1/LDS combo in Fermi runs at the base clock.
 
The L1/LDS combo in Fermi runs at the base clock.
That does not matter for the flops/bandwidth question. One just has to use the same clock for flops/cycle and bandwidth/cycle that the division makes sense. If you take the base clock, the Fermi-SM would do 64 fma/cycle, the ratio does not change.
 
So can someone explain the meaning of all this for someone whom is humbly ignorant of the meaning of what has transpired recently? :mrgreen:
 
I think the compute core and the ff hw was decoupled earlier as well. What has changed in this context?

Cayman can do 2 kernels simultaneously.

What's changed is that they've caught up to Fermi in several important areas and those advantages actually matter now (and have been further generalized). Also texture units were coupled to a SIMD in Cayman, that's no longer the case.

That does not matter for the flops/bandwidth question. One just has to use the same clock for flops/cycle and bandwidth/cycle that the division makes sense. If you take the base clock, the Fermi-SM would do 64 fma/cycle, the ratio does not change.

Yes the ratio doesn't change but normalizing to base clock makes the comparison to AMD's architecture easier. Scheduling, instruction issue, operand fetch, caches all run at base clock on Fermi and halving the throughput of those processes based on the ALU clock is just confusing/misleading.

flops/clock and bandwidth/clock comparisons are easier when you're talking about similar clock speeds no? :)
 
Just because it can be done, doesn't mean it needs to be or even will be done at all as default behaviour.

I can't see any reason why they will crow about it if they weren't planning to implement it and I can't see why they would implement it if they didn't want to use it quite a bit.
 
Also texture units were coupled to a SIMD in Cayman, that's no longer the case.
The texture units (still a quad TMU, probably modified) are an integral part of the L1 cache block as can be seen here:

img0032683_1rjgu.jpg


It was also mentioned explicitly in the talk about the fusion system architecture that there is the usual filtering hardware in there.
flops/clock and bandwidth/clock comparisons are easier when you're talking about similar clock speeds no? :)
So the comparison at Fermi's base clock:

nvidia SM: 128 byte/cycle aggregate bandwidth, 64 fma/cycle (96 for GF104 type SM)
AMD CU: 64 byte/cycle cache/memory + 128 byte/cycle LDS bandwidth, 64 fma/cycle
 
What's changed is that they've caught up to Fermi in several important areas and those advantages actually matter now (and have been further generalized). Also texture units were coupled to a SIMD in Cayman, that's no longer the case.

I don't remember TMU organization being discussed. My guess is that TMU's are still coupled to a CU.
 
The texture units (still a quad TMU, probably modified) are an integral part of the L1 cache block as can be seen here:

It was also mentioned explicitly in the talk about the fusion system architecture that there is the usual filtering hardware in there.

I wasn't saying they aren't there. Just that the SIMDs can be scaled independently of TMUs now. See my earlier posts in this thread regarding AMD's ALU:TEX ratio.

So the comparison at Fermi's base clock:

nvidia SM: 128 byte/cycle aggregate bandwidth, 64 fma/cycle (96 for GF104 type SM)
AMD CU: 64 byte/cycle cache/memory + 128 byte/cycle LDS bandwidth, 64 fma/cycle

Yeah.

I don't remember TMU organization being discussed. My guess is that TMU's are still coupled to a CU.

Yeah to a CU, not to a SIMD.
 
Back
Top