AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

rpg.314 · Jun 17, 2011

3dcgi said:
The poor choice is designing hardware assuming the maximum amount of memory is always needed.

Could you explain that assertion.

rpg.314 · Jun 17, 2011

trinibwoy said:
Fermi is also 64 FMA's per core clock. ALUs run at 2x core. A Fermi SM and a CU have the same throughput per core clock. Register file bandwidth per SIMD per clock is equivalent as well.

A CU is expected to much more denser than an SM.

A big feature that is being overlooked is coherency. It seems AMD will be offering full coherency while NV only offers semi coherent L1.

With respect to nVidia's response, there's no new API to shout about so they'll have to try harder.

There will be a DX11.1 or 12.

trinibwoy · Jun 17, 2011

rpg.314 said:
In what sense does it take it another step further?

So far:

Modularity - graphics and compute hardware are decoupled.
Parallelism - concurrent kernels from multiple application contexts.

trinibwoy · Jun 17, 2011

rpg.314 said:
A CU is expected to much more denser than an SM.

Not sure how that relates to what I stated.

A big feature that is being overlooked is coherency. It seems AMD will be offering full coherency while NV only offers semi coherent L1.

Yep, and it looks like they have an option to disable full coherency too. The GLC bit.

There will be a DX11.1 or 12.

Oh, I haven't heard a peep about either.

rpg.314 · Jun 17, 2011

trinibwoy said:
So far:

Modularity - graphics and compute hardware are decoupled.
Parallelism - concurrent kernels from multiple application contexts.

I think the compute core and the ff hw was decoupled earlier as well. What has changed in this context?

Cayman can do 2 kernels simultaneously.

rpg.314 · Jun 17, 2011

trinibwoy said:
Oh, I haven't heard a peep about either.

I'd expect noise about that ~12 months before xb720.

rpg.314 · Jun 17, 2011

A CU can now do 4 threads from 4 different apps as well. I may have not grasped the idea fully, or correctly, but does any one else see a point in bunching different threads in to a CU? IMHO, MIMD over CU's would have been better than MIMD within a CU.

For a start, you can kiss your L1 goodbye, as each will get just 4K on average?

Gipsel · Jun 17, 2011

trinibwoy said:
128.

How is that supposed to work with just 16 L/S units?
16*4=64

Unless you are not referring to the clock those units run at (hot clock), of course.

fellix · Jun 17, 2011

Gipsel said:
How is that supposed to work with just 16 L/S units?
16*4=64

Unless you are not referring to the clock those units run at (hot clock), of course.

From the B3D's Fermi article:

The 16 LSUs run at hot-clock too, with all address generation happening here.

MfA · Jun 17, 2011

rpg.314 said:
A CU can now do 4 threads from 4 different apps as well. I may have not grasped the idea fully, or correctly, but does any one else see a point in bunching different threads in to a CU? IMHO, MIMD over CU's would have been better than MIMD within a CU.

For a start, you can kiss your L1 goodbye, as each will get just 4K on average?

The cost of giving up VLIW ... and such a nasty cost it is.

Gipsel · Jun 17, 2011

fellix said:
From the B3D's Fermi article:

That is what I meant.
We (well, at least me) referred to a SM with a peak rate of 32 fma/clock. That clock is then of course the hot clock.

Fermi-SM: 64 Bytes/hot clock aggregate bandwidth, 32 fma/ hot clock (or 48 fma/hot clock)
AMDs CU: 64 Byte/clock from cache/memory, 128 Byte/clock from LDS, 64 fma/clock

Gipsel · Jun 17, 2011

rpg.314 said:
A CU can now do 4 threads from 4 different apps as well. I may have not grasped the idea fully, or correctly, but does any one else see a point in bunching different threads in to a CU? IMHO, MIMD over CU's would have been better than MIMD within a CU.

For a start, you can kiss your L1 goodbye, as each will get just 4K on average?

Just because it can be done, doesn't mean it needs to be or even will be done at all as default behaviour.

fellix · Jun 17, 2011

Gipsel said:
That is what
Fermi-SM: 64 Bytes/hot clock aggregate bandwidth, 32 fma/ hot clock (or 48 fma/hot clock)
AMDs CU: 64 Byte/clock from cache/memory, 128 Byte/clock from LDS, 64 fma/clock

The L1/LDS combo in Fermi runs at the base clock.

Gipsel · Jun 17, 2011

fellix said:
The L1/LDS combo in Fermi runs at the base clock.

That does not matter for the flops/bandwidth question. One just has to use the same clock for flops/cycle and bandwidth/cycle that the division makes sense. If you take the base clock, the Fermi-SM would do 64 fma/cycle, the ratio does not change.

Squilliam · Jun 17, 2011

So can someone explain the meaning of all this for someone whom is humbly ignorant of the meaning of what has transpired recently? :mrgreen:

trinibwoy · Jun 17, 2011

rpg.314 said:
I think the compute core and the ff hw was decoupled earlier as well. What has changed in this context?

Cayman can do 2 kernels simultaneously.

What's changed is that they've caught up to Fermi in several important areas and those advantages actually matter now (and have been further generalized). Also texture units were coupled to a SIMD in Cayman, that's no longer the case.

Gipsel said:
That does not matter for the flops/bandwidth question. One just has to use the same clock for flops/cycle and bandwidth/cycle that the division makes sense. If you take the base clock, the Fermi-SM would do 64 fma/cycle, the ratio does not change.

Yes the ratio doesn't change but normalizing to base clock makes the comparison to AMD's architecture easier. Scheduling, instruction issue, operand fetch, caches all run at base clock on Fermi and halving the throughput of those processes based on the ALU clock is just confusing/misleading.

flops/clock and bandwidth/clock comparisons are easier when you're talking about similar clock speeds no?

rpg.314 · Jun 17, 2011

Gipsel said:
Just because it can be done, doesn't mean it needs to be or even will be done at all as default behaviour.

I can't see any reason why they will crow about it if they weren't planning to implement it and I can't see why they would implement it if they didn't want to use it quite a bit.

Gipsel · Jun 17, 2011

trinibwoy said:
Also texture units were coupled to a SIMD in Cayman, that's no longer the case.

The texture units (still a quad TMU, probably modified) are an integral part of the L1 cache block as can be seen here:

It was also mentioned explicitly in the talk about the fusion system architecture that there is the usual filtering hardware in there.

trinibwoy said:
flops/clock and bandwidth/clock comparisons are easier when you're talking about similar clock speeds no?

So the comparison at Fermi's base clock:

nvidia SM: 128 byte/cycle aggregate bandwidth, 64 fma/cycle (96 for GF104 type SM)
AMD CU: 64 byte/cycle cache/memory + 128 byte/cycle LDS bandwidth, 64 fma/cycle

rpg.314 · Jun 17, 2011

trinibwoy said:
What's changed is that they've caught up to Fermi in several important areas and those advantages actually matter now (and have been further generalized). Also texture units were coupled to a SIMD in Cayman, that's no longer the case.

I don't remember TMU organization being discussed. My guess is that TMU's are still coupled to a CU.

trinibwoy · Jun 17, 2011

Gipsel said:
The texture units (still a quad TMU, probably modified) are an integral part of the L1 cache block as can be seen here:

It was also mentioned explicitly in the talk about the fusion system architecture that there is the usual filtering hardware in there.

I wasn't saying they aren't there. Just that the SIMDs can be scaled independently of TMUs now. See my earlier posts in this thread regarding AMD's ALU:TEX ratio.

So the comparison at Fermi's base clock:

nvidia SM: 128 byte/cycle aggregate bandwidth, 64 fma/cycle (96 for GF104 type SM)
AMD CU: 64 byte/cycle cache/memory + 128 byte/cycle LDS bandwidth, 64 fma/cycle

Yeah.

rpg.314 said:
I don't remember TMU organization being discussed. My guess is that TMU's are still coupled to a CU.

Yeah to a CU, not to a SIMD.

AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

rpg.314

rpg.314

trinibwoy

Meh

trinibwoy

Meh

rpg.314

rpg.314

rpg.314

Gipsel

fellix

MfA

Gipsel

Gipsel

fellix

Gipsel

Squilliam

Beyond3d isn't defined yet

trinibwoy

Meh

rpg.314

Gipsel

rpg.314

trinibwoy

Meh

Similar threads