Compare & Contrast Architectures: Nvidia Fermi and AMD GCN

fellix · Dec 26, 2011

rpg.314 said:
I am curious about how the int32 multiplication is handled. Is it still slower than int24?

Judging from the 1/4-th DP rate, most likely there isn't enough mantissa range for full-speed INT32 multiplication.

DarthShader · Dec 26, 2011

AlexV said:
Directed tests take away almost all of the burden from the software stack, since it gets fed simplistic and often optimised code, so it's far harder for it to flounder, hence why they end up more accurately following hardware differences.

Couldn't a similar argument be made for the hardware schedulers not really having the opprtunity to show their mojo in simplisitc code? (unless the test addresses them specificaly ofc)

silent_guy · Dec 26, 2011

DarthShader said:
Couldn't a similar argument be made for the hardware schedulers not really having the opprtunity to show their mojo in simplisitc code? (unless the test addresses them specificaly ofc)

It is usually pretty easy to get peak rates out of directed tests, so the situation you describe is unlikely. The real problem is to create complicated test cases that trigger secondary effects at the system level.
Typical examples are cache trashing (the more caches the harder it becomes) and SDRAM transaction scheduling.
When AMD moved from 5VLIW to 4VLIW, they said that the former was a bit more architecturally efficient than the latter (for graphics workloads), but that the smaller size of the latter made it possible to compensate for the efficiency loss that by putting more instances on the die. On average, that's the right decision, but it makes you vulnerable to cases where this doesn't work.

E.g. adding more parallel resources can make the performance dramatically worse if it tilts the cache into a trashing mode.

(This is just a general observation, I'm not saying that this is specifically the case for GCN.)

Tridam · Dec 27, 2011

fellix said:
Judging from the 1/4-th DP rate, most likely there isn't enough mantissa range for full-speed INT32 multiplication.

AMD talked about it during AFDS back in June :

24 BIT INT MUL/MULADD/LOGICAL/SPECIAL @ full SP rates
32-bit Integer MUL/MULADD @ DPFP Mul/FMA rate

3dilettante · Dec 27, 2011

Fermi's subdivision of the geometry pipeline may also be different from GCN.

Fermi has the polymorph engine and raster engine blocks, and devotes a fabric to keeping the polymorph engines in each SM in sync with one another. Outside of cases where there is an ordering constraint, it allows for more parallel setup work.

AMD has kept the geometry engine confined outside of the CU block, which may mean that it is more conservative about how it sets up primitives and geometry.
The division is also different because the pixel pipe contains both the scan conversion and render backend, while the primitive pipe contains the tessellation and geometry.
Nvidia pairs edge setup, rasterization, and culling in one block, with the other functions placed in the polymorph block.

I'm curious now as to the specialized bus in GCN for the ROPs and GDS.
Is it to save bandwidth? Is it also because the GDS and ROPs are part of a pipeline with rather strict ordering, and the arrays of CUs and their R/W subsystem is not consistent enough to maintain it?

denev2004 · Dec 28, 2011

Tridam said:
AMD talked about it during AFDS back in June :

24 BIT INT MUL/MULADD/LOGICAL/SPECIAL @ full SP rates
32-bit Integer MUL/MULADD @ DPFP Mul/FMA rate

That's so bad. Sounds like GCN's int is inferior to Fermi's

cal_guy · Dec 28, 2011

denev2004 said:
That's so bad. Sounds like GCN's int is inferior to Fermi's

For 32-bit integer Fermi is double the rate of GCN. However the HD 7970 still has a 20% advantage over the GTX 580 in 32-bit operations because of it's shader count.

denev2004 · Dec 28, 2011

cal_guy said:
For 32-bit integer Fermi is double the rate of GCN. However the HD 7970 still has a 20% advantage over the GTX 580 in 32-bit operations because of it's shader count.

Sounds like more based on its frequency...Anyway, The difference is smaller than that of SP performance

Gipsel · Dec 28, 2011

cal_guy said:
For 32-bit integer Fermi is double the rate of GCN. However the HD 7970 still has a 20% advantage over the GTX 580 in 32-bit operations because of it's shader count.

Are you sure the smaller Fermis (GF104/114 and smaller) do 32bit integer multiplication at half rate as GF100/110 does (or it is the same as GCN's "@ DPFP rate")?
And all other (simpler) 32bit integer operations are full rate starting with the HD4000 series anyway (and traditionally quite a bit faster than on nV GPUs). That's where some of the advantage for cryptographic stuff comes from (like bitcoin, the fast bit manipulating instructions of AMD GPUs also help of course).

fellix · Dec 28, 2011

Both 32-bit addition and bitwise op's are full-rate on AMD since Cayman.

Gipsel · Dec 28, 2011

fellix said:
Both 32-bit addition and bitwise op's are full-rate on AMD since Cayman.

Since Wekiva aka Spartan aka Troy aka Makedon aka RV770

AMD actually presented RV770 to have 12.5 times the bithsift performance of RV670. And I tested it, the HD4000 series has indeed full rate bitwise ops (and additions were already full rate even with R600 iirc).

Edit:
Cypress basically added fullrate 64bit bitshifts (only 32bit of the result can be written, but with the 3 source operands of the bitalign instruction, one can supply a 64bit source) or full rate 32bit rotates.

Acert93 · Jan 5, 2012

Thanks for the feedback folks. Much appreciated.

Compare & Contrast Architectures: Nvidia Fermi and AMD GCN

fellix

DarthShader

silent_guy

Tridam

3dilettante

denev2004

cal_guy

denev2004

Gipsel

fellix

Gipsel

Acert93

Artist formerly known as Acert93

Similar threads