Compare & Contrast Architectures: Nvidia Fermi and AMD GCN

Directed tests take away almost all of the burden from the software stack, since it gets fed simplistic and often optimised code, so it's far harder for it to flounder, hence why they end up more accurately following hardware differences.
Couldn't a similar argument be made for the hardware schedulers not really having the opprtunity to show their mojo in simplisitc code? (unless the test addresses them specificaly ofc)
 
Couldn't a similar argument be made for the hardware schedulers not really having the opprtunity to show their mojo in simplisitc code? (unless the test addresses them specificaly ofc)
It is usually pretty easy to get peak rates out of directed tests, so the situation you describe is unlikely. The real problem is to create complicated test cases that trigger secondary effects at the system level.
Typical examples are cache trashing (the more caches the harder it becomes) and SDRAM transaction scheduling.
When AMD moved from 5VLIW to 4VLIW, they said that the former was a bit more architecturally efficient than the latter (for graphics workloads), but that the smaller size of the latter made it possible to compensate for the efficiency loss that by putting more instances on the die. On average, that's the right decision, but it makes you vulnerable to cases where this doesn't work.

E.g. adding more parallel resources can make the performance dramatically worse if it tilts the cache into a trashing mode.

(This is just a general observation, I'm not saying that this is specifically the case for GCN.)
 
Judging from the 1/4-th DP rate, most likely there isn't enough mantissa range for full-speed INT32 multiplication.

AMD talked about it during AFDS back in June :

24 BIT INT MUL/MULADD/LOGICAL/SPECIAL @ full SP rates
32-bit Integer MUL/MULADD @ DPFP Mul/FMA rate
 
Fermi's subdivision of the geometry pipeline may also be different from GCN.

Fermi has the polymorph engine and raster engine blocks, and devotes a fabric to keeping the polymorph engines in each SM in sync with one another. Outside of cases where there is an ordering constraint, it allows for more parallel setup work.

AMD has kept the geometry engine confined outside of the CU block, which may mean that it is more conservative about how it sets up primitives and geometry.
The division is also different because the pixel pipe contains both the scan conversion and render backend, while the primitive pipe contains the tessellation and geometry.
Nvidia pairs edge setup, rasterization, and culling in one block, with the other functions placed in the polymorph block.

I'm curious now as to the specialized bus in GCN for the ROPs and GDS.
Is it to save bandwidth? Is it also because the GDS and ROPs are part of a pipeline with rather strict ordering, and the arrays of CUs and their R/W subsystem is not consistent enough to maintain it?
 
For 32-bit integer Fermi is double the rate of GCN. However the HD 7970 still has a 20% advantage over the GTX 580 in 32-bit operations because of it's shader count.
Sounds like more based on its frequency...Anyway, The difference is smaller than that of SP performance
 
For 32-bit integer Fermi is double the rate of GCN. However the HD 7970 still has a 20% advantage over the GTX 580 in 32-bit operations because of it's shader count.
Are you sure the smaller Fermis (GF104/114 and smaller) do 32bit integer multiplication at half rate as GF100/110 does (or it is the same as GCN's "@ DPFP rate")?
And all other (simpler) 32bit integer operations are full rate starting with the HD4000 series anyway (and traditionally quite a bit faster than on nV GPUs). That's where some of the advantage for cryptographic stuff comes from (like bitcoin, the fast bit manipulating instructions of AMD GPUs also help of course).
 
Both 32-bit addition and bitwise op's are full-rate on AMD since Cayman.
Since Wekiva aka Spartan aka Troy aka Makedon aka RV770 ;)

AMD actually presented RV770 to have 12.5 times the bithsift performance of RV670. And I tested it, the HD4000 series has indeed full rate bitwise ops (and additions were already full rate even with R600 iirc).

Edit:
Cypress basically added fullrate 64bit bitshifts (only 32bit of the result can be written, but with the 3 source operands of the bitalign instruction, one can supply a 64bit source) or full rate 32bit rotates.
 
Last edited by a moderator:
Back
Top