So are we looking at:
Navi ALU (as in the patent) = 2 MUL + ADD (3 flops per ALU)
GCN ALU = MADD (2 flops)
The claims are kept broad as to cover similar implementations, but one of the examples may be more like 2 FMA and a side ALU that has an ADD+miscellaneous logic.
The two core ALUs with FMA units support the most common operations, while the side ALU with no multiplier may work in concert with one or the other to implement part of a complex operation like a transcendental instruction.
Though there are 3 ALUs, there are only enough inputs for six operands--matching the two FMA units and a sustained throughput of 2 FMAs per clock. The side ALU needs to hope for an unused or shared input from a neighboring ALU, and the three ALUs arbitrate for two result outputs. While it might be possible to create an access pattern that can feed the 3 ALUs with read operands, there's no leeway for the result ports.
Another curious point I am not sure how to reconcile is that while the patent doesn't say a given SIMD width is necessary, the example given has units narrower than GCN's traditional 16-wide SIMD. The registers and ports for the example SIMD block discuss widths and outputs that are 4-wide, with the output cache capable of supplying two SIMD4 operations per clock.
Having SIMD blocks and registers sized for 4-wide paths, and then having two core/full ALUs gets the output up to half of a GCN SIMD. I'm having some trouble parsing some of the language, and there's one summary line saying the number of units is equivalent, but I don't know how to get back to the throughput of SIMD16 with the numbers given.
I believe the patent also goes into the number of tex units per unit as well, though my brain isn't quite fully operational buttAl station this morning.
It seems to point to a large CU type with two texture units and L1 caches, and a small one with one texture block and L1.
I'm not entirely sure how to reconcile this with some of the claims of keeping the ALU/TEX ratio the same as with other GPUs. Going by some of the suggested math throughput, having even one texture unit with GCN's 4-address capability would make the ALU/TEX ratio lower, much less doubling the number of texture units. I'm not sure of the purpose of having two adjacent L1s.
One possible interpretation is that the designers in this case aren't looking at the texture portion as a monolith, but rather each independent address processing block and filter unit as a texture unit in its own right. This might mean there could be multiple narrower texture units.
ergo, a compute unit
GCN CU = 4 x (16 x MADD)
NaviCU = 4 x 16 x (2 MUL + 1 ADD)
Not sure about the 16, and it looks to be sized to have 1 FMA per core/full ALU.