If games are compute-bound, then GCN compute efficiency in these games must be terrible - worse than the VLIW GPUs of yore.
ALU was a big bottleneck on last gen consoles, so it takes a while to change your code base and habits to be perfect fit for modern GPUs. GCN is not ALU bound in most shaders, but that doesn't mean that adding CUs shouldn't improve the performance (almost linearly), since additional CUs give a linear increase in total registers, L1 cache, LDS, etc (= many other potential bottlenecks in compute code).
As for what I said about GCN and tiled lighting, I think I'm misremembering, now that I've bothered to rummage:
https://forum.beyond3d.com/posts/1611685/
Unfortunately the link to the Intel page doesn't work, so I can't be 100% sure it's the comparison I was thinking of. Wrong version of Battlefield, too.
EDIT: found another post, which has a working link to the Intel site:
https://forum.beyond3d.com/posts/1638737/
where HD7970 has a serious problem with MSAA. That does seem to be what I was remembering, though there is no number presented for Tahiti.
Andrew's comparison was between 5000 series (VLIW5) Radeon and 400 series (Fermi) GeForce. Old VLIW5 Radeons had big performance bottlenecks in dynamic array indexing (of memory and/or LDS). I remember measuring constant buffer array indexing to be roughly 2x faster than LDS array indexing (and indexing structured buffers was roughly 6x slower than a constant buffer). We used CPU based tile light binning on consoles (older AMD GPUs) and DX10 era PCs, because dynamic loops (that indexed the light lists using the loop counter) were much slower compared to unrolled code (multiple shader permutations based on light counts). VLIW Radeons also had some severe bank conflicts scenarios on registers and LDS.
GCN doesn't have any of these problems. Bad GCN multisampled performance in the tiled lighting shader are most likely explained by bad occupancy. If you don't use the all tricks in your book to force the AMD shader compiler to behave properly, you will end up with high VGPR usage in the complex tiled lighting shaders. Multisampled versions are much more prone to VGPR pressure, since the shader is more complex. I think we spent at least a month in optimizing the VGPR usage of our tiled lighting shader. But the end result is nice. We can push 16k visible lights at locked 60 fps (on a middle class GCN 1.1 GPU). It is silly how even simple things such as changing the order of two lines of shader code can cut the VGPR usage down by 2-3 (giving up to 10% extra performance for an shader that has poor occupancy).
Compute shader optimizations in general doesn't "port" well across architectures. You get big performance differences simply by changing your thread group size (128 = 16x8, 256 = 16x16, 512 = 32x16, 1024 = 32x32 threads). Less than 256 threads per group doesn't suit GCN well. And 1024 threads per group (32x32 tile) is hopeless to get running at high enough occupancy (for any complex multisampled tiled lighting implementation). Depending on the GPU resource bottlenecks, a different group size is optimal. If the same shader code is used on multiple generations of AMD and NVIDIA GPUs, the thread group size will likely not be perfectly optimal for each. Wrong thread group size is alone enough to severely hamper performance on some GPUs.
Because GCN is starved by register space, it is slightly more reliant on a good shader compiler than some other GPUs. GCN scalar unit could be a big help for register pressure, but utilizing it perfectly would require even more sophisticated compiler logic. Tile based lighting is perfect candidate for scalar optimizations, since you can split the thread group to 8x8 subgroups (sub tiles), and offload all subgroup calculations and data loads (and registers to hold that data) to the scalar unit. This saves lots of ALU and registers. OpenCL 2.0 shading language has subgroup operations that would help the GCN shader compiler to use the scalar unit (and lane swizzles) better. Our PC versions have always been DirectX-based, so I don't know how well this works in practice on PC.