That's my point. Even from GCN3 to Polaris was something under than even that Kepler to Maxwell was.
Kepler -> Maxwell was HUGE change for Nvidia.
1. Brand new SM design. Split to quadrants. Less shared stuff (less communication/distance overhead). Doubled max thread group count per SM. Split LDS / L1 cache and increased LDS amount. Combined L1 and texture caches, etc. Source:
https://devblogs.nvidia.com/paralle...ould-know-about-new-maxwell-gpu-architecture/
2. New register files. Simpler banking and register operand reuse cache. One of the potential reasons Maxwell/Pascal reaches very high clocks. Source:
https://github.com/NervanaSystems/maxas/wiki/SGEMM.
On Maxwell there are 4 register banks, but unlike on Kepler (also with 4 banks) the assignment of banks to numbers is very simple. The Maxwell assignment is just the register number modulo 4. On Kepler it is possible to arrange the 64 FFMA instructions to
eliminate all bank conflicts. On Maxwell this is no longer possible. Maxwell, however, provides something to make up for this and at the same time offers the capability to significantly reduce register bank traffic and overall chip power draw. This is the operand reuse cache. The operand reuse cache has 8 bytes of data per source operand slot. An instuction like FFMA has 3 source operand slots. Each time you issue an instruction there is a flag you can use to specify if each of the operands is going to be used again. So the next instruction that uses the same register
in the same operand slot will not have to go to the register bank to fetch its value. And with this feature you can see how a register bank conflict can be averted.
3. Tiled rasterizer. Records big chunks of vertex shader output to temporary on-chip buffer (L2 cache), splits to big tiles and executes per tile. Big savings in memory bandwidth. Source:
http://www.realworldtech.com/tile-based-rasterization-nvidia-gpus/
4. Kepler was feature level 11_0 hardware. Behind even AMD GCN 1. Maxwell added all feature level 11_1, 12_0 and 12_1 features: 3d tiled resources, typed UAV load, conservative raster, rasterizer order views.
5. Maxwell added hardware thread group shared memory atomics. I have personally measured up to 3x perf boost in some of my shaders (Kepler -> Maxwell). When combined with other improvements, Maxwell is finally faster than GCN in compute shaders.
Fully programmable, but a different performance tier as indicated by the GPUOpen documentation (8.6TB/s vs 51.6TB/s for Fiji) for DPP and permute. Not exactly slow, but much more optimal paths could exist. That bandwidth difference will likely be tied to power. The theorized scalar would have it's own separate pool nearby and along with the ability to run instructions with the cross-lane at less than optimal wave sizes. Depending on the algorithm that may or may not be more efficient and at the very least leave LDS free for other commands.
8.6 TB/s is the LDS memory bandwidth. This is what you get when you communicate over LDS instead of using cross lane ops such as DPP or ds_permute. Cross lane ops do not touch LDS memory at all. LDS data sharing between threads needs both LDS write and LDS read. Two instructions and two waits. GPUOpen documentation doesn't describe the "bandwidth" of the LDS crossbar that is used by ds_permute. It describes the memory bandwidth and ds_permute doesn't touch the memory at all (and doesn't suffer from bank conflicts).
DPP is definitely faster in cases where you need multiple back-to-back operations, such as prefix sum (7 DPP in a row). But I can't come up with any good examples (*) where you need huge amount of random ds_permutes (DPP handles common cases such as prefix sums and reductions just fine). Usually you have enough math to completely hide the LDS latency (even for real LDS instructions with bank conflicts). ds_permute shouldn't be a problem. The current scalar unit is also good enough for fast votes. Compare generates 64 bit scalar register mask. You can read it back a few cycles later (constant latency, no waitcnt needed).
(*) One example where you need lots of DPP quad permutes is "emulating" vec4 code (4 threads run one vec4 "thread"). 2 cycle extra latency might be problematic in this particular case. Intel's GPU supports vec4 execution mode. They are using it for vertex, geometry, hull and domain shaders. One advantage of vec4 emulation is that it reduces your branching granularity to 16 "threads". It is also easier to fill the GPU, as you need 4x less "threads". A fun idea that could be fast enough with DPP, but definitely not fast enough with ds_permute.
I guess the biggest question is how busy is that scalar in your experience? If that utilization is already high, making it more flexible isn't likely to help very much.
I have never seen it more than 40% busy. However if the scalar unit supported floating point code, the compiler could offload much more work to it automatically. Now the scalar unit and scalar registers are most commonly used for constant buffer loads, branching and resource descriptors. I am not a compiler writer, so I can't estimate how big refactoring really good automatic scalar offloading would require.