Summer/Autumn 2015?
So, erm, are we looking at a re-run of R600?: Ambitious memory architecture (with theoreticals to drown in), new process, delays, delays, and awful performance due to terrible balance.
The memory architecture is a sort-of parallel. The possibility is there that there will be an HBM device or two on-package, but the memory controllers should help insulate the internals of the chip from that change. I haven't seen news about changes to that portion, whereas R600 re-engineered things on-die as well.
The new process is a maybe.
Terrible balance doesn't seem to be in the cards, at least not to the same degree R600 emphasized its ALU capabilities over TEX and ROP capability.
But I do wonder if the ALU architecture is working too hard for the results it generates. I'm wondering if it was built deliberately to spend lots of power to achieve its very low internal latencies (insanely short ALU pipeline, read-after-write, LDS reads/writes, branching).
I doubt it deliberately spent lots of power, but I suspect it prioritized certain other features.
GCN attempts to provide a straightforward machine for compute, one more consistent with the CPU platform it is meant to integrate with.
This means not exposing a number of low-level details AMD once left out there, like exposing forwarding considerations within a single domain(straightforward, some kind of HSA finalizer aid) and continuing to strip out the amount of hidden state that is maintained for a CU context(all the better to help virtualize resources and get CPU-type QoS measures implemented).
However, this comes with the constraint of leveraging a large amount of the existing execution paradigm, and a general emphasis on using simplistic hardware.
I'm wondering if the CUs are spinning like demons even though there's nothing for the ALUs to do, because register over-allocation has cut the number of threads per SIMD.
This seems unlikely, or should be unlikely. AMD's clock gating should be better than that, and if there's no work for the units to do they shouldn't be doing much.
Anecdotally I've heard of very recent drivers causing vast increases in power consumption at slight gains in performance on intensive compute kernels.
Does this include changes to the compiler, and for what hardware? I'd be curious what would have changed. Change in cache policy might be forcing more broadcasts than there once were, and the unevolved cache hierarchy could be a pain point.
So even if I'd like to think GCN is in many ways a sane compute architecture, it might have a rotten core.
I think it was a decent start at introduction, and then apparently it was decided that GCN and its warts were sufficiently comfortable laurels to rest on.
Which might mean AMD tries to tackle this problem with the new chip.
That's hardly guaranteed. Let's note that the original comparison is to R600, after all: dodgy drivers, fighting with the compiler, failing to reach peak...
It is possible AMD will do something, but some of the most notable promised changes have little emphasis on retuning the VALU execution balance, and AMD has not promised better software.
They tend to focus more on QoS and integration, which do not directly help the CUs. Preemption and prioritization schemes tend to make things at least somewhat worse at the CU level.
Some amount of compute context switching is present, and it has been indicated that graphics preemption is coming up next.
One set of outside possibilities includes finding some way to handle divergence, which includes measures like promoting the SALU to be capable of running a scalar variant of VALU code, and another being sporadic discussion of a wavefront repack scheme. There was a paper on promoting the scalar unit, and some research/patents (there was some throwaway commentary from watchimpress about this when Tonga launched) that might cover the repacking. A common refrain is the leveraging of the LDS and its cross-lane capability, and for what it's worth, AMD barely mentioned the introduction of some cross-lane capability with the latest hardware.
My theory is that GCN is engineered very precisely for very low intra-CU latencies, but does so without regard to power consumption. Or, at the very least, some slackening would lead to a substantial power saving. But slackening something this tightly bound ends-up being a fundamental design change.
There has been for a very long time a 4-cycle cadence to part of the execution loop, and AMD kept it.
I think there is a desire for very low
apparent latency, but it is the simplistic hardware and throwback execution loop that makes it a problem for the implementation.
The ALUs, per se, don't benefit. But the compiler likes to re-order memory accesses so that they are in the tightest sequences, making as many accesses as early as possible, preferably contiguously. Additionally, it likes to unroll loops without being asked to do so.
Both of those things generate ILP. You could say it's a hangover from R600, VLIW, ISA.
This may play into a simplified heuristic that greedily generates memory-level parallelism and avoids control overhead.
It does leverage an aggressively coalescing memory pipeline and reduces the burden on GCN's less than ideal branching capability.
It does make sense from the perspective of an optimizer for a single wavefront, but then it seems attention went elsewhere once that first step was taken.
I've been musing over whether this has ramifications for AMD's preemption and context switching, as well as other possible design enhancements. Vector memory operations are perhaps the most painful ones to cut off or discard, so the architecture might try to break just before or after long runs of vector memory traffic.
Within GCN CU, VALU, SALU, LDS and memory instructions are candidates for issue. Anything that's not VALU can cause VALU to stall, if there's too much of it. So, for example, a kernel with lots of scalar instructions on the SALU will hit VALU throughput.
The mention of the other types also brings to mind that a CU is not so much a unified processor as it is a bundle of independent pipelines, which are either architecturally constrained from going too far afield (4-cycle loop, single instruction issue per wavefront), or dependent on explicit waitcnt instructions. The compiler heuristic might be tuned to generate work for all these pipelines at the expense of occupancy.
There's still a vector wait counter without a matching wait instruction, at least as of Sea Islands.