I would actually prefer variable wavefront sizes realized by a variable amount of looping with a narrower SIMD (like vec4). Okay, it stays a bit more granular (if one keeps the latency=troughput=4 cycles one would get wavefronts sizes of at least 16), but one could keep a lot of the other stuff intact. For the smaller wavefronts one needs relatively more scalar ALUs in the CU (optimally still one per 4 vALUs). But that should be a relatively small investment.
Based on the recent die shot of Polaris, if the scalar portion is next to the shared instruction fetch/scalar cache portion of a CU group and below what appears to be the LDS, quadrupling that band gets around the area of one SIMD partition (2 SIMDs).
However, some of that area might be due to the scalar portion being integrated with part of the overall scheduling pipeline, which goes to the later portion.
One needs to increase the scheduling capacity per SP though, as each small vALU needs its own instructions. But it could work out in terms of power consumption as larger wavefronts should still dominate and one could gate the scheduling logic for 75% of the time for the old fashioned 64 element wavefronts. In case of smaller 16 or 32 element wavefronts, the increased throughput (potentially factor 4) justifies the increased consumption of the scheduler.
Breaking that 16-wide SIMD into 4 independent quad-width units creates 4x peak scheduling demand+area, but this is for only 1/4 of a current CU's ALU capacity.
What happens with the vector register file might be interesting, since the file's depth per lane would increase in order to house a 64-wide wavefront's context in a narrower SIMD.
If GCN optimized its register read process for the current configuration, there's possibly a bit more math since finding the physical register needs to take into account the wavefront width since a registerID may equal 1, 2, or 4 rows before considering 64-bit.
When I first saw that diagram, I did not think of asymmetrical SIMDs but of lane gating, which can help quite a bit if you expect to run into power limits or have a rather aggressive clock boost in place.
I'm curious whether lane clock gating isn't already being done for inactive lanes even without physically different SIMDs.
Unless a wavefront monopolizes a SIMD for multiple vector cycles, shutting down lanes based on one wavefront's mask is not going to realize much of a savings if the next issue cycle comes from another wavefront with a contradictory mask unless there's fine-grained detection of lane use. Once you have that level of detection, then it should work fine with or without the diagram's method.
The diagram's stating there are space savings doesn't make sense if it's just gating. Inactive lanes don't go away in that case, barring some other change in the relationship between lanes and storage.
About time! I have been fixing and improving console GCN2 cache management code in the past two weeks. I am happy to hear that L2 handles ROPs now as well (even if there's still some tiny L1 ROP caches). Much less L2 flushing needed. Should be good for async compute as well
Up until execution/export could become reordered with binning, the pipelined import/export of ROP tiles may have posed a risk of thrashing the L2 more than saving ROP traffic back to memory. Shading and exporting on the basis of a bin's consolidated lifetime rather than a mix of fragments with conflicting exports might have helped. It could also help reduce or avoid thrashing of the compression pipeline, and possibly give a point of consistency for an in-frame read after write to produce valid data.
That might depend on where the compression pipeline and its own DCC cached data resides.
And we still don't know if it supports ROV's and conservative rasterization.
Part of the ROV process might fall out of binning. ROV mode could make a bin terminate upon an detecting an overlap, then start a new bin with the most recent fragment. If the binning process has multiple bins buffered for each tile, the front end might be able to switch over to another tile if the next bin also hits a conflict.
Could be, but
this slide indicates both twice the clock (which I doubt should be taken literally) and twice the ops(/units) per cu.
The Vega MI25 is supposedly 25 TF of FP16. Unless there are half as many CUs, it should be higher than 25. 4096 SPs x 2(FMA) x 2(FP16) x ~1.5(clock) gives ~25 TF.