It doesn't seem unreasonable for a primitive shader to pass geometry bins as opposed to primitives to the rasterizer. Just need some clarity on where these stages begin and end. Rasterizer then packing waves as it sees fit.
AMD's patents don't mention that form of involvement, and seem to task hardware after the setup pipeline for various bin checks, making them after the early culling posited for primitive shaders. However, AMD's claims are likely general enough to allow some interpretation to span this.
As for on die storage, I'm assuming a majority of the missing 23MB SRAM as a victim cache could do it. Using undisclosed hardware.
We don't have a ready data point for how many MB of SRAM aren't part of the externally noted storage pools for prior GPUs. While storage related to the expanded geometry and rasterization could be a notable part of it, there's a lot of miscellaneous storage on every GPU that AMD didn't make a distinction for, and potentially a swath of extra buffers throughout for latency compensation.
In the comments section it mentioned 1SE and 2SE possibilities.
1SE seems trivial. 2SE may be derived from whether any of the [5:6] bits for the bounding box flip, once the required prior extent checks are passed.
Which makes me wonder: could all future AMD GPUs always have exactly four shader engines?
The instruction documentation notes that less complex cases use normal instructions.
Nothing would prevent AMD from introducing a different SE count instruction at some later date. Either way, it doesn't strike me as a particularly clean ISA decision, although a minor issue given that it's one instruction so far.
If AMD decided to change any of the semantics in the future, it would need some additional instructions or mode settings. Tile size, assignment patterns, and max extents seem like ready candidates for HWID context values, which would potentially save on setup code or reduce the chance of an architectural mismatch if the instruction were capable of a somewhat more complex evaluation than an 8-bit LUT.
One possible alternative that AMD has done before was add new instructions and rename the current one as legacy.
It might be a mark against Navi's "scalability" if that is maintained. AMD's HPC chiplet configurations sound like they could readily hit the limits of this instruction with one chiplet, which then leaves just as much software work undone if Navi or the next generation actually did any MCM graphics products.
If AMD decides to move beyond the constraints of this form of fixed function pipeline, it questions the value of this instruction anyway.
On the other hand, adding more instructions targeted at primitive shaders might mean future implementations will have a more impressive showing of the technique.
Otherwise, if AMD added the the instruction because it decided it wouldn't matter much to tack on a kludge to an architecture it was going to revamp or cull, that might be reasonable as well.
But weren't we once "promised" that AMD would break past the "limit" of four shader engines? Dunno.
Anandtech's Vega review indicates they revisited the topic. AMD stated they could have if they decided to, but they decided not to.
"Talking to AMD’s engineers about the matter, they haven’t taken any steps with Vega to change this. They have made it clear that 4 compute engines is not a fundamental limitation – they know how to build a design with more engines – however to do so would require additional work. In other words, the usual engineering trade-offs apply, with AMD’s engineers focusing on addressing things like HBCC and rasterization as opposed to doing the replumbing necessary for additional compute engines in Vega 10."
http://www.anandtech.com/show/11717/the-amd-radeon-rx-vega-64-and-56-review/2
One thing that gives me yet more pause for thought is that 16 CUs (on 14nm) appear to be about the same area as an HBM2 module. HBM2 modules are about double the size of HBM modules it seems. The power usage of 16 CUs (let's say 50W for the sake of argument though perhaps as much as 70W?) is way too high to put under memory.
It would probably need more margin unless that calculation includes the effectively dead area belonging to the power and I/O for the stack. Currently, HBM's DRAM layers lose something like 1/5 or more of their area to the stack's TSVs. The base die has that area lost, and then more area to the PHY and ballout to the interposer below.
Perhaps if a cluster of Vega-like CUs were customized to the practical clock ceiling for silicon under DRAM, they could drop the majority of the transistor bloat Vega had over Fiji.
AMD's estimated power ceiling is about 5-7x lower than those estimates, and given what they say about DRAM's efficiency at high temps some of the measurements of Vega's voltage and temps for HBM2 are potential signs of trouble.
Perhaps my starting assumption is wrong: that the base logic die of HBM2 is mostly empty. Perhaps there isn't much room for non-DRAM-controller logic?
For HBM1 at least, Hynix indicates the base logic layer has a role in distributing connections from the interposer to the rest of the stack, decoupling capacitors, as well as built-in testing and fault recovery. I'm unclear if there are other functions such as some of the thermal monitoring or DRAM stack functions that might be handled in the logic layer.
http://www.semicontaiwan.org/zh/sit..._taiwan_2015_ppt_template_sk_hynix_hbm_r5.pdf
For any practical solution, that needs to go somewhere in the system and likely somewhere in the stack. Samsung had that proposed cheap version of HBM that would remove the base die, but that leaves the question of what other layers will take the area hit.
I'm not sure if HBM2 populates the base more, or needed to more capacitors. For thermal reasons, the stack may be broader so that they could fit more thermal bumps.