Anarchist4000
Veteran
Maybe, but an independent command stream was one of the requirements for the flexible scalar. That would entail increasing the instruction buffer, by one "per wave", instead of using the first lane for wave level operations. It's possible the ALUs are 4x(16+1)+1. That could give each ALU/executing wave a scalar to prefetch and run wave level ops in addition to a dedicated scalar for the CU. Might also explain part of the transistor disparity and why we aren't seeing more CUs. It seems odd they made all those slides and didn't mention how many processors were in there. Quantities on ACEs, HWS, TMUs, ROPS, etc. Everything but the actual processor count, which is the big number everyone likes to show.The improved single thread performance is a sub-item of the larger instruction buffer item. My interpretation is that the larger bullet point gives a feature, and then the indented items give the effect or a detail about it.