Are you basically suggesting in-order execution with n-way SMT for what is currently the FlexFP unit?
The two could coexist, or one could be modified to handle the tasks of the other. The two designs are far enough apart that I think for now they are better separated.
AMD's coprocessor model allows for this. There's the cost in latency that AMD has accepted for over a decade in doing so, but it also frees up the FPU to implement things however it wants, and the integer side only needs to know the completion status.
If the CU keeps its own branching ability, it could free the integer pipe more than the FlexFP could,
since the ROB still has to track FP instruction completion status. The handoff would be its own instruction and could retire and the CU would worry about further fetches.
The CU itself is already designed to serve multiple masters, either from multiple apps and potentially from the compute control and graphics pipeline as well. A CPU would be just another client.
This makes me wonder, how is wavefront or thread scheduling fundamentally different from out-of-order scheduling? Isn't it simply scoreboarding versus Tomasulo?
There is no dependence checking within the buffer of wavefronts, just a readiness check on scalar issue (basically, check if the last instruction finished). The CU is described as being threaded in such a way that it knows the result is ready from a previous instruction during back-to-back issue.
What was the second way?
Well I suppose there's 4, there's CPU-only, CPU and discrete, a hybrid chip, and hybrid+discrete.
But it still means three paths can be taken: pure software using AVX2+, integrated CUs, or discrete CUs. The performance for each of these can be very different. I'm particularly "concerned" that a homogeneous Intel CPU would outperform AMD's integrated CU's using the software path.
That potential exists, and the evaluation would depend on implementation details and workloads that won't be available for years, so I do not have a verdict to render.
Note that AMD has to invest a lot more transistors to make this approach work, while a homogeneous architecture would have the entire die at its disposal,
I would still argue that nobody has an entire die at their disposal at any given time period, unless that die is a fraction of the size it would otherwise be, perhaps 1/3 or 1/4 the size.
Even if Intel's "2 node" advantage applied to the power ranges above ultramobile or embedded (it doesn't), that at most promises that the design won't be more heavily gated and throttled than Sandy Bridge. The best-case was power scaling of 50% is what was taken for granted prior to 90nm.
It's a case of having decent to good scaling versus an industry average of "meh" to mediocre.
It's very helpful, but its just one factor of many.
AVX-1024 can lower the power consumption, and Intel has a process advantage. I'm doubtful AMD has the right formula here.
A single AVX-1024 operation must consume at least the same amount of power as a single AVX 256 or 128. In an absolute sense it just constrains the growth.
If a workload is amenable to AVX 1024, then the power consumption could be lower.
I see savings in Icache and decode consumption. That leaves execution units, register file, and writeback unchanged to slightly worse.
You've claimed that the scheduler can be powered down, but I am not so certain it is that easy. I don't know Sandy Bridge's internal scheduler implementation though it is described as being unified over the whole core. Certainly, it can't be completely powered down for a single AVX instruction.
Another concern is that I am not certain which units broadcast the register ID, tracks exceptions and interrupts, and updates completion status in the ROB. If any parts of this are in the scheduler, they need to either stay on or be duplicated further down the pipeline, where they would remain active.
Since we are writing back results to the register file per 256-bit chunk, there are times when the intermediate register state can become visible for a single instruction.
There are ways around this, and implementation details would be interesting.
BD kind of skips out on the debate by cracking its ops, so the existing logic applies at all times.