Nope, you can get still get all the goodies of a single uop. With 3 arithmetic units the highest instruction rate of code containing only AVX-1024 instructions would be 0.75 per cycle. But the front-end can deliver 4 instructions per cycle. Hence the front-end only has to be active for less than 1/5 of the time!
This is focusing on a workload with 100% AVX-1024 instructions. There are vectorizable workloads that spend over 90% of their time in vector execution, and this small subset may reach a saturation point on occasion, so long as the core doesn't pick up a second integer thread.
For workloads with more book-keeping code, branching, and integer ops, the threshold is reached less often.
I would tentatively suggest that there may also be certain loads whose data granularity is such that the wider vectors can lead to a drop in the total number of vector instructions, but without a matching drop in non-vector instructions, leading to a worsened ratio.
I would consider a vector-heavy thread reducing the utilization of other units a sign of an unbalanced design.
With mixed code, you get mixed results. Nothing wrong with that.
Pure vector code is a minority, and with multithreading, not guaranteed even if an application is written that way.
Intel's hyperthreading solutions have lead to more consistent performance and fewer cases of negative scaling with each generation. A design that has the expectation that instruction execution should be a source of front-end stalls would be a regression.
Every AVX-1024 instruction would save power in the front-end.
Some power would be saved. Allowing the front end to be throttled or gated runs counter to the goals of an OoO core, which in the general case is limited by instruction throughput, not data or execution latency. If non 1024-bit ops are delayed in issue due to congestion caused by AVX, the latency cannot be hidden because the front end and OoO engine are what would be needed to hide it.
You seem to fail to understand that instructions still have to retire in-order. When one execution port is clogged up, it doesn't matter how many other free ports there are, you'll quickly run out of instructions within the scheduling window which either don't need the clogged port or don't depend on an instruction which needs the clogged port.
This is untrue in the multithreaded case, which involves explicitly independent code. A stalled front-end is a global penalty.
In the single-threaded case, the front end and OoO engine actually go through a lot of effort in modern designs of using renaming tricks and sideband stack units to reduce the number of ops sent to the back end. A throttled front end reduces the rate of encountering these, and any ops that cannot issue to non-AVX units can lead to a detectable drop in throughput outside of the AVX unit.
You have to seriously change/widen your perspective on this. All power consumed by handling instructions is wasted power. So the aim is to minimize that while maximizing actual useful work. If any part of the out-of-order execution engine can be made idle while the ALUs stay active, that's definitely a very good thing.
My first contention with that is that the justifications for that OoO core are that it prioritizes execution speed in the general case. An OoO core that regresses in that performance in favor of 1024-bit throughput is a weaker proposition over a broader range of workloads.
The OoO engine can be turned off for power, imperfectly. The broad front end can be turned off for power, imperfectly.
If those general purpose elements all too often are not needed, it should be noted that not having them at all in an execution core is a very effective form of power gating.
Please explain your sentiment.
4-cycle ops complicate instruction scheduling, since they indicate an addition 3 cycles of latency per operation. The fact that the port itself cannot issue for an additional 3 cycles can complicate the heuristics in the scheduler, and the buildup in the ROB and other buffers can lead to non-AVX ports being starved for want of scheduler space.
Sandy Bridge already arranged its units in such a way that each port had consistent execution latencies internally to help reduce contention.
With 1 to 3 to 5 latency ops being possible now, having 6 and 8 or more cycle operations overlaid is a complex problem. I'm curious if they'd be given their own domain to keep them from interfering with the other ports. It might be possible that instead of overloading existing ports and reducing access to all their functionality, the bulkier ops will be issued to new ones.
Just like on a GPU, that completely depends on the ratio of memory accesses to arithmetic instructions. An L2 cache miss takes only 11 cycles on Sandy Bridge, while two threads can hide 8 cycles with a single AVX-1024 instruction.
An L2 cache hit is 11 cycles. A miss to the L3 is roughly in the upper 20s.
Actually serving an AVX-1024 memory access is of uncertain latency without defining how many cache ports of what width we are talking about, cache line length, the width of the L1-L2 bus, and the additional latency of consecutive line fills.
The cache would need to be rearchitected significantly. Sandy Bridge's cache, for example could not cleanly service two threads, each with an AVX-1024 memory access hitting the L2.
Bandwith is insufficient even for an L1 hit.
Bank conflicts would take their toll with at most one cache line per cycle.
If the loads were broken up into separate 128-bit accesses, the L1 cannot have more than 10 misses to higher levels of memory, meaning one of the instructions would be stuck pending a completion of two or more L1 cache line fills.
edit:
The cache controller may be able to pick up on the sequential accesses and coalesce them into a smaller number of L1 cache line misses to the L2. If so, then somewhere between 3-5 may be handled somewhat cleanly by the L2.
The atomicity of the ops may need additional provisions. AVX-1024 would always cross at least one cache line boundary, which may insert a additional latency to prevent an an update to one of the lines while the op is in-flight.