It doesn't stall execution, which is all we really care about.
An OoO CPU tends to have instruction issue as a limiting factor, rather than data latency. Halting the front end means a stall in further instruction fetches, branch predictions, or bringing in additional instructions for reordering. This can have an impact on other instruction types or other threads, since neither the OoO engine or its multithreading can easily reorder around a stall in the front end.
If the supply of non-AVX ops in flight dwindles, the bubble can be exposed.
While executing AVX-1024 code, the reservation station will quickly become full. This means the logic to receive new uops from the front-end can be clock gated for a while (and as mentioned before, once the ROB becomes full the entire front-end can be clock gated).
I'm not sure if reservation station is the correct term for what Sandy Bridge uses. I think the physical register file means there isn't a collector of operands the front of each issue port. Going with the reservation station idea, a full reservation station can cause the in-order front end to stall before the ROB or other ports are saturated.
Exception handling would essentially be no different than with cracked vector operations. Note that with the Pentium 4's cracked vector operations you got two identical uops, except for the physical register numbers. These can be fused together, and you only need a tiny bit of logic in a few places to sequence the register numbers.
I do not see how this would be compatible with the desire to gate the scheduler. The P4 cracked vector operations into separate uops that were actively issued by the scheduler as if they came from separate instructions. I was under the impression that an AVX 1024 op would not be cracked into multiple uops. Deferring the writeback of the exception status until the last cycle would allow the data path that updates the ROB status entries to be gated for several cycles.
Catching the exception status after every cycle would leave it active.
The uop would probably have a 4x as wide exception status section, which could mean that the path could be 4x as wide. However, the plus side to that is that this same path could be 3/4 gated off during normal mode AVX.
A potentially simpler alternative would be to drain the execution pipelines and put the entire core on a 1/4 clock regime, except for the ALU, register file and caches, once an AVX-1024 instruction reaches the scheduler.
That sounds like it would too seriously compromise the scalar performance of the core. Also, the L/S units and AGUs would need to be kept at speed if the cache is kept at speed.
Sandy Bridge has 144 x 256-bit registers. Note that with two threads per core and 16 logical 1024-bit registers, you need at most 128 physical rename registers. So there's no strict need to increase the number of registers if you want to cover the same latency. And as long as there are independent instructions, those few extra registers will suffice to continue progress.
144 256-bit registers is 36 1024-bit registers. The physical register file hosts both architectural and rename registers.
32 of them would be the architectural state registers for two threads.
That leaves 4 rename registers . For an AVX instruction with a 5+ cycle latency, the 4-cycle writeback phase would not occur immediately, so the pipeline would need need independent instructions to hide the latency. Each new instruction takes a rename register, of which there are only 4 remaining.
Keeping a SB-sized register file would bring a GPU-like occupancy problem to the OoO engine, and bring up acute software scheduling pressures that the OoO engine used to handle transparently.
Also, once the 4 rename registers are taken, the front end will stall if it encounters another AVX instruction.
This could be avoided by expanding register capacity in some way, possibly by having native 1024 bit registers or more of the smaller registers to prevent a pretty significant regression in scheduling capability.
Other architectures may segregate the super extended register instructions to a separate set of ports or a coprocessor model like BD or its APU descendants.