The problem is you can't assume all else to be equal.
We are debating a few new instructions in an ISA while discarding an entire ISA change as being insufficient. We can't debate much at all if we give one side an implied process node and physical design advantage. I've already noted that ISA does not override implementation, but even apparently minor implementation decisions could swing it either way.
Suffice to say I won't settle for a "guesstimate".
None of the parties in question that have the most detailed knowledge have an interest in giving an honest assessment, even if there were comparable designs.
Medfield and the oncoming optimized 28nm ARM cores from Qualcom and others might bring some clarity in the tablet/phone range. At least then we'd have cores with the same target market and a few competitors to Intel that devote at least some effort to optimized physical design.
For argument's sake let's say that an AVX-256 instruction takes 120% of the power consumption of an equivalent 256-bit ARM instruction, of which 70% in the front-end and 50% in the back-end.
I see we might have started guesstimating again.
The numerical basis seems a little muddled, since we have a value that is 120% of an unknown base and then two percentages that don't add up to 100%. What are the 70% and 50% percentages of?
Would my interpretation sound correct to you:
Assume an ARM 256-bit op costs 100 units of energy.
Assume an AVX 256-bit op costs 120 units.
Your next assumption is that for AVX, 70 units are expended by the front end and 50 by the back.
In the 1024-bit comparison:
100 x 4 ARM ops = 400.
70+50+50+50+50 = 270.
This would make the 1024-bit x86 about 33% more power efficient than 4 ARM ops and 44% more efficient than 4 AVX-256, using proportions I cannot verify as being accurate for either ARM or x86.
There are certain assumptions to this, such as it being so simple to divide 60% of an AVX op's power cost into a "front end" bucket that can be made 0 units for 3 of the 4 cycles.
There are parts of the front end that can be very difficult to truly turn off, so there is some undetermined additive factor that persists across all 4 cycles. Also, the usual definition of front end is instruction fetch and decode, but are you including the scheduling phase as front end? The back-end effort is also a little more complex, so what happens in a 1024-bit case is not quite the same as what is done in the 256-bit case.
The fact that our base numbers are unverifiable is the first source of uncertainty.
The second is that the units of power devoted to each portion will be different between an AVX 256 and 1024 implementation, since there are slight differences in how the pipeline behaves.
My earlier arguments concerned the context of these instructions. Are we assuming multithreaded aggressive OoO superscalar cores for each case? There are costs associated with this that I would place as a floor of power consumption that becomes more significant as the rest becomes more efficient.
My other point was that the nature of these cores is to have a policy of keeping the front end and scheduler awake, but that has been covered already.
This then goes back to my post at the start of this chain on the idea of an AVX-1024 core approaching the power-efficiency of an in-order throughput core. The unit costs for the front end and back end when comparing either ARM or x86 fat cores to a power optimized simple core are very different.
The thing I was really most concerned about is whether it's a viable idea. And I'm glad to see that you appear to agree it's worth considering.
I agree that it is doable and that it can be an improvement. I believe we have been discussing as best we can the question of how much.
That's an interesting idea! I'm pretty certain that Haswell will have two FMA ports. I wonder how you could detect a dependency chain though (or if you even should). In any case it's a detail that shouldn't affect the overall viability.
With cracked ops, it would be handled in the renaming stage as the 256-bit registers are allocated. The dependence would be indicated by the rename registers used, and the scheduler would put the ops across multiple ports as naturally as if they were separate 256-bit ops.
If not cracked, it would still be possible. One port would need the ability to send control signals to a unit on the other, and the scheduler would in this scenario would suppress instruction issue on the same domain for the secondary port. The scheduler would have a more active role in tracking how often it has an opportunity to gang the ports together, and it's not as transparent to the back end in that case.